Data Engineer

Apache Kafka Partition assignments

Apache Kafka Partition assignments

Optimize the performance of the Apache Kafka cluster by smartly assigning Topic partitions to the various brokers. The goal is to minimize the latency between the different brokers (for, for instance, replication or election workloads) while dividing the partitions somewhat evenly over the different brokers.

As a Data Engineer at a software company, you are tasked with optimizing the performance of your Apache Kafka cluster. Kafka topics are used to manage data streams, and each topic is divided into partitions. Efficient partitioning of these topics can significantly reduce latency and improve throughput. Your goal is to determine the optimal number of partitions for each topic and assign these partitions to brokers in a way that balances the load and minimizes data transfer between brokers.

Given the following data about your Kafka cluster and topics, use integer linear programming to find the optimal solution.

Objective:
Maximize throughput by minimizing data transfer between brokers.
You calculate this by for every topic, calculating the average of transfer between all brokers based on the number of partitions they have assigned for that topic.

Constraints:
- Partition Capacity Constraint: The number of partitions assigned to each broker must not exceed the broker's maximum capacity.
- Topic Partition Constraint: The number of partitions assigned to each topic must lie within the specified range. All Topics need to be assigned
- Broker Load Balance Constraint: The total number of partitions assigned to each broker should be as balanced as possible, with no broker deviating more than 20% from the broker average.
- Partition Distribution Constraint: For each topic, no broker can hold more than 40% of the partitions.

Data:
The Brokers can be found in kafka_brokers.csv and has the following columns: Broker_ID,Max_Partitions
The Topics can be found in kafka_topics.csv and has the following columns: Topic_ID,Min_Partitions,Max_Partitions
The Data Transfer rates can be found in kafka_rates.csv and has the following columns: From_Broker,To_Broker,Transfer_Rate

Snowflake Optimizing clustering keys

As a Data Engineer at a software company, you are tasked with optimizing the performance of a critical application running on Snowflake. The database handles a large volume of read and write operations, and the current table structures and clustering keys are not optimized, leading to slow query performance and higher costs associated with compute time and storage.

Your goal is to design the most efficient clustering strategy and optimize materialized views to enhance query performance while minimizing the impact on storage and compute costs.

You have a set of queries that are frequently run against the database. Each query has a different frequency and selectivity (the percentage of rows filtered by the query). Optimizing clustering keys and materialized views on the columns used in these queries can significantly improve their performance but will also increase storage usage and maintenance costs.

Objective: Maximize the total query performance improvement. This is calculated taking the sum of chosen Performance_Improvements and dividing that by the total number of queries.

Constraints:
- Compute Maintenance Cost Constraint: The total additional compute cost introduced by the clustering keys and materialized views maintenance should not exceed 20 units.
- Storage Cost Constraint: The total storage used by all optimizations should not exceed 45 GB.
- Query Performance Constraint: The total query performance improvement must be at least 20%.

Data:
The data for this is in the attached csv file with the following columns: Query_ID,Frequency,Selectivity (%),Compute_Maintenance_Cost,Storage_Cost (GB),Performance_Improvement (%)