DevOps Engineer

Scheduling deployments

Scheduling deployments

Schedule resource-heavy deployments on customer-facing hardware at low-traffic times so that customers are least likely to be impacted.

As a DevOps engineer, you are responsible for scheduling deployments throughout the day while minimizing the maximum load on the servers at any given time. You have data on the expected load during the day from regular customer usage, which consumes part of the available load on the servers. Additionally, you have several deployments planned, each with its own load requirements and duration.

The challenge is to schedule these deployments such that the load deviation on the servers is minimized. You also need to ensure that the load never surpasses 100%.

Objective: Minimize the total deviation from the average load on the servers.

Constraints:
- All deployments have to be executed and each deployment's start time must be in the range given in `Deployment Start Window Start` and `Deployment Start Window End`
- The total load at any given time (customer load + deployment load) should not exceed the server capacity.
- Deployments must be non-preemptive (i.e., once started, a deployment must run to completion).

Data:
The customer load can be found in scheduling_deployments_base_load.csv and has the following columns: Time,Customer Load
The deployments can be found in scheduling_deployments_deployments.csv and has the following columns: Deployment ID,Deployment Load,Deployment Duration,Deployment Start Window Start,Deployment Start Window End

Assigning workloads

Assigning workloads

Having a limited number of machines to schedule workloads on, assign the jobs so as to minimize the number of machines impacted.

A DevOps Engineer is responsible for scheduling workloads on a limited number of onsite machines. Each machine has specific capacities in terms of virtual CPUs (vCPU), RAM, and GPU FLOPS. Each workload requires a certain amount of these resources to run.

Objective: Maximize the total resources of machines without any workloads assigned to them.

Constraints:
- vCPU constraint: The total vCPU requirement of the workloads assigned to a machine must not exceed the vCPU capacity of that machine.
- RAM constraint: The total RAM requirement of the workloads assigned to a machine must not exceed the RAM capacity of that machine.
- GPU FLOPS constraint: The total GPU FLOPS requirement of the workloads assigned to a machine must not exceed the GPU FLOPS capacity of that machine.
- All workloads must be scheduled.

Data:
The machines can be found in assigning_workloads_machines.csv and has the following columns: Machines,vCPU Capacity,RAM Capacity (GB),GPU Capacity (GFLOPS)
The workloads can be found in assigning_workloads_workloads.csv and has the following columns: Workloads,vCPU Requirement,RAM Requirement (GB),GPU Requirement (GFLOPS)

Incident Response Planning

Disaster Recovery Planning

A complex system of internal and customer-facing services that have many interdependencies should be brought online efficiently in case of a disaster. The customer-facing services get assigned a priority value, determine the order in which the services should be brought back online.

To induce urgency, we utilize the following formula that states that customer-facing services should be brought online as quickly as possible, with more important services getting a higher priority:

\(V(t) = V_0 \cdot e^{-0.0398t}\)

You are a DevOps Engineer responsible for developing an optimized incident response plan to prioritize critical systems and allocate resources efficiently during outages.
You need to plan the recovery of all 60 interconnected systems as far as possible.
Each system has dependencies on other systems, and only systems with higher numbers (customer-facing systems) have priority scores.

The goal is to get systems with a priority score up and running as quickly as possible. We start at t=0, and the time required for every recovery is indicated by "Recovery Time (minutes)".
As time goes by, value of each system goes down. The value of a system at time t can be calculated via the following function:
V(t)=V0⋅e^−0.0398t with t being the time the system has finished recovering and V0 being the initial priority score.

Objective: Maximize the total of priority scores given the above function.

Data:
The systems and their dependencies can be found in incident_response.json and has the following fields: System, Priority, Dependencies, Recovery Time (minutes)

Testing strategy optimization

Testing strategy optimization

Smartly decide which machines to run tests on and what kind of testing environment to simulate.

A DevOps Engineer wants to optimize the testing strategy for a software application. The application needs to be tested on four operating systems: Linux64, Armlinux64, MacOS, and Windows. Each operating system must be tested once. There are 10 testing environments available, numbered from 1 to 10, and each OS must be assigned a unique testing environment. Additionally, there are 40 testing machines, with 10 machines dedicated to each OS. Each machine can only handle a subset of the testing environments.

The goal is to find the optimal combination of {machine, testing environment} for each operating system to maximize the total score. The score for each {machine, testing environment} combination is calculated by taking the number of days since the Testing Date and then multiplying that number by its modifier.

Objective: Maximize the total score.

Constraints:
- Each operating system (Linux64, Armlinux64, MacOS, Windows) must be tested exactly once on one of their OS-specific machines.
- Each testing environment can be chosen at most once.
- Each {machine, testing environment} combination can only be chosen if the machine supports the given environment.

Data:
Use the data from testing_strategy.csv with the following columns: OS,Testing Environment,Machine,Testing Date,Modifier
The data also show which combination of {machine, testing environment} are available.