Solving the Set Cover Problem with Python and OR-Tools

Introduction

The Set Cover Problem is a fundamental optimization problem in computer science and operations research, with many practical applications, such as network design, resource allocation, and scheduling.

In this tutorial, we'll explore how to use Python and Google's OR-Tools library to solve it.

Intended audience
This blog post is intended for beginner-to-intermediate Python developers.
You don't need any prior knowledge of optimization or OR-Tools, but it will help if you understand basic Python concepts like loops, functions and lists.

Problem description

Imagine that a university wants to schedule a set of courses and needs to ensure that all topics are covered by the available professors. Each professor can cover multiple topics, and our goal is to select the minimum number of professors required to cover all topics.

Here’s the basic timetable:

Subject	Time Slot	Teacher
Maths	9:00 - 10:00 AM	?
Physics	10:00 - 11:00 AM	?
Chemistry	11:00 - 12:00 PM	?
Biology	1:00 - 2:00 PM	?

4 rows x 3 columns

And here are the professors and the topics they’re able to cover:

Professor	Topics Covered
Prof. A	(Math, Physics)
Prof. B	(Physics, Chemistry)
Prof. C	(Math, Biology)
Prof. D	(Biology, Chemistry)
Prof. E	(Math, Chemistry, Biology)

5 rows x 2 columns

The simplest way to ensure that all courses are covered would be to assign a different professor to each class, e.g.:

Subject	Time Slot	Teacher
Maths	9:00 - 10:00 AM	Prof. A
Physics	10:00 - 11:00 AM	Prof. B
Chemistry	11:00 - 12:00 PM	Prof. D
Biology	1:00 - 2:00 PM	Prof. E

4 rows x 3 columns

But, while it might be the “fairest” solution, it would also be incredibly inefficient; we’d need to pay 4 professors for a teaching day, even though each professor only teaches for a single hour.

Instead, we can use mathematical programming to to find the smallest set of professors that cover all the topics (Math, Physics, Chemistry, and Biology).

Isn't this an unrealistic problem?
When creating timetables in the real world, we’ll likely have much more complex requirements relating to room availability, concurrent classes, and multi-day or multi-week timetables. The reason we’ve articulated the problem in this simplistic way is to illustrate the basic principles of the Set Cover Problem.
If you’re interested in using linear programming to solve more complex timetabling problems, check out some of the other articles on this site:

Article 1

Article 2

Mathematics

To solve this problem using mathematical optimization, we need to translate our constraints and goals into mathematical language. Let's break this down step by step.

Sets

First, let’s define the sets: the “populations” or “groups” of data which frame the problem.

In this problem, we have two main sets: the set of professors, and the set of topics. We’ll use mathematical notation to denote them:

Let $P$ be the set of all professors
Let $T$ be the set of all topics

Parameters

Next we’ll define a parameter which refers to each professor’s topics:

Let $S_p$ be the subset of topics covered by professor $p$

Decision variables

Decision variables are variables which have an as-yet-unknown value; later, when we solve the problem, we’ll work out which value they should take on.

In this problem, we’ll need one Boolean decision variable for each professor. Each variable will take the value 1 if the corresponding professor is chosen, and 0 if not:

Let $x_p$ be a binary variable representing whether professor $p$ is included in the solution (1 if included, 0 otherwise)

Constraints

Next we need to define the constraints:

Coverage constraint (every topic must be covered by at least one professor):
$\sum_{p \in P} x_p \cdot \bold{1}_{t \in S_p} \geq 1, \space \space \space \space \space \forall t \in T$

What’s going on here? 🤔

The goal of the Set Cover Problem is to ensure that every topic is covered by at least one selected professor. Mathematically, this can be expressed with the above constraint. There are two parts to this:

Summation: $\sum_{p \in P} x_p \cdot \bold{1}_{t \in S_p}$ : For each professor $p$ , if they are selected ( $x_p = 1$ ), and if they cover the topic $t$ ( $\bold{1}_{t \in S_p} = 1$ ), then that professor contributes 1 to the sum. If either the professor is not selected or does not cover the topic, they contribute 0 to the sum.
- Note: the $\bold{1}$ in $\bold{1}_{t \in S_p}$ is not a “1”; it’s the indicator function. This might be a bit confusing if you’re new to maths. But the “1” in $\bold{1}_{t \in S_p}$ is not a constant value; rather, it signifies that the indicator function outputs 1 (true) or 0 (false) based on the condition $t \in S_p$ .
Constraint: Finally, the constraint ≥1 ensures that the sum is at least 1, meaning that at least one selected professor covers the topic $t$ .

Objective function

Having defined our variables and constraints, we can now proceed to articulating our goals.

Remember: our goal is to minimize the number of selected professors. We can express this goal via the following objective function:

\min \sum_{p \in P} x_p

Solving with Python

Now that we've formulated the problem mathematically, let's implement the solution using Python and OR-Tools.

First, we'll import the necessary libraries:

from ortools.linear_solver import pywraplp
import pandas as pd

Next, let's define our data:

data = {
    'professor': ['Prof. A', 'Prof. B', 'Prof. C', 'Prof. D', 'Prof. E'],
    'topics': [
        {'Math', 'Physics'},
        {'Physics', 'Chemistry'},
        {'Math', 'Biology'},
        {'Biology', 'Chemistry'},
        {'Math', 'Chemistry', 'Biology'}
    ]
}

df = pd.DataFrame(data)
topics = {'Math', 'Physics', 'Chemistry', 'Biology'}

Now, let's create our solver and define the variables:

# Create the solver
solver = pywraplp.Solver.CreateSolver('SCIP')

# Create variables
x = {}
for i in df.index:
    x[i] = solver.BoolVar(f'x_{i}')

# Print variables to verify
print('Number of variables =', solver.NumVariables())

Next, we'll add our constraints:

# Coverage constraints
for topic in topics:
    solver.Add(solver.Sum(x[i] for i in df.index if topic in df.loc[i, 'topics']) >= 1)

# Print constraints to verify
print('Number of constraints =', solver.NumConstraints())

Now, let's define our objective function:

# Objective function
objective = solver.Objective()

for i in df.index:
    objective.SetCoefficient(x[i], 1)

objective.SetMinimization()

Finally, we can solve the problem and print the results:

# Solve the problem
status = solver.Solve()

# Print solution
if status == pywraplp.Solver.OPTIMAL:
    print('Solution:')
    print('Objective value =', solver.Objective().Value())
    for i in df.index:
        if x[i].solution_value() > 0:
            print(f"{df.loc[i, 'professor']}: {x[i].solution_value()}")
else:
    print('The problem does not have an optimal solution.')

When we run this code, we get the following output:

Number of variables = 5
Number of constraints = 4

Solution:
Objective value = 2.0
Prof. A: 1.0
Prof. D: 1.0

Interpreting the results

Our optimization model suggests that the university should select Prof. A and Prof. D. This combination will ensure that all topics (Math, Physics, Chemistry, and Biology) are covered with the minimum number of professors (2).

Conclusion

In this article, we've explored how to solve a real-world optimization problem using mathematical programming techniques with Python and Google OR-Tools. We've seen how to:

Formulate a problem mathematically
Implement the solution using OR-Tools
Interpret the results

This approach can be applied to a wide range of optimization problems beyond just scheduling professors. Whether you're optimizing resource allocation, network design, or other combinatorial problems, the principles remain the same.

Remember, while our model provides a mathematically optimal solution, real-world considerations might lead to different choices. For example, the university might have preferences for certain professors or other constraints not considered in the model.

Happy optimizing, and good luck with your scheduling!