Solving the Set Cover Problem with Python and OR-Tools

By Matt Chapman

July 16, 20246 min read

Introduction

The Set Cover Problem is a fundamental optimization problem in computer science and operations research, with many practical applications, such as network design, resource allocation, and scheduling.

In this tutorial, we'll explore how to use Python and Google's OR-Tools library to solve it.

Intended audience

This blog post is intended for beginner-to-intermediate Python developers.

You don't need any prior knowledge of optimization or OR-Tools, but it will help if you understand basic Python concepts like loops, functions and lists.

Problem description

Imagine that a university wants to schedule a set of courses and needs to ensure that all topics are covered by the available professors. Each professor can cover multiple topics, and our goal is to select the minimum number of professors required to cover all topics.

Here’s the basic timetable:

SubjectTime SlotTeacher
Maths9:00 - 10:00 AM?
Physics10:00 - 11:00 AM?
Chemistry11:00 - 12:00 PM?
Biology1:00 - 2:00 PM?
4 rows x 3 columns

And here are the professors and the topics they’re able to cover:

ProfessorTopics Covered
Prof. A(Math, Physics)
Prof. B(Physics, Chemistry)
Prof. C(Math, Biology)
Prof. D(Biology, Chemistry)
Prof. E(Math, Chemistry, Biology)
5 rows x 2 columns

The simplest way to ensure that all courses are covered would be to assign a different professor to each class, e.g.:

SubjectTime SlotTeacher
Maths9:00 - 10:00 AMProf. A
Physics10:00 - 11:00 AMProf. B
Chemistry11:00 - 12:00 PMProf. D
Biology1:00 - 2:00 PMProf. E
4 rows x 3 columns

But, while it might be the “fairest” solution, it would also be incredibly inefficient; we’d need to pay 4 professors for a teaching day, even though each professor only teaches for a single hour.

Instead, we can use mathematical programming to to find the smallest set of professors that cover all the topics (Math, Physics, Chemistry, and Biology).

Isn't this an unrealistic problem?

When creating timetables in the real world, we’ll likely have much more complex requirements relating to room availability, concurrent classes, and multi-day or multi-week timetables. The reason we’ve articulated the problem in this simplistic way is to illustrate the basic principles of the Set Cover Problem.

If you’re interested in using linear programming to solve more complex timetabling problems, check out some of the other articles on this site:

  • Article 1
  • Article 2

Mathematics

To solve this problem using mathematical optimization, we need to translate our constraints and goals into mathematical language. Let's break this down step by step.

Sets

First, let’s define the sets: the “populations” or “groups” of data which frame the problem.

In this problem, we have two main sets: the set of professors, and the set of topics. We’ll use mathematical notation to denote them:

  • Let PP be the set of all professors
  • Let TT be the set of all topics

Parameters

Next we’ll define a parameter which refers to each professor’s topics:

  • Let SpS_p be the subset of topics covered by professor pp

Decision variables

Decision variables are variables which have an as-yet-unknown value; later, when we solve the problem, we’ll work out which value they should take on.

In this problem, we’ll need one Boolean decision variable for each professor. Each variable will take the value 1 if the corresponding professor is chosen, and 0 if not:

  • Let xpx_p be a binary variable representing whether professor pp is included in the solution (1 if included, 0 otherwise)

Constraints

Next we need to define the constraints:

  • Coverage constraint (every topic must be covered by at least one professor):

    pPxp1tSp1,     tT\sum_{p \in P} x_p \cdot \bold{1}_{t \in S_p} \geq 1, \space \space \space \space \space \forall t \in T
What’s going on here? 🤔

The goal of the Set Cover Problem is to ensure that every topic is covered by at least one selected professor. Mathematically, this can be expressed with the above constraint. There are two parts to this:

  1. Summation: pPxp1tSp\sum_{p \in P} x_p \cdot \bold{1}_{t \in S_p} : For each professor pp, if they are selected (xp=1x_p = 1), and if they cover the topic tt (1tSp=1\bold{1}_{t \in S_p} = 1), then that professor contributes 1 to the sum. If either the professor is not selected or does not cover the topic, they contribute 0 to the sum.

    • Note: the 1\bold{1} in 1tSp\bold{1}_{t \in S_p} is not a “1”; it’s the indicator function. This might be a bit confusing if you’re new to maths. But the “1” in 1tSp\bold{1}_{t \in S_p} is not a constant value; rather, it signifies that the indicator function outputs 1 (true) or 0 (false) based on the condition tSpt \in S_p.
  2. Constraint: Finally, the constraint ≥1 ensures that the sum is at least 1, meaning that at least one selected professor covers the topic tt.

Objective function

Having defined our variables and constraints, we can now proceed to articulating our goals.

Remember: our goal is to minimize the number of selected professors. We can express this goal via the following objective function:

minpPxp\min \sum_{p \in P} x_p

Solving with Python

Now that we've formulated the problem mathematically, let's implement the solution using Python and OR-Tools.

First, we'll import the necessary libraries:

from ortools.linear_solver import pywraplp import pandas as pd

Next, let's define our data:

data = { 'professor': ['Prof. A', 'Prof. B', 'Prof. C', 'Prof. D', 'Prof. E'], 'topics': [ {'Math', 'Physics'}, {'Physics', 'Chemistry'}, {'Math', 'Biology'}, {'Biology', 'Chemistry'}, {'Math', 'Chemistry', 'Biology'} ] } df = pd.DataFrame(data) topics = {'Math', 'Physics', 'Chemistry', 'Biology'}

Now, let's create our solver and define the variables:

# Create the solver solver = pywraplp.Solver.CreateSolver('SCIP') # Create variables x = {} for i in df.index: x[i] = solver.BoolVar(f'x_{i}') # Print variables to verify print('Number of variables =', solver.NumVariables())

Next, we'll add our constraints:

# Coverage constraints for topic in topics: solver.Add(solver.Sum(x[i] for i in df.index if topic in df.loc[i, 'topics']) >= 1) # Print constraints to verify print('Number of constraints =', solver.NumConstraints())

Now, let's define our objective function:

# Objective function objective = solver.Objective() for i in df.index: objective.SetCoefficient(x[i], 1) objective.SetMinimization()

Finally, we can solve the problem and print the results:

# Solve the problem status = solver.Solve() # Print solution if status == pywraplp.Solver.OPTIMAL: print('Solution:') print('Objective value =', solver.Objective().Value()) for i in df.index: if x[i].solution_value() > 0: print(f"{df.loc[i, 'professor']}: {x[i].solution_value()}") else: print('The problem does not have an optimal solution.')

When we run this code, we get the following output:

Number of variables = 5 Number of constraints = 4 Solution: Objective value = 2.0 Prof. A: 1.0 Prof. D: 1.0

Interpreting the results

Our optimization model suggests that the university should select Prof. A and Prof. D. This combination will ensure that all topics (Math, Physics, Chemistry, and Biology) are covered with the minimum number of professors (2).

Conclusion

In this article, we've explored how to solve a real-world optimization problem using mathematical programming techniques with Python and Google OR-Tools. We've seen how to:

  • Formulate a problem mathematically
  • Implement the solution using OR-Tools
  • Interpret the results

This approach can be applied to a wide range of optimization problems beyond just scheduling professors. Whether you're optimizing resource allocation, network design, or other combinatorial problems, the principles remain the same.

Remember, while our model provides a mathematically optimal solution, real-world considerations might lead to different choices. For example, the university might have preferences for certain professors or other constraints not considered in the model.

Happy optimizing, and good luck with your scheduling!