Introduction to Mathematical Optimization With Python and OR-Tools

Introduction

Mathematical optimisation, also known as mathematical programming, is a way to solve problems which involve choosing the best option from a set of possible choices.

In this tutorial, we’ll learn how to formulate and solve optimisation problems with Python and OR-Tools, a fantastic open source library developed by Google.

We’ll focus on a specific type of mathematical optimisation - linear programming - and illustrate the concepts by designing an algorithm to select an optimal 15-man team for the Fantasy Premier League.

What's Fantasy Premier League?
Fantasy Premier League (FPL) is the official fantasy football game of the English Premier League. (For any Americans or Australians out there, “football” means “soccer” in this context!) With over 10 million players, it is the largest fantasy football game of any domestic football league.
Given that there are hundreds of players in the Premier League, there are gazillions of different 15-man combinations you could choose. Even if you have extensive domain expertise about football, this can feel like an impossible task; like trying to find the needle in a haystack. But with the power of mathematical optimisation and Python, this problem becomes easily solvable with just a few lines of code.

Problem description

Before we write any code or mathematical equations, let’s describe the problem in plain English, enumerating all the constraints and assumptions. The more specific we are at this stage, the easier it will be to translate the problem into mathematical language later on.

Task: Select a fantasy football squad of exactly 15 players.

Overall aim: Maximise expected points. What do we mean by this? Well, once you’ve picked an FPL team, you’ll collect points from your players depending on how they perform. The better a player performs, the more points they generate. The goal of FPL is to pick a combination of players which will earn as many points as possible.

Constraints: The FPL rulebook stipulates a few important constraints:

We must have a certain number of players in each position:
- 2 Goalkeepers
- 5 Defenders
- 5 Midfielders
- 3 Forwards
The total value of your initial squad must not exceed £100 million
You can select no more than 3 players from a single Premier League team. For example, you couldn't select all 4 of Haaland, Walker, Stones and Grealish (since they all play for Manchester City). You could only select 3 of them.

Formulating the problem mathematically

Next, we need to translate these constraints and goals into mathematical language which will form the basis for our Python code later on.

Did you say... math? 😰

If you've a strong aversion to math, the prospect of “formulating a problem in mathematical terms” might seem daunting. But math is much less scary than it first appears. As its core, math is just a way of describing things in concise, precise terms. If you understood the “plain English” constraints and goals specified above, then you've already understood all the ideas we'll be expressing in these mathematical formulas.

If you're new to math, my advice would be: try to read through this section, but don't worry if you don't understand every symbol/notation. The main reason we start with the math is to make it easier to write the code later on. But, if the math just confuses you even more, you can just jump straight into the code.

To formulate the problem mathematically, we need to define five things:

Sets
Parameters
Decision variables
Constraints
Objective function(s)

Let’s walk through these one-by-one.

Sets

A set is a collection of non-repeated items. In our FPL problem, there are three sets: the set of players, the set of teams, and the set of positions. We’ll use mathematical symbols to denote them:

Let $P$ be the set of all available players
Let $T$ be the set of all teams
Let $Y = (\text{GK}, \text{DEF}, \text{MID}, \text{FWD})$ be the set of positions (goalkeeper, defender, midfielder, forward)

Why don't we specify the size of each set? 🤔

At this stage, we don’t need to specify the size of each set. For example, even though we know that there are exactly 20 teams in the Premier League, we don’t need to specify that $T = 20$ . We can just define $T$ in a slightly more abstract way as “the set of all teams.”

Generalisations like this are useful because we won’t always know the size of a set in advance. For example, at the start of the season, we might not know the exact number of players in the Premier League. Luckily, we don’t have to specify the size of $P$ . Instead, we can just define $P$ more abstractly as “the set of all available players."

Parameters

Next, we need to define the parameters:

Let $c_p$ be the cost of player $p$
Let $\text{team}_p \in T$ be the team of player $p$
Let $\text{pos}_p \in Y$ be the position of player $p$
Let $m_y$ be the number of players needed in position $y$
Let $\text{points}_p$ be the number of points player $p$ is expected to get

Later, we’ll use these parameters to define the constraints and goals of our optimisation problem.

Decision variables

Decision variables are the controllable inputs of our model.

For our task, we need to define one binary decision variable for each player. If a given player is selected for our 15-man team, their corresponding decision variable will equal 1. If they are not selected, their decision variable will equal 0.

Mathematically, we can express this as follows:

Let $x_p$ $x_{p}$ be a binary variable for each player $p$ $p$ in $P$ $P$
- $x_p = 1$ if player $p$ is selected, 0 otherwise

For example, if we had 700 players to choose from, we would have 700 decision variables:

x_1, x_2, x_3, \cdots , x_{688}, x_{689}, x_{700}

For the 15 players we end up choosing, their decision variables will equal 1. For the 685 players we do not choose, their decision variables will equal 0. For example, if we selected Erling Haaland (player 335), the variable $x_{335}$ would equal 1. If we didn’t select Haaland, the variable $x_{335}$ would equal 0.

Constraints

Now that we’ve defined our sets, parameters and decision variables, we can specify the constraints. In total, we’ll have four constraints:

We must select exactly 15 players

\sum_{p \in P} x_p = 15

We need a specific number of players in each position: 2 goalkeepers, 5 defenders, 5 midfields, and 3 forwards

\forall y \in Y: \sum_{p \in P: \text{pos}_p = y} x_p = m_y

Note: technically, satisfying this constraint implies that we will also satisfy the first one, so we wouldn’t need to specify the first constraint as well. But for simplicity/completeness I’ll include them both.

There can be no more than 3 players from the same team

\forall t \in T: \sum_{p \in P: \text{team}_p = t} x_p <= 3

We cannot spend more than our total budget (£100m)

\sum_{p \in P} c_p <= 1000

Objective function

Now that we’ve defined our constraints, we can define our objective function. In mathematical optimisation problems like ours, the objective function is the thing we want to maximise (or minimise).

In our case, the objective function is the sum of expected points. We can express our desire to maximise this sum via the following formula:

\text{Maximize} \space \space \space \sum_{p \in P} \text{points}_p \cdot x_p

Code

Now, let's translate the mathematical formulas into code.

Load data

First, we need to fetch some data. Let's use FPL data from Kaggle.

The dataset contains 75 attributes for each of the 778 players in the Premier League, including their team, position, cost (now_cost) and points_per_game.

To follow along with this tutorial, you’ll also need 3 libraries installed: numpy, pandas, and Google’s popular linear programming library OR-Tools . If you haven’t got these installed, you can get them using pip:

pip install numpy, pandas, ortools

Next, we’ll import these libraries and load the data we just downloaded:

import numpy as np
import pandas as pd
from ortools.sat.python import cp_model

# Fetch data
df = pd.read_csv('players.csv') # Replace with the path to your downloaded file

# Preview
display(df.head())

id	name	now_cost	position	team	clean sheets per 90	threat rank type	expected assists per 90	...
3	Granit Xhaka	48	MID	Arsenal	0.39	38	0.04	...
4	Mohamed Elneny	41	MID	Arsenal	0	233	0	...
5	Rob Holding	42	DEF	Arsenal	0	118	0.01	...
6	Thomas Partey	47	MID	Arsenal	0.4	90	0.05	...
7	Martin Ødegaard	69	MID	Arsenal	0.37	8	0.13	...

5 rows x 75 columns

Note that the column indicating each player’s cost (now_cost) is in units of £100k. E.g., the first player in that preview (Granit Xhaka) will cost 48 * £100k = £4.8m to buy.

Note: the data get updated every week, so you might get slightly different numbers than I do, depending on when you’re reading this.

Define the model

We'll initialise a CP (constraint programming) model:

model = cp_model.CpModel()

Add decision variables

We'll create one Boolean decision variable for each player in our data set:

# Create an empty list to hold the decision variables
decision_variables = {}

# Create decision variables for each of the players
for player_id in df['id']:

    player_variables = {player_id: model.NewBoolVar(f"player{player_id}")}

    decision_variables.update(player_variables) # Add to dict

After running that code, we will have added all our Boolean player variables to the model. We can take a quick look at these decision variables to check they’ve been added:

print(decision_variables.values())
print(f"Number of decision variables: {len(decision_variables)}")

result
[player0(0..1), player1(0..1), ... player777(0..1)]
Number of decision variables: 778

Add constraints

First, let's add the constraint that we need 2 goalkeepers, 5 defenders, 5 midfielders and 3 forwards (15 players in total):

# Define the number of players we want to select in each position
POSITION_MAP = {
    'Goalkeeper': {'code': 'GKP', 'count': 2},
    'Defender': {'code': 'DEF', 'count': 5},
    'Midfielder': {'code': 'MID', 'count': 5},
    'Forward': {'code': 'FWD', 'count': 3}
}

# Add the constraint: per position, only allowed the specified count of players
for position, details in POSITION_MAP.items():
    players_in_position = list(df[df['position'] == details['code']].id.values)
    player_variables = {i: decision_variables[i] for i in players_in_position}

    # Fetch the number of players we're allowed to pick in this position
    player_count = details['count']

    # Add the constraint for each position
    model.Add(sum(player_variables.values()) == player_count)

Next, let's add the constraint that we can't spend more than 1000 (£100m):

This one’s reasonably straightforward: first, we set our budget (in units of £100k), and then we define a new decision variable which will equal the total amount we’ve spent. How do we calculate this variable? We calculate the sum of (points_per_game)*(1 or 0) for each player (depending on whether the player is selected). As a linear equation, this looks like:

totalcost = player1_selected * player1_cost + … + player778_selected * player778_cost

Each variable $playerx_{selected}$ will be assigned a value 1 if that player is selected, and 0 if not. Here it is in code:

BUDGET = 1000
player_costs = {player: df[df['id']==player]['now_cost'].values[0] for player in df['id']}
model.Add(sum(var * player_costs[i] for i, var in decision_variables.items()) <= BUDGET)

Finally, we'll add the constraint that we can have no more than 3 players from the same team.

We can do this by defining one linear equation for each team using a for loop:

MAX_PLAYERS_PER_TEAM = 3
teams = df['team'].unique()
for team in teams:
    eligible_players = df[df['team']==team].id.values
    model.Add(sum(decision_variables[i] for i in eligible_players) <= MAX_PLAYERS_PER_TEAM)

Add objective function

If you’ve done all that, well done! We’ve successfully added the constraints (i.e. we’ve told the model the minimum requirements which must be met when selecting players).

If the model follows all these constraints, it will narrow down the search space and identify only those options that are feasible.

But you'll still be left with millions of options left. Which of these should we pick?

That’s where the objective function comes in. The objective function specifies the thing we’re trying to optimise for, which will guide the algorithm so that it selects the optimal solution out of all the possible options.

Since we don't know each player's expected points per game in the upcoming season, we'll use a simple proxy: their average points scored per game last season.

It’s not a perfect proxy, but it’s good enough for our purposes; it’s reasonable to expect that players who’ve scored a high number of points previously will continue getting high scores.

In our DataFrame, this data point is shown in the points_per_game column, which shows the average number of points each player has achieved across all the games they’ve played so far this season.

With a little sorting, we can see the players who have the highest points_per_game, including Erling Haaland and Mo Salah:

df.sort_values(by=['points_per_game'], ascending=False)[['name', 'team', 'now_cost', 'points_per_game']].head(10)

ID	Name	Team	Now Cost	Points per Game
279	Asmir Begović	Everton	38	9
489	Erling Haaland	Man City	124	7.8
494	Stefan Ortega Moreno	Man City	37	7.7
673	Harry Kane	Spurs	115	6.9
445	Mohamed Salah	Liverpool	131	6.3
516	Marcus Rashford	Man Utd	72	5.9
4	Martin Ødegaard	Arsenal	69	5.9
474	Kevin De Bruyne	Man City	121	5.7
15	Gabriel Martinelli Silva	Arsenal	65	5.5
122	Ivan Toney	Brentford	71	5.5

10 rows x 5 columns

Using this field, let's define an objective function which maximises this points_per_game field:

# For each player, collect their average `points_per_game`
player_points_per_game = {player: df[df['id']==player]['points_per_game'].values[0] for player in decision_variables.keys()}

# Add another decision variable: the total `points_per_game` of the selected players
# This var is the sum of (points_per_game)*(1 or 0) for each player (depending on whether the player is selected)
total_points = sum(var * player_points_per_game[i] for i, var in decision_variables.items())

# Instruct the model to maximise this
model.Maximize(total_points)

Solve

Finally, we instruct the model to find the optimal solution, given the constraints and objective function:

solver = cp_model.CpSolver()
status = solver.Solve(model)

And we show the solution:

def show_solution(status):
    """
    Print out the solution (if any).
    """
    if status == cp_model.OPTIMAL:

        print("Optimal solution found. Players selected:\\n")
        total_cost = 0
        total_points_per_game = 0

        players = {
            'GKP': [],
            'DEF': [],
            'MID': [],
            'FWD': []
        }
        for i, var in decision_variables.items():
            if solver.Value(var) == 1:
                player_position = df[df['id']==i].position.values[0]
                players[player_position].append(i)

        for position, ids in players.items():
            print(f"\\nPlayers in {position}:")
            for i in ids:
                player_name = df[df['id']==i]['name'].values[0]
                player_cost = df[df['id']==i]['now_cost'].values[0] / 10
                player_team = df[df['id']==i]['team'].values[0]
                player_points_per_game = df[df['id']==i]['points_per_game'].values[0]
                print(f"{player_name}: {player_team}, £{player_cost}, {player_points_per_game}")
                total_cost += player_cost
                total_points_per_game += player_points_per_game

        print("\\nTotal cost: ", total_cost)
        print("\\nTotal points_per_game: ", total_points_per_game)

    else:
        print("No solution found.")

show_solution(status)

Optimal solution found. Players selected:

Goalkeeper:
Asmir Begović: Everton, £3.8, 9.0
Stefan Ortega Moreno: Man City, £3.7, 7.7

Defender:
Yerry Mina: Everton, £4.3, 4.7
João Cancelo: Man City, £7.1, 4.3
Luke Shaw: Man Utd, £5.3, 4.0
Diogo Dalot Teixeira: Man Utd, £4.8, 3.9
Kieran Trippier: Newcastle, £6.0, 5.2

Midfielder:
Martin Ødegaard: Arsenal, £6.9, 5.7
Bukayo Saka: Arsenal, £8.0, 5.3
Gabriel Martinelli Silva: Arsenal, £6.5, 5.5
Marcus Rashford: Man Utd, £7.2, 5.9
Miguel Almirón Rejala: Newcastle, £5.4, 4.6

Forward:
Ivan Toney: Brentford, £7.1, 5.5
Erling Haaland: Man City, £12.4, 7.8
Harry Kane: Spurs, £11.5, 6.9

Total cost: 100.0
Total points per game: 86.0

Hooray! We’ve successfully picked a team within our budget, with a total predicted points per game of 86.0. Of course, this won't be the truly optimal team, since we used last season's average points_per_game as the proxy for expected points this season, and that's unlikely to match up completely with points achieved next season. But nonetheless, this provides a great starting point.

Conclusion

This article has demonstrated how to apply the concepts of mathematical optimisation and linear programming using Python and OR-Tools.

Thanks for reading!