Introduction to Mathematical Optimization With Python and OR-Tools

By Matt Chapman

July 12, 20246 min read

Introduction

Mathematical optimisation, also known as mathematical programming, is a way to solve problems which involve choosing the best option from a set of possible choices.

In this tutorial, we’ll learn how to formulate and solve optimisation problems with Python and OR-Tools, a fantastic open source library developed by Google.

We’ll focus on a specific type of mathematical optimisation - linear programming - and illustrate the concepts by designing an algorithm to select an optimal 15-man team for the Fantasy Premier League.

What's Fantasy Premier League?

Fantasy Premier League (FPL) is the official fantasy football game of the English Premier League. (For any Americans or Australians out there, “football” means “soccer” in this context!) With over 10 million players, it is the largest fantasy football game of any domestic football league.

Given that there are hundreds of players in the Premier League, there are gazillions of different 15-man combinations you could choose. Even if you have extensive domain expertise about football, this can feel like an impossible task; like trying to find the needle in a haystack. But with the power of mathematical optimisation and Python, this problem becomes easily solvable with just a few lines of code.

Problem description

Before we write any code or mathematical equations, let’s describe the problem in plain English, enumerating all the constraints and assumptions. The more specific we are at this stage, the easier it will be to translate the problem into mathematical language later on.

Task: Select a fantasy football squad of exactly 15 players.

Overall aim: Maximise expected points. What do we mean by this? Well, once you’ve picked an FPL team, you’ll collect points from your players depending on how they perform. The better a player performs, the more points they generate. The goal of FPL is to pick a combination of players which will earn as many points as possible.

Constraints: The FPL rulebook stipulates a few important constraints:

  1. We must have a certain number of players in each position:
    • 2 Goalkeepers
    • 5 Defenders
    • 5 Midfielders
    • 3 Forwards
  2. The total value of your initial squad must not exceed £100 million
  3. You can select no more than 3 players from a single Premier League team. For example, you couldn't select all 4 of Haaland, Walker, Stones and Grealish (since they all play for Manchester City). You could only select 3 of them.

Formulating the problem mathematically

Next, we need to translate these constraints and goals into mathematical language which will form the basis for our Python code later on.

Did you say... math? 😰

If you've a strong aversion to math, the prospect of “formulating a problem in mathematical terms” might seem daunting. But math is much less scary than it first appears. As its core, math is just a way of describing things in concise, precise terms. If you understood the “plain English” constraints and goals specified above, then you've already understood all the ideas we'll be expressing in these mathematical formulas.

If you're new to math, my advice would be: try to read through this section, but don't worry if you don't understand every symbol/notation. The main reason we start with the math is to make it easier to write the code later on. But, if the math just confuses you even more, you can just jump straight into the code.

To formulate the problem mathematically, we need to define five things:

  • Sets
  • Parameters
  • Decision variables
  • Constraints
  • Objective function(s)

Let’s walk through these one-by-one.

Sets

A set is a collection of non-repeated items. In our FPL problem, there are three sets: the set of players, the set of teams, and the set of positions. We’ll use mathematical symbols to denote them:

  1. Let PP be the set of all available players
  2. Let TT be the set of all teams
  3. Let Y=(GK,DEF,MID,FWD)Y = (\text{GK}, \text{DEF}, \text{MID}, \text{FWD}) be the set of positions (goalkeeper, defender, midfielder, forward)
Why don't we specify the size of each set? 🤔

At this stage, we don’t need to specify the size of each set. For example, even though we know that there are exactly 20 teams in the Premier League, we don’t need to specify that T=20T = 20. We can just define TT in a slightly more abstract way as “the set of all teams.”

Generalisations like this are useful because we won’t always know the size of a set in advance. For example, at the start of the season, we might not know the exact number of players in the Premier League. Luckily, we don’t have to specify the size of PP. Instead, we can just define PP more abstractly as “the set of all available players."

Parameters

Next, we need to define the parameters:

  1. Let cpc_p be the cost of player pp
  2. Let teampT\text{team}_p \in T be the team of player pp
  3. Let pospY\text{pos}_p \in Y be the position of player pp
  4. Let mym_y be the number of players needed in position yy
  5. Let pointsp\text{points}_p be the number of points player pp is expected to get

Later, we’ll use these parameters to define the constraints and goals of our optimisation problem.

Decision variables

Decision variables are the controllable inputs of our model.

For our task, we need to define one binary decision variable for each player. If a given player is selected for our 15-man team, their corresponding decision variable will equal 1. If they are not selected, their decision variable will equal 0.

Mathematically, we can express this as follows:

  • Let xpx_p be a binary variable for each player pp in PP
    • xp=1x_p = 1 if player pp is selected, 0 otherwise

For example, if we had 700 players to choose from, we would have 700 decision variables:

x1,x2,x3,,x688,x689,x700x_1, x_2, x_3, \cdots , x_{688}, x_{689}, x_{700}

For the 15 players we end up choosing, their decision variables will equal 1. For the 685 players we do not choose, their decision variables will equal 0. For example, if we selected Erling Haaland (player 335), the variable x335x_{335} would equal 1. If we didn’t select Haaland, the variable x335x_{335} would equal 0.

Constraints

Now that we’ve defined our sets, parameters and decision variables, we can specify the constraints. In total, we’ll have four constraints:

  1. We must select exactly 15 players
pPxp=15\sum_{p \in P} x_p = 15
  1. We need a specific number of players in each position: 2 goalkeepers, 5 defenders, 5 midfields, and 3 forwards
yY:pP:posp=yxp=my\forall y \in Y: \sum_{p \in P: \text{pos}_p = y} x_p = m_y

Note: technically, satisfying this constraint implies that we will also satisfy the first one, so we wouldn’t need to specify the first constraint as well. But for simplicity/completeness I’ll include them both.

  1. There can be no more than 3 players from the same team
tT:pP:teamp=txp<=3\forall t \in T: \sum_{p \in P: \text{team}_p = t} x_p <= 3
  1. We cannot spend more than our total budget (£100m)
pPcp<=1000\sum_{p \in P} c_p <= 1000

Objective function

Now that we’ve defined our constraints, we can define our objective function. In mathematical optimisation problems like ours, the objective function is the thing we want to maximise (or minimise).

In our case, the objective function is the sum of expected points. We can express our desire to maximise this sum via the following formula:

Maximize   pPpointspxp\text{Maximize} \space \space \space \sum_{p \in P} \text{points}_p \cdot x_p

Code

Now, let's translate the mathematical formulas into code.

Load data

First, we need to fetch some data. Let's use FPL data from Kaggle.

The dataset contains 75 attributes for each of the 778 players in the Premier League, including their team, position, cost (now_cost) and points_per_game.

To follow along with this tutorial, you’ll also need 3 libraries installed: numpy, pandas, and Google’s popular linear programming library OR-Tools . If you haven’t got these installed, you can get them using pip:

pip install numpy, pandas, ortools

Next, we’ll import these libraries and load the data we just downloaded:

import numpy as np import pandas as pd from ortools.sat.python import cp_model # Fetch data df = pd.read_csv('players.csv') # Replace with the path to your downloaded file # Preview display(df.head())
idnamenow_costpositionteamclean sheets per 90threat rank typeexpected assists per 90...
3Granit Xhaka48MIDArsenal0.39380.04...
4Mohamed Elneny41MIDArsenal02330...
5Rob Holding42DEFArsenal01180.01...
6Thomas Partey47MIDArsenal0.4900.05...
7Martin Ødegaard69MIDArsenal0.3780.13...
5 rows x 75 columns

Note that the column indicating each player’s cost (now_cost) is in units of £100k. E.g., the first player in that preview (Granit Xhaka) will cost 48 * £100k = £4.8m to buy.

Note: the data get updated every week, so you might get slightly different numbers than I do, depending on when you’re reading this.

Define the model

We'll initialise a CP (constraint programming) model:

model = cp_model.CpModel()

Add decision variables

We'll create one Boolean decision variable for each player in our data set:

# Create an empty list to hold the decision variables decision_variables = {} # Create decision variables for each of the players for player_id in df['id']: player_variables = {player_id: model.NewBoolVar(f"player{player_id}")} decision_variables.update(player_variables) # Add to dict

After running that code, we will have added all our Boolean player variables to the model. We can take a quick look at these decision variables to check they’ve been added:

print(decision_variables.values()) print(f"Number of decision variables: {len(decision_variables)}")
result
[player0(0..1), player1(0..1), ... player777(0..1)] Number of decision variables: 778

Add constraints

First, let's add the constraint that we need 2 goalkeepers, 5 defenders, 5 midfielders and 3 forwards (15 players in total):

# Define the number of players we want to select in each position POSITION_MAP = { 'Goalkeeper': {'code': 'GKP', 'count': 2}, 'Defender': {'code': 'DEF', 'count': 5}, 'Midfielder': {'code': 'MID', 'count': 5}, 'Forward': {'code': 'FWD', 'count': 3} } # Add the constraint: per position, only allowed the specified count of players for position, details in POSITION_MAP.items(): players_in_position = list(df[df['position'] == details['code']].id.values) player_variables = {i: decision_variables[i] for i in players_in_position} # Fetch the number of players we're allowed to pick in this position player_count = details['count'] # Add the constraint for each position model.Add(sum(player_variables.values()) == player_count)

Next, let's add the constraint that we can't spend more than 1000 (£100m):

This one’s reasonably straightforward: first, we set our budget (in units of £100k), and then we define a new decision variable which will equal the total amount we’ve spent. How do we calculate this variable? We calculate the sum of (points_per_game)*(1 or 0) for each player (depending on whether the player is selected). As a linear equation, this looks like:

totalcost = player1_selected * player1_cost + + player778_selected * player778_cost

Each variable playerxselectedplayerx_{selected} will be assigned a value 1 if that player is selected, and 0 if not. Here it is in code:

BUDGET = 1000 player_costs = {player: df[df['id']==player]['now_cost'].values[0] for player in df['id']} model.Add(sum(var * player_costs[i] for i, var in decision_variables.items()) <= BUDGET)

Finally, we'll add the constraint that we can have no more than 3 players from the same team.

We can do this by defining one linear equation for each team using a for loop:

MAX_PLAYERS_PER_TEAM = 3 teams = df['team'].unique() for team in teams: eligible_players = df[df['team']==team].id.values model.Add(sum(decision_variables[i] for i in eligible_players) <= MAX_PLAYERS_PER_TEAM)

Add objective function

If you’ve done all that, well done! We’ve successfully added the constraints (i.e. we’ve told the model the minimum requirements which must be met when selecting players).

If the model follows all these constraints, it will narrow down the search space and identify only those options that are feasible.

But you'll still be left with millions of options left. Which of these should we pick?

That’s where the objective function comes in. The objective function specifies the thing we’re trying to optimise for, which will guide the algorithm so that it selects the optimal solution out of all the possible options.

Since we don't know each player's expected points per game in the upcoming season, we'll use a simple proxy: their average points scored per game last season.

It’s not a perfect proxy, but it’s good enough for our purposes; it’s reasonable to expect that players who’ve scored a high number of points previously will continue getting high scores.

In our DataFrame, this data point is shown in the points_per_game column, which shows the average number of points each player has achieved across all the games they’ve played so far this season.

With a little sorting, we can see the players who have the highest points_per_game, including Erling Haaland and Mo Salah:

df.sort_values(by=['points_per_game'], ascending=False)[['name', 'team', 'now_cost', 'points_per_game']].head(10)
IDNameTeamNow CostPoints per Game
279Asmir BegovićEverton389
489Erling HaalandMan City1247.8
494Stefan Ortega MorenoMan City377.7
673Harry KaneSpurs1156.9
445Mohamed SalahLiverpool1316.3
516Marcus RashfordMan Utd725.9
4Martin ØdegaardArsenal695.9
474Kevin De BruyneMan City1215.7
15Gabriel Martinelli SilvaArsenal655.5
122Ivan ToneyBrentford715.5
10 rows x 5 columns

Using this field, let's define an objective function which maximises this points_per_game field:

# For each player, collect their average `points_per_game` player_points_per_game = {player: df[df['id']==player]['points_per_game'].values[0] for player in decision_variables.keys()} # Add another decision variable: the total `points_per_game` of the selected players # This var is the sum of (points_per_game)*(1 or 0) for each player (depending on whether the player is selected) total_points = sum(var * player_points_per_game[i] for i, var in decision_variables.items()) # Instruct the model to maximise this model.Maximize(total_points)

Solve

Finally, we instruct the model to find the optimal solution, given the constraints and objective function:

solver = cp_model.CpSolver() status = solver.Solve(model)

And we show the solution:

def show_solution(status): """ Print out the solution (if any). """ if status == cp_model.OPTIMAL: print("Optimal solution found. Players selected:\\n") total_cost = 0 total_points_per_game = 0 players = { 'GKP': [], 'DEF': [], 'MID': [], 'FWD': [] } for i, var in decision_variables.items(): if solver.Value(var) == 1: player_position = df[df['id']==i].position.values[0] players[player_position].append(i) for position, ids in players.items(): print(f"\\nPlayers in {position}:") for i in ids: player_name = df[df['id']==i]['name'].values[0] player_cost = df[df['id']==i]['now_cost'].values[0] / 10 player_team = df[df['id']==i]['team'].values[0] player_points_per_game = df[df['id']==i]['points_per_game'].values[0] print(f"{player_name}: {player_team}, £{player_cost}, {player_points_per_game}") total_cost += player_cost total_points_per_game += player_points_per_game print("\\nTotal cost: ", total_cost) print("\\nTotal points_per_game: ", total_points_per_game) else: print("No solution found.") show_solution(status)
Optimal solution found. Players selected: Goalkeeper: Asmir Begović: Everton, £3.8, 9.0 Stefan Ortega Moreno: Man City, £3.7, 7.7 Defender: Yerry Mina: Everton, £4.3, 4.7 João Cancelo: Man City, £7.1, 4.3 Luke Shaw: Man Utd, £5.3, 4.0 Diogo Dalot Teixeira: Man Utd, £4.8, 3.9 Kieran Trippier: Newcastle, £6.0, 5.2 Midfielder: Martin Ødegaard: Arsenal, £6.9, 5.7 Bukayo Saka: Arsenal, £8.0, 5.3 Gabriel Martinelli Silva: Arsenal, £6.5, 5.5 Marcus Rashford: Man Utd, £7.2, 5.9 Miguel Almirón Rejala: Newcastle, £5.4, 4.6 Forward: Ivan Toney: Brentford, £7.1, 5.5 Erling Haaland: Man City, £12.4, 7.8 Harry Kane: Spurs, £11.5, 6.9 Total cost: 100.0 Total points per game: 86.0

Hooray! We’ve successfully picked a team within our budget, with a total predicted points per game of 86.0. Of course, this won't be the truly optimal team, since we used last season's average points_per_game as the proxy for expected points this season, and that's unlikely to match up completely with points achieved next season. But nonetheless, this provides a great starting point.

Conclusion

This article has demonstrated how to apply the concepts of mathematical optimisation and linear programming using Python and OR-Tools.

Thanks for reading!