Introduction
Mathematical optimisation, also known as mathematical programming, is a way to solve problems which involve choosing the best option from a set of possible choices.
In this tutorial, we’ll learn how to formulate and solve optimisation problems with Python and OR-Tools, a fantastic open source library developed by Google.
We’ll focus on a specific type of mathematical optimisation - linear programming - and illustrate the concepts by designing an algorithm to select an optimal 15-man team for the Fantasy Premier League.
What's Fantasy Premier League?Fantasy Premier League (FPL) is the official fantasy football game of the English Premier League. (For any Americans or Australians out there, “football” means “soccer” in this context!) With over 10 million players, it is the largest fantasy football game of any domestic football league.
Given that there are hundreds of players in the Premier League, there are gazillions of different 15-man combinations you could choose. Even if you have extensive domain expertise about football, this can feel like an impossible task; like trying to find the needle in a haystack. But with the power of mathematical optimisation and Python, this problem becomes easily solvable with just a few lines of code.
Problem description
Before we write any code or mathematical equations, let’s describe the problem in plain English, enumerating all the constraints and assumptions. The more specific we are at this stage, the easier it will be to translate the problem into mathematical language later on.
Task: Select a fantasy football squad of exactly 15 players.
Overall aim: Maximise expected points. What do we mean by this? Well, once you’ve picked an FPL team, you’ll collect points from your players depending on how they perform. The better a player performs, the more points they generate. The goal of FPL is to pick a combination of players which will earn as many points as possible.
Constraints: The FPL rulebook stipulates a few important constraints:
- We must have a certain number of players in each position:
- 2 Goalkeepers
- 5 Defenders
- 5 Midfielders
- 3 Forwards
- The total value of your initial squad must not exceed £100 million
- You can select no more than 3 players from a single Premier League team. For example, you couldn't select all 4 of Haaland, Walker, Stones and Grealish (since they all play for Manchester City). You could only select 3 of them.
Formulating the problem mathematically
Next, we need to translate these constraints and goals into mathematical language which will form the basis for our Python code later on.
Did you say... math? 😰
If you've a strong aversion to math, the prospect of “formulating a problem in mathematical terms” might seem daunting. But math is much less scary than it first appears. As its core, math is just a way of describing things in concise, precise terms. If you understood the “plain English” constraints and goals specified above, then you've already understood all the ideas we'll be expressing in these mathematical formulas.
If you're new to math, my advice would be: try to read through this section, but don't worry if you don't understand every symbol/notation. The main reason we start with the math is to make it easier to write the code later on. But, if the math just confuses you even more, you can just jump straight into the code.
To formulate the problem mathematically, we need to define five things:
- Sets
- Parameters
- Decision variables
- Constraints
- Objective function(s)
Let’s walk through these one-by-one.
Sets
A set is a collection of non-repeated items. In our FPL problem, there are three sets: the set of players, the set of teams, and the set of positions. We’ll use mathematical symbols to denote them:
- Let be the set of all available players
- Let be the set of all teams
- Let be the set of positions (goalkeeper, defender, midfielder, forward)
Why don't we specify the size of each set? 🤔
At this stage, we don’t need to specify the size of each set. For example, even though we know that there are exactly 20 teams in the Premier League, we don’t need to specify that . We can just define in a slightly more abstract way as “the set of all teams.”
Generalisations like this are useful because we won’t always know the size of a set in advance. For example, at the start of the season, we might not know the exact number of players in the Premier League. Luckily, we don’t have to specify the size of . Instead, we can just define more abstractly as “the set of all available players."
Parameters
Next, we need to define the parameters:
- Let be the cost of player
- Let be the team of player
- Let be the position of player
- Let be the number of players needed in position
- Let be the number of points player is expected to get
Later, we’ll use these parameters to define the constraints and goals of our optimisation problem.
Decision variables
Decision variables are the controllable inputs of our model.
For our task, we need to define one binary decision variable for each player. If a given player is selected for our 15-man team, their corresponding decision variable will equal 1. If they are not selected, their decision variable will equal 0.
Mathematically, we can express this as follows:
- Let be a binary variable for each player in
- if player is selected, 0 otherwise
For example, if we had 700 players to choose from, we would have 700 decision variables:
For the 15 players we end up choosing, their decision variables will equal 1. For the 685 players we do not choose, their decision variables will equal 0. For example, if we selected Erling Haaland (player 335), the variable would equal 1. If we didn’t select Haaland, the variable would equal 0.
Constraints
Now that we’ve defined our sets, parameters and decision variables, we can specify the constraints. In total, we’ll have four constraints:
- We must select exactly 15 players
- We need a specific number of players in each position: 2 goalkeepers, 5 defenders, 5 midfields, and 3 forwards
Note: technically, satisfying this constraint implies that we will also satisfy the first one, so we wouldn’t need to specify the first constraint as well. But for simplicity/completeness I’ll include them both.
- There can be no more than 3 players from the same team
- We cannot spend more than our total budget (£100m)
Objective function
Now that we’ve defined our constraints, we can define our objective function. In mathematical optimisation problems like ours, the objective function is the thing we want to maximise (or minimise).
In our case, the objective function is the sum of expected points. We can express our desire to maximise this sum via the following formula:
Code
Now, let's translate the mathematical formulas into code.
Load data
First, we need to fetch some data. Let's use FPL data from Kaggle.
The dataset contains 75 attributes for each of the 778 players in the Premier League, including their team, position, cost (now_cost) and points_per_game.
To follow along with this tutorial, you’ll also need 3 libraries installed: numpy, pandas, and Google’s popular linear programming library OR-Tools . If you haven’t got these installed, you can get them using pip:
pip install numpy, pandas, ortools
Next, we’ll import these libraries and load the data we just downloaded:
import numpy as np import pandas as pd from ortools.sat.python import cp_model # Fetch data df = pd.read_csv('players.csv') # Replace with the path to your downloaded file # Preview display(df.head())
id | name | now_cost | position | team | clean sheets per 90 | threat rank type | expected assists per 90 | ... |
---|---|---|---|---|---|---|---|---|
3 | Granit Xhaka | 48 | MID | Arsenal | 0.39 | 38 | 0.04 | ... |
4 | Mohamed Elneny | 41 | MID | Arsenal | 0 | 233 | 0 | ... |
5 | Rob Holding | 42 | DEF | Arsenal | 0 | 118 | 0.01 | ... |
6 | Thomas Partey | 47 | MID | Arsenal | 0.4 | 90 | 0.05 | ... |
7 | Martin Ødegaard | 69 | MID | Arsenal | 0.37 | 8 | 0.13 | ... |
Note that the column indicating each player’s cost (now_cost) is in units of £100k. E.g., the first player in that preview (Granit Xhaka) will cost 48 * £100k = £4.8m to buy.
Note: the data get updated every week, so you might get slightly different numbers than I do, depending on when you’re reading this.
Define the model
We'll initialise a CP (constraint programming) model:
model = cp_model.CpModel()
Add decision variables
We'll create one Boolean decision variable for each player in our data set:
# Create an empty list to hold the decision variables decision_variables = {} # Create decision variables for each of the players for player_id in df['id']: player_variables = {player_id: model.NewBoolVar(f"player{player_id}")} decision_variables.update(player_variables) # Add to dict
After running that code, we will have added all our Boolean player variables to the model. We can take a quick look at these decision variables to check they’ve been added:
print(decision_variables.values()) print(f"Number of decision variables: {len(decision_variables)}")
result[player0(0..1), player1(0..1), ... player777(0..1)] Number of decision variables: 778
Add constraints
First, let's add the constraint that we need 2 goalkeepers, 5 defenders, 5 midfielders and 3 forwards (15 players in total):
# Define the number of players we want to select in each position POSITION_MAP = { 'Goalkeeper': {'code': 'GKP', 'count': 2}, 'Defender': {'code': 'DEF', 'count': 5}, 'Midfielder': {'code': 'MID', 'count': 5}, 'Forward': {'code': 'FWD', 'count': 3} } # Add the constraint: per position, only allowed the specified count of players for position, details in POSITION_MAP.items(): players_in_position = list(df[df['position'] == details['code']].id.values) player_variables = {i: decision_variables[i] for i in players_in_position} # Fetch the number of players we're allowed to pick in this position player_count = details['count'] # Add the constraint for each position model.Add(sum(player_variables.values()) == player_count)
Next, let's add the constraint that we can't spend more than 1000 (£100m):
This one’s reasonably straightforward: first, we set our budget (in units of £100k), and then we define a new decision variable which will equal the total amount we’ve spent. How do we calculate this variable? We calculate the sum of (points_per_game)*(1 or 0) for each player (depending on whether the player is selected). As a linear equation, this looks like:
totalcost = player1_selected * player1_cost + … + player778_selected * player778_cost
Each variable will be assigned a value 1 if that player is selected, and 0 if not. Here it is in code:
BUDGET = 1000 player_costs = {player: df[df['id']==player]['now_cost'].values[0] for player in df['id']} model.Add(sum(var * player_costs[i] for i, var in decision_variables.items()) <= BUDGET)
Finally, we'll add the constraint that we can have no more than 3 players from the same team.
We can do this by defining one linear equation for each team using a for loop:
MAX_PLAYERS_PER_TEAM = 3 teams = df['team'].unique() for team in teams: eligible_players = df[df['team']==team].id.values model.Add(sum(decision_variables[i] for i in eligible_players) <= MAX_PLAYERS_PER_TEAM)
Add objective function
If you’ve done all that, well done! We’ve successfully added the constraints (i.e. we’ve told the model the minimum requirements which must be met when selecting players).
If the model follows all these constraints, it will narrow down the search space and identify only those options that are feasible.
But you'll still be left with millions of options left. Which of these should we pick?
That’s where the objective function comes in. The objective function specifies the thing we’re trying to optimise for, which will guide the algorithm so that it selects the optimal solution out of all the possible options.
Since we don't know each player's expected points per game in the upcoming season, we'll use a simple proxy: their average points scored per game last season.
It’s not a perfect proxy, but it’s good enough for our purposes; it’s reasonable to expect that players who’ve scored a high number of points previously will continue getting high scores.
In our DataFrame, this data point is shown in the points_per_game column, which shows the average number of points each player has achieved across all the games they’ve played so far this season.
With a little sorting, we can see the players who have the highest points_per_game, including Erling Haaland and Mo Salah:
df.sort_values(by=['points_per_game'], ascending=False)[['name', 'team', 'now_cost', 'points_per_game']].head(10)
ID | Name | Team | Now Cost | Points per Game |
---|---|---|---|---|
279 | Asmir Begović | Everton | 38 | 9 |
489 | Erling Haaland | Man City | 124 | 7.8 |
494 | Stefan Ortega Moreno | Man City | 37 | 7.7 |
673 | Harry Kane | Spurs | 115 | 6.9 |
445 | Mohamed Salah | Liverpool | 131 | 6.3 |
516 | Marcus Rashford | Man Utd | 72 | 5.9 |
4 | Martin Ødegaard | Arsenal | 69 | 5.9 |
474 | Kevin De Bruyne | Man City | 121 | 5.7 |
15 | Gabriel Martinelli Silva | Arsenal | 65 | 5.5 |
122 | Ivan Toney | Brentford | 71 | 5.5 |
Using this field, let's define an objective function which maximises this points_per_game field:
# For each player, collect their average `points_per_game` player_points_per_game = {player: df[df['id']==player]['points_per_game'].values[0] for player in decision_variables.keys()} # Add another decision variable: the total `points_per_game` of the selected players # This var is the sum of (points_per_game)*(1 or 0) for each player (depending on whether the player is selected) total_points = sum(var * player_points_per_game[i] for i, var in decision_variables.items()) # Instruct the model to maximise this model.Maximize(total_points)
Solve
Finally, we instruct the model to find the optimal solution, given the constraints and objective function:
solver = cp_model.CpSolver() status = solver.Solve(model)
And we show the solution:
def show_solution(status): """ Print out the solution (if any). """ if status == cp_model.OPTIMAL: print("Optimal solution found. Players selected:\\n") total_cost = 0 total_points_per_game = 0 players = { 'GKP': [], 'DEF': [], 'MID': [], 'FWD': [] } for i, var in decision_variables.items(): if solver.Value(var) == 1: player_position = df[df['id']==i].position.values[0] players[player_position].append(i) for position, ids in players.items(): print(f"\\nPlayers in {position}:") for i in ids: player_name = df[df['id']==i]['name'].values[0] player_cost = df[df['id']==i]['now_cost'].values[0] / 10 player_team = df[df['id']==i]['team'].values[0] player_points_per_game = df[df['id']==i]['points_per_game'].values[0] print(f"{player_name}: {player_team}, £{player_cost}, {player_points_per_game}") total_cost += player_cost total_points_per_game += player_points_per_game print("\\nTotal cost: ", total_cost) print("\\nTotal points_per_game: ", total_points_per_game) else: print("No solution found.") show_solution(status)
Optimal solution found. Players selected: Goalkeeper: Asmir Begović: Everton, £3.8, 9.0 Stefan Ortega Moreno: Man City, £3.7, 7.7 Defender: Yerry Mina: Everton, £4.3, 4.7 João Cancelo: Man City, £7.1, 4.3 Luke Shaw: Man Utd, £5.3, 4.0 Diogo Dalot Teixeira: Man Utd, £4.8, 3.9 Kieran Trippier: Newcastle, £6.0, 5.2 Midfielder: Martin Ødegaard: Arsenal, £6.9, 5.7 Bukayo Saka: Arsenal, £8.0, 5.3 Gabriel Martinelli Silva: Arsenal, £6.5, 5.5 Marcus Rashford: Man Utd, £7.2, 5.9 Miguel Almirón Rejala: Newcastle, £5.4, 4.6 Forward: Ivan Toney: Brentford, £7.1, 5.5 Erling Haaland: Man City, £12.4, 7.8 Harry Kane: Spurs, £11.5, 6.9 Total cost: 100.0 Total points per game: 86.0
Hooray! We’ve successfully picked a team within our budget, with a total predicted points per game of 86.0. Of course, this won't be the truly optimal team, since we used last season's average points_per_game as the proxy for expected points this season, and that's unlikely to match up completely with points achieved next season. But nonetheless, this provides a great starting point.
Conclusion
This article has demonstrated how to apply the concepts of mathematical optimisation and linear programming using Python and OR-Tools.
Thanks for reading!