5.10. Exercise: Airlines Problem (Optional)#

In this tutorial, we show you an example of a large-scale and (sort of) practical linear programming problem using real data.

Note

Gurobi cannot be loaded in this online book, so download this notebook to work on it with Gurobi locally installed. Instruction on how to do that are in the README.md and PA_2_4_gurobilicious.ipynb of week 2.4.

Problem description#

Data description#

The dataset we will use for this example can be found here. Description of the data including column names is provided here. We also use monthly leasing rates for aircrafts, which can be found here.

The flight dataset includes monthly records of all flights in the US for each Origin-Destination city pair from the beginning of 1990 till the end of 2009. This data is used to model a realistic problem, with some simplifying assumptions, that can be formulated as a linear programming problem and solved using Gurobi.

Problem definition#

Our task is to specify the fleet size and the frequency of flights for each route and each aircraft type in each time period that minimizes the total operating and purchasing cost of the entire fleet while satisfying travel demand and operating constraints.

Assumptions (to make the problem manageable)#

  • We want to model the problem as a linear program with continous decision variables

  • There are no constraints on the type of aeroplane that can land at an airport; all airplanes can land at any airport, and we can operate in all cities.

  • There is no minimum flight frequency.

  • All aircrafts are leased, allowing us to change fleet size from one month to the next.

Data preprocessing#

You need to download the dataset and upload it on the folder where your notebook is.

import gurobipy as gp
import pandas as pd
import numpy as np

# read data
data = pd.read_csv('flight_edges.tsv', sep='\t', header=None)

# add column names (they are not included in the dataset but can be found in dataset description)
data.columns = ['Origin', 'Destination', 'Origin_City', 'Destination_City', 'Passengers', 'Seats',
                'Flights', 'Distance', 'Fly_Date', 'Origin_Population', 'Destination_Population']

# let's check out the first three rows of the data
data.head(3)
Origin Destination Origin_City Destination_City Passengers Seats Flights Distance Fly_Date Origin_Population Destination_Population
0 MHK AMW Manhattan, KS Ames, IA 21 30 1 254.0 200810 122049 86219
1 EUG RDM Eugene, OR Bend, OR 41 396 22 103.0 199011 284093 76034
2 EUG RDM Eugene, OR Bend, OR 88 342 19 103.0 199012 284093 76034
# check for missing values (if the resulting dataframe is empty, it means there is no missing value)
missing_rows = data[data.isna().any(axis=1)]
missing_rows
Origin Destination Origin_City Destination_City Passengers Seats Flights Distance Fly_Date Origin_Population Destination_Population
# eliminate columns we don't need (for this study)
data = data.drop(columns=['Origin_City', 'Destination_City', 'Origin_Population', 'Destination_Population'])

# sort data based on OD and month
data = data.sort_values(['Origin', 'Destination', 'Fly_Date'])

# eliminate flights with the same OD (not sure why these records existed, but it makes sense to eliminate them)
data = data[data['Origin'] != data['Destination']]

# calculate sum of passengers for each record and add to data (they are spread in multiple lines for some reason)
data['passenger_sum'] = data.groupby(['Origin', 'Destination', 'Fly_Date'])['Passengers'].transform('sum')
data['Seats'] = data.groupby(['Origin', 'Destination', 'Fly_Date'])['Seats'].transform('sum')
data['Flights'] = data.groupby(['Origin', 'Destination', 'Fly_Date'])['Flights'].transform('sum')
data['Passengers'] = data.groupby(['Origin', 'Destination', 'Fly_Date'])['Passengers'].transform('sum')

# eliminate duplicates (since aggregated group values, first row in each group is enough to keep
data = data.drop_duplicates(subset=['Origin', 'Destination', 'Fly_Date'], keep='first')

# eliminate rows where the number of passengers is less than 5
data = data[data['passenger_sum'] > 5]

# some useful variables
demand_grouped_monthly = data.groupby(['Origin', 'Destination', 'Fly_Date'])['passenger_sum'].agg(np.sum)

# let's check out how this grouped object looks like
demand_grouped_monthly.head(3)
Origin  Destination  Fly_Date
ABE     ACY          199007      73
        AGS          200402      47
                     200508      57
Name: passenger_sum, dtype: int64
# save the distance between each OD pair (since there will be duplicate OD pairs with the same distance, we take the mean here)
distance = data.groupby(['Origin', 'Destination'])['Distance'].mean()

# let's examine the distance data
distance.head(3)
Origin  Destination
ABE     ACY             94.0
        AGS            618.0
        ALB            167.0
Name: Distance, dtype: float64
# save all unique OD pairs in a list; this gives us the full list of routes
# note: pandas unique function preserves order (which is the desired behavior here)
routes = pd.unique(distance.index.values.tolist())

# check the first 5 rows of the route list
routes[:5]
array([('ABE', 'ACY'), ('ABE', 'AGS'), ('ABE', 'ALB'), ('ABE', 'ATL'),
       ('ABE', 'AVL')], dtype=object)
# find all months that appear in the dataset and select the last (in this case) 6 months to use in the model
# note: numpy unique function sorts the output (which is the desired behavior here)
months = np.unique(data['Fly_Date'])[-6:]
months
array([200907, 200908, 200909, 200910, 200911, 200912], dtype=int64)
# let's make a table with origin-destination pairs (routes) on rows and monthly demand on columns

# empty dataset with origin-destination pairs (routes) as rows and monthly demand as columns
demand_data = pd.DataFrame(index=routes, columns=[month for month in months])

# find monthly demand for each route and add to monthly demand dataset
# note that iterating through rows of a dataframe can be prohibitively slow, so we use a more efficient way
for month in months:
    # initiate relevant column with zeros
    demand_data[month].values[:] = 0
    # temporary array to store iteration results
    demand_temp_all_routes = demand_data[month]
    # temporary array to store demand for routes that have none-zero demand values
    demand_temp_nonzero = demand_grouped_monthly[demand_grouped_monthly.index.get_level_values(2) == month]
    # eliminate month from array index (all values here have the same month)
    demand_temp_nonzero.index = demand_temp_nonzero.index.droplevel(2)
    # insert none-zero demand values into relevant rows
    demand_temp_all_routes[demand_temp_all_routes.index.intersection(demand_temp_nonzero.index)] = demand_temp_nonzero
    # assign iteration results (monthly demand) to monthly demand dataset
    demand_data[month] = demand_temp_all_routes

# convert demand data to a numpy array for efficiency
demand = demand_data.values

# let's see how it looks like
demand_data.head(4)
200907 200908 200909 200910 200911 200912
(ABE, ACY) 0 0 0 0 0 0
(ABE, AGS) 0 0 0 0 0 0
(ABE, ALB) 0 0 0 0 0 0
(ABE, ATL) 3854 3307 2407 2855 2739 2344

Now we have the set of routes (OD pairs), set of months, distance for each route and demand for each route in each month. So we are ready to build the model in Gurobi.

Model construction#

Let’s build a Gurobi model object first. Then we will add all required components to the model step by step.

# create a gurobi model object
model = gp.Model()

# just to avoid cluttering the notebook with unnecessary logging output
model.Params.LogToConsole = 0
Set parameter Username
Academic license - for non-commercial use only - expires 2023-05-06

Parameters#

We need average aircraft speed (\(S\)), number of working days per period (\(N\)), and daily operating hours (\(H\)) to use in the model. It is good Python practice to use meaningful names instead of notations, so we use full names for parameters. They are introduced below.

We also need the list of aircraft types (\(j \in J\)), their capacity (\(C_j\)), their hourly operating cost (\(O_j\)) and their monthly leasing cost (\(L_j\)). We store these in a multidictionary, which is a convenient way of storing related parameters and sets in Gurobi.

average_aircraft_speed = 500
working_day_per_period = 30
daily_operating_hours = 20

aircraft_type, aircraft_capacity, aircraft_hourly_operating_cost, aircraft_leasing_cost = gp.multidict({
    'A320-200 CEO': [180, 4378, 196 * 1000],
    'A320 NEO':     [188, 4378, 315 * 1000],
    'A321-200 CEO': [230, 4432, 310 * 1000],
    'A321 NEO':     [236, 5149, 359 * 1000],
    'A330-200':     [436, 6564, 515 * 1000]
})

Sets#

We need the following sets to model this problem:

  • set of routes (OD pairs) that we extracted from the data: \(i \in I\)

  • set of months that we extracted from the data: \(m \in M\)

  • set of aircraft types (extracted from the references mentioned in data description and listed above): \(j \in J\)

Decision variables#

The decision variables are:

  • frequency of flights for each route and each aircraft type in each time period (\(x_{ijm}\); i: route, j: aircraft type, m: month)

  • fleet size for each route and each aircraft type in each time period (\(y_{ijm}\); i: route, j: aircraft type, m: month)

## decision variables: 

# frequency: Xijm; i: route, j: aircraft type, m: month
frequency = model.addVars(len(routes), len(aircraft_type), len(months), vtype=gp.GRB.CONTINUOUS, name='x')

# fleet size: Yijm; i: route, j: aircraft type, m: month
fleet_size = model.addVars(len(routes), len(aircraft_type), len(months), vtype=gp.GRB.CONTINUOUS, name='y')

Objective function#

The objective function includes:

  • aircraft leasing cost (sum for all routes, aircraft types and time periods): \(\sum_{ijm}{y_{ijm} . L_j}\)

  • aircraft operating cost (frequency times hourly operating cost times flight duration, calculated as distance divided by average speed): \(\sum_{ijm}{x_{ijm} . O_j . (D_i/S_j)}\)

# objective 1 (aircraft leasing cost)
obj_leasing_cost = gp.quicksum(fleet_size[i, j, m] * aircraft_leasing_cost[aircraft_type[j]]
                               for i in range(len(routes))
                               for j in range(len(aircraft_type))
                               for m in range(len(months)))
# objective 2 (aircraft operating cost)
obj_operating_cost = gp.quicksum(frequency[i, j, m] * aircraft_hourly_operating_cost[aircraft_type[j]] * (distance[i] / average_aircraft_speed)
                                 for i in range(len(routes))
                                 for j in range(len(aircraft_type))
                                 for m in range(len(months)))

# objective function
model.setObjective((obj_leasing_cost + obj_operating_cost), gp.GRB.MINIMIZE)

Constraints#

The constraints are:

  • demand satisfaction constraints: the sum of frequencies multiplied by the capacity of all aircraft types should be greater than or equal to demand for each route in each time period

\[ \sum_{j}{x_{ijm} . C_j} \geq D_{im}; \forall i \in I, \forall m \in M\]
  • flight frequency constraint: the frequency of flights is restricted by the fleet size and available operating hours (per route, plane type and month)

\[ x_{ijm} . (D_i/S_j) \leq (y_{ijm} . H . N) ; \forall i \in I, \forall j \in J, \forall m \in M\]
# demand satisfaction constraints
model.addConstrs(
    gp.quicksum(frequency[i, j, m] * aircraft_capacity[aircraft_type[j]] for j in range(len(aircraft_type)))
    >= demand[i, m]
    for i in range(len(routes))
    for m in range(len(months)))

# flight frequency constraint
model.addConstrs(frequency[i, j, m] * distance[i] / average_aircraft_speed
                 <= fleet_size[i, j, m] * daily_operating_hours * working_day_per_period
                 for i in range(len(routes))
                 for j in range(len(aircraft_type))
                 for m in range(len(months)))

model.update()

Complete model formulation#

\[\begin{split}\begin{align} & min_{x,y}\\ \\ &\sum_{ijm}{y_{ijm} . L_j} + \sum_{ijm}{x_{ijm} . O_j . (D_i/S_j)} \\ \\ & s.t. \\ \\ & \sum_{j}{x_{ijm} . C_j} \geq D_{im}; &\forall i \in I, \forall m \in M\\ & x_{ijm} . (D_i/S_j) \leq (y_{ijm} . H . N) ; &\forall i \in I, \forall j \in J, \forall m \in M\\ & x_{ijm}, y_{ijm} \geq 0\\ \end{align}\end{split}\]

Solving the model#

For solving linear programming problems with continuous decision variables, possible solution algorithms (available with Gurobi) are:

  • primal simplex

  • dual simplex

  • barrier algorithm (which is a specific type of interior point method).

When we ask Gurobi to solve a model without specifying the algorithm, it selects what it thinks is the best algorithm to solve each specific problem. It uses the concurrent optimizer, which is the default option for the solution algorithm.

Here we solve the model without specifying the solution method (letting Gurobi decide which algorithm to use). In the next tutorial, we discuss different algorithms.

# to see the output (remember we turned this off before)
model.Params.LogToConsole = 1

# solve the model
model.optimize()
Set parameter LogToConsole to value 1
Gurobi Optimizer version 9.5.1 build v9.5.1rc2 (win64)
Thread count: 4 physical cores, 8 logical processors, using up to 8 threads
Optimize a model with 959616 rows, 1599360 columns and 2399040 nonzeros
Model fingerprint: 0x97205460
Coefficient statistics:
  Matrix range     [4e-03, 6e+02]
  Objective range  [2e+01, 5e+05]
  Bounds range     [0e+00, 0e+00]
  RHS range        [6e+00, 1e+05]

Concurrent LP optimizer: dual simplex and barrier
Showing barrier log only...

Presolve removed 959616 rows and 1599360 columns
Presolve time: 1.44s
Presolve: All rows and columns removed
Iteration    Objective       Primal Inf.    Dual Inf.      Time
       0    7.9354517e+09   0.000000e+00   0.000000e+00      3s

Solved with dual simplex
Solved in 0 iterations and 2.73 seconds (0.88 work units)
Optimal objective  7.935451713e+09

Exercise#

Try to verify feasibility of the optimal solution as well as the objective value of this solution. You can access the optimal values of decision variables via their x attribute (e.g., fleet_size[i, j, m].x)

Analysis#

Let’s check out the optimal solution.

optimal_results = pd.DataFrame(columns=["Origin", "Destination", "Aircraft Type", "Month", "Fleet size", "Frequency"])
for i in range(len(routes)):
    for j in range(len(aircraft_type)):
        for m in range(len(months)):
            if frequency[i, j, m].x > 1e-6 or fleet_size[i, j, m].x > 1e-6:
                optimal_results = optimal_results.append({"Origin": routes[i][0],
                                                          "Destination": routes[i][1],
                                                          "Aircraft Type": aircraft_type[j],
                                                          "Month": months[m],
                                                          "Fleet size": fleet_size[i,j,m].x,
                                                          "Frequency": frequency[i,j,m].x}, ignore_index=True)
optimal_results.head(20)
Origin Destination Aircraft Type Month Fleet size Frequency
0 ABE ATL A330-200 200907 0.020390 8.839450
1 ABE ATL A330-200 200908 0.017496 7.584862
2 ABE ATL A330-200 200909 0.012734 5.520642
3 ABE ATL A330-200 200910 0.015104 6.548165
4 ABE ATL A330-200 200911 0.014491 6.282110
5 ABE ATL A330-200 200912 0.012401 5.376147
6 ABE AVL A330-200 200907 0.000186 0.105505
7 ABE AVP A330-200 200908 0.000018 0.107798
8 ABE CLE A330-200 200907 0.007348 6.502294
9 ABE CLE A330-200 200908 0.007749 6.857798
10 ABE CLE A330-200 200909 0.006122 5.417431
11 ABE CLE A330-200 200910 0.007177 6.350917
12 ABE CLE A330-200 200911 0.006018 5.325688
13 ABE CLE A330-200 200912 0.004999 4.424312
14 ABE CLT A330-200 200907 0.019529 12.201835
15 ABE CLT A330-200 200908 0.018623 11.635321
16 ABE CLT A330-200 200909 0.018193 11.366972
17 ABE CLT A330-200 200910 0.018681 11.672018
18 ABE CLT A330-200 200911 0.017940 11.208716
19 ABE CLT A330-200 200912 0.017474 10.917431

Did you notice something odd about the results? The frequencies and fleet sizes are represented as decimal numbers. This occurs because we specified the decision variable types as continuous during their declaration.

While integer decision variables are more realistic for representing quantities such as flight frequencies or fleet sizes, they significantly increase the complexity of the problem. Large-scale linear programs with integer decision variables can be much more challenging to solve and, in some cases, practically infeasible.

Using continuous variables in this example allows us to simplify the problem and focus on learning core optimization techniques. Such simplifications are also common in practice, especially when approximate solutions are sufficient for decision-making.

Now let’s try to force Gurobi to only find solutions with integer decision variable values. In this case, we can do that by iterating through decision variables and changing their type from continuous to integer (it’s not always that simple).

# discard existing solution information from the model
model.reset(0)

# iterate over decision variables and modify their type to integer (remember we declare them continuous before)
for var in model.getVars():
    var.vtype = gp.GRB.INTEGER
Discarded solution information
# optimize the modified model
model.optimize()
Gurobi Optimizer version 9.5.1 build v9.5.1rc2 (win64)
Thread count: 4 physical cores, 8 logical processors, using up to 8 threads
Optimize a model with 959616 rows, 1599360 columns and 2399040 nonzeros
Model fingerprint: 0xb003a3e3
Variable types: 0 continuous, 1599360 integer (0 binary)
Coefficient statistics:
  Matrix range     [4e-03, 6e+02]
  Objective range  [2e+01, 5e+05]
  Bounds range     [0e+00, 0e+00]
  RHS range        [6e+00, 1e+05]
Found heuristic solution: objective 2.043082e+10
Presolve removed 809084 rows and 1599350 columns (presolve time = 35s) ...
Presolve removed 959610 rows and 1599350 columns
Presolve time: 35.19s
Presolved: 6 rows, 10 columns, 15 nonzeros
Found heuristic solution: objective 1.588337e+10
Variable types: 0 continuous, 10 integer (5 binary)

Explored 0 nodes (0 simplex iterations) in 35.82 seconds (2.52 work units)
Thread count was 8 (of 8 available processors)

Solution count 2: 1.58834e+10 2.04308e+10 

Optimal solution found (tolerance 1.00e-04)
Best objective 1.588336894802e+10, best bound 1.588282454130e+10, gap 0.0034%

The optimal objective function value is higher (less desirable) compared to the continuous problem, which was to be expected. Can you guess why?

Also note that the computation time is significantly higher compared to the continuous problem.

Now let’s make a table and look at the results.

optimal_results_integer = pd.DataFrame(columns=["Origin", "Destination", "Aircraft Type", "Month", "Fleet size", "Frequency"])
for i in range(len(routes)):
    for j in range(len(aircraft_type)):
        for m in range(len(months)):
            if frequency[i, j, m].x > 1e-6 or fleet_size[i, j, m].x > 1e-6:
                optimal_results_integer = optimal_results_integer.append({"Origin": routes[i][0],
                                                                          "Destination": routes[i][1],
                                                                          "Aircraft Type": aircraft_type[j],
                                                                          "Month": months[m],
                                                                          "Fleet size": fleet_size[i,j,m].x,
                                                                          "Frequency": frequency[i,j,m].x}, ignore_index=True)
optimal_results_integer.head(20)
Origin Destination Aircraft Type Month Fleet size Frequency
0 ABE ATL A320-200 CEO 200907 1.0 22.0
1 ABE ATL A320-200 CEO 200908 1.0 19.0
2 ABE ATL A320-200 CEO 200909 1.0 14.0
3 ABE ATL A320-200 CEO 200910 1.0 16.0
4 ABE ATL A320-200 CEO 200911 1.0 16.0
5 ABE ATL A320-200 CEO 200912 1.0 14.0
6 ABE AVL A320-200 CEO 200907 1.0 1.0
7 ABE AVP A320-200 CEO 200908 1.0 1.0
8 ABE CLE A320-200 CEO 200907 1.0 16.0
9 ABE CLE A320-200 CEO 200908 1.0 17.0
10 ABE CLE A320-200 CEO 200909 1.0 14.0
11 ABE CLE A320-200 CEO 200910 1.0 16.0
12 ABE CLE A320-200 CEO 200911 1.0 13.0
13 ABE CLE A320-200 CEO 200912 1.0 11.0
14 ABE CLT A320-200 CEO 200907 1.0 30.0
15 ABE CLT A320-200 CEO 200908 1.0 29.0
16 ABE CLT A320-200 CEO 200909 1.0 28.0
17 ABE CLT A320-200 CEO 200910 1.0 29.0
18 ABE CLT A320-200 CEO 200911 1.0 28.0
19 ABE CLT A320-200 CEO 200912 1.0 27.0

Now fleet size and frequencies make more sense. If you still haven’t figured out why the objective function value of the integer problem is a bit greater, comparing fleet size and frequency values of integer and continuous variations should help you understand that (hint: aircraft leasing cost depends on the fleet size and aircraft operating cost depends on the frequency).

Exercise#

As you saw above, computation time of solving the model changed when we changed decision variable types from continuous to integer. On the other hand, computation time usually increases with problem size. Try to change the size of this problem (simply by changing the number of months to include in input cell 5 from 6 to a different number and rerunning the code), solve both continuous and integer variations, and record computation times. You can go all the way up to 240 months (but note that this might take quite some time). Make a figure to see the trends properly (e.g., number of months on the x-axis, computation time on the y-axis, and a separate line for each problem variation). What can you conclude?

Attribution

This chapter is written by Gonçalo Homem de Almeida Correia, Maria Nogal Macho, Jie Gao and Bahman Ahmadi. Find out more here.