Exercise: Airlines Problem (Optional)

5.10. Exercise: Airlines Problem (Optional)#

In this tutorial, we show you an example of a large-scale and (sort of) practical linear programming problem using real data.

MUDE Exam Information

The airlines problem serves as an example for solving more complex optimization problems and is included for practice. You’re not expected to know the problem’s details.

Note

Gurobi cannot be loaded in this online book, so download this notebook to work on it with Gurobi locally installed. Instruction on how to do that are in the README.md and PA_2_4_gurobilicious.ipynb of week 2.4.

Problem description#

Data description#

The dataset we will use for this example can be found here. Description of the data including column names is provided here. We also use monthly leasing rates for aircrafts, which can be found here.

The flight dataset includes monthly records of all flights in the US for each Origin-Destination city pair from the beginning of 1990 till the end of 2009. This data is used to model a realistic problem, with some simplifying assumptions, that can be formulated as a linear programming problem and solved using Gurobi.

Problem definition#

Our task is to specify the fleet size and the frequency of flights for each route and each aircraft type in each time period that minimizes the total operating and purchasing cost of the entire fleet while satisfying travel demand and operating constraints.

Assumptions (to make the problem manageable)#

We want to model the problem as a linear program with continous decision variables
There are no constraints on the type of aeroplane that can land at an airport; all airplanes can land at any airport, and we can operate in all cities.
There is no minimum flight frequency.
All aircrafts are leased, allowing us to change fleet size from one month to the next.

Data preprocessing#

You need to download the dataset and upload it on the folder where your notebook is.

import gurobipy as gp
import pandas as pd
import numpy as np

# read data
data = pd.read_csv('flight_edges.tsv', sep='\t', header=None)

# add column names (they are not included in the dataset but can be found in dataset description)
data.columns = ['Origin', 'Destination', 'Origin_City', 'Destination_City', 'Passengers', 'Seats',
                'Flights', 'Distance', 'Fly_Date', 'Origin_Population', 'Destination_Population']

# let's check out the first three rows of the data
data.head(3)

	Origin	Destination	Origin_City	Destination_City	Passengers	Seats	Flights	Distance	Fly_Date	Origin_Population	Destination_Population
0	MHK	AMW	Manhattan, KS	Ames, IA	21	30	1	254.0	200810	122049	86219
1	EUG	RDM	Eugene, OR	Bend, OR	41	396	22	103.0	199011	284093	76034
2	EUG	RDM	Eugene, OR	Bend, OR	88	342	19	103.0	199012	284093	76034

# check for missing values (if the resulting dataframe is empty, it means there is no missing value)
missing_rows = data[data.isna().any(axis=1)]
missing_rows

	Origin	Destination	Origin_City	Destination_City	Passengers	Seats	Flights	Distance	Fly_Date	Origin_Population	Destination_Population

# eliminate columns we don't need (for this study)
data = data.drop(columns=['Origin_City', 'Destination_City', 'Origin_Population', 'Destination_Population'])

# sort data based on OD and month
data = data.sort_values(['Origin', 'Destination', 'Fly_Date'])

# eliminate flights with the same OD (not sure why these records existed, but it makes sense to eliminate them)
data = data[data['Origin'] != data['Destination']]

# calculate sum of passengers for each record and add to data (they are spread in multiple lines for some reason)
data['passenger_sum'] = data.groupby(['Origin', 'Destination', 'Fly_Date'])['Passengers'].transform('sum')
data['Seats'] = data.groupby(['Origin', 'Destination', 'Fly_Date'])['Seats'].transform('sum')
data['Flights'] = data.groupby(['Origin', 'Destination', 'Fly_Date'])['Flights'].transform('sum')
data['Passengers'] = data.groupby(['Origin', 'Destination', 'Fly_Date'])['Passengers'].transform('sum')

# eliminate duplicates (since aggregated group values, first row in each group is enough to keep
data = data.drop_duplicates(subset=['Origin', 'Destination', 'Fly_Date'], keep='first')

# eliminate rows where the number of passengers is less than 5
data = data[data['passenger_sum'] > 5]

# some useful variables
demand_grouped_monthly = data.groupby(['Origin', 'Destination', 'Fly_Date'])['passenger_sum'].agg(np.sum)

# let's check out how this grouped object looks like
demand_grouped_monthly.head(3)

Origin  Destination  Fly_Date
ABE     ACY          199007      73
        AGS          200402      47
                     200508      57
Name: passenger_sum, dtype: int64

# save the distance between each OD pair (since there will be duplicate OD pairs with the same distance, we take the mean here)
distance = data.groupby(['Origin', 'Destination'])['Distance'].mean()

# let's examine the distance data
distance.head(3)

Origin  Destination
ABE     ACY             94.0
        AGS            618.0
        ALB            167.0
Name: Distance, dtype: float64

# save all unique OD pairs in a list; this gives us the full list of routes
# note: pandas unique function preserves order (which is the desired behavior here)
routes = pd.unique(distance.index.values.tolist())

# check the first 5 rows of the route list
routes[:5]

array([('ABE', 'ACY'), ('ABE', 'AGS'), ('ABE', 'ALB'), ('ABE', 'ATL'),
       ('ABE', 'AVL')], dtype=object)

# find all months that appear in the dataset and select the last (in this case) 6 months to use in the model
# note: numpy unique function sorts the output (which is the desired behavior here)
months = np.unique(data['Fly_Date'])[-6:]
months

array([200907, 200908, 200909, 200910, 200911, 200912], dtype=int64)

# let's make a table with origin-destination pairs (routes) on rows and monthly demand on columns

# empty dataset with origin-destination pairs (routes) as rows and monthly demand as columns
demand_data = pd.DataFrame(index=routes, columns=[month for month in months])

# find monthly demand for each route and add to monthly demand dataset
# note that iterating through rows of a dataframe can be prohibitively slow, so we use a more efficient way
for month in months:
    # initiate relevant column with zeros
    demand_data[month].values[:] = 0
    # temporary array to store iteration results
    demand_temp_all_routes = demand_data[month]
    # temporary array to store demand for routes that have none-zero demand values
    demand_temp_nonzero = demand_grouped_monthly[demand_grouped_monthly.index.get_level_values(2) == month]
    # eliminate month from array index (all values here have the same month)
    demand_temp_nonzero.index = demand_temp_nonzero.index.droplevel(2)
    # insert none-zero demand values into relevant rows
    demand_temp_all_routes[demand_temp_all_routes.index.intersection(demand_temp_nonzero.index)] = demand_temp_nonzero
    # assign iteration results (monthly demand) to monthly demand dataset
    demand_data[month] = demand_temp_all_routes

# convert demand data to a numpy array for efficiency
demand = demand_data.values

# let's see how it looks like
demand_data.head(4)

	200907	200908	200909	200910	200911	200912
(ABE, ACY)	0	0	0	0	0	0
(ABE, AGS)	0	0	0	0	0	0
(ABE, ALB)	0	0	0	0	0	0
(ABE, ATL)	3854	3307	2407	2855	2739	2344

Now we have the set of routes (OD pairs), set of months, distance for each route and demand for each route in each month. So we are ready to build the model in Gurobi.

Model construction#

Let’s build a Gurobi model object first. Then we will add all required components to the model step by step.

# create a gurobi model object
model = gp.Model()

# just to avoid cluttering the notebook with unnecessary logging output
model.Params.LogToConsole = 0

Set parameter Username
Academic license - for non-commercial use only - expires 2023-05-06

Parameters#

We need average aircraft speed (\(S\)), number of working days per period (\(N\)), and daily operating hours (\(H\)) to use in the model. It is good Python practice to use meaningful names instead of notations, so we use full names for parameters. They are introduced below.

We also need the list of aircraft types (\(j \in J\)), their capacity (\(C_j\)), their hourly operating cost (\(O_j\)) and their monthly leasing cost (\(L_j\)). We store these in a multidictionary, which is a convenient way of storing related parameters and sets in Gurobi.

average_aircraft_speed = 500
working_day_per_period = 30
daily_operating_hours = 20

aircraft_type, aircraft_capacity, aircraft_hourly_operating_cost, aircraft_leasing_cost = gp.multidict({
    'A320-200 CEO': [180, 4378, 196 * 1000],
    'A320 NEO':     [188, 4378, 315 * 1000],
    'A321-200 CEO': [230, 4432, 310 * 1000],
    'A321 NEO':     [236, 5149, 359 * 1000],
    'A330-200':     [436, 6564, 515 * 1000]
})

Sets#

We need the following sets to model this problem:

set of routes (OD pairs) that we extracted from the data: \(i \in I\)
set of months that we extracted from the data: \(m \in M\)
set of aircraft types (extracted from the references mentioned in data description and listed above): \(j \in J\)

Decision variables#

The decision variables are:

frequency of flights for each route and each aircraft type in each time period (\(x_{ijm}\); i: route, j: aircraft type, m: month)
fleet size for each route and each aircraft type in each time period (\(y_{ijm}\); i: route, j: aircraft type, m: month)

## decision variables: 

# frequency: Xijm; i: route, j: aircraft type, m: month
frequency = model.addVars(len(routes), len(aircraft_type), len(months), vtype=gp.GRB.CONTINUOUS, name='x')

# fleet size: Yijm; i: route, j: aircraft type, m: month
fleet_size = model.addVars(len(routes), len(aircraft_type), len(months), vtype=gp.GRB.CONTINUOUS, name='y')

Objective function#

The objective function includes:

aircraft leasing cost (sum for all routes, aircraft types and time periods): \(\sum_{ijm}{y_{ijm} . L_j}\)
aircraft operating cost (frequency times hourly operating cost times flight duration, calculated as distance divided by average speed): \(\sum_{ijm}{x_{ijm} . O_j . (D_i/S_j)}\)

# objective 1 (aircraft leasing cost)
obj_leasing_cost = gp.quicksum(fleet_size[i, j, m] * aircraft_leasing_cost[aircraft_type[j]]
                               for i in range(len(routes))
                               for j in range(len(aircraft_type))
                               for m in range(len(months)))
# objective 2 (aircraft operating cost)
obj_operating_cost = gp.quicksum(frequency[i, j, m] * aircraft_hourly_operating_cost[aircraft_type[j]] * (distance[i] / average_aircraft_speed)
                                 for i in range(len(routes))
                                 for j in range(len(aircraft_type))
                                 for m in range(len(months)))

# objective function
model.setObjective((obj_leasing_cost + obj_operating_cost), gp.GRB.MINIMIZE)

Constraints#

The constraints are:

demand satisfaction constraints: the sum of frequencies multiplied by the capacity of all aircraft types should be greater than or equal to demand for each route in each time period

\[ \sum_{j}{x_{ijm} . C_j} \geq D_{im}; \forall i \in I, \forall m \in M\]

flight frequency constraint: the frequency of flights is restricted by the fleet size and available operating hours (per route, plane type and month)

\[ x_{ijm} . (D_i/S_j) \leq (y_{ijm} . H . N) ; \forall i \in I, \forall j \in J, \forall m \in M\]

# demand satisfaction constraints
model.addConstrs(
    gp.quicksum(frequency[i, j, m] * aircraft_capacity[aircraft_type[j]] for j in range(len(aircraft_type)))
    >= demand[i, m]
    for i in range(len(routes))
    for m in range(len(months)))

# flight frequency constraint
model.addConstrs(frequency[i, j, m] * distance[i] / average_aircraft_speed
                 <= fleet_size[i, j, m] * daily_operating_hours * working_day_per_period
                 for i in range(len(routes))
                 for j in range(len(aircraft_type))
                 for m in range(len(months)))

model.update()

Complete model formulation#

\[\begin{split}\begin{align} & min_{x,y}\\ \\ &\sum_{ijm}{y_{ijm} . L_j} + \sum_{ijm}{x_{ijm} . O_j . (D_i/S_j)} \\ \\ & s.t. \\ \\ & \sum_{j}{x_{ijm} . C_j} \geq D_{im}; &\forall i \in I, \forall m \in M\\ & x_{ijm} . (D_i/S_j) \leq (y_{ijm} . H . N) ; &\forall i \in I, \forall j \in J, \forall m \in M\\ & x_{ijm}, y_{ijm} \geq 0\\ \end{align}\end{split}\]

Solving the model#

For solving linear programming problems with continuous decision variables, possible solution algorithms (available with Gurobi) are:

primal simplex
dual simplex
barrier algorithm (which is a specific type of interior point method).

When we ask Gurobi to solve a model without specifying the algorithm, it selects what it thinks is the best algorithm to solve each specific problem. It uses the concurrent optimizer, which is the default option for the solution algorithm.

Here we solve the model without specifying the solution method (letting Gurobi decide which algorithm to use). In the next tutorial, we discuss different algorithms.

# to see the output (remember we turned this off before)
model.Params.LogToConsole = 1

# solve the model
model.optimize()

Set parameter LogToConsole to value 1
Gurobi Optimizer version 9.5.1 build v9.5.1rc2 (win64)
Thread count: 4 physical cores, 8 logical processors, using up to 8 threads
Optimize a model with 959616 rows, 1599360 columns and 2399040 nonzeros
Model fingerprint: 0x97205460
Coefficient statistics:
  Matrix range     [4e-03, 6e+02]
  Objective range  [2e+01, 5e+05]
  Bounds range     [0e+00, 0e+00]
  RHS range        [6e+00, 1e+05]

Concurrent LP optimizer: dual simplex and barrier
Showing barrier log only...

Presolve removed 959616 rows and 1599360 columns
Presolve time: 1.44s
Presolve: All rows and columns removed
Iteration    Objective       Primal Inf.    Dual Inf.      Time
       0    7.9354517e+09   0.000000e+00   0.000000e+00      3s

Solved with dual simplex
Solved in 0 iterations and 2.73 seconds (0.88 work units)
Optimal objective  7.935451713e+09

Exercise#

Try to verify feasibility of the optimal solution as well as the objective value of this solution. You can access the optimal values of decision variables via their x attribute (e.g., fleet_size[i, j, m].x)

Analysis#

Let’s check out the optimal solution.

optimal_results = pd.DataFrame(columns=["Origin", "Destination", "Aircraft Type", "Month", "Fleet size", "Frequency"])
for i in range(len(routes)):
    for j in range(len(aircraft_type)):
        for m in range(len(months)):
            if frequency[i, j, m].x > 1e-6 or fleet_size[i, j, m].x > 1e-6:
                optimal_results = optimal_results.append({"Origin": routes[i][0],
                                                          "Destination": routes[i][1],
                                                          "Aircraft Type": aircraft_type[j],
                                                          "Month": months[m],
                                                          "Fleet size": fleet_size[i,j,m].x,
                                                          "Frequency": frequency[i,j,m].x}, ignore_index=True)
optimal_results.head(20)

	Origin	Destination	Aircraft Type	Month	Fleet size	Frequency
0	ABE	ATL	A330-200	200907	0.020390	8.839450
1	ABE	ATL	A330-200	200908	0.017496	7.584862
2	ABE	ATL	A330-200	200909	0.012734	5.520642
3	ABE	ATL	A330-200	200910	0.015104	6.548165
4	ABE	ATL	A330-200	200911	0.014491	6.282110
5	ABE	ATL	A330-200	200912	0.012401	5.376147
6	ABE	AVL	A330-200	200907	0.000186	0.105505
7	ABE	AVP	A330-200	200908	0.000018	0.107798
8	ABE	CLE	A330-200	200907	0.007348	6.502294
9	ABE	CLE	A330-200	200908	0.007749	6.857798
10	ABE	CLE	A330-200	200909	0.006122	5.417431
11	ABE	CLE	A330-200	200910	0.007177	6.350917
12	ABE	CLE	A330-200	200911	0.006018	5.325688
13	ABE	CLE	A330-200	200912	0.004999	4.424312
14	ABE	CLT	A330-200	200907	0.019529	12.201835
15	ABE	CLT	A330-200	200908	0.018623	11.635321
16	ABE	CLT	A330-200	200909	0.018193	11.366972
17	ABE	CLT	A330-200	200910	0.018681	11.672018
18	ABE	CLT	A330-200	200911	0.017940	11.208716
19	ABE	CLT	A330-200	200912	0.017474	10.917431

Did you notice something odd about the results? The frequencies and fleet sizes are represented as decimal numbers. This occurs because we specified the decision variable types as continuous during their declaration.

While integer decision variables are more realistic for representing quantities such as flight frequencies or fleet sizes, they significantly increase the complexity of the problem. Large-scale linear programs with integer decision variables can be much more challenging to solve and, in some cases, practically infeasible.

Using continuous variables in this example allows us to simplify the problem and focus on learning core optimization techniques. Such simplifications are also common in practice, especially when approximate solutions are sufficient for decision-making.

Now let’s try to force Gurobi to only find solutions with integer decision variable values. In this case, we can do that by iterating through decision variables and changing their type from continuous to integer (it’s not always that simple).

# discard existing solution information from the model
model.reset(0)

# iterate over decision variables and modify their type to integer (remember we declare them continuous before)
for var in model.getVars():
    var.vtype = gp.GRB.INTEGER

Discarded solution information

# optimize the modified model
model.optimize()

Gurobi Optimizer version 9.5.1 build v9.5.1rc2 (win64)
Thread count: 4 physical cores, 8 logical processors, using up to 8 threads
Optimize a model with 959616 rows, 1599360 columns and 2399040 nonzeros
Model fingerprint: 0xb003a3e3
Variable types: 0 continuous, 1599360 integer (0 binary)
Coefficient statistics:
  Matrix range     [4e-03, 6e+02]
  Objective range  [2e+01, 5e+05]
  Bounds range     [0e+00, 0e+00]
  RHS range        [6e+00, 1e+05]
Found heuristic solution: objective 2.043082e+10
Presolve removed 809084 rows and 1599350 columns (presolve time = 35s) ...
Presolve removed 959610 rows and 1599350 columns
Presolve time: 35.19s
Presolved: 6 rows, 10 columns, 15 nonzeros
Found heuristic solution: objective 1.588337e+10
Variable types: 0 continuous, 10 integer (5 binary)

Explored 0 nodes (0 simplex iterations) in 35.82 seconds (2.52 work units)
Thread count was 8 (of 8 available processors)

Solution count 2: 1.58834e+10 2.04308e+10 

Optimal solution found (tolerance 1.00e-04)
Best objective 1.588336894802e+10, best bound 1.588282454130e+10, gap 0.0034%

The optimal objective function value is higher (less desirable) compared to the continuous problem, which was to be expected. Can you guess why?

Also note that the computation time is significantly higher compared to the continuous problem.

Now let’s make a table and look at the results.

optimal_results_integer = pd.DataFrame(columns=["Origin", "Destination", "Aircraft Type", "Month", "Fleet size", "Frequency"])
for i in range(len(routes)):
    for j in range(len(aircraft_type)):
        for m in range(len(months)):
            if frequency[i, j, m].x > 1e-6 or fleet_size[i, j, m].x > 1e-6:
                optimal_results_integer = optimal_results_integer.append({"Origin": routes[i][0],
                                                                          "Destination": routes[i][1],
                                                                          "Aircraft Type": aircraft_type[j],
                                                                          "Month": months[m],
                                                                          "Fleet size": fleet_size[i,j,m].x,
                                                                          "Frequency": frequency[i,j,m].x}, ignore_index=True)
optimal_results_integer.head(20)

	Origin	Destination	Aircraft Type	Month	Fleet size	Frequency
0	ABE	ATL	A320-200 CEO	200907	1.0	22.0
1	ABE	ATL	A320-200 CEO	200908	1.0	19.0
2	ABE	ATL	A320-200 CEO	200909	1.0	14.0
3	ABE	ATL	A320-200 CEO	200910	1.0	16.0
4	ABE	ATL	A320-200 CEO	200911	1.0	16.0
5	ABE	ATL	A320-200 CEO	200912	1.0	14.0
6	ABE	AVL	A320-200 CEO	200907	1.0	1.0
7	ABE	AVP	A320-200 CEO	200908	1.0	1.0
8	ABE	CLE	A320-200 CEO	200907	1.0	16.0
9	ABE	CLE	A320-200 CEO	200908	1.0	17.0
10	ABE	CLE	A320-200 CEO	200909	1.0	14.0
11	ABE	CLE	A320-200 CEO	200910	1.0	16.0
12	ABE	CLE	A320-200 CEO	200911	1.0	13.0
13	ABE	CLE	A320-200 CEO	200912	1.0	11.0
14	ABE	CLT	A320-200 CEO	200907	1.0	30.0
15	ABE	CLT	A320-200 CEO	200908	1.0	29.0
16	ABE	CLT	A320-200 CEO	200909	1.0	28.0
17	ABE	CLT	A320-200 CEO	200910	1.0	29.0
18	ABE	CLT	A320-200 CEO	200911	1.0	28.0
19	ABE	CLT	A320-200 CEO	200912	1.0	27.0

Now fleet size and frequencies make more sense. If you still haven’t figured out why the objective function value of the integer problem is a bit greater, comparing fleet size and frequency values of integer and continuous variations should help you understand that (hint: aircraft leasing cost depends on the fleet size and aircraft operating cost depends on the frequency).

Exercise#

As you saw above, computation time of solving the model changed when we changed decision variable types from continuous to integer. On the other hand, computation time usually increases with problem size. Try to change the size of this problem (simply by changing the number of months to include in input cell 5 from 6 to a different number and rerunning the code), solve both continuous and integer variations, and record computation times. You can go all the way up to 240 months (but note that this might take quite some time). Make a figure to see the trends properly (e.g., number of months on the x-axis, computation time on the y-axis, and a separate line for each problem variation). What can you conclude?

Attribution

This chapter is written by Gonçalo Homem de Almeida Correia, Maria Nogal Macho, Jie Gao and Bahman Ahmadi. Find out more here.

Exercise: Airlines Problem (Optional)

Contents

5.10. Exercise: Airlines Problem (Optional)#

Problem description#

Data description#

Problem definition#

Assumptions (to make the problem manageable)#

Data preprocessing#

Model construction#

Parameters#

Sets#

Decision variables#

Objective function#

Constraints#

Complete model formulation#

Solving the model#

Exercise#

Analysis#

Exercise#