5.10. Exercise Airlines Problem#
In this tutorial, we show you an example of a large-scale and (sort of) practical linear programming problem using real data.
Problem description#
Data description#
The dataset we will use for this example can be found here. Description of the data including column names is provided here. Some information about airplane types, their capacity, hourly operation cost and purchasing cost (that use as model parameters) can be found here. We also use monthly leasing rates for aircrafts, which can be found here.
The data we use is a large dataset including all the flights per month in the US for several years for each Origin-Destination pair of cities (from the beginning of 1990 till the end of 2009). We use this information to model a realistic problem (including some simplifying assumptions) that can be modeled and solved as a linear programming with Pyomo +Highs.
Problem definition#
Our task is to specify the fleet size and the frequency of flights for each route and each aircraft type in each time period that minimizes the total operating and purchasing cost of the entire fleet while satisfying travel demand and operating constraints.
Assumptions (to make the problem manageable)#
We want to model the problem as a linear program with continuous decision variables; of course you can’t break airplanes, but this assumption is taken for simplicity.
There are no constraints on the type of airplane that can land at an airport. All airplanes can land anywhere and we can operate on all cities.
There is no minimum flight frequency.
All aircrafts are leased, so we can change fleet size from one month to the next.
Data preprocessing#
You need to download the dataset and upload it on the folder where your notebook is. In case you have not installed pandas and numpy packages, you have to install them before running this notebook (using pip install). If you use the latest version of Anaconda navigator, it comes with these packages preinstalled.
Note In the next few cells, we briefly use pandas, a Python library for working with tabular data.
You’ll explore pandas in more detail in Week 7, so you do not need to fully understand Cells 1–7.
These cells simply prepare the data needed for Pyomo + HiGHS.
import numpy as np
import pandas as pd
# Read data (Since the data is massive, we just take 1000 rows for this purpose)
data = pd.read_csv('flight_edges.tsv', sep='\t', header=None, nrows=1000)
# Ddd column names (they are not included in the dataset but can be found in dataset description)
data.columns = ['Origin', 'Destination', 'Origin_City', 'Destination_City', 'Passengers', 'Seats',
'Flights', 'Distance', 'Fly_Date', 'Origin_Population', 'Destination_Population']
# Let's check out the first three rows of the data
data.head(3)
| Origin | Destination | Origin_City | Destination_City | Passengers | Seats | Flights | Distance | Fly_Date | Origin_Population | Destination_Population | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | MHK | AMW | Manhattan, KS | Ames, IA | 21 | 30 | 1 | 254.0 | 200810 | 122049 | 86219 |
| 1 | EUG | RDM | Eugene, OR | Bend, OR | 41 | 396 | 22 | 103.0 | 199011 | 284093 | 76034 |
| 2 | EUG | RDM | Eugene, OR | Bend, OR | 88 | 342 | 19 | 103.0 | 199012 | 284093 | 76034 |
# Check for missing values (if the resulting dataframe is empty, it means there is no missing value)
missing_rows = data[data.isna().any(axis=1)]
missing_rows
| Origin | Destination | Origin_City | Destination_City | Passengers | Seats | Flights | Distance | Fly_Date | Origin_Population | Destination_Population |
|---|
# Eliminate columns we don't need (for this study)
data = data.drop(columns=['Origin_City', 'Destination_City', 'Origin_Population', 'Destination_Population'])
# Sort data based on OD and month
data = data.sort_values(['Origin', 'Destination', 'Fly_Date'])
# Eliminate flights with the same OD (not sure why these records existed, but it makes sense to eliminate them)
data = data[data['Origin'] != data['Destination']]
# Calculate sum of passengers for each record and add to data (they are spread in multiple lines for some reason)
data['passenger_sum'] = data.groupby(['Origin', 'Destination', 'Fly_Date'])['Passengers'].transform('sum')
data['Seats'] = data.groupby(['Origin', 'Destination', 'Fly_Date'])['Seats'].transform('sum')
data['Flights'] = data.groupby(['Origin', 'Destination', 'Fly_Date'])['Flights'].transform('sum')
# Eliminate duplicates (since aggregated group values, first row in each group is enough to keep
data = data.drop_duplicates(subset=['Origin', 'Destination', 'Fly_Date'], keep='first')
# Eliminate rows where the number of passengers is less than 5
data = data[data['passenger_sum'] > 5]
# Some useful variables
demand_grouped_monthly = data.groupby(['Origin', 'Destination', 'Fly_Date'])['passenger_sum'].agg('sum')
# Let's examine this data (Note that .groupby transforms a dataframe into a series)
demand_grouped_monthly.head(3)
Origin Destination Fly_Date
AST RDM 199410 6
BFI RDM 200310 32
BIL RDM 200408 63
Name: passenger_sum, dtype: int64
# Save the distance between each OD pair (since there will be duplicate OD pairs with the same distance, we take the mean here)
distance = data.groupby(['Origin', 'Destination'])['Distance'].mean()
# Let's examine the distance data
distance.head(3)
Origin Destination
AST RDM 187.0
BFI RDM 233.0
BIL RDM 626.0
Name: Distance, dtype: float64
# Save all unique OD pairs in a list; so we have the full list of routes (Note that pandas unique function preserves order, which is desired here)
routes = distance.index.unique().tolist()
# Check the first 5 rows of the route list
routes[:5]
[('AST', 'RDM'),
('BFI', 'RDM'),
('BIL', 'RDM'),
('CLM', 'RDM'),
('EAT', 'RDM')]
# Find all months in the dataset and select the last 6 months to use here (Note that numpy unique function preserves order, which is desired here)
months = np.unique(data['Fly_Date'])[-6:]
months
array([200508, 200509, 200510, 200511, 200512, 200810])
# Let's make a table with origin-destination pairs (routes) on rows and monthly demand on columns
# Empty dataset with origin-destination pairs (routes) as rows and monthly demand as columns
demand_data = pd.DataFrame(index=routes, columns=[month for month in months])
# Find monthly demand for each route and add to monthly demand dataset (Iterating through dataframe rows can be slow, so we use a more efficient way)
for month in months:
# Initiate relevant column with zeros
demand_data[month].values[:] = 0
# Temporary array to store iteration results
demand_temp_all_routes = demand_data[month]
# Temporary array to store demand for routes that have none-zero demand values
demand_temp_nonzero = demand_grouped_monthly[demand_grouped_monthly.index.get_level_values(2) == month]
# Eliminate month from array index (all values here have the same month)
demand_temp_nonzero.index = demand_temp_nonzero.index.droplevel(2)
# Insert none-zero demand values into relevant rows
demand_temp_all_routes[demand_temp_all_routes.index.intersection(demand_temp_nonzero.index)] = demand_temp_nonzero
# Assign iteration results (monthly demand) to monthly demand dataset
demand_data[month] = demand_temp_all_routes
# Convert demand data to a numpy array for efficiency
demand = demand_data.values
# Let's see how it looks like
demand_data.head(4)
| 200508 | 200509 | 200510 | 200511 | 200512 | 200810 | |
|---|---|---|---|---|---|---|
| (AST, RDM) | 0 | 0 | 0 | 0 | 0 | 0 |
| (BFI, RDM) | 0 | 0 | 0 | 0 | 0 | 0 |
| (BIL, RDM) | 0 | 0 | 0 | 0 | 0 | 0 |
| (CLM, RDM) | 0 | 0 | 0 | 0 | 0 | 0 |
Now we have the set of routes (OD pairs), set of months, distance for each route and demand for each route in each month. So we are ready to build the model in Pyomo+Highs.
from pyomo.environ import (
ConcreteModel, Set, Param, Var, Objective, Constraint,
NonNegativeReals, NonNegativeIntegers, minimize, SolverFactory, value
)
Model construction#
Let’s build a Pyomo model and then add all required components step by step.
m = ConcreteModel()
Parameters#
We need average aircraft speed (\(S\)), number of working days per period (\(N\)), and daily operating hours (\(H\)). We also need the list of aircraft types (\(j \in J\)), their capacities (\(C_j\)), their hourly operating costs (\(O_j\)), and their monthly leasing costs (\(L_j\)).
We’ll store these in Python structures and then pass them to Pyomo parameters.
# Scalars
average_aircraft_speed = 500
working_day_per_period = 30
daily_operating_hours = 20
# Aircraft data
aircraft_specs = {
'A320-200 CEO': (180, 4378, 196 * 1000),
'A320 NEO': (188, 4378, 315 * 1000),
'A321-200 CEO': (230, 4432, 310 * 1000),
'A321 NEO': (236, 5149, 359 * 1000),
'A330-200': (436, 6564, 515 * 1000)
}
aircraft_type = list(aircraft_specs.keys())
aircraft_capacity = {k: aircraft_specs[k][0] for k in aircraft_type}
aircraft_hourly_operating_cost = {k: aircraft_specs[k][1] for k in aircraft_type}
aircraft_leasing_cost = {k: aircraft_specs[k][2] for k in aircraft_type}
# Build index ranges to mimic i in I, j in J, m in M
I = range(len(routes))
J = range(len(aircraft_type))
M = range(len(months))
# Map distance to an index-based dict consistent with 'routes'. This keeps 'distance[i]' semantics consistent with the original code’s intent.
distance_i = {i: float(distance.loc[routes[i]]) for i in I}
# Demand is a numpy array with shape [len(routes), len(months)]. create a dict keyed by (i, m) so Pyomo can index it directly.
demand_im = {(i, m): float(demand[i, m]) for i in I for m in M}
Sets#
We need the following sets to model this problem:
set of routes (OD pairs) that we extractedd from the data: \(i \in I\)
set of months that we extracted from the data: \(m \in M\)
set of aircraft types (extracted from the references mentioned in data description and listed above): \(j \in J\)
m.I = Set(initialize=I)
m.J = Set(initialize=J)
m.M = Set(initialize=M)
# Parameters as Pyomo Params
m.distance = Param(m.I, initialize=distance_i, within=NonNegativeReals)
m.demand = Param(m.I, m.M, initialize=demand_im, within=NonNegativeReals)
# Aircraft parameters
# Build index-aligned dicts for aircraft params (j is an integer index)
cap_dict = {j: aircraft_capacity[aircraft_type[j]] for j in J}
opc_dict = {j: aircraft_hourly_operating_cost[aircraft_type[j]] for j in J}
lease_dict = {j: aircraft_leasing_cost[aircraft_type[j]] for j in J}
m.cap = Param(m.J, initialize=cap_dict, within=NonNegativeReals)
m.opc = Param(m.J, initialize=opc_dict, within=NonNegativeReals)
m.lease = Param(m.J, initialize=lease_dict, within=NonNegativeReals)
# Scalars: we can just keep these as Python scalars and use them inside expressions
S = average_aircraft_speed
N = working_day_per_period
H = daily_operating_hours
Decision variables#
The decision variables are:
frequency of flights for each route and each aircraft type in each time period (\(x_{ijm}\); i: route, j: aircraft type, m: month)
fleet size for each route and each aircraft type in each time period (\(y_{ijm}\); i: route, j: aircraft type, m: month)
# x_{ijm}: flight frequency for route i, aircraft j, month m
m.frequency = Var(m.I, m.J, m.M, domain=NonNegativeReals)
# y_{ijm}: fleet size allocated to route i, aircraft j, month m
m.fleet_size = Var(m.I, m.J, m.M, domain=NonNegativeReals)
Objective function#
The objective function includes:
aircraft leasing cost (sum for all routes, aircraft types and time periods): \(\sum_{ijm}{y_{ijm} . L_j}\)
aircraft operating cost (frequency times hourly operating cost times flight duration, which is distance divided by average speed): \(\sum_{ijm}{x_{ijm} . O_j . (D_i/S_j)}\)
def total_cost_rule(m):
lease_cost = sum(m.fleet_size[i, j, k] * m.lease[j] for i in m.I for j in m.J for k in m.M)
# flight hours for route i is distance_i / S
oper_cost = sum(m.frequency[i, j, k] * m.opc[j] * (m.distance[i] / S)
for i in m.I for j in m.J for k in m.M)
return lease_cost + oper_cost
m.obj = Objective(rule=total_cost_rule, sense=minimize)
Constraints#
The constraints are:
demand satisfaction constraints; sum of frequency times capacity of all aircraft types should be greater than or equal to demand for each route in each time period
flight frequency constraint; frequency is restricted by the fleet size and available operating hours (per route, plane type and month)
# 1) Demand satisfaction
def demand_satisfaction_rule(m, i, k):
return sum(m.frequency[i, j, k] * m.cap[j] for j in m.J) >= m.demand[i, k]
m.demand_satisfaction = Constraint(m.I, m.M, rule=demand_satisfaction_rule)
# 2) Utilization (x is flights, time per flight on route i is distance_i / S, total time <= fleet * hours):
def frequency_utilization_rule(m, i, j, k):
return m.frequency[i, j, k] * (m.distance[i] / S) <= m.fleet_size[i, j, k] * H * N
m.frequency_utilization = Constraint(m.I, m.J, m.M, rule=frequency_utilization_rule)
Complete model formulation#
Solving the model#
result = SolverFactory("highs").solve(m, tee=False) # or: SolverFactory("appsi_highs").solve(m, tee=True)
print("Status:", result.solver.status)
print("Termination:", result.solver.termination_condition)
Status: ok
Termination: optimal
Analysis#
Let’s check out the optimal solution.
# Collect a compact table of non-negligible decisions
rows = []
for i in I:
for j in J:
for k in M:
x_val = m.frequency[i, j, k].value or 0.0
y_val = m.fleet_size[i, j, k].value or 0.0
if abs(x_val) > 1e-6 or abs(y_val) > 1e-6:
rows.append({
"Origin": routes[i][0],
"Destination": routes[i][1],
"Aircraft Type": aircraft_type[j],
"Month": months[k],
"Fleet size": y_val,
"Frequency": x_val
})
results_continuous = pd.DataFrame(rows)
print("Objective (total cost):", float(m.obj()))
results_continuous.head(10)
Objective (total cost): 554584.269603858
| Origin | Destination | Aircraft Type | Month | Fleet size | Frequency | |
|---|---|---|---|---|---|---|
| 0 | MHK | AMW | A330-200 | 200810 | 0.000041 | 0.048165 |
| 1 | PDX | RDM | A330-200 | 200508 | 0.007085 | 18.323394 |
| 2 | PDX | RDM | A330-200 | 200509 | 0.006386 | 16.516055 |
| 3 | PDX | RDM | A330-200 | 200510 | 0.006404 | 16.561927 |
| 4 | PDX | RDM | A330-200 | 200511 | 0.005555 | 14.366972 |
| 5 | PDX | RDM | A330-200 | 200512 | 0.007363 | 19.041284 |
| 6 | SEA | RDM | A330-200 | 200508 | 0.007980 | 10.500000 |
| 7 | SEA | RDM | A330-200 | 200509 | 0.007080 | 9.316514 |
| 8 | SFO | RDM | A330-200 | 200508 | 0.006676 | 4.334862 |
| 9 | SFO | RDM | A330-200 | 200509 | 0.004680 | 3.038991 |
Did you notice something odd about the results? We have decimal numbers for frequencies and fleet sizes. This is because we specified decision variable types as continuous when declaring them.
Now let’s try to force to find solutions with integer decision variable values. In this case, we can do that by changing decision variable types to integer. As you will see later, linear programs with integer decision variables are more challenging to solve. Large scale problems with integer decision variables can even be practically infeasible to solve.
Parameters (integer version)#
m_int = ConcreteModel()
m_int.I = Set(initialize=I)
m_int.J = Set(initialize=J)
m_int.M = Set(initialize=M)
m_int.distance = Param(m_int.I, initialize=distance_i, within=NonNegativeReals)
m_int.demand = Param(m_int.I, m_int.M, initialize=demand_im, within=NonNegativeReals)
m_int.cap = Param(m_int.J, initialize=cap_dict, within=NonNegativeReals)
m_int.opc = Param(m_int.J, initialize=opc_dict, within=NonNegativeReals)
m_int.lease = Param(m_int.J, initialize=lease_dict, within=NonNegativeReals)
Decision Variables#
m_int.frequency = Var(m_int.I, m_int.J, m_int.M, domain=NonNegativeIntegers)
m_int.fleet_size = Var(m_int.I, m_int.J, m_int.M, domain=NonNegativeIntegers)
Objective Function#
def total_cost_rule_int(m):
lease_cost = sum(m.fleet_size[i, j, k] * m.lease[j] for i in m.I for j in m.J for k in m.M)
oper_cost = sum(m.frequency[i, j, k] * m.opc[j] * (m.distance[i] / S)
for i in m.I for j in m.J for k in m.M)
return lease_cost + oper_cost
m_int.obj = Objective(rule=total_cost_rule_int, sense=minimize)
Constraints#
def demand_satisfaction_rule_int(m, i, k):
return sum(m.frequency[i, j, k] * m.cap[j] for j in m.J) >= m.demand[i, k]
m_int.demand_satisfaction = Constraint(m_int.I, m_int.M, rule=demand_satisfaction_rule_int)
def utilization_rule_int(m, i, j, k):
return m.frequency[i, j, k] * (m.distance[i] / S) <= m.fleet_size[i, j, k] * H * N
m_int.frequency_utilization = Constraint(m_int.I, m_int.J, m_int.M, rule=utilization_rule_int)
result_int = SolverFactory("highs").solve(m_int, tee=False)
print("Status:", result_int.solver.status)
print("Termination:", result_int.solver.termination_condition)
Status: ok
Termination: optimal
Continuous vs. Integer Computation Time#
As you have already noticed, the integer model takes considerably longer to solve than the continuous version.
This is expected: enforcing integrality turns the problem into a Mixed-Integer Linear Program (MILP), which requires the solver to explore discrete combinations rather than a smooth feasible region. As a result, solution time can increase quickly.
Next, we explore how the optimal results change when integrality is imposed.
Analysis#
rows_int = []
for i in I:
for j in J:
for k in M:
x_val = m_int.frequency[i, j, k].value or 0
y_val = m_int.fleet_size[i, j, k].value or 0
if abs(x_val) > 0 or abs(y_val) > 0:
rows_int.append({
"Origin": routes[i][0],
"Destination": routes[i][1],
"Aircraft Type": aircraft_type[j],
"Month": months[k],
"Fleet size": int(y_val),
"Frequency": int(x_val)
})
results_integer = pd.DataFrame(rows_int)
print("Objective (total cost) [integer]:", float(m_int.obj()))
results_integer.head(10)
Objective (total cost) [integer]: 4345316.763661545
| Origin | Destination | Aircraft Type | Month | Fleet size | Frequency | |
|---|---|---|---|---|---|---|
| 0 | MHK | AMW | A320-200 CEO | 200810 | 1 | 1 |
| 1 | PDX | RDM | A320-200 CEO | 200508 | 1 | 45 |
| 2 | PDX | RDM | A320-200 CEO | 200509 | 1 | 40 |
| 3 | PDX | RDM | A320-200 CEO | 200510 | 1 | 41 |
| 4 | PDX | RDM | A320-200 CEO | 200511 | 1 | 34 |
| 5 | PDX | RDM | A320-200 CEO | 200512 | 1 | 47 |
| 6 | PDX | RDM | A320 NEO | 200511 | 0 | 0 |
| 7 | PDX | RDM | A321 NEO | 200508 | 0 | 0 |
| 8 | SEA | RDM | A320-200 CEO | 200508 | 1 | 25 |
| 9 | SEA | RDM | A320-200 CEO | 200509 | 1 | 23 |
Comparing Continuous vs. Integer Results#
Comparing the two solutions, we observe that the objective value of the integer model is higher (worse) than that of the continuous model.
This makes sense: the continuous version can choose fractional fleet sizes and flight frequencies, allowing it to satisfy demand more efficiently.
In contrast, the integer model must use whole units, which can lead to:
extra aircraft being required,
over-capacity on certain routes,
and generally higher total cost.
This difference highlights a common trade-off in optimization:
Integer models are more realistic, but typically more expensive and harder to solve.
Exercise#
Earlier, we observed that switching from continuous to integer decision variables led to noticeably longer computation times.
A second factor that strongly influences performance is problem size.
In this exercise, you will investigate how computation time scales with the size of the model.
Task#
Go back to the data-preprocessing step where we selected the last 6 months of data.
Change this number (e.g., use 3, 6, 12, 24, …).For each problem size:
Rebuild the model.
Solve the continuous version.
Solve the integer version.
Record the computation time of each run.
Create a figure showing:
x-axis: number of months included
y-axis: computation time (seconds)
Two curves: one for the continuous model and one for the integer model
Comment on the results.
Attribution
This page is written by Nadia Pourmohammadzia Find out more here.