5.10. Exercise Airlines Problem#

In this tutorial, we show you an example of a large-scale and (sort of) practical linear programming problem using real data.

Problem description#

Data description#

The dataset we will use for this example can be found here. Description of the data including column names is provided here. Some information about airplane types, their capacity, hourly operation cost and purchasing cost (that use as model parameters) can be found here. We also use monthly leasing rates for aircrafts, which can be found here.

The data we use is a large dataset including all the flights per month in the US for several years for each Origin-Destination pair of cities (from the beginning of 1990 till the end of 2009). We use this information to model a realistic problem (including some simplifying assumptions) that can be modeled and solved as a linear programming with Pyomo +Highs.

Problem definition#

Our task is to specify the fleet size and the frequency of flights for each route and each aircraft type in each time period that minimizes the total operating and purchasing cost of the entire fleet while satisfying travel demand and operating constraints.

Assumptions (to make the problem manageable)#

  • We want to model the problem as a linear program with continuous decision variables; of course you can’t break airplanes, but this assumption is taken for simplicity.

  • There are no constraints on the type of airplane that can land at an airport. All airplanes can land anywhere and we can operate on all cities.

  • There is no minimum flight frequency.

  • All aircrafts are leased, so we can change fleet size from one month to the next.

Data preprocessing#

You need to download the dataset and upload it on the folder where your notebook is. In case you have not installed pandas and numpy packages, you have to install them before running this notebook (using pip install). If you use the latest version of Anaconda navigator, it comes with these packages preinstalled.

Note In the next few cells, we briefly use pandas, a Python library for working with tabular data.
You’ll explore pandas in more detail in Week 7, so you do not need to fully understand Cells 1–7.
These cells simply prepare the data needed for Pyomo + HiGHS.

import numpy as np
import pandas as pd

# Read data (Since the data is massive, we just take 1000 rows for this purpose)
data = pd.read_csv('flight_edges.tsv', sep='\t', header=None, nrows=1000)

# Ddd column names (they are not included in the dataset but can be found in dataset description)
data.columns = ['Origin', 'Destination', 'Origin_City', 'Destination_City', 'Passengers', 'Seats',
                'Flights', 'Distance', 'Fly_Date', 'Origin_Population', 'Destination_Population']

# Let's check out the first three rows of the data
data.head(3)
Origin Destination Origin_City Destination_City Passengers Seats Flights Distance Fly_Date Origin_Population Destination_Population
0 MHK AMW Manhattan, KS Ames, IA 21 30 1 254.0 200810 122049 86219
1 EUG RDM Eugene, OR Bend, OR 41 396 22 103.0 199011 284093 76034
2 EUG RDM Eugene, OR Bend, OR 88 342 19 103.0 199012 284093 76034
# Check for missing values (if the resulting dataframe is empty, it means there is no missing value)
missing_rows = data[data.isna().any(axis=1)]
missing_rows
Origin Destination Origin_City Destination_City Passengers Seats Flights Distance Fly_Date Origin_Population Destination_Population
# Eliminate columns we don't need (for this study)
data = data.drop(columns=['Origin_City', 'Destination_City', 'Origin_Population', 'Destination_Population'])

# Sort data based on OD and month
data = data.sort_values(['Origin', 'Destination', 'Fly_Date'])

# Eliminate flights with the same OD (not sure why these records existed, but it makes sense to eliminate them)
data = data[data['Origin'] != data['Destination']]

# Calculate sum of passengers for each record and add to data (they are spread in multiple lines for some reason)
data['passenger_sum'] = data.groupby(['Origin', 'Destination', 'Fly_Date'])['Passengers'].transform('sum')
data['Seats'] = data.groupby(['Origin', 'Destination', 'Fly_Date'])['Seats'].transform('sum')
data['Flights'] = data.groupby(['Origin', 'Destination', 'Fly_Date'])['Flights'].transform('sum')

# Eliminate duplicates (since aggregated group values, first row in each group is enough to keep
data = data.drop_duplicates(subset=['Origin', 'Destination', 'Fly_Date'], keep='first')

# Eliminate rows where the number of passengers is less than 5
data = data[data['passenger_sum'] > 5]

# Some useful variables
demand_grouped_monthly = data.groupby(['Origin', 'Destination', 'Fly_Date'])['passenger_sum'].agg('sum')

# Let's examine this data (Note that .groupby transforms a dataframe into a series)
demand_grouped_monthly.head(3)
Origin  Destination  Fly_Date
AST     RDM          199410       6
BFI     RDM          200310      32
BIL     RDM          200408      63
Name: passenger_sum, dtype: int64
# Save the distance between each OD pair (since there will be duplicate OD pairs with the same distance, we take the mean here)
distance = data.groupby(['Origin', 'Destination'])['Distance'].mean()

# Let's examine the distance data
distance.head(3)
Origin  Destination
AST     RDM            187.0
BFI     RDM            233.0
BIL     RDM            626.0
Name: Distance, dtype: float64
# Save all unique OD pairs in a list; so we have the full list of routes (Note that pandas unique function preserves order, which is desired here)
routes = distance.index.unique().tolist()

# Check the first 5 rows of the route list
routes[:5]
[('AST', 'RDM'),
 ('BFI', 'RDM'),
 ('BIL', 'RDM'),
 ('CLM', 'RDM'),
 ('EAT', 'RDM')]
# Find all months in the dataset and select the last 6 months to use here (Note that numpy unique function preserves order, which is desired here)
months = np.unique(data['Fly_Date'])[-6:]
months
array([200508, 200509, 200510, 200511, 200512, 200810])
# Let's make a table with origin-destination pairs (routes) on rows and monthly demand on columns

# Empty dataset with origin-destination pairs (routes) as rows and monthly demand as columns
demand_data = pd.DataFrame(index=routes, columns=[month for month in months])

# Find monthly demand for each route and add to monthly demand dataset (Iterating through dataframe rows can be slow, so we use a more efficient way)
for month in months:
    # Initiate relevant column with zeros
    demand_data[month].values[:] = 0
    # Temporary array to store iteration results
    demand_temp_all_routes = demand_data[month]
    # Temporary array to store demand for routes that have none-zero demand values
    demand_temp_nonzero = demand_grouped_monthly[demand_grouped_monthly.index.get_level_values(2) == month]
    # Eliminate month from array index (all values here have the same month)
    demand_temp_nonzero.index = demand_temp_nonzero.index.droplevel(2)
    # Insert none-zero demand values into relevant rows
    demand_temp_all_routes[demand_temp_all_routes.index.intersection(demand_temp_nonzero.index)] = demand_temp_nonzero
    # Assign iteration results (monthly demand) to monthly demand dataset
    demand_data[month] = demand_temp_all_routes

# Convert demand data to a numpy array for efficiency
demand = demand_data.values
# Let's see how it looks like
demand_data.head(4)
200508 200509 200510 200511 200512 200810
(AST, RDM) 0 0 0 0 0 0
(BFI, RDM) 0 0 0 0 0 0
(BIL, RDM) 0 0 0 0 0 0
(CLM, RDM) 0 0 0 0 0 0

Now we have the set of routes (OD pairs), set of months, distance for each route and demand for each route in each month. So we are ready to build the model in Pyomo+Highs.

from pyomo.environ import (
    ConcreteModel, Set, Param, Var, Objective, Constraint,
    NonNegativeReals, NonNegativeIntegers, minimize, SolverFactory, value
)

Model construction#

Let’s build a Pyomo model and then add all required components step by step.

m = ConcreteModel()

Parameters#

We need average aircraft speed (\(S\)), number of working days per period (\(N\)), and daily operating hours (\(H\)). We also need the list of aircraft types (\(j \in J\)), their capacities (\(C_j\)), their hourly operating costs (\(O_j\)), and their monthly leasing costs (\(L_j\)).

We’ll store these in Python structures and then pass them to Pyomo parameters.

# Scalars
average_aircraft_speed = 500
working_day_per_period = 30
daily_operating_hours = 20

# Aircraft data
aircraft_specs = {
    'A320-200 CEO': (180, 4378, 196 * 1000),
    'A320 NEO':     (188, 4378, 315 * 1000),
    'A321-200 CEO': (230, 4432, 310 * 1000),
    'A321 NEO':     (236, 5149, 359 * 1000),
    'A330-200':     (436, 6564, 515 * 1000)
}


aircraft_type = list(aircraft_specs.keys())
aircraft_capacity = {k: aircraft_specs[k][0] for k in aircraft_type}
aircraft_hourly_operating_cost = {k: aircraft_specs[k][1] for k in aircraft_type}
aircraft_leasing_cost = {k: aircraft_specs[k][2] for k in aircraft_type}

# Build index ranges to mimic i in I, j in J, m in M
I = range(len(routes))
J = range(len(aircraft_type))
M = range(len(months))

# Map distance to an index-based dict consistent with 'routes'. This keeps 'distance[i]' semantics consistent with the original code’s intent.
distance_i = {i: float(distance.loc[routes[i]]) for i in I}

# Demand is a numpy array with shape [len(routes), len(months)]. create a dict keyed by (i, m) so Pyomo can index it directly.
demand_im = {(i, m): float(demand[i, m]) for i in I for m in M}

Sets#

We need the following sets to model this problem:

  • set of routes (OD pairs) that we extractedd from the data: \(i \in I\)

  • set of months that we extracted from the data: \(m \in M\)

  • set of aircraft types (extracted from the references mentioned in data description and listed above): \(j \in J\)

m.I = Set(initialize=I)
m.J = Set(initialize=J)
m.M = Set(initialize=M)
# Parameters as Pyomo Params
m.distance = Param(m.I, initialize=distance_i, within=NonNegativeReals)
m.demand   = Param(m.I, m.M, initialize=demand_im, within=NonNegativeReals)

# Aircraft parameters 
# Build index-aligned dicts for aircraft params (j is an integer index)
cap_dict   = {j: aircraft_capacity[aircraft_type[j]] for j in J}
opc_dict   = {j: aircraft_hourly_operating_cost[aircraft_type[j]] for j in J}
lease_dict = {j: aircraft_leasing_cost[aircraft_type[j]] for j in J}

m.cap   = Param(m.J, initialize=cap_dict,   within=NonNegativeReals)
m.opc   = Param(m.J, initialize=opc_dict,   within=NonNegativeReals)
m.lease = Param(m.J, initialize=lease_dict, within=NonNegativeReals)

# Scalars: we can just keep these as Python scalars and use them inside expressions
S = average_aircraft_speed
N = working_day_per_period
H = daily_operating_hours

Decision variables#

The decision variables are:

  • frequency of flights for each route and each aircraft type in each time period (\(x_{ijm}\); i: route, j: aircraft type, m: month)

  • fleet size for each route and each aircraft type in each time period (\(y_{ijm}\); i: route, j: aircraft type, m: month)

# x_{ijm}: flight frequency for route i, aircraft j, month m
m.frequency  = Var(m.I, m.J, m.M, domain=NonNegativeReals)

# y_{ijm}: fleet size allocated to route i, aircraft j, month m
m.fleet_size = Var(m.I, m.J, m.M, domain=NonNegativeReals)

Objective function#

The objective function includes:

  • aircraft leasing cost (sum for all routes, aircraft types and time periods): \(\sum_{ijm}{y_{ijm} . L_j}\)

  • aircraft operating cost (frequency times hourly operating cost times flight duration, which is distance divided by average speed): \(\sum_{ijm}{x_{ijm} . O_j . (D_i/S_j)}\)

def total_cost_rule(m):
    lease_cost = sum(m.fleet_size[i, j, k] * m.lease[j] for i in m.I for j in m.J for k in m.M)
    # flight hours for route i is distance_i / S
    oper_cost  = sum(m.frequency[i, j, k] * m.opc[j] * (m.distance[i] / S)
                     for i in m.I for j in m.J for k in m.M)
    return lease_cost + oper_cost

m.obj = Objective(rule=total_cost_rule, sense=minimize)

Constraints#

The constraints are:

  • demand satisfaction constraints; sum of frequency times capacity of all aircraft types should be greater than or equal to demand for each route in each time period

\[ \sum_{j}{x_{ijm} . C_j} \geq D_{im}; \forall i \in I, \forall m \in M\]
  • flight frequency constraint; frequency is restricted by the fleet size and available operating hours (per route, plane type and month)

\[ x_{ijm} . (D_i/S_j) \leq (y_{ijm} . H . N) ; \forall i \in I, \forall j \in J, \forall m \in M\]
# 1) Demand satisfaction
def demand_satisfaction_rule(m, i, k):
    return sum(m.frequency[i, j, k] * m.cap[j] for j in m.J) >= m.demand[i, k]
m.demand_satisfaction = Constraint(m.I, m.M, rule=demand_satisfaction_rule)

# 2) Utilization (x is flights, time per flight on route i is distance_i / S, total time <= fleet * hours):
def frequency_utilization_rule(m, i, j, k):
    return m.frequency[i, j, k] * (m.distance[i] / S) <= m.fleet_size[i, j, k] * H * N
m.frequency_utilization = Constraint(m.I, m.J, m.M, rule=frequency_utilization_rule)

Complete model formulation#

\[\begin{split}\begin{align} & min \sum_{ijm}{y_{ijm} . L_j} + \sum_{ijm}{x_{ijm} . O_j . (D_i/S_j)} \\ \\ & s.t. \\ \\ & \sum_{j}{x_{ijm} . C_j} \geq D_{im}; &\forall i \in I, \forall m \in M\\ & x_{ijm} . (D_i/S_j) \leq (y_{ijm} . H . N) ; &\forall i \in I, \forall j \in J, \forall m \in M\\ & x_{ijm}, y_{ijm} \geq 0\\ \end{align}\end{split}\]

Solving the model#

result = SolverFactory("highs").solve(m, tee=False)   # or: SolverFactory("appsi_highs").solve(m, tee=True)

print("Status:", result.solver.status)
print("Termination:", result.solver.termination_condition)
Status: ok
Termination: optimal

Analysis#

Let’s check out the optimal solution.

# Collect a compact table of non-negligible decisions
rows = []
for i in I:
    for j in J:
        for k in M:
            x_val = m.frequency[i, j, k].value or 0.0
            y_val = m.fleet_size[i, j, k].value or 0.0
            if abs(x_val) > 1e-6 or abs(y_val) > 1e-6:
                rows.append({
                    "Origin":        routes[i][0],
                    "Destination":   routes[i][1],
                    "Aircraft Type": aircraft_type[j],
                    "Month":         months[k],
                    "Fleet size":    y_val,
                    "Frequency":     x_val
                })

results_continuous = pd.DataFrame(rows)
print("Objective (total cost):", float(m.obj()))
results_continuous.head(10)
Objective (total cost): 554584.269603858
Origin Destination Aircraft Type Month Fleet size Frequency
0 MHK AMW A330-200 200810 0.000041 0.048165
1 PDX RDM A330-200 200508 0.007085 18.323394
2 PDX RDM A330-200 200509 0.006386 16.516055
3 PDX RDM A330-200 200510 0.006404 16.561927
4 PDX RDM A330-200 200511 0.005555 14.366972
5 PDX RDM A330-200 200512 0.007363 19.041284
6 SEA RDM A330-200 200508 0.007980 10.500000
7 SEA RDM A330-200 200509 0.007080 9.316514
8 SFO RDM A330-200 200508 0.006676 4.334862
9 SFO RDM A330-200 200509 0.004680 3.038991

Did you notice something odd about the results? We have decimal numbers for frequencies and fleet sizes. This is because we specified decision variable types as continuous when declaring them.
Now let’s try to force to find solutions with integer decision variable values. In this case, we can do that by changing decision variable types to integer. As you will see later, linear programs with integer decision variables are more challenging to solve. Large scale problems with integer decision variables can even be practically infeasible to solve.

Parameters (integer version)#

m_int = ConcreteModel()

m_int.I = Set(initialize=I)
m_int.J = Set(initialize=J)
m_int.M = Set(initialize=M)

m_int.distance = Param(m_int.I,              initialize=distance_i, within=NonNegativeReals)
m_int.demand   = Param(m_int.I, m_int.M,     initialize=demand_im,  within=NonNegativeReals)
m_int.cap      = Param(m_int.J, initialize=cap_dict,   within=NonNegativeReals)
m_int.opc      = Param(m_int.J, initialize=opc_dict,   within=NonNegativeReals)
m_int.lease    = Param(m_int.J, initialize=lease_dict, within=NonNegativeReals)

Decision Variables#

m_int.frequency  = Var(m_int.I, m_int.J, m_int.M, domain=NonNegativeIntegers)
m_int.fleet_size = Var(m_int.I, m_int.J, m_int.M, domain=NonNegativeIntegers)

Objective Function#

def total_cost_rule_int(m):
    lease_cost = sum(m.fleet_size[i, j, k] * m.lease[j] for i in m.I for j in m.J for k in m.M)
    oper_cost  = sum(m.frequency[i, j, k]  * m.opc[j]   * (m.distance[i] / S)
                     for i in m.I for j in m.J for k in m.M)
    return lease_cost + oper_cost

m_int.obj = Objective(rule=total_cost_rule_int, sense=minimize)

Constraints#

def demand_satisfaction_rule_int(m, i, k):
    return sum(m.frequency[i, j, k] * m.cap[j] for j in m.J) >= m.demand[i, k]
m_int.demand_satisfaction = Constraint(m_int.I, m_int.M, rule=demand_satisfaction_rule_int)

def utilization_rule_int(m, i, j, k):
    return m.frequency[i, j, k] * (m.distance[i] / S) <= m.fleet_size[i, j, k] * H * N
m_int.frequency_utilization = Constraint(m_int.I, m_int.J, m_int.M, rule=utilization_rule_int)
result_int = SolverFactory("highs").solve(m_int, tee=False)

print("Status:", result_int.solver.status)
print("Termination:", result_int.solver.termination_condition)
Status: ok
Termination: optimal

Continuous vs. Integer Computation Time#

As you have already noticed, the integer model takes considerably longer to solve than the continuous version.
This is expected: enforcing integrality turns the problem into a Mixed-Integer Linear Program (MILP), which requires the solver to explore discrete combinations rather than a smooth feasible region. As a result, solution time can increase quickly.

Next, we explore how the optimal results change when integrality is imposed.

Analysis#

rows_int = []
for i in I:
    for j in J:
        for k in M:
            x_val = m_int.frequency[i, j, k].value or 0
            y_val = m_int.fleet_size[i, j, k].value or 0
            if abs(x_val) > 0 or abs(y_val) > 0:
                rows_int.append({
                    "Origin":        routes[i][0],
                    "Destination":   routes[i][1],
                    "Aircraft Type": aircraft_type[j],
                    "Month":         months[k],
                    "Fleet size":    int(y_val),
                    "Frequency":     int(x_val)
                })

results_integer = pd.DataFrame(rows_int)
print("Objective (total cost) [integer]:", float(m_int.obj()))
results_integer.head(10)
Objective (total cost) [integer]: 4345316.763661545
Origin Destination Aircraft Type Month Fleet size Frequency
0 MHK AMW A320-200 CEO 200810 1 1
1 PDX RDM A320-200 CEO 200508 1 45
2 PDX RDM A320-200 CEO 200509 1 40
3 PDX RDM A320-200 CEO 200510 1 41
4 PDX RDM A320-200 CEO 200511 1 34
5 PDX RDM A320-200 CEO 200512 1 47
6 PDX RDM A320 NEO 200511 0 0
7 PDX RDM A321 NEO 200508 0 0
8 SEA RDM A320-200 CEO 200508 1 25
9 SEA RDM A320-200 CEO 200509 1 23

Comparing Continuous vs. Integer Results#

Comparing the two solutions, we observe that the objective value of the integer model is higher (worse) than that of the continuous model.
This makes sense: the continuous version can choose fractional fleet sizes and flight frequencies, allowing it to satisfy demand more efficiently.

In contrast, the integer model must use whole units, which can lead to:

  • extra aircraft being required,

  • over-capacity on certain routes,

  • and generally higher total cost.

This difference highlights a common trade-off in optimization:

Integer models are more realistic, but typically more expensive and harder to solve.

Exercise#

Earlier, we observed that switching from continuous to integer decision variables led to noticeably longer computation times.
A second factor that strongly influences performance is problem size.

In this exercise, you will investigate how computation time scales with the size of the model.

Task#

  1. Go back to the data-preprocessing step where we selected the last 6 months of data.
    Change this number (e.g., use 3, 6, 12, 24, …).

  2. For each problem size:

    • Rebuild the model.

    • Solve the continuous version.

    • Solve the integer version.

    • Record the computation time of each run.

  3. Create a figure showing:

    • x-axis: number of months included

    • y-axis: computation time (seconds)

    • Two curves: one for the continuous model and one for the integer model

  4. Comment on the results.

Attribution

This page is written by Nadia Pourmohammadzia Find out more here.