Analysis traffic dataset

Analysis traffic dataset#

Case 2: Traffic density near Helsinki#

Openstreetmap screenshot of the location of the measurement station from which this data has been sourced. Map data © OpenStreetMap contributors, licensed under ODbL.

What’s the propagated uncertainty? How large is the traffic density?

In this project, you have chosen to work on the traffic density on a highway close to Helsinki, Finland. You have observations of the flow of vehicles \(F\) [#/h] and their average velocity \(v\) [km/h]. The dataset has been accessed here. In this exercise, we want to compute the density of traffic \(D\) [#/km] along this path in both directions:

\[ D = \frac{F}{v} \]

The goal of this project is:

Choose a reasonable distribution function for \(F\) and \(v\).
Fit the chosen distributions to the observations of \(F\) and \(v\).
Assuming \(F\) and \(v\) are independent, propagate their distributions to obtain a distribution for the traffic density \(D\).
Analyze the distribution of \(D\).

Importing packages#

import numpy as np              # For math
import matplotlib.pyplot as plt # For plotting
from scipy import stats         # For math
from math import ceil, trunc    # For plotting

# This is just cosmetic - it updates the font size for our plots
plt.rcParams.update({'font.size': 14})

1. Explore the data#

The first step in the analysis is exploring the data, visually and through statistics.

Tip: In the workshop files, you have used the pandas .describe() function to obtain the statistics of a data vector. scipy.stats has a similar function.

import os
from urllib.request import urlretrieve

def findfile(fname):
    if not os.path.isfile(fname):
        print(f"Downloading {fname}...")
        urlretrieve('http://files.mude.citg.tudelft.nl/GA1.4/'+fname, fname)

findfile('dataset_traffic.csv')

# Import the data from the .csv file
F, v = np.genfromtxt('dataset_traffic.csv', delimiter=",", unpack=True, skip_header=True)

# Plot the time series for the number of heavy vehicles H
fig, ax = plt.subplots(2, 1, figsize=(10, 7), layout = 'constrained')
ax[0].plot(F,'k')
ax[0].set_xlabel('Time')
ax[0].set_ylabel('Traffic flow $F$ [#/h]')
ax[0].grid()

# Plot the time series for the number of cars C
ax[1].plot(v,'k')
ax[1].set_xlabel('Time')
ax[1].set_ylabel('Average velocity $v$ [km/h]')
ax[1].grid()

# Statistics for F
print(stats.describe(F))

# Statistics for v
print(stats.describe(v))

** Task 1: **

Describe the data based on the previous statistics:

Which variable features a higher variability? Also consider the magnitudes of the different variables.
What does the skewness coefficient represent? Which kind of distribution functions should we consider to fit based on this coefficient?

2. Empirical distribution functions#

Now, we are going to compute and plot the empirical PDF and CDF for each variable. Note that you have the pseudo-code for the empirical CDF in the reader.

Task 2:

Define a function to compute the empirical CDF. Plot your empirical PDF and CDF.

# def ecdf(YOUR_CODE_HERE):
#     """Write a function that returns [non_exceedance_probabilities, sorted_values]."""
#     YOUR_CODE_HERE # may be more than one line
#     return [non_exceedance_probabilities, sorted_values]

### YOUR PLOTS HERE ###

Task 3:

Based on the results of Task 1 and the empirical PDF and CDF, select one distribution to fit to each variable.

For \(F\), select between an exponential or lognormal distribution.
For \(v\) choose between a Gaussian and Gumbel distribution.

3. Fitting a distribution#

Task 4:

Fit the selected distributions to the observations using MLE (Maximum Likelihood Estimation).

Hint: Use Scipy’s built-in functions (be careful with the parameter definitions!).

### YOUR CODE HERE ###

4. Assessing goodness of fit#

Task 5:

Assess the goodness of fit of the selected distribution using:

One graphical method: QQplot or Logscale. Choose one.
The Kolmogorov-Smirnov test.

Hint: The Kolmogorov-Smirnov test is implemented in Scipy.

### YOUR PLOTS HERE ###

### YOUR CODE HERE ###

Task 6:

Interpret the results of the GOF techniques. How does the selected parametric distribution perform?

5. Propagating the uncertainty#

Using the fitted distributions, we are going to propagate the uncertainty from \(F\) and \(v\) to compute the density of traffic \(D\) assuming that \(F\) and \(v\) are independent.

Task 7:

Draw 10,000 random samples from the fitted distribution functions for \(F\) and \(v\).
Compute the traffic density \(D\) for each pair of the generated samples.
Compute the traffic density \(D\) for the observations.
Plot the PDF and exceedance curve in logscale of traffic density \(D\) computed using both the simulations and the observations.

# Draw random samples
rs_F = ### YOUR CODE HERE ###
rs_v = ### YOUR CODE HERE ###

# Compute D
rs_D = ### YOUR CODE HERE ###

# Repeat for observations
D = ### YOUR CODE HERE ###

# Plot the PDF and the CDF

Task 8:

Interpret the figures above, answering the following questions:

Are there differences between the two computed distributions for the traffic density \(D\)?
What are the advantages and disadvantages of using the simulations?

If you run the code in the cell below, you will obtain a scatter plot of both variables. Explore the relationship between both variables and answer the following questions:

Task 9:

Observe the plot below. What differences do you observe between the generated samples and the observations?
What can you improve into the previous analysis? Do you have any ideas/suggestions on how to implement those suggestions?

fig, axes = plt.subplots(1, 1, figsize=(7, 7))
axes.scatter(rs_F, rs_v, 40, 'k', label = 'Simulations')
axes.scatter(F, v, 40, 'r', marker = 'x', label = 'Observations')
axes.set_xlabel('Traffic flow $F$ [#/h]')
axes.set_ylabel('Average velocity $v$ [km/h]')
axes.legend(loc = "upper right")
axes.grid()
plt.savefig("scatterplot.png",dpi=300)

By Max Ramgraber, Patricia Mares Nasarre and Robert Lanzafame, Delft University of Technology. CC BY 4.0, more info on the Credits page of Workbook.