Assessing the uncertainties in the compressive strength of the produced concrete is key for the safety of infrastructures and buildings. However, a lot of boundary conditions influence the final resistance of the concrete, such the cement content, the environmental temperature or the age of the concrete. Probabilistic tools can be applied to model this uncertainty. In this workshop, you will work with a dataset of observations of the compressive strength of concrete (you can read more about the dataset here).
The goal of this project is:
- Choose a reasonable distribution function for the concrete compressive strength analyzing the statistics of the observations.
- Fit the chosen distributions by moments.
- Assess the fit computing probabilities analytically.
- Assess the fit using goodness of fit techniques and computer code.
The project will be divided into 3 parts: 1) data analysis, 2) pen and paper stuff (math practice!), and 3) programming.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy import stats
from math import ceil, trunc
plt.rcParams.update({'font.size': 14})
Part 1: Explore the data¶
First step in the analysis is exploring the data, visually and through its statistics.
# Import
data = np.genfromtxt('dataset_concrete.csv', delimiter=",", skip_header=True)
# Clean
data = data[~np.isnan(data)]
# plot time series
plt.figure(figsize=(10, 6))
plt.plot(data,'ok')
plt.xlabel('# observation')
plt.ylabel('Concrete compressive strength [MPa]')
plt.grid()
weights = 5*np.ones(len(data))
plt.hist(data, orientation='horizontal', weights=weights, color='lightblue', rwidth=0.9)
In the figure above, you can see all the observations of concrete compressive strength. You can see that there is no clear pattern in the observations. Let's see how the statistics look like!
# Statistics
df_describe = pd.DataFrame(data)
df_describe.describe()
Task 1.1: Using ONLY the statistics calculated in the previous lines:
Your answer here.
Part 2: Use pen and paper!¶
Once you have selected the appropriate distribution, you are going to fit it by moments manually and check the fit by computing some probabilities analytically. Remember that you have all the information you need in the textbook. Do not use any computer code for this section, you have to do in with pen and paper. You can use the notebook as a calculator.
Task 2.1: Fit the selected distribution by moments.
Your answer here.
We can now check the fit by computing manually some probabilities from the fitted distribution and comparing them with the empirical ones.
Task 2.2: Check the fit of the distribution:
Minimum value | P25% | P50% | P75% | Maximum value | |
---|---|---|---|---|---|
Non-exceedance probability | |||||
Empirical quantiles | |||||
Predicted quantiles |
Part 3: Let's do it with Python!¶
Now, let's assess the performance using further goodness of fit metrics and see whether they are consistent with the previously done analysis.
Task 3.1: Prepare a function to compute the empirical cumulative distribution function.
def ecdf(YOUR_CODE_HERE):
YOUR_CODE_HERE # may be more than one line
return YOUR_CODE_HERE
Task 3.2: Transform the fitted parameters for the selected distribution to loc-scale-shape.
Hint: the distributions are in our online textbook, but it is also critical to make sure that the formulation in the book is identical to that of the Python package we are using. You can do this by finding the page of the relevant distribution in the Scipy.stats documentation.
Task 3.3: Assess the goodness of fit of the fitted distribution by:
Hint: Use Scipy built in functions (watch out with the parameters definition!).
loc = YOUR_CODE_HERE
scale = YOUR_CODE_HERE
fig, axes = plt.subplots(1, 1, figsize=(10, 5))
axes.hist(YOUR_CODE_HERE,
edgecolor='k', linewidth=0.2, color='cornflowerblue',
label='Empirical PDF', density = True)
axes.plot(YOUR_CODE_HERE, YOUR_CODE_HERE,
'k', linewidth=2, label='YOUR_DISTRIBUTION_NAME_HERE PDF')
axes.set_xlabel('Compressive strength [MPa]')
axes.set_title('PDF', fontsize=18)
axes.legend()
fig, axes = plt.subplots(1, 1, figsize=(10, 5))
axes.step(YOUR_CODE_HERE, YOUR_CODE_HERE,
color='k', label='Empirical CDF')
axes.plot(YOUR_CODE_HERE, YOUR_CODE_HERE,
color='cornflowerblue', label='YOUR_DISTRIBUTION_NAME_HERE CDF')
axes.set_xlabel('Compressive strength [MPa]')
axes.set_ylabel('${P[X > x]}$')
axes.set_title('Exceedance plot in log-scale', fontsize=18)
axes.set_yscale('log')
axes.legend()
axes.grid()
fig, axes = plt.subplots(1, 1, figsize=(10, 5))
axes.plot([0, 120], [0, 120], 'k')
axes.scatter(YOUR_CODE_HERE, YOUR_CODE_HERE,
color='cornflowerblue', label='Gumbel')
axes.set_xlabel('Observed compressive strength [MPa]')
axes.set_ylabel('Estimated compressive strength [MPa]')
axes.set_title('QQplot', fontsize=18)
axes.set_xlim([0, 120])
axes.set_ylim([0, 120])
axes.set_xticks(np.arange(0, 121, 20))
axes.grid()