Lecture 1.2 Activity: The Best Model? You Bet!¶
CEGM1000 MUDE: Week 1.2, Monday, Sep 9, 2024. This assignment does not need to be turned in.
Overview¶
In this assignment we will fit two models to observations of ice break-up date and reflect on their performance.
The data we are working with represents the number of days since the new year that it took for the ice in a river to completely melt and break apart. The record goes from 1917 to 2019, which is in total 103 years of measurements.
Goal of the assignment: reflect on model performance and model accuracy in the context of its use.
We will follow these steps:
- Import data using
numpy
; - Explore the data;
- Fit a linear regression model and reflect on its goodness of fit;
- Apply confidence intervals to the model;
- Fit a non-linear model and reflect on its goodness of fit
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as sci
import scipy.optimize as opt
Part 1: Import data¶
We will import the dataset by using the numpy
function loadtxt
. If you open the data-days.csv file, you will notice that the comma is used as decimal separator for the number of recorded days. For this reason, we will import the data in the following steps:
Read the file
data-days.csv
as strings. To this end, we will use the optiondtype=str
within the functionloadtxt
. Note that we also need to set the delimiter between columns asdelimeter=;
and specify to skip the first row (which contains the header of the columns) asskiprows=1
.Replace the commas by dots. We will use the
numpy
functionchar.replace
to this end. The first argument is the numpy array of strings where we want to make changes, the second argument is the character we want to look for and change, and the third argument is what we want to use as replacement.Transform it back to
float
data. We will usedata.astype(float)
for this.
And our data is ready to start playing!
We also printed for you the first 10 elements of the array, so you can see how it actually looks.
data = np.loadtxt('data/days.csv', dtype=str, delimiter=';', skiprows=1)
data = np.char.replace(data, ',', '.')
data = data.astype(float)
data[0:10]
array([[1917. , 119.4791667], [1918. , 130.3979167], [1919. , 122.60625 ], [1920. , 131.4486111], [1921. , 130.2791667], [1922. , 131.5555556], [1923. , 128.0833333], [1924. , 131.6319444], [1925. , 126.7722222], [1926. , 115.66875 ]])
Part 2: Preliminary Analysis¶
One of the first steps when getting familiar with new data is to see the dimensions of the data. To this end, we can use the numpy
function shape
.
np.shape(data)
(103, 2)
The result is a (103, 2) array, i.e., a matrix with 103 rows and 2 columns. The first column contains the year of record, while the second one contains the measured data.
We can also compute the mean and the standard deviation of the variable of interest (second column) to get a sense of how the variable behaves.
mean = np.mean(data[:,1])
std = np.std(data[:,1])
print(f'Mean: {mean:.3f}\n\
Standard deviation: {std:.3f}')
Mean: 123.647 Standard deviation: 6.516
We can also quickly plot them to see the scatter of the data and the evolution in time.
plt.scatter(data[:, 0], data[:, 1], label='Measured data')
plt.xlabel('Year [-]')
plt.ylabel('Number of days/year [-]')
plt.title(f'Number of days per year between {data[0,0]:.0f}-{data[-1,0]:.0f}')
plt.grid()
In the figure above, we have plotted the year of the measurement in the x-axis and the number of days until the ice broke during that year in the y-axis. We can see that there is a significant scatter. Also, there seems to be a trend over time: as we go ahead in time (higher values in the x-axis), the number of days until the ice broke seems to decrease.
We have identifid a trend but can we model it?
Part 3: Fit a linear regression model: is it a good model?¶
We are going to create a model which allows us to predict the number of days until the ice broke as function of the year. For that, we are going to assume a linear relationship between the variables (a linear model) and we will fit it using linear regression. This is, we will fit a regression model $days=m\cdot year+q$, where $m$ represents the slope of the line, and $q$ is the intercept.
We will do it using functions which were already coded for us. We will use the scipy.stats
library which contains the linregress
function. For more info see here.
In the next code block we define the function regression
which requires x and y values as input and computes the values of $R^2$, $m$, and $q$.
Remember that $R^2$ is a goodness of fit metric and you should have read about it here.
def regression(x, y):
'''
Determine linear regression
Input: x = array, x values
y = array, y values
Output: r_sq = coefficient of determination
q = intercept of the line
m = slope of the line
'''
regression = sci.linregress(x, y)
r_sq = regression.rvalue**2
q = regression.intercept
m = regression.slope
print(f'Coefficient of determination R^2 = {r_sq:.3f}')
print(f'Intercept q = {q:.3f} \nSlope m = {m:.3f}')
return r_sq, q, m
r_sq, q, m = regression(data[:,0], data[:,1])
Coefficient of determination R^2 = 0.151 Intercept q = 291.239 Slope m = -0.085
Task 3.1: Based on the obtained coefficient of determination, assess the performance of the model.
- What does the coefficient of determination mean in this context?
- Is the developed model accurate?
$\textbf{Solution}$
- Coefficient of determination $ R^2 $ measures the percentage of the variance in our observations explained by the model. Thus, the higher, the better. As we can see, the value of $R^2$ is quite low. Only $\approx15\%$ of the variance is explained by the model, which is very low. Therefore, the linear model is not able to explain the scatter in our observations.
- Based on the answer to the previous question, the linear model is not an accurate model. Whether this low level of accuracy is good enough or not, depends on the use we want to give to the model.
We can also plot the data and the fitted model to see how the fit looks. To do so, we can make computations using the previous equation $days=m\cdot year+q$ with the fitted intercept $q$ and slope $m$. We have already defined a function which does it for you.
def calculate_line(x, m, q):
'''
Determine y values from linear regression
Input: x = array
m = slope of the line
q = intercept of the line
Output: y = array
'''
y = m * x + q
return y
line = calculate_line(data[:,0], m, q)
fig, axes = plt.subplots(1, 2,figsize = (12, 4))
axes[0].scatter(data[:,0], data[:,1], label = 'Observations')
axes[0].plot(data[:,0], line, color='r', label='Fitted line')
axes[0].set_ylabel('Number of days/year [-]')
axes[0].set_xlabel('Year [-]')
axes[0].grid()
axes[0].legend()
axes[0].set_title('(a) Number of days as function of the year')
axes[1].scatter(data[:,1], line)
axes[1].plot([105, 145],[105, 145], line, color = 'k')
axes[1].set_xlim([105, 145])
axes[1].set_ylim([105, 145])
axes[1].set_ylabel('Predicted number of days/year [-]')
axes[1].set_xlabel('Observed number of days/year [-]')
axes[1].grid()
axes[1].set_title('(b) Observed and predicted number of days');
Task 3.2:
Interpret the figures above. Do the previous plots fit with the result given by the coefficient of determination? Are they aligned?
$\textbf{Solution}$
In the plot (a), we observe that the observations have a high scatter around the fitted line, and visually it looks consistent with the $R^2$ value found above.
In the plot (b), we compare the observations with the predictions of the model. The perfect fit would correspond to the 45-degrees line in black. Thus, the model has a poor performance as we already quantified using the coefficient of determination. Both results are aligned.
We can also assess the scatter using the Root Mean Square Error ($RMSE$). Don't you remember it? Go back to the book!
Let's see how our model performs for this metric.
def RMSE(data, fit_data):
'''
Compute the RMSE
RMSE = [sum{(data - fit_data)^2} / N]^(1/2)
Input: data = array with real measured data
fit_data = array with predicted data
Output: RMSE
'''
diff_n = (data - fit_data)**2
mean = np.mean(diff_n)
error = mean**(1/2)
print(f'RMSE = {error:.3f}')
return error
RMSE_line = RMSE(data[:,1], line)
RMSE = 6.003
Task 3.3: Based on the obtained RMSE, assess the performance of the model. Answer in the cell below using markdown:
- What does RMSE mean in this context?
- Is the developed model accurate according to RMSE?
$\textbf{Solution}$
- $RMSE$ represents the mean error between the observations and the predictions of the model. This means that the mean error is $\approx 6$ days. Therefore, the linear model is not able to explain the scatter in our observations.
- Based on the previous interpretation, the linear model is not accurate. Whether this low level of accuracy is good enough or not, depends on the use we want to give to the model.
Finally, we can compute the bias of our model using $rbias$.
def rbias(data, fit_data):
'''
Compute the relative bias
rbias = [sum{(fit_data-data) / |data|}]/N
Input: data = array with real measured data
fit_data = array with predicted data
Output: relative bias
'''
bias = np.mean((fit_data-data)/data)
print(f'rbias = {bias:.3f}')
return bias
rbias_line = rbias(data[:,1], line)
rbias = 0.002
Task 3.4: Based on the obtained relative bias, assess the performance of the model. Answer in the cell below using markdown:
- What does
rbias
mean in this context?
$\textbf{Solution}$
rbias
provides an standardized measure of the systematic tendency of our model to under- or over-prediction. It is very low for our model, so it does not have a clear tendency to under- or overestimate and, thus, does not seem to be biased.
Part 4: Confidence Intervals¶
One way of assessing the uncertainty around the predictions of a model are confidence intervals. They give us insight into the precision of their predictions by transforming them into probabilities. In short, the 95% confidence interval (significance $\alpha=0.05$) shows the range of values within which my observation would be with a probability of 95%. Here, we want you to focus on their interpretation. In the following weeks (1.3), you will learn more about how to compute them.
Note: The confidence intervals as computed here and later in part 5 are based on a simplification, in week 1.3 you will see how to compute the confidence intervals correctly. This will also result in a different 'shape' (curved intervals).
def conf_int(x, y, alpha):
'''
Compute the confidence intervals
Input: x = array, observations
y = array, predictions
alpha = float, confidence interval
Output: k = float, width of the confidence interval
'''
sd_error = (y - x).std()
k = sci.norm.ppf(1-alpha/2)*sd_error
return k
k = conf_int(data[:,1], line, 0.05)
ci_low = line - k
ci_up = line + k
#plot
plt.scatter(data[:,0], data[:,1], label = 'Observations')
plt.plot(data[:,0], line, color='r', label='Fitted line')
plt.plot(data[:,0], ci_low, '--k')
plt.plot(data[:,0], ci_up, '--k')
plt.ylabel('Number of days/year [-]')
plt.xlabel('Year [-]')
plt.grid()
plt.legend()
plt.title('Number of days as function of the year')
Text(0.5, 1.0, 'Number of days as function of the year')
Task 4.1: What can you conclude from the previous plot? Think about the accuracy of the prediction compared to the precision you need in your bet (up to minutes!).
Solution: If you consider that you need to place a bet with not only the day but also the hour and minute at which the ice would break, the model is not accurate enough. You can see that the confidence interval spans almost 20 days!
Part 5: Non-linear Models¶
As we have seen, the data-driven linear model is not really a good choice for representing the data we have. Let's try with one which is slightly more complicated: a non-linear model.
In this section, we will analyze the fitting of a quadratic model as $days = A \cdot year^2 + B \cdot year + C$. The steps are the same as in the previous section, so we will go fast through the code to focus on the interpretation and comparison between the two models.
You do not need to worry about this right now, but in case you are curious: we will make use of the scipy.optimize
library, which contains the curve_fit
function. For further info on the function see here.
def parabola(x, a, b, c):
'''
Compute the quadratic model
y = a * x^2 + b * x + c
Input: x = array, independent variable
a, b, c = parameters to be optimized
Output: y = array, dependent variable
'''
y = a * x**2 + b * x + c
return y
popt_parabola, pcov_parabola = opt.curve_fit(parabola, data[:,0], data[:,1])
print(f'Optimal estimation for parameters:\n\
a = {popt_parabola[0]:.3e}, b = {popt_parabola[1]:.3f}, c = {popt_parabola[2]:.3f}\n')
print(f'Covariance matrix for parameters:\n\
Sigma = {pcov_parabola}')
Optimal estimation for parameters: a = -1.277e-03, b = 4.942, c = -4654.244 Covariance matrix for parameters: Sigma = [[ 5.60366689e-07 -2.20560328e-03 2.16981823e+00] [-2.20560328e-03 8.68165065e+00 -8.54118418e+03] [ 2.16981823e+00 -8.54118418e+03 8.40337452e+06]]
Therefore, our parabola now looks like $days = -1.277 \cdot 10^{-3} \cdot year^2 + 4.942 \cdot year - 4654.244$.
Now that we have fitted it, we can use it to compute predictions.
fitted_parabola = parabola(data[:,0], *popt_parabola)
We can also determine the confidence intervals for this fit and see how it looks!
k = conf_int(data[:,1], fitted_parabola, 0.05)
ci_low_2 = fitted_parabola - k
ci_up_2 = fitted_parabola + k
#plot
plt.scatter(data[:,0], data[:,1], label = 'Observations')
plt.plot(data[:,0], fitted_parabola, color='r', label='Fitted line')
plt.plot(data[:,0], ci_low_2, '--k')
plt.plot(data[:,0], ci_up_2, '--k')
plt.ylabel('Number of days/year [-]')
plt.xlabel('Year [-]')
plt.grid()
plt.legend()
plt.title('Number of days as function of the year')
Text(0.5, 1.0, 'Number of days as function of the year')
And finally compute the GOF metrics.
RMSE_parabola = RMSE(data[:,1], fitted_parabola)
R2_parabola = 1-((data[:,1]-fitted_parabola)**2).mean()/(data[:,1].var())
print(f'Coefficient of determination = {R2_parabola:.3f}')
rbias_parabola = rbias(data[:,1], fitted_parabola)
RMSE = 5.918 Coefficient of determination = 0.175 rbias = 0.002
Task 5.1: Based on the previous plot and the computed GOF metrics, is the quadratic model better than the linear one?
Solution: No! The GOF metrics are actually similar while the non-linear model is more complicated (e.g., it includes more parameters to fit).
At the same time we can see that the observations do fall quite well within the 95% confidence bounds. Moreover, the 'misfit' is clearly due to large year-to-year variability in the observed number of days. This variability might be explainable by the year-to-year weather conditions (e.g., number of heatwaves). The question is then: can we find a model by including a weather-related explanatory variable to get a better fit? But also, is this really needed? That depends on whether you are interested in predicting the number of days on the short term (e.g., next year) or the number of days in the long term (e.g., in 10 years or longer).
We will revisit this problem in the part on Sensing and Observation Theory in week 1.4.