GA 2.4: Beary Icy¶

No description has been provided for this image No description has been provided for this image

CEGM1000 MUDE: Week 2.4, Time Series Analysis. For: December 6, 2024

Winter is coming and it is time to start getting our models ready for the ice classic. Our first goal is to improve the temperature model, as that seems to be an important factor in determining breakup day. Temperature is notoriously hard to predict, but we can analyze historical data to get a better understanding of the patterns.

In this assignment we will analyze a time series from a single year; in fact, only the first 152 days of the year, from January 1 until June 1. This is the period of interest for the ice classic, as the ice forms in this period, reaching its maximum thickness between January-March, and then starts melting, with breakup day typically happening in April or May.

Remember that we have until April 5 to place a bet. Why, then do we want to fit a model several months beyond this point? This gives us confidence in assessing the ability of the model to predict temperature, so that when we use it on April 5 to make predictions about the future, we can understand the uncertainty associated with it.

Let's start by loading the data and plotting it, then we will determine which components should be used to detrend it.

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf
from scipy.signal import periodogram

Part 1: Load the data and plot it¶

Task 1.1:

Do the following:

  • load the data
  • create time vector
  • plot the data

In [ ]:
YOUR_CODE_HERE

data = YOUR_CODE_HERE # Temperature data
time_days = YOUR_CODE_HERE # Time in days

Task 1.2:

Use the Markdown cell below to describe the data (you can use a few bullet points). For example, confirm relevant characteristics like number of points, units, describe the values (qualitatively), etc.

Your answer here.

Part 2: Extract the Dominant Patterns¶

We clearly see that the data contains a strong pattern (the general increase in temperature from winter to summer). We will start by fitting a functional model to the data in order to stationarize it. To find the frequency of the seasonal pattern we will use the power spectrum of the data.

We will reuse the function find_seasonal_pattern from the workshop.

Remember that for running this function we need to predefine the A-matrix to detrend the data. Since the data only contains the first 5 months of the year, we see that the temperature is increasing over time. What type of model would be most appropriate to remove the seasonal trend?

Task 2.1:

Define functions to help carry out this analysis, for example, fit_model and find_frequency.

In [ ]:
def fit_model(data, time, A, plot=False):
    '''
    Function to find the least squares solution of the data
    data: input data
    time: time vector
    A: A-matrix to fit the data
    plot: boolean to plot the results or not
    '''

    x_hat = YOUR_CODE_HERE # least squares solution
    y_hat = YOUR_CODE_HERE # model prediction
    e_hat = YOUR_CODE_HERE # residuals

    if plot:
        plt.figure(figsize=(10, 5))
        plt.subplot(211)
        plt.plot(time, data, label='Data')
        plt.plot(time, y_hat, label='Estimated data')
        plt.xlabel('Time [days]')
        plt.ylabel('Temperature [°C]')
        plt.title('Data vs Estimated data')
        plt.grid(True)
        plt.legend()
        plt.subplot(212)
        plt.plot(time, e_hat, label='Residuals')
        plt.xlabel('Time [days]')
        plt.ylabel('Temperature [°C]')
        plt.title('Residuals')
        plt.grid(True)
        plt.legend()
        plt.tight_layout()

    return x_hat, y_hat, e_hat

def find_frequency(data, time, A, fs, plot=True):
    '''
    Function to find the dominant frequency of the signal
    data: input data
    time: time vector
    A: A-matrix to detrend the data (prior to spectral analysis)
    fs: sampling frequency
    plot: boolean to plot the psd or not
    '''
    # Detrending the data
    _, _, e_hat= fit_model(YOUR_CODE_HERE)

    N = len(data)

    # Finding the dominant frequency in e_hat
    freqs, pxx = periodogram(YOUR_CODE_HERE, fs=YOUR_CODE_HERE, window='boxcar',
                                nfft=N, return_onesided=False,
                                scaling='density')

    # finding the dominant frequency and amplitude
    # Note: there are many ways to do this
    amplitude = YOUR_CODE_HERE # Amplitude of the dominant frequency
    dominant_frequency = YOUR_CODE_HERE # Dominant frequency


    # Plotting the PSD
    if plot:
        plt.figure(figsize=(10, 5))
        plt.subplot(211)
        plt.plot(time, e_hat)
        plt.title('Residuals')
        plt.ylabel('Atmospheric Pressure [hPa]')
        plt.grid(True)
        plt.subplot(212)
        plt.plot(freqs[freqs>0], pxx[freqs>0], label='PSD of residuals')
        plt.xlabel('Frequency')
        plt.ylabel('PSD')
        plt.title('Power Spectral Density')
        plt.grid(True)
        plt.plot(dominant_frequency, amplitude, 'ro', label='Dominant Frequency')
        plt.yscale('log')
        plt.xscale('log')
        plt.legend()
        plt.tight_layout()

    return dominant_frequency

Task 2.2:

Now provide an A-matrix that removes the trend from the data. There are multiple answers that will work, but some are better than others.

First, use the Markdown cell below to define your A-matrix and include a brief explanation justifying your choice.

Your answer here.

Task 2.3:

Now define the A-matrix in code and extract the seasonal pattern. Continue extracting components until the time series is stationary (you will then summarize your findings in the next task).

In [ ]:
# YOUR_CODE_HERE

Task 2.4:

Describe how you have detrended the time series. Include at least: a) the number and types of components used (and their parameters; in task 2.5 you will print those), b) how you decided to stop extracting components.

Your answer here.

Fitting the Functional Model¶

In the next cell we will fit the model to generate stationary residuals. Above, you may have a periodic signal, where for each dominant frequency $f_i$ ($i=1,2$) the model is:

$$a_i \cos(2\pi f_i t) + b_i \sin(2\pi f_i t)$$

However, to report the periodic signals we would like to have the amplitude, phase shift and the frequency of those signals, which can be recovered from: $$A_i \cos(2\pi f_i t + \theta_i)$$ Where the amplitude $A_i = \sqrt{a_i^2 + b_i^2}$ and $\theta_i = \arctan(-b_i/a_i)$

Note: in Section 4.1 book this was shown where the angular frequency $\omega = 2\pi f$ was used.

Task 2.5:

Complete the code cell below to create the functional model.

In [ ]:
def rewrite_seasonal_comp(ak, bk):
    '''
    Function to rewrite the seasonal component in terms of sin and cos
    ak: seasonal component coefficient for cos
    bk: seasonal component coefficient for sin

    returns: Ak, theta_k
    '''
    YOUR_CODE_HERE

# creating the A matrix of the functional model
A = YOUR_CODE_HERE
x_hat, y_hat, e_hat = YOUR_CODE_HERE

# Plotting the data and the estimated trend
plt.figure(figsize=(10, 3))
plt.plot(time_days, data, label='Original data')
plt.plot(time_days, y_hat, label='Estimated trend')
plt.xlabel('Time [days]')
plt.ylabel('Temperature [°C]')
plt.title('Temperature data Nenana, Alaska')
plt.grid(True)
plt.legend()

# Plotting the residuals
plt.figure(figsize=(10, 3))
plt.plot(time_days, e_hat0)
plt.xlabel('Time [days]')
plt.ylabel('Temperature [°C]')
plt.title('Residuals')
plt.grid(True)

# Extracting the seasonal component coefficients from the estimated parameters
a_i = YOUR_CODE_HERE
b_i = YOUR_CODE_HERE
freqs = YOUR_CODE_HERE

print(f'Estimated Parameters:')
for i in range(len(x_hat)):
    print(f'x{i} = {x_hat[i]:.3f}')

print('\nThe seasonal component is rewritten as:')
i = 0
for a, b, f in zip(a_i, b_i, freqs):
    A_i, theta_i = rewrite_seasonal_comp(a, b)
    i += 1
    print(f'A_{i} = {A_i:.3f}, theta_{i} = {theta_i:.3f}, f_{i} = {f:.3f}')

Task 2.6:

Are the residuals stationary? State yes or no and describe why in the cell below.

Your answer here.

Part 3: Finding the grizzly¶

When we look at the residuals after removing the periodic pattern(s), we see that there is still a pattern in the data. From researchers in the Nenana area we have heard that there is a grizzly bear that likes to take a nap (hibernate) in the area. We suspect that the grizzly bear has slept too close to the temperature sensor and has influenced the data.

In the next cell we will write an offset detection algorithm to find the offset in the data. The offset detection algorithm is based on the likelihood ratio test framework. However, due to the presence of autocorrelation in the residuals, the traditional critical values for the likelihood ratio test are not valid. Therefore, we will use a bootstrap approach to estimate the critical values. Luckily, this is not the first time we had to remove a grizzly bear from our data, so we know that the estimated critical values is approximately 100 (i.e. you do not have to find this value yourself!).

The offset detection algorithm¶

The offset detection algorithm is based on the likelihood ratio test framework. The likelihood ratio test has a test statistic that is given by:

$$\Lambda = n \log \left( \frac{SSR_0}{SSR_1} \right)$$$$SSR_i = \sum_{i=1}^n (\hat{e}_i)^2$$

where $SSR_0$ is the sum of the squared residuals for the model without an offset, $SSR_1$ is the sum of the squared residuals for the model with an offset, and $n$ is the number of data points. The likelihood ratio test statistic is compared to a critical value to determine if an offset is present in the data.

The cell below defines several functions which roughly accomplish the following:

  1. Calculate the sum of the squared residuals for the model without an offset, $SSR_0$.
  2. Calculate the sum of the squared residuals for the model with an offset at each possible point, $SSR_1$.
    1. For each possible offset location, we will calculate the sum of the squared residuals for the model with an offset at that data point.
    2. The A-matrix for the model with an offset is the same as the A-matrix for the model without an offset, but with an additional column that is 0 till the data point and 1 after the data point.
  3. At each possible offset location, calculate the likelihood ratio test statistic and store it in the results vector.
  4. We will find the offset location that maximizes the likelihood ratio test statistic, i.e. the location where an offset is most likely.
  5. We will include the offset in the model and repeat the process until the likelihood ratio test statistic is below the critical value.

Task 3.1:

Using the description above and the comments and docstring in the code, fill in the code below to complete the offset detection algorithm.

In [ ]:
def A1_matrix(A0, break_point):
    '''
    Function to create the A1 matrix
    A0: A matrix under H0
    break_point: break point location
    return: A1 matrix
    '''
    # create the new column and stack it to the A0 matrix
    YOUR_CODE_HERE
    
    return YOUR_CODE_HERE


def LR(e0, e1, cv=100, verbose=True):
    '''
    Function to perform the LR test
    e0: residuals under H0
    e1: residuals under H1
    cv: critical value
    '''
    n = YOUR_CODE_HERE
    SSR0 = YOUR_CODE_HERE
    SSR1 = YOUR_CODE_HERE
    test_stat = YOUR_CODE_HERE
    
    if test_stat > cv:
        if verbose:
            print(f'Test Statistic: {test_stat:.3f} > Critical Value: {cv:.3f}')
            print('Reject the null hypothesis')
    else:
        if verbose:
            print(f'Test Statistic: {test_stat:.3f} < Critical Value: {cv:.3f}')
            print('Fail to reject the null hypothesis')
    return test_stat

def jump_detection(data, time, A, cv=100, plot=True):
    '''
    Function to detect the jump in the data
    data: input data
    time: time vector
    A: A matrix under H0
    cv: critical value
    plot: boolean to plot the results or not
    '''
    # initialize the results vector
    results = YOUR_CODE_HERE
    # find the residuals under H0
    YOUR_CODE_HERE

    # loop over the data points
    for i in range(1, len(data)):
        # create the A1 matrix
        A1 = YOUR_CODE_HERE

        # We need this statement to avoid singular matrices
        if np.linalg.matrix_rank(A1) < A1.shape[1]:
            pass
        else:
            # find the residuals under H1
            _, _, e_hat1 = YOUR_CODE_HERE
            test_stat = YOUR_CODE_HERE
            results[i] = YOUR_CODE_HERE

    results = np.array(results)
    
    # finding the offset location. 
    # Offset is the location where the test statistic is maximum
    location = YOUR_CODE_HERE
    value = YOUR_CODE_HERE

    if plot:
        plt.figure(figsize=(10, 3))
        plt.plot(time, results)
        plt.plot(time[location], value, 'ro', label='offset')
        plt.plot([0, max(time)], [cv, cv], 'k--', label='Critical Value')
        plt.xlabel('Time [days]')
        plt.ylabel('Test Statistic')
        plt.title('LR Test')
        plt.grid(True)
        plt.legend()

    return location, value

Task 3.2:

Before we implement the offset detection algorithm use the following Markdown cell to describe the following in a few sentences or bullet points:

How is this process similar to the one we used to find a periodic pattern? How is it different?

Your answer here.

Task 3.3:

Now we will implement the offset detection algorithm by using the functions defined above to find the offset in the data. The function will provide figures from which you will be able to determine the offset.

In [ ]:
YOUR_CODE_HERE

Task 3.4:

Write your chosen offset in the cell below (report both the size and location of the offset).

My offset is: ...

Task 3.5:

Once you have found the offset, identify the offset location and update your A-matrix to include it in the model. Then repeat the process until the likelihood ratio test statistic is below the critical value.

In [ ]:
A2 = YOUR_CODE_HERE
x_hat, y_hat, e_hat = fit_model(YOUR_CODE_HERE)

# Plotting the data and the estimated trend
plt.figure(figsize=(10, 3))
plt.plot(time_days, data, label='Original data')
plt.plot(time_days, y_hat, label='Estimated trend')
plt.xlabel('Time [days]')
plt.ylabel('Temperature [°C]')
plt.title('Temperature data Nenana, Alaska')
plt.grid(True)
plt.legend()

# Plotting the residuals
plt.figure(figsize=(10, 3))
plt.plot(time_days, e_hat)
plt.xlabel('Time [days]')
plt.ylabel('Temperature [°C]')
plt.title('Residuals')
plt.grid(True)

# Extracting the seasonal component coefficients from the estimated parameters
a_i = YOUR_CODE_HERE
b_i = YOUR_CODE_HERE
freqs = YOUR_CODE_HERE

print(f'Estimated Parameters:')
for i in range(len(x_hat)):
    print(f'x{i} = {x_hat[i]:.3f}')

print('\nThe seasonal component is rewritten as:')
i = 0
for a, b, f in zip(a_i, b_i, freqs):
    A_i, theta_i = rewrite_seasonal_comp(a, b)
    i += 1
    print(f'A_{i} = {A_i:.3f}, theta_{i} = {theta_i:.3f}, f_{i} = {f:.3f}')

Task 3.6:

Use the Markdown cell below to summarize the location and size of the offset(s) you have found. Include the number of components used in the final model.

Your answer here.

Part 4: Analyzing the residuals¶

Now that we have our residuals we can fit an AR model to the residuals. We will start by plotting the ACF of the residuals. We will then fit an AR model to the residuals and report the parameters of the AR model. Using the likelihood ratio test framework we will determine the order of the AR model.

In [ ]:
# Lets start with the ACF plot
fig, ax = plt.subplots(1, 1, figsize=(10, 3))
plot_acf(e_hat, ax=ax, lags=20);
ax.grid()

Task 4.1:

Begin by completing the functions below to define AR(1) (hint: you did this on Wednesday).

In [ ]:
def AR1(s, time, plot=True):
    '''
    Function to find the AR(1) model of the given data
    s: input data
    return: x_hat, e_hat
    '''
    y = YOUR_CODE_HERE
    y_lag_1 = YOUR_CODE_HERE
    A = np.atleast_2d(y_lag_1).T
    x_hat, y_hat, e_hat = fit_model(YOUR_CODE_HERE)

    if plot:
        fig, ax = plt.subplots(2, 1, figsize=(10, 5))
        ax[0].plot(time[1:], y, label='Original Residuals')
        ax[0].plot(time[1:], y_hat, label='Estimated Residuals')
        ax[0].set_xlabel('Time [days]')
        ax[0].set_ylabel('Temperature [°C]')
        ax[0].set_title('Original Data vs Estimated Data')
        ax[0].grid(True)
        ax[0].legend()
        plot_acf(e_hat, ax=ax[1], lags=20)
        ax[1].grid()
        fig.tight_layout()
        
    print(f'Estimated Parameters:')
    print(f'phi = {x_hat[0]:.4f}')

    return x_hat, e_hat

# Estimating the AR(1) model
phi_hat_ar1, e_hat_ar1 = AR1(YOUR_CODE_HERE)

Task 4.2:

  • As you can see, the next task asks you to implement AR(2). State why this is necessary, using the results from the cell above.
  • Based on the ACF plot, will the $\phi_2$ parameter in the AR(2) be positive or negative? Why?

Your answer here.

Task 4.3:

Now complete the functions to set up AR(2).

In [ ]:
def AR2(s, time, plot=True):
    '''
    Function to find the AR(2) model of the given data
    s: input data
    return: x_hat, e_hat
    '''
    y = YOUR_CODE_HERE
    y_lag_1 = YOUR_CODE_HERE
    y_lag_2 = YOUR_CODE_HERE
    A = YOUR_CODE_HERE
    x_hat, y_hat, e_hat = fit_model(YOUR_CODE_HERE)

    if plot:
        fig, ax = plt.subplots(2, 1, figsize=(10, 5))
        ax[0].plot(time[2:], y, label='Original Residuals')
        ax[0].plot(time[2:], y_hat, label='Estimated Residuals')
        ax[0].set_xlabel('Time [days]')
        ax[0].set_ylabel('Temperature [°C]')
        ax[0].set_title('Original Data vs Estimated Data')
        ax[0].grid(True)
        ax[0].legend()
        plot_acf(e_hat, ax=ax[1], lags=20)
        ax[1].grid()
        fig.tight_layout()

    print(f'Estimated Parameters:')
    print(f'phi_1 = {x_hat[0]:.4f}, phi_2 = {x_hat[1]:.4f}')

    return x_hat, e_hat

# Estimating the AR(2) model
phi_hat_ar2, e_hat_ar2 = AR2(YOUR_CODE_HERE)

Part 5: Report the Results¶

Note: you did this on Wednesday! It was optional then, so you are not expected to know this for the exam; however, you should implement the code using the WS as a template, and your interpretation at the end will be part of the grade for this assignment.

Now that we have found the periodic signals in the data and fitted an AR model to the residuals, we can report the results. By combining including the AR (noise) process, we get residuals that are white noise. When the model has white noise residuals, we can also report the confidence intervals of the model. The estimated variance is only consistent when the residuals are white noise.

We will use the unbiased estimate of the variance of the residuals to calculate the confidence intervals. The unbiased estimate of the variance is given by:

$$\hat{\sigma}^2 = \frac{1}{n-p} \sum_{t=1}^{n} \hat{e}_t^2$$

Where $n$ is the number of observations and $p$ is the number of parameters in the model.

The covariance matrix of the parameters is given by:

$$\hat{\Sigma} = \hat{\sigma}^2 (\mathbf{A}^T \mathbf{A})^{-1}$$

Where $\mathbf{A}$ is the design matrix of the model.

Task 4.1:

Complete the missing parts of the code cell below. Note that you will need to add one additional term, compared to Wednesday.

In [ ]:
# combine ar2 and functional model
A_final = YOUR_CODE_HERE
x_hat, y_hat, e_hat_final = fit_model(YOUR_CODE_HERE)

# Plotting the acf of the residuals
fig, ax = plt.subplots(1, 1, figsize=(10, 3))
plot_acf(YOUR_CODE_HERE, ax=ax, lags=20);
ax.grid()

# compute the standard errors
N = YOUR_CODE_HERE
p = YOUR_CODE_HERE
sigma2 = YOUR_CODE_HERE
Cov = YOUR_CODE_HERE
se = YOUR_CODE_HERE

# Extracting the seasonal component coefficients from the estimated parameters
a_i = YOUR_CODE_HERE
b_i = YOUR_CODE_HERE
freqs = YOUR_CODE_HERE

# Check if the number of coefficients match the number of frequencies
assert len(a_i) == len(b_i) == len(freqs), 'The number of coefficients do not match'

print(f'Estimated Parameters (standard deviation):')
for i in range(len(x_hat)):
    print(f'x{i} = {x_hat[i]:.3f}\t\t ({se[i]:.3f})')

print('\nThe seasonal component is rewritten as:')
i = 0
for a, b, f in zip(a_i, b_i, freqs):
    A_i, theta_i = rewrite_seasonal_comp(a, b)
    i += 1
    print(f'A_{i} = {A_i:.3f}, theta_{i} = {theta_i:.3f}, f_{i} = {f:.3f}')

Task 4.2:

Now we have the complete functional model. Reflect on it's suitability for capturing the time dependent variation of temperature throughout the spring. Comment specifically on the time series components that were included and which ones have the most significant influence on the result.

Compare your final parameters to the ones you found in the previous tasks (i.e. model without offset, model with offset). Are they similar? If not, why do you think that is?

Comment also on the suitability of this model for predicting the temperature beyond the betting deadline of April 5, assuming that you have data up until that date. Remember that the ice typically breaks apart 2 to 6 weeks after the betting deadline.

Your answer here.

End of notebook.

Creative Commons License TU Delft MUDE

By MUDE Team © 2024 TU Delft. CC BY 4.0. doi: 10.5281/zenodo.16782515.