Analysis

Analysis#

Import packages#

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from urllib.request import urlretrieve

from scipy import stats 
from scipy.signal import find_peaks
import os

plt.rcParams.update({'font.size': 14})

Part 1. Preprocess and explore the data#

This section shows the preprocessing of the data (just FYI), which includes:

Loading the data;
Identifying the city with the highest measurement of rain;
Extracting the data corresponding to that city;
Selecting the columns we need from the dataset for further analysis;
Formatting the date;
Plotting the data.

Let’s start by loading the data.

def findfile(fname):
    if not os.path.isfile(fname):
        print(f"Downloading {fname}...")
        urlretrieve('https://github.com/TUDelft-MUDE/source-files/raw/main/file/GA2.7/'+fname, fname)

findfile('ABD_Rainfall_1981_2025_PAK.csv')

data = pd.read_csv('ABD_Rainfall_1981_2025_PAK.csv')
data.head()

Downloading ABD_Rainfall_1981_2025_PAK.csv...

	Date	PCODE	Country	Rainfall - (MM)	Province	City	Year	Month	Quarter
0	01-01-81	PK2	Pakistan	6.0	Balochistan	Balochistan	1981.0	January	Q3
1	11-01-81	PK2	Pakistan	6.0	Balochistan	Balochistan	1981.0	January	Q3
2	21-01-81	PK2	Pakistan	9.0	Balochistan	Balochistan	1981.0	January	Q3
3	01-02-81	PK2	Pakistan	15.0	Balochistan	Balochistan	1981.0	February	Q3
4	11-02-81	PK2	Pakistan	5.0	Balochistan	Balochistan	1981.0	February	Q3

As you can see, the dataset includes information about the date, location of the measuremet and the cumulative precipitation in mm. As previously mentioned, you are going to analyze the city with the highest recorded precipitation. Let’s identify it!

data.iloc[data['Rainfall - (MM)'].idxmax(),5]

'Islamabad'

Let’s now extract the data only from that location.

data_loc = data[data['City']==data.iloc[data['Rainfall - (MM)'].idxmax(),5]]
data_loc.head()

	Date	PCODE	Country	Rainfall - (MM)	Province	City	Year	Month	Quarter
54570	01-01-81	PK401	Pakistan	33.0	Islamabad	Islamabad	1981.0	January	Q3
54571	11-01-81	PK401	Pakistan	7.0	Islamabad	Islamabad	1981.0	January	Q3
54572	21-01-81	PK401	Pakistan	95.0	Islamabad	Islamabad	1981.0	January	Q3
54573	01-02-81	PK401	Pakistan	52.0	Islamabad	Islamabad	1981.0	February	Q3
54574	11-02-81	PK401	Pakistan	17.0	Islamabad	Islamabad	1981.0	February	Q3

You do not need all the columns from the dataset for further analysis so you can simplify the data to the two needed columns: date and precipitation.

cols_to_keep = ['Date', 'Rainfall - (MM)']
data_loc = data[cols_to_keep]
data_loc.head()

	Date	Rainfall - (MM)
0	01-01-81	6.0
1	11-01-81	6.0
2	21-01-81	9.0
3	01-02-81	15.0
4	11-02-81	5.0

Once you have extracted the needed information, you can format the date, remove NaNs and leave the data ready for further use.

data_format = data_loc.copy()
data_format['Date'] = pd.to_datetime(data_format['Date'], format='mixed')
data_format = data_format.dropna(how = 'any')
data_format=data_format.sort_values('Date')
data_format = data_format.reset_index(drop=True)
data_format.head()

	Date	Rainfall - (MM)
0	1981-01-01	6.0
1	1981-01-01	14.0
2	1981-01-01	0.0
3	1981-01-01	10.0
4	1981-01-01	2.0

Let’s explore it a bit by plotting it.

# plot time series
plt.figure(figsize=(10, 6))
plt.plot(data_format['Date'], data_format['Rainfall - (MM)'],'k', label = 'Rainfall')
plt.xlabel('Time')
plt.ylabel('Rainfall [mm]')
plt.grid()
plt.legend()

<matplotlib.legend.Legend at 0x7fba1fd263f0>

../../_images/ec9b1cabea170c255dde25be264f06ccd96be2c984a377f8f964fc37c91657f0.png

Note that from now on for further analysis you will use the Dataframe called data_format.

Part 2. Yearly maxima#

Task 2.1:

Sample yearly maxima. Check if you extracted one maxima per year by checking the size of the obtained maxima. Plot those maxima in the timeseries.

idx_max = data_format.groupby(pd.DatetimeIndex(data_format['Date']).year)['Rainfall - (MM)'].idxmax()
yearly_list = data_format.loc[idx_max]

print('The number of extracted yearly maxima is:', yearly_list.shape[0])

# plot the maxima in the timeseries

plt.figure(figsize=(10, 6))
plt.plot(data_format['Date'], data_format['Rainfall - (MM)'],'k', label = 'Rainfall')
plt.scatter(yearly_list['Date'], yearly_list['Rainfall - (MM)'], 40, 'r', label = 'Yearly Maxima')
plt.xlabel('Time')
plt.ylabel('Rainfall [mm]')
plt.grid()
plt.legend()

The number of extracted yearly maxima is: 45

<matplotlib.legend.Legend at 0x7fba1229bbf0>

../../_images/27db8d275a3911788ff071257281409b7a24c3ac1f4c77f092a821087f5602d4.png

Task 2.2:

Compute the empirical cumulative distribution function of the sampled maxima. Fit the adequate distribution function and assess the fit by plotting both in an exceedance plot with log-scale. Interpret the parameters of the obtained distribution and the goodness of fit.

# function for ECDF
def ecdf(var):
    x = np.sort(var) # sort the values from small to large
    n = x.size # determine the number of datapoints\
    y = np.arange(1, n+1) / (n+1)
    return [y, x]

# Fit GEV
params_YM = stats.genextreme.fit(yearly_list['Rainfall - (MM)'])
print('The parameters of the fitted distribution are:', *params_YM)

#Plot in logscale
x_range = np.linspace(0, 300, 500)
plt.figure(figsize=(10, 6))
plt.step(ecdf(yearly_list['Rainfall - (MM)'])[1], 1-ecdf(yearly_list['Rainfall - (MM)'])[0],'cornflowerblue', label = 'Yearly maxima')
plt.plot(x_range, 1-stats.genextreme.cdf(x_range, *params_YM),
             '--k', label='GEV')
plt.xlabel('Precipitation [mm]')
plt.ylabel('${P[X > x]}$')
plt.yscale('log') 
plt.grid()
plt.legend()

The parameters of the fitted distribution are: 0.18749320385057244 157.42294756046465 43.76604726074726

<matplotlib.legend.Legend at 0x7fba124c6750>

../../_images/68de3e0793c2134dcce66b9c2d6370f87b361399ccf4cbc34da54d871a211378.png

Solution 2.2:

The shape parameter of the GEV distribution is -0.19 which means that the obtained GEV is a Reverse Weibull and presents an upper bound.

Regarding the goodness of fit, the tail of the fitted distribution seems to follow the trend of the tail empirical distribution. Higher discrepancies are observed around 150 mm of precipitation.

Part 3. Peak Over Threshold#

Task 3.1:

Apply POT on the timeseries using a threshold of 125mm and a declustering time of 1 month. Note that the dataset contains three observations per month. Plot the sampled extremes on the timeseries. What difference do you see when compared to the extremes sampled with Yearly Maxima?

Tip: plot in the timeseries the extremes obtained with both Yearly Maxima and POT. It will help in the comparison.

Hint: you already used the function find_peaks in the programming assignment.

# POT
threshold = 125 # can be defined by the user
distance = 3 # number of observation per month
peaks, _ = find_peaks(data_format['Rainfall - (MM)'], height=threshold, distance=distance)
print('Using POT, the number of maxima obtained is:', peaks.shape[0])

plt.figure(figsize=(10, 6))
plt.plot(data_format['Date'], data_format['Rainfall - (MM)'],'k', label = 'Rainfall')
plt.scatter(yearly_list['Date'], yearly_list['Rainfall - (MM)'], 70, 'r', marker = 'x', label = 'Yearly Maxima')
plt.scatter(data_format.iloc[peaks, 0], data_format.iloc[peaks, 1], 40, 'cornflowerblue', label = 'POT')
plt.xlabel('Time')
plt.ylabel('Rainfall [mm]')
plt.grid()
plt.legend()

Using POT, the number of maxima obtained is: 136

<matplotlib.legend.Legend at 0x7fba12304b00>

../../_images/62965b998133bb0b0f672a668b68c1d68db689189cdb48092c948304e5fa5497.png

Solution 3.1:

POT samples many more extremes in comparison with Yearly Maxima. In many years, there is more than one extreme event of precipitation and using Yearly Maxima we miss them. Note that if we would want to limit our analysis to a seasonal event that occurs every year, Yearly Maxima could be a better choice.

Task 3.2:

Compute the empirical cumulative distribution function of the sampled maxima using POT. Fit the adequate distribution function and assess the fit by plotting both in an exceedance plot with log-scale.

# Fit GPD
params_POT = stats.genpareto.fit(data_format.iloc[peaks, 1]-threshold, floc = 0)
print('The parameters of the fitted distribution are:', *params_POT)

#Plot in logscale
x_range = np.linspace(0, 200, 500)
plt.figure(figsize=(10, 6))
plt.step(ecdf(data_format.iloc[peaks, 1])[1], 1-ecdf(data_format.iloc[peaks, 1])[0],'cornflowerblue', label = 'POT maxima')
plt.plot(x_range+threshold, 1-stats.genpareto.cdf(x_range, *params_POT),
             '--k', label='GPD')
plt.xlabel('Precipitation [mm]')
plt.ylabel('${P[X > x]}$')
plt.yscale('log') 
plt.grid()
plt.legend()

The parameters of the fitted distribution are: 0.1024897478178134 0 29.201408920614618

<matplotlib.legend.Legend at 0x7fba0ebf6270>

../../_images/92c8c78a3875738c5f43755663c244365f03e22a95a7236297e9d2271c5b6700.png

Solution 3.2:

The GPD distribution seems to capture the general trend of the empirical distribution. However, it starts to deviate at the end of the tail from approximately 225mm. The fit of the GPD is then a bit worse than the one of the GEV.

Part 4. Return levels#

Task 4.1:

Assuming that we want to design a device with a return period RT = 100 years, what are the design values predicted by both EVA approaches? Compare them.

#YM
YM_design_value = stats.genextreme.ppf(1-1/100, *params_YM)
print('The design value using Yearly maxima is', np.round(YM_design_value, 3), 'mm')

#POT
average_n_excesses = len(data_format.iloc[peaks, 1])/45 #number of sampled extremes divided by the number of years
non_exc_prob = 1- 1/(100*average_n_excesses)
POT_design_value = stats.genpareto.ppf(non_exc_prob, *params_POT)+threshold
print('The design value for a RT = 100 years computed using POT is:', np.round(POT_design_value, 3), 'mm')

The design value using Yearly maxima is 292.319 mm
The design value for a RT = 100 years computed using POT is: 351.681 mm

Solution 4.1:

In this case, POT predicts higher return levels than Yearly Maxima. In the case of the return period of 100 years, there is a difference of around 60 mm. Note that this conclusion is case specific.

Task 4.2:

Compute and plot the return level plot for both approaches (Yearly Maxima and POT): values of the random variable in the x-axis and return periods on the y-axis. Y-axis in log-scale.

Hint: it is very similar to the previous task but for multiple values!

# There are multiple approaches to do this task. We suggest the following steps but feel free to do it in your own way!

#range of RT
RT_range = np.linspace(1, 500, 500)

#YM&GEV
YM_range = stats.genextreme.ppf(1-1/RT_range, *params_YM)

#POT&GPD
average_n_excesses = len(data_format.iloc[peaks, 1])/45 #number of sampled extremes divided by the number of years
non_exc_prob_range = 1- 1/(RT_range*average_n_excesses)
POT_range = stats.genpareto.ppf(non_exc_prob_range, *params_POT)+threshold

#plot
plt.figure(figsize=(10, 6))
plt.plot(YM_range, RT_range, '--r', label = 'YM')
plt.plot(POT_range, RT_range, 'cornflowerblue', label = 'POT')
plt.xlabel('Precipitation [mm]')
plt.ylabel('RT [years]')
plt.yscale('log') 
plt.grid()
plt.legend()

<matplotlib.legend.Legend at 0x7fba0ec37bf0>

../../_images/4c9d20377c9c849e3a860ab2cc877a6930f6b134223afbe7b9f2cb0490507250.png

Task 4.3:

Answer the following questions:

Which differences do you observe between the results with the two approaches?
Why do you think those differences happen?
Which design value would you pick? Justify your answer
How can you improve the performed analysis?

Solution 4.3:

In this case, POT is significantly more conservative than YM. It is also possible to see these differences in the plot in Task 4.2: the tails of the fitted distributions diverge from approximately 220 mm. From that point, the return levels predicted by the POT method become more conservative than those predicted by YM. This effect increases for increasing values of the return period.
POT quantifies the distribution using way more observations than YM which may lead to differences in the fitted distribution. In this case, there are significant differences in the shape of the tail which leads to significantly difference inferred return levels. Note that this is case specific.
If no additional information about the physical phenomenon we are studying is available, the safest option would be to use the most conservative approach. In this case, it is POT, but that is case specific. Also, we can argue that the GPD was quantified using more data, which makes the fitting more reliable. On the other hand, we have not performed here a formal analysis to choose the parameters of POT (threshold and declustering time), so we should check whether they are appropriate as the results are dependent on the selected parameters.
As mentioned in the previous point, the parameters of the POT (threshold and declustering time) should be further analyzed to determine whether they are appropriate (whether the sampled extremes are independent and identically distributed).

Part 5. Bonus (not mandatory!)#

Task 5.1:

Not mandatory!

Analyze whether the selected parameters to perform POT (threshold and declustering time) are appropriate. Go to the page on parameter selection in the MUDE book and perform the analysis presented there.

#new pandas dataframe with the peaks and dates
peaks_year = data_format.iloc[peaks, :]

#count the number of extremes sampled each year
count = peaks_year.groupby(pd.DatetimeIndex(peaks_year['Date']).year)['Rainfall - (MM)'].count()

#Mean=variance?
mean_count = count.mean()
var_count = count.var()
print(f'Mean = {mean_count:.3f} and variance = {var_count:.3f}')

Mean = 3.238 and variance = 6.283

import math

k = np.arange(count.min(),count.max()+1,1)
res = stats.poisson.pmf(k,1)
occur = count.groupby(count).size()
occur = pd.DataFrame(occur,columns=['Rainfall - (MM)'])

occur_df = pd.DataFrame(index=k,columns=['Count'])
occur_df = occur_df.join(occur)
occur_df = occur_df['Rainfall - (MM)'].fillna(0)

chi = np.sum((occur_df.values/len(peaks_year) - res)**2 / res)
p_value = 1 - stats.chi2.cdf(x=chi, df=1)
print('The p-value computed for the Chi Squared test is:', p_value)

plt.figure(figsize=(10, 6))
plt.hist(count,bins=25,label='Data', density=False)
plt.scatter(k,res*len(peaks_year),label='Fit', color = 'k')
plt.legend()
plt.grid()
plt.ylabel('PMF')
plt.xlabel(f'Number of times the precipitation {threshold} mm is exceeded in a year')

The p-value computed for the Chi Squared test is: 0.0

Text(0.5, 0, 'Number of times the precipitation 125 mm is exceeded in a year')

../../_images/b232d82ef9911287c2117ac260a65e104c8577b5504ceb7f2790aec587aa198f.png

Solution 5.1:

One of the properties of the Poisson distribution is that the mean = variance = parameter of the distribution. The sampled number of extremes per year does not seem to fulfill this property so it is unlikely that they follow a Poisson distribution.

The empirical probability mass function (pmf) and the fitted Poisson pmf are plotted in the previous figure. The number of exceedances per year does not seem to follow a Poisson distribution.

Moreover, the chi-squared test is applied to determine if the fit to the data is a good representation of the data. Chi-squared is a value that represents the sum of the squared differences between the data and the fit, divided by the fit. With the chi-squared value, it is possible to calculate the p-value of the fit. The null hypothesis is that the fit is a good representation of the data. The alternative hypothesis is that this fit is not representing the data. If the p-value is smaller than the significance level, the null hypothesis is rejected in favor of the alternative hypothesis. The significance level is 0.05. Here, a p-value \(\approx\) 0 is obtained, indicating that the Poisson distribution is rejected. With all the above, it can be concluded that the values selected for the threshold and the declustering time are not optimal since the number of exceedances per year does not clearly follow a Poisson distribution and, thus, the samples events may not be independent.

By Patricia Mares Nasarre, Delft University of Technology. CC BY 4.0, more info on the Credits page of Workbook