Analysis#
Import packages#
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from urllib.request import urlretrieve
from scipy import stats
from scipy.signal import find_peaks
import os
plt.rcParams.update({'font.size': 14})
Part 1. Preprocess and explore the data#
This section shows the preprocessing of the data (just FYI), which includes:
Loading the data;
Identifying the city with the highest measurement of rain;
Extracting the data corresponding to that city;
Selecting the columns we need from the dataset for further analysis;
Formatting the date;
Plotting the data.
Let’s start by loading the data.
def findfile(fname):
if not os.path.isfile(fname):
print(f"Downloading {fname}...")
urlretrieve('https://github.com/TUDelft-MUDE/source-files/raw/main/file/GA2.7/'+fname, fname)
findfile('ABD_Rainfall_1981_2025_PAK.csv')
data = pd.read_csv('ABD_Rainfall_1981_2025_PAK.csv')
data.head()
As you can see, the dataset includes information about the date, location of the measuremet and the cumulative precipitation in mm. As previously mentioned, you are going to analyze the city with the highest recorded precipitation. Let’s identify it!
data.iloc[data['Rainfall - (MM)'].idxmax(),5]
Let’s now extract the data only from that location.
data_loc = data[data['City']==data.iloc[data['Rainfall - (MM)'].idxmax(),5]]
data_loc.head()
You do not need all the columns from the dataset for further analysis so you can simplify the data to the two needed columns: date and precipitation.
cols_to_keep = ['Date', 'Rainfall - (MM)']
data_loc = data[cols_to_keep]
data_loc.head()
Once you have extracted the needed information, you can format the date, remove NaNs and leave the data ready for further use.
data_format = data_loc.copy()
data_format['Date'] = pd.to_datetime(data_format['Date'], format='mixed')
data_format = data_format.dropna(how = 'any')
data_format=data_format.sort_values('Date')
data_format = data_format.reset_index(drop=True)
data_format.head()
Let’s explore it a bit by plotting it.
# plot time series
plt.figure(figsize=(10, 6))
plt.plot(data_format['Date'], data_format['Rainfall - (MM)'],'k', label = 'Rainfall')
plt.xlabel('Time')
plt.ylabel('Rainfall [mm]')
plt.grid()
plt.legend()
Note that from now on for further analysis you will use the Dataframe called data_format.
Part 2. Yearly maxima#
Task 2.1:
Sample yearly maxima. Check if you extracted one maxima per year by checking the size of the obtained maxima. Plot those maxima in the timeseries.
idx_max = ### YOUR CODE HERE ###
yearly_list = ### YOUR CODE HERE ###
print('The number of extracted yearly maxima is:', yearly_list.shape[0])
# plot the maxima in the timeseries
### YOUR CODE HERE ###
Task 2.2:
Compute the empirical cumulative distribution function of the sampled maxima. Fit the adequate distribution function and assess the fit by plotting both in an exceedance plot with log-scale. Interpret the parameters of the obtained distribution and the goodness of fit.
# function for ECDF
def ecdf(var):
### YOUR CODE HERE ###
return [y, x]
# Fit GEV
params_YM = ### YOUR CODE HERE ###
print('The parameters of the fitted distribution are:', *params_YM)
#Plot in logscale
### YOUR CODE HERE ###
Part 3. Peak Over Threshold#
Task 3.1:
Apply POT on the timeseries using a threshold of 125mm and a declustering time of 1 month. Note that the dataset contains three observations per month. Plot the sampled extremes on the timeseries. What difference do you see when compared to the extremes sampled with Yearly Maxima?
Tip: plot in the timeseries the extremes obtained with both Yearly Maxima and POT. It will help in the comparison.
Hint: you already used the function find_peaks in the programming assignment.
# POT
threshold = ### YOUR CODE HERE ###
distance = ### YOUR CODE HERE ###
peaks, _ = find_peaks ### YOUR CODE HERE ###
print('Using POT, the number of maxima obtained is:', peaks.shape[0])
### YOUR CODE HERE ###
Task 3.2:
Compute the empirical cumulative distribution function of the sampled maxima using POT. Fit the adequate distribution function and assess the fit by plotting both in an exceedance plot with log-scale.
# Fit GPD
params_POT = ### YOUR CODE HERE ###
print('The parameters of the fitted distribution are:', *params_POT)
### YOUR CODE HERE ###
Part 4. Return levels#
Task 4.1:
Assuming that we want to design a device with a return period RT = 100 years, what are the design values predicted by both EVA approaches? Compare them.
#YM
YM_design_value = ### YOUR CODE HERE ###
print('The design value using Yearly maxima is', np.round(YM_design_value, 3), 'mm')
#POT
average_n_excesses = ### YOUR CODE HERE ###
non_exc_prob = ### YOUR CODE HERE ###
POT_design_value = ### YOUR CODE HERE ###
print('The design value for a RT = 100 years computed using POT is:', np.round(POT_design_value, 3), 'mm')
Task 4.2:
Compute and plot the return level plot for both approaches (Yearly Maxima and POT): values of the random variable in the x-axis and return periods on the y-axis. Y-axis in log-scale.
Hint: it is very similar to the previous task but for multiple values!
# There are multiple approaches to do this task. We suggest the following steps but feel free to do it in your own way!
#range of RT
RT_range = ### YOUR CODE HERE ###
#YM&GEV
YM_range = ### YOUR CODE HERE ###
#POT&GPD
average_n_excesses = ### YOUR CODE HERE ###
non_exc_prob_range = ### YOUR CODE HERE ###
POT_range = ### YOUR CODE HERE ###
#plot
### YOUR CODE HERE ###
Task 4.3:
Answer the following questions:
Which differences do you observe between the results with the two approaches?
Why do you think those differences happen?
Which design value would you pick? Justify your answer
How can you improve the performed analysis?
Part 5. Bonus (not mandatory!)#
Task 5.1:
Not mandatory!
Analyze whether the selected parameters to perform POT (threshold and declustering time) are appropriate. Go to the page on parameter selection in the MUDE book and perform the analysis presented there.
By Patricia Mares Nasarre, Delft University of Technology. CC BY 4.0, more info on the Credits page of Workbook