Pandas

Pandas#

Overview#

This assignment introduces you to the package pandas. We only use a selection of all the features here, to help you get familiar with it before using it more in the coming weeks. The primary purpose is to easily process the contents. This is accomplished with a new data type unique to pandas: a DataFrame. It also makes it very easy to export data to a *.csv file.

If you want to learn more about pandas after finishing this assignment, the Getting Started page is a great resource.

In this assignment we’ll focus on the following:

Pandas core functionality relies on two objects DataFrame and Series
A dictionary-like syntax is used to access data, e.g., df['key'] returns values
You can use the attribute values to access the data in a Series, e.g., df['key'].values
We used (but did not explain) the method loc to find the “rows” of the DataFrame subject to certain criteria, e.g., see Task 0.2 of PA15
Import and export carried out with read_csv and to_csv methods
A few methods for exploring the data set once imported into a notebook, e.g., describe() or head()
See one example of how to address date and time data in a way that makes it “easy” for Pandas to interpret it
Extract maximum values within a set period of interest using groupby; specifically, the maximum value observed in each month
Extract maximum values that are defined by a threshold and with a minimum separation in time using scipy.find_peaks

The last two items are key for block maxima and peak over threshold methods used in Extreme Value Analysis; the focus is on understanding the syntax and process of what is happening rather than implementing the entire workflow yourself from scratch.

After this programming assignment, our journey with Pandas in MUDE ends. Note that we only scratched the surface in terms of capabilities, focusing on fundamental data types and a few simple operations, along with pre-processed data sets that are easy to handle. If you find yourself dealing with unprocessed data in the future, especially if a time series, consider using Pandas; there are many free tutorials available online.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
import os
from urllib.request import urlretrieve
plt.rcParams.update({'font.size': 14})

Part 1: Introduction to pandas#

Pandas dataframes are considered by some to be difficult to use. For example, here is a line of code from the MUDE assignment archive. Can you understand what it is doing?

net_data.loc[net_data['capacity'] <= 0, 'capacity'] = 0

One of the reasons for this is that the primary pandas data type, a DataFrame object, uses a dictionary-like syntax to access and store elements. For example, remember that a dictionary is defined using curly braces.

my_dict = {}
type(my_dict)

Also remember that you can add items as a key-value pair:

my_dict = {'key': 5}

The item key was added with value 5. We can access it like this:

my_dict['key']

This is useful beceause if we have something like a list as the value, we can simply add the index the the end of the call to the dictionary. For example:

my_dict['array'] = [34, 634, 74, 7345]
my_dict['array'][3]

And now that you see the “double brackets” above, i.e., [ ][ ], you can see where the notation starts to get a little more complicated. Here’s a fun nested example:

shell = ['chick']
shell = {'shell': shell}
shell = {'shell': shell}
shell = {'shell': shell}
nest = {'egg': shell}
nest['egg']['shell']['shell']['shell'][0]

Don’t worry about that too much…as long as you keep dictionaries and their syntax in mind, it becomes easier to “read” the complicated pandas syntax.

Now let’s go through a few simple tasks that will illustrate what a DataFrame is (when constructed from a dictionary), and some of its fundamental methods and characteristics.

\(\text{Task 1.1:}\)

Run the cell below and check what kind of object was created using the method type.

new_dict = {'names': ['Gauss', 'Newton', 'Lagrange', 'Euler'],
            'birth year': [1777, 1643, 1736, 1707]}

# YOUR_CODE_HERE

\(\text{Task 1.2:}\)

Run the cell below and check what kind of object was created using the method type.

df = pd.DataFrame(new_dict)

# YOUR_CODE_HERE

\(\text{Task 1.3:}\)

Just like you can access values in a dictionary using keys (e.g., new_dict['names']), you can access columns in a pandas DataFrame using the column name as a key.

Use indexing to extract the 'birth year' column from the DataFrame df and print it.
Also, try accessing the 'names' column in the same way.

df_birth_year = #YOUR_CODE_HERE
df_names = #YOUR_CODE_HERE

print(df_birth_year)
print(df_names)

\(\text{Task 1.4:}\)

The loc method in pandas is used to access a group of rows and columns by labels or a boolean array.
Given the df dataframe created from new_dict, try the following:

Manually create a boolean array that selects all rows where the birth year is after 1700.
Use loc to select all rows where the birth year is after 1700.
print the result

df_after_1700 = df.loc[[YOUR CODE HERE]]
print(df_after_1700)

\(\text{Task 1.5:}\)

The modified Pandas provides the iloc method to select rows and columns by their integer position (i.e., by number), rather than by label. Attempting indexing like you would do for arrays doesn’t work, because DataFrame indexing with a single integer selects columns by label, not rows by position. Furthermore, loc selects by label, so if your DataFrame index is not a simple range from 0, 1, 2, … you may get unexpected results. Use iloc to reliably select rows by their integer position.

Attempt to select the first row of df_after_1700 using df_after_1700[0]. What happens?
Attempt to select the second row of df_after_1700 using df_after_1700.loc[1]. What happens?
Remove non-functional code
Fix the code by using df_after_1700.iloc[YOUR CODE HERE] to select the second row. Print the result.

df_after_1700[0]  #REMOVE_AFTER TRYING
df_after_1700.loc[1] #REMOVE_AFTER TRYING

second_row = YOUR CODE HERE
print(second_row)

\(\text{Task 1.6:}\)

Read the code below and try to predict what the answer should be before you run it and view the output. Then run the cell, confirm your guess and in the second cell check what kind of object was created using the method type.

guess = df.loc[df['birth year'] <= 1700, 'names']
print(guess)

# YOUR_CODE_HERE

Note that this is a Series data type, which is part of the pandas package (you can read about it here). If you need to use the value that is stored in the series, you can use the attribute values as if it were an object with the same type as the data in the Series; the example below shows that the names in the DataFrame is a Series where the data has type ndarray.

print(type(df.loc[df['birth year'] <= 1700, 'names']))
print(type(df.loc[df['birth year'] <= 1700, 'names'].values))
print('The value in the series is an ndarray with first item:',
      df.loc[df['birth year'] <= 1700, 'names'].values[0])

\(\text{Task 1.7:}\)

Use the .values method to extract the birth year of the second row of task 1.5

# Extract the birth year of the second row of task 1.5 using .values
birth_year_second_row = int(#YOUR_CODE_HERE) #don't remove the int() conversion, required for grading
print(birth_year_second_row)

Another useful feature of pandas is to be able to quickly look at the contents of the data frame. You can quickly see which columns are present:

df.head()

You can also get summary information easily:

df.describe()

Finally, it is also very easy to read and write dataframes to a *.csv file, which you can do (and already did so) using the following commands:

df = pd.read_csv('dams.csv')

To write, the method is similar; the keyword argument index=False avoids adding a numbered index as an extra column in the csv:

df.to_csv('dams.csv', index=False)

Task 2: Evaluate and process the data#

For this assignment we will use a small files dams.csv which we can download with the following code.

def findfile(fname):
    if not os.path.isfile(fname):
        print(f"Downloading {fname}...")
        urlretrieve('https://github.com/TUDelft-MUDE/source-files/raw/main/file/'+fname, fname)

findfile('dams.csv')

\(\text{Task 2.1:}\)

Import the dataset as a DataFrame, then explore it and learn about its contents (use the methods presented above; you can also look inside the csv file).

# df = YOUR_CODE_HERE
# YOUR_CODE_HERE.head()

\(\text{Task 2.2:}\)

Using the example above, find the dams in the DataFrame that are of type earth fill.

# names_of_earth_dams = YOUR_CODE_HERE
# print('The earth fill dams are:', names_of_earth_dams)

\(\text{Task 2.3:}\)

Create a new dataframe that only includes the earth fill dams. Save it as a new csv file called earth_dams.csv.

\(\text{Tip:}\)

You only need to remove a small thing from the code for your answer to the task above.

# df_earth = YOUR_CODE_HERE
# df_earth.YOUR_CODE_HERE

\(\text{Task 2.4:}\)

Check the contents of the new csv file to make sure you created it correctly.

Part 3: Finding Maximum Values (2 methods)#

You may have struggled to imagine how to extract the maximum values, besides simply trying a method like h['Stage'].values.max(). We will help you out here. We will show you a useful method groupby, which uses the datetime object to easily get the maximum value that occurred in each month.

To showcase how to do it, we will use a dataset of concentrations of chlorophyll (\(Chlfa\)) in the North sea. \(Chlfa\) is a proxy for the amount of algae in the water. If there are very high concentrations of \(Chlfa\), it can be an indication that there is an overgrowth of algae due to an excess of nutrients, too little water movement and renewal, between others, potentially leading to situations such as eutrophization. If the concentrations are too low, it may indicate lack of nutrients potentially leading to starvation of species.

If you want to know more about this dataset, you can go to Santjer et al. (2024).

\(\text{Task 3.1:}\)

Import the dataset as a DataFrame, then explore it and learn about its contents (use the methods presented above; you can also look inside the csv file). Use the commented lines of code to interpret the contents.

def findfile(fname):
    if not os.path.isfile(fname):
        print(f"Downloading {fname}...")
        urlretrieve('https://github.com/TUDelft-MUDE/source-files/raw/main/file/'+fname, fname)

findfile('chla.csv')

h = YOUR_CODE_HERE

h.columns=['Date', 'Chlfa']
h['Date'] = pd.to_datetime(h[YOUR CODE HERE], format= YOUR_CODE_HERE)
# h.describe()
# h.head()
# h['Date']

plt.figure(figsize=(10, 6))
plt.plot(h['Date'], h['Chlfa'],'k.')
plt.xlabel('Date')
plt.ylabel('Concentrations [mg/m3]')
plt.grid()
plt.title('Concentrations of Chlorophyll in the North Sea');

\(\text{Task 3.2:}\)

Read the code cell below and explore using it until you understand what is happening. Then modify the code cell to extract the montly maxima instead of the yearly maxima.

idx_max = h.groupby(pd.DatetimeIndex(h['Date']).year)['Chlfa'].idxmax()

\(\text{Note:}\)

The code above looks for the maximum value of Chlfa for each value of month in the dataset. Note that this dataset presents observations of a single year. Therefore, there is only one month of March, for instance. If the dataset would be composed by more than one year, we would need to group by both year and month.

\(\text{Task 3.3:}\)

Complete the code cell below to plot the monthly maximum values.

plt.figure(figsize=(10, 6))
plt.plot(h['Date'], h['Chlfa'],'k.', markersize=4)
plt.xlabel('Date')
plt.ylabel('Concentrations [mg/m3]')
plt.grid()
plt.title('Concentrations of Chlorophyll in the North Sea');
plt.plot(h[YOUR_CODE_HERE][YOUR_CODE_HERE], h[YOUR_CODE_HERE][YOUR_CODE_HERE], 'y.', markersize=16, markeredgecolor='k')

As you may have noticed, this does not find all of the peak! We need another method.

We can use the scipy method find_peaks to help get the “true” maximum value of each peak in nitrogen. It requires two additional arguments:

height can specify a lower bound, above which to consider, and
distance makes sure that all peaks are separated by a minimum amount

Note that the one_day variable defines the duration of 24 hours as the number of data points, which are spaced 1-hour apart.

\(\text{Task 3.4:}\)

The code cell below is set up with an initial set of parameters, along with a print and figure summary of the results. Alter the arguments to try and match the most significant peak values in the data set. Note that the answer is subjective; try to get around 20 of the biggest values.

one_day = 24
idx_peaks, _ = find_peaks(h['Chlfa'], height=5, distance=one_day)

print(h.loc[idx_peaks].describe())

plt.figure(figsize=(10, 6))
plt.plot(h['Date'], h['Chlfa'],'k.', markersize=4)
plt.xlabel('Date')
plt.ylabel('Concentrations [mg/m3]')
plt.grid()
plt.title('Concentrations of Chlorophyll in the North Sea');
plt.plot(h['Date'][idx_max], h['Chlfa'][idx_max],'y.', markersize=16, markeredgecolor='k')
plt.plot(h['Date'][idx_peaks], h['Chlfa'][idx_peaks],'r.', markersize=16, markeredgecolor='k');

That’s it for Pandas for now. In the next section we will learn about formatted tables, then practice using the tables by present the results of our peaks analysis.

By Tom van Woudenberg, Robert Lanzafame and Patricia Mares Nasarre, Delft University of Technology. CC BY 4.0, more info on the Credits page of Workbook.