Bivariate Gaussian Distributions#
Part 1: Introduction and set up#
In this workshop, you will work with data on water quality of offshore waters, namely the concetration of nitrogen (\(DIN\)) and the concentration of phosphorus (\(DIP\)). As you probably know, these variables are nutrients which are key for the growth of species of algae and molluscs that feed on them. In order to guarantee the survival of these species, a minimum amount of nutrients in the water is needed so they do not starve.
Currently, there is a growing interest in installing farms of species such as mussels and seaweed offshore due to the limited space in coastal waters. When selecting the location of one of these farms, the availability of enough nutrients in the water is checked so the farmed species can survive and the farm is economically feasible. We can check that by computing the probability of meeting the minimum amount of nutrients in the water.
Here, you will first quantify the relationship between \(DIN\) and \(DIP\) (if they were independent, you could model them using independent univariate distributions!). As you can imagine, there is a relationship between the presence of both nutrients so, afterwards, you will model the bivariate distribution of \(DIN\) and \(DIP\) using a bivariate Gaussian distribution. Finally, you will use it to compute the probabilities of meeting some conditions.
import pandas as pd
import numpy as np
from scipy.stats import multivariate_normal
from scipy.stats import norm
import matplotlib.pyplot as plt
from statistics import mean, stdev
import os
from urllib.request import urlretrieve
def findfile(fname):
    if not os.path.isfile(fname):
        print(f"Downloading {fname}...")
        urlretrieve('http://files.mude.citg.tudelft.nl/'+fname, fname)
findfile('S_FINO3_DIN_DIP.csv')
data = np.loadtxt('S_FINO3_DIN_DIP.csv', delimiter = ',', skiprows=1)
data
Part 2: Covariance and correlation#
Task 2.1:
Plot variable X against variable Y. How do you expect the correlation to be? Make sure to label your plot!
### YOUR CODE HERE ###
Task 2.2:
It looks like a relationship exists between both variables. Let’s quantify it!
Define a function which calculates the covariance between two variables. The function must take as input the two vectors of observations. The output of the function must be the value for the covariance. Apply the function to the dataset to calculate the covariance between \(DIN\) and \(DIP\). Interpret the obtained value of the covariance. You may want to write out the calculations steps on paper before starting to code.
Important: you have to code the function yourself, do not use a prebuilt function from a package.
Hint: you can use the function zip to multiply two lists.
def calculate_covariance(X1, X2):
    ### YOUR CODE HERE ###
    return covariance
print('The covariance between DIN and DIP is', ### YOUR CODE HERE ###)
Task 2.3:
Define a function which calculates Pearson’s correlation coefficient between two variables. The function must take as input the two vectors of observations. The output of the function must be the value for the correlation coefficient. Apply the function to the dataset to calculate the correlation coefficient between \(DIN\) and \(DIP\). Interpret the obtained value of the Pearson’s correlation coefficient and compare it with the obtained covariance.
Hint: You may want to use the function for finding the covariance that you defined above
def pearson_correlation(X1, X2):
    covariance = calculate_covariance(X1, X2)
    correl_coeff = ### YOUR CODE HERE ###
    return correl_coeff
print("Pearson's correlation coefficient between DIN and DIP is", 
      ### YOUR CODE HERE ###)
Part 3: Bivariate Gaussian distribution#
Task 3.1:
Model the joint distribution of \(DIN\) and \(DIP\) using a bivariate Gaussian distribution. Follow the following steps:
Define the vector of means
Define the covariance matrix
Define the bivariate Gaussian distribution and draw 100 samples to compare with the observations.
Do you see differences between the observations and the samples from the bivariate Gaussian distribution?
# Define the vector of means
mu1 = ### YOUR CODE HERE ###
mu2 = ### YOUR CODE HERE ###
mu = [mu1, mu2] #vector of means
# Define the covariance matrix
s1 = ### YOUR CODE HERE ###
s2 = ### YOUR CODE HERE ###
covariance = ### YOUR CODE HERE ###
sigma = ### YOUR CODE HERE ###
# Draw 100 samples from a bivariate Gaussian distribution
samples = multivariate_normal(mean=mu, cov=sigma).rvs(size=100)
# Scatter plot against observations
fig, axs = plt.subplots(1, 1)
axs.scatter(data[:,0], data[:,1], 40, 'k')
axs.scatter(samples[:,0], samples[:,1], 40, 'r')
axs.set_ylabel('${DIP} [g/m^3]$')
axs.set_xlabel('${DIN} [g/m^3]$')
fig.set_size_inches(5, 5)
axs.grid();
Task 3.2:
Using the defined bivariate Gaussian distribution, compute and plot the bivariate CDF as contours where joint probabilities are projected. This is, the x- and y-axis are the values of \(DIN\) and \(DIP\) and the contours in the plot represent values of joint probabilities.
# Define the mesh of values where we want to evaluate the random variables
n = 200 #size of the mesh
values_DIN = np.linspace(### YOUR CODE HERE ###)
values_DIP = np.linspace(### YOUR CODE HERE ###)
# Define the grid
X1,X2 = np.meshgrid(### YOUR CODE HERE ###)
X = ### YOUR CODE HERE ###
# Evaluate the CDF
Z = ### YOUR CODE HERE ###
# Create contours plot
### YOUR CODE LINES HERE
Task 3.3:
Using the defined bivariate Gaussian distribution, compute the following probabilities:
\(P[DIN>0.3]\)
\(P[DIN \leq 0.3 \ AND \ DIP \leq 0.005]\)
\(P[DIN \leq 0.3 \ OR \ DIP \leq 0.005]\)
\(P[DIN > 0.3 \ AND \ DIP > 0.005]\)
The last probability would represent the probability of meeting the minimum requirement of nutrients to ensure the survival of the species (note that the values of the requirements are not realistic). Based on the result, would your species starve? Would you build the farm?
### YOUR CODE HERE ###
Part 4: Conditionalizing the Bivariate Gaussian distribution#
Task 4.1:
Compute the distribution of \(DIP\), given that the value of \(DIN = 0.05\). Plot the PDFs for both the unconditional distribution of \(DIP\) and the conditionalized distribution of \(DIP\). Comment on the differences of the two distributions and explain why those changes appear.
Compute the parameters of the conditional distribution of \(DIP\) using pen and paper and make use of the properties of the Gaussian distribution.
### YOUR CODE HERE ###
By Patricia Mares Nasarre, Delft University of Technology. CC BY 4.0, more info on the Credits page of Workbook.