Objective¶
This workshop focuses fundamental skills for building, using and understanding multivariate distributions: in particular when our variables are no longer statistically independent.
For our case study we will use a Thingamajig: an imaginary object for which we have limited information. One thing we do know, however, is that it is very much influenced by two random variables, $X_1$ and $X_2$: high values for these variables can cause the Thingamajig to fail. We will use a multivariate probability distribution to compute the probability of interest under various cases (we aren't sure which one is relevant, so we consider them all!). We will also use a comparison of distributions drawn from our multivariate probability model with the empirical distributions to validate the model.
Multivariate Distribution¶
In this assignment we will build a multivariate distribution, which is defined by a probability density function. From now on, we will call it bivariate, since there are only two random variables:
$$ f_{X_1,X_2}(x_1,x_2) $$This distribution is implemented in scipy.stats.multivariate_normal
. The bivariate normal distribution is defined by 5 parameters: the parameters of the Gaussian distribution for $X_1$ and $X_2$, as well as the correlation coefficient between them, $\rho_{X_1,X_2}$. In this case we often refer to $X_1$ and $X_2$ as the marginal variables (univariate) and the bivariate distribution as the joint distribution. We will use the bivariate PDF to create contour plots of probability density, as well as the CDF to evaluate probabilities of different cases:
Cases¶
We will consider three different cases and see how the probability of interest is different for each, as well as how they are influenced by the dependence structure of the data. The cases are described here; although they vary slightly, they have something in common: they are all integrals of the bivariate PDF over some domain of interest $\Omega$.
Case 1: Union (OR)¶
The union case is relevant if the Thingamajig fails when either or both random variable exceeds a specified value:
$$ P[X_1>x_1 \;\cup\; X_2>x_2] $$This is also called the "OR" probability because it considers either one variable or the other or both exceeding a specified value.
Case 2: Intersection (AND)¶
The intersection case is relevant if the Thingamajig fails when the specified interval for each random variable are exceeded together:
$$ P[X_1>x_1 \;\cap\; X_2>x_2] $$This is also called the "AND" probability because it considers both variables exceeding a specified value.
Case 3: Function of Random Variables¶
Often it is not possible to describe a region of interest $\Omega$ as a simple union or intersection probability. Instead, there are many combinations of $X_1$ and $X_2$ that define $\Omega$. If we can integrate the probability density function over this region we can evaluate the probability.
Luckily, it turns out there is some extra information about the Thingamajig: a function that describes some aspect of its behavior that we are very very interested in:
$$ Z(X_{1},X_{2}) = 800 - X_{1}^2 - 20X_{2} $$where the condition in which we are interested occurs when $Z(X_{1},X_{2})<0$. Thus, the probability of interest is:
$$ P[X_1,X_2:\; Z<0] $$Evaluating Probabilities in Task 2¶
Cases 1 and 2 can be evaluated with the bivariate cdf directly because the integral bounds are relatively simple (be aware that some arithmetic and thinking is required, it's not so simple as multivariate.cdf()
).
Case 3 is not easy to evaluate because it must be integrated over a complicated region. Instead, we will approximate the integral numerically using Monte Carlo simulation (MCS). This is also how we will evaluate the distribution of the function of random variables in Task 3. Remember, there are four essential steps to MCS:
- Define distributions for random variables (probability density over a domain)
- Generate random samples
- Do something with the samples (deterministic calculation)
- Evaluate the results: e.g., “empirical” PDF, CDF of samples, etc.
Note that as we now have a multivariate distribution we can no longer sample from the univariate distributions independently!
Validating the Bivariate Distribution in Task 3¶
The final task of this assignment is to use the distribution of the function of random variables (univariate) to validate the bivariate distribution, by comparing the empirical distribution to our model. Once the sample is generated, it involves the same goodness of fit tools that we used last week.
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as st
from helper import plot_contour
Part 1: Creating a Bivariate Distribution¶
We need to represent our two dependent random variables with a bivariate distribution; a simple model is the bivariate Gaussian distribution, which is readily available via scipy.stats.multivariate_normal
. To use it in this case study, we first need to check that the marginal distributions are each Gaussian, as well as compute the covariance and correlation coefficient.
Task 1.1:
Import the data in data.csv
, then find the parameters of a normal distribution to fit to the data for each marginal. Quickly check the goodness of fit and state whether you think it is an appropriate distribution (we will keep using it anyway, regardless of the answer).
Don't spend more than a few minutes on this, you should be able to quickly use some of your code from last week.
data = np.genfromtxt('data.csv', delimiter=";")
data.shape
YOUR_CODE_HERE # probably many lines
Task 1.2: Write two functions to compute the covariance and correlation coefficient between the two random variables. Print the results.
The input arguments should be Numpy arrays and you should calculate each value without using a pre-existing Python package or method.
def calculate_covariance(X1, X2):
'''
Covariance of two random variables X1 and X2 (numpy arrays).
'''
YOUR_CODE_HERE # may be more than one line
return covariance
def pearson_correlation(X1, X2):
YOUR_CODE_HERE # may be more than one line
return correl_coeff
covariance = calculate_covariance(data_x1, data_x2)
print(f'The covariance of X1 and X2 is {covariance:.5f}')
correl_coeff = pearson_correlation(data_x1, data_x2)
print(f'The correlation coefficient of X1 and X2 is {correl_coeff:.5f}')
Task 1.3:
Build the bivariate distribution using scipy.stats.multivariate_normal
(as well as the mean vector and covariance matrix). To validate the result, create a plot that shows contours of the joint PDF, compared with the data (see note below). Comment on the quality of the fit in 2-3 sentences or bullet points.
Use the helper function plot_contour
in helper.py
; it was already imported above. Either look in the file to read it, or view the documentation in the notebook with plot_contour?
Hint: for this Task use the optional data
argument!.
# plot_contour? # uncomment and run to read docstring
mean_vector = YOUR_CODE_HERE
cov_matrix = YOUR_CODE_HERE
bivar_dist = YOUR_CODE_HERE
plot_contour(YOUR_CODE_HERE, [0, 30, 0, 30], data=YOUR_CODE_HERE);
Part 2: Using the Bivariate Distribution¶
Now that we have the distribution, we will use it compute probabilities related to the three cases, presented above, as follows:
- $P[X_1>20 \;\cup\; X_2>20]$
- $P[X_1>20 \;\cap \ X_2>20]$
- $P[X_1,X_2:\; Z<0]$
Task 2:
For each of the three cases, do the following:
- Compute the requested probability using the empirical distribution.
- Compute the requested probability using the bivariate distribution.
- Create a bivariate plot that includes PDF contours and the region of interest.
- Repeat the calculations for additional cases of correlation coefficient (for example change $\rho$ to: +0.9, 0.0, then -0.9) to see how the answer changes (you can simply regenerate the plot, you don't need to make multiple versions). You can save this sub-task for later if you are running out of time. It is more important to get through Task 3 during the in-class session.
- Write two or three sentences that summarize the results and explains the quantitative impact of correlation coefficient. Make a particular note about whether or not one case may or be affected more or less than the others.
Note that the optional arguments in the helper function plot_contour
will be useful here--also for the Project on Friday!
Here is an example code that shows you what it can do (the values are meaningless)
region_example = np.array([[0, 5, 12, 20, 28, 30],
[5, 20, 0, 18, 19, 12]])
plot_contour(bivar_dist, [0, 30, 0, 30],
case=[23, 13],
region=region_example);
Task 2.1 and 2.2: create cells below to carry out the OR and AND calculations.
YOUR_CODE_HERE
# DEFINITELY more than one line.
# probably several cells too ;)
Task 2.3: create cells below to carry out the Case 3 calculations.
Note that in this case you need to make the plot to visualize the region over which we want to integrate. We need to define the boundary of the region of interest by solving the equation $Z(X_1,X_2)$ for $X_2$ when $Z=0$.
The equation can be defined as follows:
$$ \textrm{WRITE THE EQUATION HERE} $$which is then defined in Python and included in the plot_contours
function as an array for the keyword argument region
.
YOUR_CODE_HERE
# DEFINITELY more than one line.
# probably several cells too ;)
Note: the bivariate figures are an important concept for the exam, so if using the code is too difficult for you to use when studying on your own, try sketching it on paper.
Part 3: Validate Bivariate with Monte Carlo Simulation¶
Now that we have seen how the different cases give different values of probability, let's focus on the function of random variables. This is a more interesting case because we can use the samples of $Z$ to approximate the distribution $f_Z(z)$ and use the empirical distribution of $Z$ to help validate the bivariate model.
$\textbf{Task 3:}$
Do the following:
- Use Monte Carlo Simulation to create a sample of $Z(X_1,X_2)$ and compare this distribution to the empirical distribution.
- Write 2-3 sentences assessing the quality of the distribution from MCS, and whether the bivariate distribution is acceptable for this problem. Use qualitative and quantitative measures from last week to support your arguments.
Note: it would be interesting to fit a parametric distribution to the MCS sample, but it is not required for this assignment.
Task 3.1: Plot histograms of $Z$ based on the Monte Carlo samples, and based on the data. Note that you probably already computed the samples in Part 2.
plot_values = np.linspace(-100, 1000, 30)
fig, ax = plt.subplots(1)
ax.hist([YOUR_CODE_HERE, YOUR_CODE_HERE],
bins=plot_values,
density=True,
edgecolor='black');
ax.legend(['Data', 'MCS Sample'])
ax.set_xlabel('$Z(X_1,X_2)$')
ax.set_xlabel('Empirical Density [--]')
ax.set_title('Comparison of Distributions of $Z$');
Task 3.2: Define a function to compute the ecdf.
def ecdf(var):
x = YOUR_CODE_HERE # sort the values from small to large
n = YOUR_CODE_HERE # determine the number of datapoints
y = YOUR_CODE_HERE
return [y, x]
Task 3.3: Create a semi-log plot of the non-exceedance probability.
fig, axes = plt.subplots(1, 1, figsize=(8, 5))
axes.step(YOUR_CODE_HERE, YOUR_CODE_HERE,
color='k', label='Data')
axes.step(YOUR_CODE_HERE, YOUR_CODE_HERE,
color='r', label='MCS Sample')
axes.set_xlabel('$Z(X_1,X_2)$')
axes.set_ylabel('CDF, $\mathrm{P}[Z < z]$')
axes.set_title('Comparison: CDF (log scale expands lower tail)')
axes.set_yscale('log')
axes.legend()
axes.grid()
Task 3.4: Create a semi-log plot of the exceedance probability.
fig, axes = plt.subplots(1, 1, figsize=(8, 5))
axes.step(YOUR_CODE_HERE, YOUR_CODE_HERE,
color='k', label='Data')
axes.step(YOUR_CODE_HERE, YOUR_CODE_HERE,
color='r', label='MCS Sample')
axes.set_xlabel('$Z(X_1,X_2)$')
axes.set_ylabel('Exceedance Probability, $\mathrm{P}[Z > z]$')
axes.set_title('Comparison: CDF (log scale expands upper tail)')
axes.set_yscale('log')
axes.legend()
axes.grid()
In case you are wondering, the data for this exercise was computed with a Clayton Copula. A Copula is a useful way of modelling non-linear dependence. If you would like to learn more about this, you should consider the 2nd year cross-over module CEGM2005 Probabilistic Modelling of real-world phenomena through ObseRvations and Elicitation (MORE).