Resit 22/23 Q1¶

CEGM1000 MUDE

Part 1: Coding¶

The code below defines a ParkingLot Python class implementing the functioning of a parking lot in a simplistic way. The following description is provided from the documentation:

  • ParkingLot objects are created by specifying the overall capacity of the parking lot, stored in self.total_spots.
  • self.spots_available is used to keep track of the current number of free spots; this value changes in time based on whether a vehicle leaves and enters the facility. Upon creation, ParkingLot objects are empty, i.e., there are no cars in the parking lot.
  • book_spot and free_spot are used to either decrease or increase by one the count of spots available, e.g., whenever a vehicle exists or enters the parking lot. Checks are put in place with if statements to ensure that the self.spots_available counter does not go below zero or exceeds the parking lot capacity.
  • visualize is used to print the current number of available spots, or that there are no spots available. This method is called once when the object is first created, and whenever the number of available spots changes.

A. There are four different errors in the implementation below. Localize these errors (you can use the line numbers as reference), define their type (e.g., syntax error, exception or logical error) and explain briefly how to fix them. image.png

Model answer:

  • There is a logical error in line 8; self.spots_available should be equal to total_spots
  • There is a logical error in line 12, must remove the = after >
  • An exception is raised in line 17, should be self.total_spots
  • There is a logical error in line 23: replace [ ] with {}

B. With respect to the ParkingLot code, which lines define the constructor of the class?

  • Lines 1-9
  • Lines 1-4
  • Lines 6-9
  • Lines 6-25

Model answer:

  • Lines 6-9

C. With respect to the ParkingLot code, which line contains an attribute of the class?

  • Line 6
  • Line 7
  • Line 19
  • Line 21

Model answer:

  • Line 7

D. Imagine that a correct implementation of the ParkingLot class is available in the my_classes.py file. Given the code below,

image.png

which of the following statement is FALSE

  • The import statement in (A) will correctly load the class
  • (A) will fail to create an object due to an exception
  • The outcome of running (C) and (D) is the same
  • (C) will correctly load the class and create the object
  • (B) will raise an exception

Model answer:

  • (B) will raise an exception

E. We want to add a check on the code defining the ParkingLot class to limit the overall capacity of the parking facility to 200 parking slots. The code should stop and return an error when this occurs. Which of these statements would implement the check correctly? Regarding these results, mark all the options that are TRUE; consider that each wrong answer will result in negative points, but the lowest score for this sub-question is 0 (we will not subtract points from the rest of the exam):

image.png

Model answer:

  • A
  • C
  • D

Part 2: Probability¶

As an engineer in a consultancy firm, you need to perform the Extreme Value Analysis on the hourly wave height measurements from a buoy in the middle of the North Sea. You have applied both sampling methods in order to find a probability distribution to model the maximum wave height observed per year: Block Maxima and Peak Over Threshold. Recall that the objective is to select statistically independent and identically distributed samples; the Block Maxima method uses fixed increments of time and the Peak Over Threshold methods filter maxima using threshold values and a minimum time between peaks. In the following table, the outputs of the describe function (from pandas) applied on the sampled maxima are shown.

image.png

A. Using the summary provided, which statement is the most valid?

  • Block Maxima is the best method because it is a simple and fast method which ensures independent extreme observations.
  • Block Maxima is the best method because wave height data is probably Gaussian distributed, so Generalized Extreme Value distribution would be a good fit for its extremes.
  • Peak Over Threshold is the best method because it is a simple and fast method which ensures independent extreme observations.
  • Peak Over Threshold is the best method because it maximizes the number of extremes sampled from our data and, in this case, our time series seems to be short.

Model answer:

  • Peak Over Threshold is the best method because it maximizes the number of extremes sampled from our data and, in this case, our time series seems to be short.

Assume that a colleague did the sampling using the Peak Over Threshold method and asks your advice on the result, as shown in the plot below. The plot represents the sampled extremes for 3 weeks using different values for the threshold and the declustering time.

0.png

B. Based on such plot, which pair of values would you choose?

  • Threshold = 3m and declustering time = 24h.
  • Threshold = 4.25m and declustering time = 48h.

Model answer:

  • Threshold = 4.25m and declustering time = 48h.

C. Explain why.

Model answer:

Basic assumption for EVA is that the sampled events are independent and identically distributed. With th=3m and dcl=24h, three events are sampled within the same cluster (group) of extreme observations. That means that they come from the same storm and, thus, they are not independent.


Part 3: Probability¶

For the same problem as in Part 2, assume that you now have the extreme observations. You want to fit a parametric distribution to them, and you can do it using different methods.

A. Which of the following methods cannot be applied to this end?

  • Maximum Loglikelihood Estimator
  • Inverse Modelling
  • Using moments of data to match moments of the parametric distribution

Model answer:

  • Inverse Modelling

Since you are not sure about the distribution that would be a good model for your data, you fit a Generalized Pareto distribution and a Normal distribution. After fitting them, you assess their goodness of fit using the Q-Q plot and the Kolmogorov-Smirnov test.

image.png

  • Kolmogorov-Smirnov test:
    • Normal: p-value=0.294
    • Generalized Pareto: p-value=0.892

B. Based on the results above, which distribution fits best your data? How have you determined it? Explain your answer.

Model answer:

GPD best fit. QQplot: GPD closer to 45-degrees line, so better fit to observations. KS test: p-value of GPD > p-value Normal


Part 4: Probability¶

For the same problem as in Part 2 and 3, assume you choose the Gumbel distribution (see formula sheet).

A. Compute the distribution parameters using the information from Peak Over Threshold analysis. Round to two decimals.

Model answer:

From the table: 

  • $\mu = 5.09$ and $\sigma = 0.87$

From the formula sheet: 

  • $\mu = \alpha + \beta \lambda$
  • $\sigma^2 = \pi^2/6 \beta^2$

Use the second equation to determine $\beta$:

  • $\beta = \sqrt 6\sigma^2/\pi^2 = 0.68$

Then use the first one to determine $\alpha$:

  • $\alpha = \mu - \beta\lambda = 4.70$

In case you did not get an answer to the previous question, assume you obtained α=1 and β=2.

B. Calculate P(X>5) using Gumbel distribution.

Results with $\alpha = 1$ and $\beta = 2$:

$z=(x- 𝛼)/𝛽 =(5-1)/2=2$

$P(X<5)=exp(-exp(-2))=0.87$

$P(X>5)=1-P(X<5)=0.13$

Results with $\alpha = 4.70$ and $\beta = 0.68$:

$z=(x- 𝛼)/𝛽 =(5-4.7)/0.68=0.44$

$P(X<5)=exp(-exp(-0.44))=0.53$

$P(X>5)=1-P(X<5)=0.47$


C. Calculate the value of the wave height (X) whose exceedance probability is 0.05.

Model answer:

Results with $\alpha = 1$ and $\beta = 2$:

$P(X>x)=0.05$

$P(X<x)=0.95$

$P(X<x)=exp(-exp(-z))=0.95$

$-exp(-z)=-0.051 => z=2.97$

$x = 2z + 1 = 6.94$

Results with $\alpha = 4.70$ and $\beta = 0.68$:

$P(X>x)=0.05$

$P(X<x)=0.95$

$P(X<x)=exp(-exp(-z))=0.95$

$-exp(-z)=-0.051 => z=2.97$

$x = 0.68z + 4.7 = 6.72$


You are also interested in the wave periods (Y) for your design. Assume that they follow an exponential distribution with λ=0.05.

D. Assume that wave heights (X) and wave periods (Y) are independent, compute the probability of P(X<5,Y<7).

Model answer:

Since they are independent $P(X<5,Y<7)= P(X<5)P(Y<7)$.

Results with $\alpha = 1$ and $\beta = 2$:

$P(X<5)=exp(-exp(-(5-1)/2)=0.873$

$P(Y<3)=1-exp(-0.05x7)=0.295$

$P(X<5)P(Y<7)=0.873 x 0.295 = 0.26$

Results with $\alpha = 4.70$ and $\beta = 0.68$:

$P(X<5)=exp(-exp(-(5-4.7)/0.68)=0.526$

$P(Y<3)=1-exp(-0.05x7)=0.295$

$P(X<5)P(Y<7)=0.526 x 0.295 = 0.16$


E. Assume that wave heights (X) and wave periods (Y) are independent, compute the probability of P(X<5∣Y<7).

Model answer:

Since they are independent $P(X<5|Y<7)= P(X<5)=0.873$ ($\alpha = 1$ and $\beta = 2$) or 0.525 ($\alpha = 4.70$ and $\beta = 0.68$)


F. However, you know wave heights and periods are not independent. Given the plot below, calculate the empirical value of P(X>6,Y>8). Note that there are 35 observations in the plot.

image.png

Model answer:

There are two points in the area of observations $X>6$ and $Y>8$ (upper right rectangle). Thus: $P(X>6,Y>8)=2/35=0.06$


Part 5: Numerical Modelling: Validation and Verification¶

A. When can you actually say that a numerical model is verified?

  1. When the mathematical part of the model is correct, and the code is free of errors.
  2. When the model matches observed experimental data (or simulated data from validated models).
  3. When the numerical model matches analytical results obtained for simpler configurations of the system.

Model answer:

Right answer is 1. Answer 2 is for validation, while answer 3 is partly true for a verified model, it depends on how many simpler configurations you can compare with.


B. By definition, a sensitivity analysis implies:

  • The validation of a model due to uncertainties of its parameters.
  • Assessing the influence of certain parameters to the investigated response.
  • Identifying code errors.
  • Matching numerical modelling results to experimental results.

Model answer:

  • Assessing the influence of certain parameters to the investigated response.

C. A sensitivity analysis is not useful for:

  • Dimension reduction of the model.
  • Model validation.
  • Dividing and allocating the uncertainty of the output in our model to the different sources of variability of the input parameters.
  • Implementation of a model

Model answer:

  • Implementation of a model

D. Which of the following is generally true? A model that has this form: $\dot x+3x=0$ and that has been derived from Newton’s second law can be classified as

  • A dynamic model, since it contains a first order derivative and depends on time.
  • A static model, since it contains a first order derivative.
  • A static model, since it doesn’t contain a second order derivative.
  • A dynamic model, since it has been derived from Newton’s second law.

Model answer:

  • A dynamic model, since it contains a first order derivative and depends on time.

Part 6: Numerical Methods¶

A. Using Taylor expansion, derive the backward Euler approximation for the first derivative. Show the truncation error introduced by the approximation.

Model answer:

$f(x−h)≈f(x)−hf′(x)+O(h^2)$

$hf′(x)=f(x)−f(x−h)+O(h^2)$

$f′(x)= \frac{f(x)−f(x−h)}{h} + O(h)$

Truncation error is O(h) due to division by h


B. Derive the discrete form of the following ODE using the backward Euler approximation and calculate first 5 time steps of the solution using dt=0.2:

$y′=y+tcost$

$y(0)=1$

Model answer:

$ y$n+1$ = y_n + \Delta t (y$n+1$ + t$n+1$cos(t$n+1$))$

$ y$n+1$ = (y_n + \Delta t $tn+1$ cos($tn+1$))/(1 - \Delta t)$

image.png


Part 7: Sensing and Observation¶

The distance x between a fixed benchmark and a moving benchmark on a landslide is measured at times t=0,2,4,6,8,10 months. The observations are shown in the figure. The precision of the first 3 observations is 0.5 mm, the precision of the last 3 observations is 0.2 mm. All observables are independent and normally distributed.

It is assumed that normally the distance is changing at a constant rate. It is known, however, that at t=5 months there was a sudden slip of the landslide, causing an additional change in distance at that time.

Two engineers are asked to estimate the parameters that characterize the motion of the landslide.

  • Engineer 1 uses as nominal model (null hypothesis) that before and after the slip, the rate of change was the same. The 3 unknown parameters she is estimating are: the distance $x_0$ [mm] at t=0, the constant rate of distance change $v_0$ [mm/month], and the distance change s during the sudden slip event. See figure.
  • Engineer 2 uses as nominal model (null hypothesis) that before and after the slip, the rate of change was different. This means that there will be 4 unknown parameters to estimate.

image.png

A. Give the functional model (in the form E(Y)=A⋅x) and stochastic model corresponding to the null hypothesis of Engineer 1.

Model answer:

For $t\leq4$ we have: $x(t) = x_0 + v_0 * t_i$

For $t\geq6$ we have: $x(t) = x_0 + v_0 * t_i + s$

image-2.png

$\Sigma_Y = diag(0.25,0.25,0.25,0.04,0.04,0.04)$


B. Give the functional model corresponding corresponding to the null hypothesis of Engineer 2 in the form E(Y)=A⋅x. Explain (describe) what the 4 unknown parameters are.

Hint: there are multiple options, you only need to give one option. For example you can choose either $x_s$ or $s$ as one of the unknown parameters.

Model answer:

Option 1:

For $t\leq4$ we have: $x(t) = x_0 + v_0 * t_i$

For $t\geq6$ we have: $x(t) = x_0 + v_s * (t_i-5)$

image.png

The unknown parameters are the distances $x_0=x(0)$ and $x_s=x(5)$ [mm], and the constant rates of change $v_0$ and $v_s$ [mm/month] before and after the slip, respectively.

Option 2:

same as option 1, but instead of $x_s$ estimate the distance change during the slip event $s$ (note that $x_s = x_0 + 5·v_0 + s$):

image-2.png

Option 3:

formulate the model with the same parameters as model 1, plus the differential velocity $v_d = v_s - v_0$ where $v_s$ is the constant rate of change after the slip event.

image-3.png


C. Which test do the engineers need to apply to check the validity of their model?

Model answer:

Overall model test


D. What are the corresponding critical values used by the engineers with α=0.05?

Model answer:

  • Engineer 1: $k_\alpha = \chi^2_\alpha (6 - 3) = 7.8147$
  • Engineer 2: $k_\alpha = \chi^2_\alpha (6 - 4) = 5.9915$

E. Do you generally expect the test statistic of Engineer 1 to be larger or smaller than the test statistic of Engineer 2? Explain.

Model answer:

With the model of Engineer 2 you have one extra parameter, hence smaller residuals (better fit) can be expected: hence the test statistic with model 2 is expected to be smaller. Fitting 2 lines with a different slope to the first set and the second set, should indeed result in a better fit. Note: that is also why the critical value is smaller.


Part 8: Sensing and Observation¶

A company is producing instruments for measuring the CO2 concentration [ppm] in air. Their top range model has a precision of 2 ppm (for a single measurement). Scientists need to know the CO2 concentration with a 98% confidence interval of ± 0.75 ppm. Therefore they use repeated measurements to estimate the concentration.

The company would like to improve their instrument, such that only 5 repeated measurements will be needed to meet the requirement. What is the required precision of a single measurement in that case?

Model answer:

$ k$0.5$\alpha$$ = 2.326$ from the standard normal distribution: $P(Z<k$0.5$\alpha$$)=1−0.5\alpha$

Required is k0.5$\alpha$$1/5\sigma$new = 0.75 , from which we can compute that the precision must be $\sigma$new = 0.72 ppm.


Part 9: Simulation¶

Monte Carlo Simulations are a broad class of computational algorithms that rely on repeated random sampling to obtain a set of numerical results, and ultimately to quantify the uncertainty in the output.

A. State-of-the art algorithms for generating random samples from a distribution are based on the optimized pseudo-number generation of identically distributed samples. If the input variable has a Gaussian distribution, which is the most appropriate sampling method to be used?

  • Inverse Transform
  • Box-Mueller
  • Standard Accept/Reject method

Model answer:

  • Box-Mueller

B. Why are the other two methods not appropriate? State one reason for each.

Model answer:

  • The inverse CDF of the Gaussian Distribution is not known in closed-form, therefore the inverse transform cannot be applied accurately
  • The standard Accept/Reject method will not be as efficient.
  • The Ziggurat algorithm is a more advanced rejection algorithm that is currently implemented in Numpy.

Part 10: Simulation¶

The second step of the Box-Mueller algorithm requires to compute

$r = \sqrt(-2 ln[1-u_1])$

$\theta = 2 \pi u_2$

with u being samples generated from the random variables:

$U_1 \approx U(0, 1)$

$U_2 \approx U(0, 1)$

r is often written in the code as

r = np.sqrt(-2 * np.log(U1))

A. Explain why it is correct to use the $log(u_1)$ instead of $log(1-u_1)$

Model answer:

Because u1 is a random number between 0 and 1, therefore also 1-u1 is a random number between 0 and 1.


B. You have generated enough samples from $N(\mu,\sigma^2)$, so that your output mean is now converged to a value. Make two sketches of the empirical PDFs (probability density functions) for samples with a very small variance, and the other with a large variance.

Model answer:

image-3.png image-2.png

  • If the variance is small, the number of samples generated would give a good approximation of the standard deviation. If the variance is large, this would not be the case.
  • To obtain the full distribution many more sample will be needed, especially in the case with large variance.
In [ ]: