Part 1: Coding¶
A. What does the acronym FAIR stand for?
- Flexible, Available, Interoperable, Reusable
- Findable, Accessible, Interoperable, Repeatable
- Flexible, Available, Interoperable, Repeatable
- Findable, Accessible, Interoperable, Reusable
Model answer:
- Findable, Accessible, Interoperable, Reusable
B. What does the adjective "Interoperable" stand for in FAIR data?
- It means that data can be used by multiple operators concurrently, regardless of who they are (e.g., researchers, publishers, stakeholders, ...).
- It means that data can be integrated with other data. To this end, data must be standardised.
- It means that data can be used across multiple operating systems (e.g., Windows, Linux, ...).
- It means that data can be operated regardless of its origins.
Model answer:
- It means that data can be integrated with other data. To this end, data must be standardised.
C. Is FAIR data Open Data?
- Sometimes it can be the case. But FAIR and Open Data must be kept separate
- Yes, every FAIR data is Open Data
- Yes, and the converse is also true: every Open Data is FAIR
- FAIR data is never Open data
Model answer:
- Sometimes it can be the case. But FAIR and Open Data must be kept separate
D. Can FAIR principles prevent the fabrication of data (i.e., deliberate creation of data to propound a particular hypothesis with greater conviction)?
- Yes / No
Model answer:
- No
Part 2: Finite Difference Methods¶
A. Using the finite difference method, this PDE can be rewritten in the following discretized mathematical formulation:
What kind of approximation of the time derivative is used in the equation above?
- Forward difference
- Central difference
- Backward difference
Model answer:
- Forward difference
B. The same equation can be discretized with different choices for the finite difference approximations. Below a vectorized implementation is given:
Give a mathematical expression for the update of un+1i,j from relevant u values at time step n that is implemented in the code block above. The expected answer is an equation in a similar notation as the equation given in part (a) above.
Model answer:
C. If nothing is added to the code above, what kind of boundary conditions are applied by default? Motivate your answer.
Model answer:
Dirichlet boundary conditions where u is fixed at the initial values, becuase the u at the boundary is always copied from the previous time step and not updated.
Part 3: Finite Element Methods¶
Recall the weak form of the Poisson equation is given as:
$$ \int_\Omega \nu \nabla w \cdot \nabla u d\Omega - \int_\Gamma w\nu \nabla u \cdot \mathbf{n} d\Gamma = \int_\Omega w f d \Omega $$Assume a Robin boundary condition is applied to the boundary $\Gamma$:
$$ \alpha u + \nu \nabla u \cdot \mathbf{n} = \beta $$Following the conventional notation, let the $\mathbf{N}$-matrix contain shape functions and the $\mathbf{B}$-matrix contain shape function derivatives. In the discretized form with the finite element method, terms with the coefficients $\nu$,$f$,$\alpha$ and $\beta$ appear. Select the correct form of the following terms.
A. With $\nu$??
- $\int_{\Omega}\mathbf{N}{^T}\nu d\Omega$
- $\int_{\Omega}\mathbf{B}^T\nu d\Omega$
- $\int_{\Omega}\mathbf{N}{^T}\nu \mathbf{N} d\Omega$
- $\int_{\Omega}\mathbf{B}{^T}\nu \mathbf{B} d\Omega$
Model answer:
Substitute $\nu \nabla u\cdot n=\beta - \alpha u$ into the expression and having $u=\mathbf{Nu}$, $w=\mathbf{Nw}$, $\nabla u=\mathbf{Bu}$, $\nabla w=\mathbf{Bw}$. We can eliminate $\mathbf{w}$ in all terms and move $\mathbf{u}$ out of the integrals: $$\left(\int_\Omega \mathbf{B}^T\nu\mathbf{B}\mathrm{d}\Omega + \int_\Gamma \mathbf{N}^T \alpha\mathbf{N} \mathrm{d}\Gamma\right) \mathbf{u} = \int_\Omega \mathbf{N}^T f \mathrm{d}\Omega + \int_\Gamma \mathbf{N}^T \beta \mathrm{d}\Gamma$$
So the answer is: $\int_{\Omega}\mathbf{B}{^T}\nu \mathbf{B} d\Omega$
B. With α?
Model answer: See model answer in part a, so answer is $\int_\Gamma\mathbf{N}{^T}\alpha \mathbf{N} d\Gamma$
C. For a 4-node quadrilateral (2D) element, what is the size of the B-matrix?
Model answer:
- [2x4]
- B relates grad(u) (size 2) to the vector with nodal u-values for the element (size 4). For a four-node element, there are 4 shape functions. In 2D each shape function has derivatives in 2 directions. The B-matrix, which contains these derivatives is of size 2x4. It cannot be 4x2 because the element stiffness matrix defined as BTB should be 4x4 (related to the number of degrees of freedom in the element).
Part 4: Optimization 1¶
Three cities($C_1,C_2,C_3$) are supplied with water from three different sources ($S_1,S_2,S_3$). The first ($S_1$) is a major reservoir, the other two are local sources ($S_2,S_3$). Sources $S_2$ can supply cities $C_1$ and $C_3$ and $S_1$ and $S_3$ can supply all cities. The cities have a consumption of aminimum of $R_1, R_2, R_3$ respectively. The local sources can only supply a maximum of $Q_2$ and $Q_3$ of water volume. The reservoir can supply a maximum of $Q_1$, but there is a minimum supply of Qmin to be imposed.
Establish the model that allows obtaining the optimum solution for the problem of supplying the cities in the most economical way knowing that the cost of supplying city j from source i is given by cij expressed in monetary units (m.u.) per unit of water volume.
Model answer:
Alternatively, this question may be solved as follows:
Part 5: Optimization 2¶
Consider the following table of the SIMPLEX method for solving an LP maximization
problem:
Solve the problem
Model answer:
- X2 enters the basis and S1 leaves the basis:
- In the next table we choose S3 to leave but it could have been S2 as well because they have the same ration of the independent term and the coefficient in the column of the variable that is going to enter the basis.
Part 6: Optimization 3¶
In the next diagram you can see the solving process of the branch and bound for the minimization of an integer programming problem with two decision variables. The number in the upper right corner
represents the solving order in the tree:
Is the process finished? That is, are there more nodes to be explored? Why?
Model answer:
The process is finished. This is a minimization problem. Three nodes have been explored. Node 2, which was found after the relaxed solution was branched on variable X1, has produced an integer solution x1=5, x2=0. This solution has an objective function value of 60. The next solution (3) X1=4 X2=3.3 is not an integer solution and at the same time, it results in a wost objective function value (65) which means that it is not worth branching the problem on variable X2.
Part 7: Signal Processing¶
A continuous time signal $x(t)$ is given as:
- $x(t) = A_1 cos(2\pi f_1 t) + A_2 cos(2\pi f_2 t)$
With $A_1 = 1$, $A_2 = 0.1$, $f_1 = 10$ Hz and $f_2 = 80$ Hz
The signal has been sampled in three experiments, each time using a different sampling frequency $f_s$ and a different measurement duration Tmeas. The frequency domain plots (magnitude spectrum in logarithmic scale, as a result of the DFT) are shown below; the spectrum is double sided, but only shown for positive frequencies and, as commonly done in practice, up to the Nyquist frequency. The values of $X_k$, straight from the fft-implementation, have been divided by N, the number of samples.
Determine, for each experiment, the sampling frequency $f_s$, as well as the measurement duration Tmeas. Only the final numerical answers are asked!
Some useful formulas: Tnyquist = $f_s/2$ , Tmeas = $1/f_0$
A. Experiment A
Model answer:
- $f_s$ = 100 Hz ; Tmeas = 2 s
- Explanation (not asked):
A peak at 10 Hz (correct for $f_1$=10 Hz), and peak at 20 Hz (which must be an alias for f + 2 = 80 Hz), also matching the given (half) amplitudes. Observing that this magnitude spectrum is given up to 50 Hz (= $f_s/2$ Nyquist frequency) you can verify that $f_s$ = 100 Hz and that then indeed the $f_2$ produces an alias at 100−80=20 Hz. From the left part of the graph you can see that the spectrum is computed at 0.5 Hz intervals (the spacing of the crosses), hence Tmeas = 1/0.5 = 2 seconds.
B. Experiment B
Model answer:
- $f_s$ = 160 - 200 Hz ; Tmeas = 0.5 s
- Explanation (not asked): A peak at 10 Hz (correct for $f_1$ =10 Hz), and peak at 80 Hz (correct for $f_2$ =80Hz), hence $f_s > 2 * 80 = 160$ Hz. Also amplitudes match with given (halved) values; spectrum is given up to $f_s/2$ , in this case 85 Hz, so $f_s$ =170 Hz, but this is maybe a bit hard to see/tell. Hence, any value of $f_s$ bigger than 160, and smaller than 200 Hz (as the graph clearly stops before 100 Hz) is correct. From the left part of the graph you can see that the spectrum is computed at 2 Hz intervals (the spacing of the crosses), hence Tmeas = 1/2 = 0.5 seconds.
C. Experiment C
Model answer:
- $f_s$ = 90 Hz ; Tmeas = 1 s
- Explanation (not asked): A single peak at 10 Hz, but now with a different amplitude, that is fishy! From this amplitude you may guess that the two cosines ended up (possibly aliased) at the very same frequency; the spectrum is given up to $f_s/2$ , in this case 45 Hz, hence $f_s$ =90 Hz, and indeed, then $f_2$ =80 Hz gets aliased into 90−80=10Hz, so they are on top of each other. From the left part of the graph you can see that the spectrum is computed at 1 Hz intervals (the spacing of the crosses), hence Tmeas = 1/1 = 1 seconds.
Part 8: Time Series¶
A tide gauge station has been installed to measure the hourly sea-level variations relative to a vertical datum (referencesystem). These measurements are usually connected to a stable benchmark next to the tide gauge station (just to show variations with respect to a reference system). Therefore, there is a shift of approximately 3 m (the correct value should be determined) between the mean sea level(MSL) and the bench mark.
Sea level is subject to variations due to many variables like wind speed, pressure, and global warming (sea-levelrise). One of such variations is caused by the forces induced by celestial bodies like the Moon and Sun (main contributors), called tide. The two major tidal constituents are the so-called $M_2$ (semi-lunar constituent) and $S_2$ (semi-solar constituent). Their periods are TM2 =12.4206 hour and TS2=12 hour. We only use these two major constituents, and therefore ignore others.
A high-end tide gauge has been installed to measure the sea-level variations in the vertical direction. Up to now, we have collected 45 days of hourly data(so $m$ =24 ⋅45 =1080 observations). We assume that the measurements have been collected independently with the precision of σ =5 cm (independent and normally distributed). The time series of the measurements y = [y1,...,ym]$^T$ at time instances t =[1,...,m]$^T$ (so t in hour) is as follows:
Based on the information provided above, and the fact that sea-level rise can be neglected here because it cannot be determined by this time series (45 days of data are too short and usually much longer time spans are required), we are interested in the functional and stochastic model $y = Ax + e, D(y)=Qyy$
A. Specify the first and last row of the A-matrix and its dimensions. Also, specify Qyy.
Model answer:
The model of observation equations should include an intercept, a sum of sine and cosine for $M_2$, a sum of sine and cosine for $S_2$, and noise. From this, we find as observation equation $y(t) = y_0 + a_1 cos(2 \pi $fM2$ t) + b_1 sin(2 \pi $fM2$ t) + a_2 cos (2 \pi $fS2$ t) + b_2 sin(2 \pi $fS2$ t) + e_t$. In here, the vector of unknowns is x = [$y_0, a_1, b_1, a_2, b_2$]$^T$ of size n=5. The terms fM2 and fS2 are the tidal constituents M_2 and S_2, respectively.
As a result, we can write the first row of A as the elements of the observation equation for the first time point t=1, so $A_1$ = [$1, cos(2 \pi $fM2$), sin(2 \pi $fM2$), cos(2 \pi $fS2$), sin(2 \pi $fS2$)$] The last row of A is for the last time point t=1080, so A1080 = [$1, cos(2160 \pi $fM2$), sin(2160 \pi $fM2$), cos(2160 \pi $fS2$), sin(2160 \pi $fS2$)$].
Dimension of A is 1080×5.
The covariance matrix is of size m×m=1080×1080 as follows:
Qyy=σ$^2$·$I$1080=25·I1080·$cm^2$=0.0025·$I$1080·$m^2$, with $I$1080 an identity matrix of size 1080.
Assume that the two tidal constituents are not known a-priori, so we want to use spectral analysis technique to identify them. We want to compute the power spectral densities (PSD) using the least-squares harmonic estimation (LS-HE).
B. Sketch a plot of the expected PSD from the data set, where the horizontal axis is the frequency (cycle/hour). Add relevant numerical values on the horizontal axis.
Model answer:
We will have two signals with periods of TM2 = 12.4206 hour and TS2. This leads to the frequencies of fM2 = 1/12.4206 = 0.080511 cycle/hour and fS2 = 1/12 = 0.083333 cycle/hour. Therefore we have two peaks at these two frequencies:
Based on the observations in the past 45 days, we are now going to predict sea levels for the future. The functional part of the prediction comes from the settings of question a. For the stochastic part, we have
computed the normalized Auto-Covariance Function (ACF) and partial ACF (PACF) of the least squares
residuals $\hat e$. They are as follows:
We want to use the available data of the time series to predict the sea levels for the coming day (so 24 hours from $t = m + 1$ to $t = m + 24$). Two functional and stochastic parts can contribute to the prediction ($y_P = y_F + y_S$).
C. For the stochastic part, we need to specify the ARMA(p,q) process. How do you determine appropriate orders for the ARMA(p,q) process? So p,q=? What kinds of parameters $\beta_i 's$ or $\theta_i 's$ of the ARMA process should we estimate?
Model answer:
There is a tail-off in the ACF and a cut-off in the PACF at lag 3 (two non-zero lags). Therefore the best ARMA model is ARMA(2,0), which is indeed AR(2). Therefore, there is no $\theta_i$ to estimate. We just need to estimate $\beta_1$ and $\beta_2$.
D. Based on the results for Question d, write an expression for $y_S$ at the time instance t=m+1, i.e. $y_S(m+1)=$?
Model answer:
The AR(2) process for the time instance t=m+1 is $y_S(m+1) = \beta_1 y_s(m) + \beta_2 y_s(m-1)$. (This is the process as described in the hint of 8c, dropping out all terms except for those containing $\beta_1$ and $\beta_2$.)
Part 9: Machine learning¶
A. Assume you have a dataset with N=100 data points and would like to train a linear basis function model with weights w which could potentially be too complex and overfit the data. You then decide to introduce an $L_2$ regularization term λ to the loss function and do a model selection study. Assume the number of basis functions is fixed and you cannot afford to collect more data.
- You allocate 20 samples for training, 40 for validation and 40 for testing. You use the training loss to calibrate λ, the validation loss to calibrate w and the test set to assess the final model.
- You allocate all 100 samples for training and use those to obtain both λ and w at the same time.
- You allocate 70 samples for training, 20 for validation and 10 for testing. You use the validation loss to calibrate λ, the training loss to calibrate w and the test set to assess the final model.
- You allocate all 100 samples to the training set and use those to obtain w. Then you move the samples for validation and use those to obtain λ. Finally, you move the samples to the test set and assess the final model.
Model answer:
- You allocate 70 samples for training, 20 for validation and 10 for testing. You use the validation loss to calibrate λ, the training loss to calibrate w and the test set to assess the final model.
B. A regularization term λ is added to the loss function of a neural network and a model selection study is performed by computing the mean squared error (MSE) over a validation dataset for different values of λ. The results of this study are shown above.
Regarding these results, mark all the options that are TRUE; consider that each wrong answer will result in negative points, but the lowest score for this sub-question is 0 (we will not subtract points from the rest of the exam):
- High values of λ lead to very rigid models
- Even without regularization, this specific model would already be resistant to overfitting
- The weights w of the neural net most likely increase as λ is decreased
- Increasing the size of the validation dataset (N) would make the "U"-shaped behavior of this curve less pronounced
- For λ=$10^3$, training the model on a different dataset of the same size will lead to a very different model
Model answer:
- High values of λ lead to very rigid models
- The weights w of the neural net most likely increase as λ is decreased
- Increasing the size of the validation dataset (N) would make the "U"-shaped behavior of this curve less pronounced
C. Consider the dataset with five data samples {$x_1, x_2, x_3, x_4, x_5$} = {−1.6,−0.2,0,1.6,2.2} shown above. Using the Euclidean distance, whe perform K-means clustering to find the global optimal (minimum objective) when the cluster number K=3. Which single data sample forms one cluster? (Euclidean distance between a and b: d = $\sqrt(a-b)^2$.
- x1
- x2
- x3
- x4
- x5
Model answer:
- x1
D. Consider again the previous dataset. This time K-means clustering with Euclidean distance is used to find the global optimal (minimum objective) for K=2. What are the centroids of the final clusters?
- [-0.9, 1.9]
- [-0.6, 1.9]
- [-1.0, 2.0]
- [-0.9, 1.3]
- [-0.2, 1.6]
Model answer:
- [-0.6, 1.9]
E. We perform principal component analysis on a given dataset. Consider the explained variance ratio with respect to the principal component number shown in the figure. What is the lowest dimension that guarantees a total explained variance ratio of 95%?
- 2
- 3
- 4
- 5
Model answer:
- 4
Part 10: Risk and Reliability¶
You are asked to evaluate the system reliability of a 2m diameter oil pipe line that is currently operating in an earthquake region. The main objective is to evaluate the probability of failure, which in this case is defined as: the annual probability of a leak from the pipeline due to one of three different failure modes caused by an earthquake, $M_i$:
- $M_1$: buckling of the pipe from longitudinal stress
- $M_2$: high pressure failure (hoop stress)
- $M_3$: failure of welded joint
Each failure mode is dependent on whether or not an earthquake occurs, which has an annual probability of occurrence of 10%. For simplicity, consider each failure mode to be mutually exclusive, and that damage can only occur once per year per failure scenario. In other words:
$ P(M_1 \cup M_2 \cup M_3) = P(M_1) + P(M_2) + P(M_3)$
The probabilities of each failure mode have already been assessed and are summarized in the following table:
Note that the probabilities in the table don't sum to 1, which was a mistake. Nevertheless, the computations proceed without issue, so please ignore it.
A. Construct the FD curve (i.e., the FN curve, except with damage in place of fatalities on the x-axis) for leakage of the pipeline segment due to each of the 3 failure modes. Don’t worry about the scale of your plot being precise, so long as the FD values are clearly indicated at each point.
Model answer:
B. Failure of one of the pipeline segments is a function of the horizontal and vertical acceleration, $X_1$ and $X_2$, respectively. The limit state can be described by a function, illustrated in the figure, where the failure region is represented by $\Omega$ . If f$X_1,X_2$($x_1, x_2$) is the multivariate probability distribution of the random variables $X_1$ and $X_2$. Which of the following best defines the probability of failure:
Model answer:
The correct answer is 4. An short explanation for each choice is provided here:
- This finds the union ("or") of the two random variables, and is only correct when mutually exclusive, which is not the failure probability; in addition, the integral is over over the failure region $\Omega$, which would partially remove some of the "or" part of the variable space.
- This finds the intersection ("and") probability for the joint exceedance case of $X_i = x_i^*$
- This finds the probability of $X_1 > x_1^*$ given that $X_2 = x_2^*$. This is incorrect because it does not consider all possible failure conditions, only that where $X_2 = x_2^*$.
- This is a very simple way to formulate the failure probability: it integrates the joint PDF over the failure region, $\Omega$. It is the only correct answer.
As the pipeline is made up of many individual segments, you would like to perform a system reliability analysis to evaluate the probability of failure for the entire pipeline. Should you consider the pipeline to be a series or parallel system, and what will be the role of dependence between segments on the calculated failure probability for the entire pipeline? (use this information for the next 2 questions)
C. Should you consider the pipeline to be a series or parallel system?
Model answer:
A multi-segment pipeline is a series system, since if one segment fails, the entire pipeline has a leak and no longer functions as designed.
D. What is the quantitative effect of positive dependence between segments on the calculated failure probability? (Choose only one)
- Increase fialure probability
- Decrease failure probability
- No change (they are independent)
- No change (they are mutually exclusive)
Model answer:
Consider 2 events, A and B, using the independent case as a frame of reference, i.e., when $P(A\cap B)=P(A)P(B)$. Positive dependence will cause an increase in the joint ("and") probability, $P(A\cap B)$, which in turns leads to a decrease in the union ("or") probability, $P(A\cup B)=P(A) + P(B) - P(A\cap B)$. This means that the series failure probability would decrease with positive dependence. Thus answer B is correct.
Answer 3. is incorrect because the problem is asking about dependence, why would you assume an independent case?!
Answer 4. is incorrect because: 1) failures in the segments are probably not mutually exclusive (can have multiple leaks), and 2) even if they were mutually exclusive the probability would increase, since this can be considered as an extreme case of negative dependence between elements. Note that this increase only applies to the case of mutually exclusive events being quantified with the limit of ρ = -1; you should not interpret this as meaning 1. is correct!
The annual probability of one or more leaks is $P_1 = 0.2$ and is based on the current operating procedure of inspecting the pipeline once per year (n = 1). However, experience within the pipeline industry indicates the failure probability can be reduced with additional inspections, such that $P_n =P_1/n$. Environmental consequences of a leak have been estimated to be D=100,000 euros, and each inspection costs 1,000 euros. Repair costs are negligible.
E. Find the optimal number of inspections per year, n, that minimizes total annual expected cost due to a pipeline leak.
Model answer:
Total annual expected cost is given by: $ $Ctot$(n) = P_1/n * D + (n-1) * C_I$
Where $C_I$ is the inspection cost. Optimum is found by solving:
$ d$Ctot$(n)/dn = - P_1/n^2 * D + C_l = 0$
$n = \sqrt P_1 D / C_l = 4.47$ ==> 5 inspections
Technically the problem should be formulated with $(n-1) * C_I$ instead of $n * C_I$, but this makes almost no difference when finding the optimum, and is not explicitly stated in the exam question so no points were taken off for this.
Also, to determine whether the number of inspections should be 4 or 5 the best approach would be to compare the total expected costs for both and choose the lower. Rounding up or down does not guarantee that 4 or 5 is the most optimal choice. n=4 is actually better since the total investment is less. No points were taken off for making a proper choice of n when deciding between 4 and 5.
Note: if calculation is done for a long project lifetime, an interest rate r can be assumed and D and $C_I$ should be multiplied by 1/r. Terms cancel, resulting in 5 inspections. It is incorrect to compare the investment to either total risk or change in total risk.