Resit 22/23 Q2¶

CEGM1000 MUDE

Part 1: Coding¶

A. Who benefits from applying the FAIR principles?

Primarily system admins and data stewards, so as to keep the data well organized
Researchers, publishers, and software builders, among other relevant stakeholders
FAIR is a principle that guides data storage and retrieval, so it is only relevant for the algorithm in charge of storage and retrieval
Only corporations profit from data. FAIR data is private data and, therefore, behind a paywall

Model answer:

Researchers, publishers, and software builders, among other relevant stakeholders

B. What does it mean that data is Findable?

It means that data is easily accessed by the users. To this end, users need an authentication process
It means that data is easy to locate for its retrieval. To this end, data must be appropriately labeled
It means that data can be integrated with other data. To this end, data must be standardized
It means that data can be used and reused by diverse users. To this end, it is important to make clear its origins and conditions of (re)use

Model answer:

It means that data is easy to locate for its retrieval. To this end, data must be appropriately labeled

C. Can private data be FAIR?

Yes. Accessible data does not mean "open data". FAIR data might only be accessible by the relevant stakeholders (for instance, physicians have access to medical records which are not open to the public)
All FAIR data is private data. So yes, private data can be FAIR
All FAIR data is open data. So no, private data cannot be FAIR

Model answer:

Yes. Accessible data does not mean "open data". FAIR data might only be accessible by the relevant stakeholders (for instance, physicians have access to medical records which are not open to the public)

D. What is the value of implementing the FAIR principles?

The value of the FAIR principles is that they keep the data well-structured and easy to find
There are multiple values in the FAIR principles. Here are just some: It recognizes research by making data findable'; It democratizes access to data by allowing numerous researchers to access the same qualitative data regardless of their institution or origin; It facilitates the reproducibility of research, paving the way to more reliable scientific research
The value of the FAIR principles is economic: institutions, governments, and companies save millions of dollars annually, well-structuring the data to be findable, accessible, interoperable, and reusable
There is no value in the FAIR principles other than technical value: data can be found quickly and effectively, data can be shared using similar formats, and data can be reused by the same researchers.

Model answer:

There are multiple values in the FAIR principles. Here are just some: It recognizes research by making data findable'; It democratizes access to data by allowing numerous researchers to access the same qualitative data regardless of their institution or origin; It facilitates the reproducibility of research, paving the way to more reliable scientific research

Part 2: Finite Difference Method¶

Consider the following implementation of a finite difference time stepper:

A. What kind of approximation of the time derivative is used in the given implementation?

Forward difference
Central difference
Backward difference

Model answer:

Forward difference

B. Running this code in a loop over time steps gives an approximate solution for a PDE. This PDE has a term with $k_x$(kx in the code). To which PDE term does kx-string correspond?

$k_xx$
$k_x\frac{\partial u}{\partial x}$
$k_x\frac{\partial^2 u}{\partial x^2}$
$k_x\frac{\partial^4 u}{\partial x^4}$

Model answer:

$k_x\frac{\partial^2 u}{\partial x^2}$

C. On which line(s) in the code block are Neumann boundary conditions applied?

Model answer:

Line 11 and 12

Part 3: Finite Element Method 1¶

A. Two-point Gauss integration on a 1D subdomain running from ξ ∈ [−1,1] is performed with ξ_ip $= ±\frac{1}{\sqrt3}$. For a general element defined on the domain x ∈ [a, b], what are the x positions of the integration points:

$x$_ip$ = \frac{a+b}{2} \pm \frac{b-a}{2\sqrt 3}$
$x$_ip$ = \frac{b-a}{2} \pm \frac{a+b}{2\sqrt 3}$
$x$_ip$ = \frac{a+b}{2} \pm \frac{b-a}{\sqrt 3}$
$x$_ip$ = \frac{b-a}{2} \pm \frac{a+b}{\sqrt 3}$

Model answer:

- $x$_ip$ = \frac{a+b}{2} \pm \frac{b-a}{2\sqrt 3}$

B. What is the order of polynomial that can be integrated exactly with the two-point Gauss integration scheme?

First order (linear function)
Second order (quadratic function)
Third order (cubic function)
None, numerical integration always results in an approximation

Model answer:

Third order (cubic function)

Part 4: Finite Element Method 2¶

Consider a PDE of the type $-\alpha \frac{\partial^2 u}{\partial x^2} + \beta u - \gamma = 0 $ where $u(x)$ is the primary unknown and $\alpha(x)$, $\beta(x)$, and $\gamma(x)$ are given and independent of $u$. Following the conventional notation, let the N-matrix contain shape functions, the B-matrix contain shape function derivatives. In the discretized form with the finite element method, terms with the parameters $\alpha$, $\beta$, and $\gamma$ appear. What form do these terms have:

A. with $\alpha$?

$\int N^T \alpha dx$
$\int B^T \alpha dx$
$\int N^T \alpha N dx$
$\int B^T \alpha B dx$

Model answer:

Since the $\alpha$ term contains the second-order partial of $u$, the B-matrix approximates the shape function derviatives. With $u$ factored out, the form of the solution will be $\int B^T \alpha B dx$.

B. with $\beta$?

$\int N^T \beta dx$
$\int B^T \beta dx$
$\int N^T \beta N dx$
$\int B^T \beta B dx$

Model answer:

For the $\beta u$-term, the N-matrix shape function is applied and similarly, the $u$ may be factored to leave the approximation in the form $\int N^T \beta N dx$

C. with $\gamma$?

$\int N^T \gamma dx$
$\int B^T \gamma dx$
$\int N^T \gamma N dx$
$\int B^T \gamma B dx$

Model answer:

With the absense of $u$ in the $\gamma$-term, the form of the discretized solution is $\int N^T \gamma dx$.

Part 5: Optimization¶

A. Consider the example of an animal feed producer that intends to produce a special type of product with specific characteristics. This feed is constituted of two types of cereals and one binding component. 10 kg of this feed has to have a minimum of 900g of nutrient A, 500g of nutrient B, and 30 g of C so that it can be sold to farms. Knowing that 1) the amount of cereal 2 cannot exceed double the amount of cereal 1 and the binder material together; 2) there must be always at least 1 kg of binding material per 500 g of cereal 1 and cereal 2, and that 3) the cereals have the following nutrients and costs:

Formulate the problem that would decide on the amount of Cereal 1, Cereal 2, and Binder that this producer should select per 10 kg of animal feed. Do not solve the problem! Only the formulation is required.

Model answer:

$x_i$: amount of component i={Cereal 1, Cereal 2, Binder} in kg.
$c_i$: cost of component i={Cereal 1, Cereal 2, Binder} per kg.
$Min(Z) = \sum_{i=Cereal1,Cereal2,Binder} = x_i * c_i$ Subject to:

NOTE: there is an error in the figure above---the equation for $x_3$ should read $x_3\leq 2 (x_1+x_2)$.

B. Solve the following mathematical programming problem using the graphical solution method. Clearly state the solution and the value of the objective function in your answer as well as the intermediate steps you need to take.

$Minimize(F) = 6X + 5Y$

$Y \leq \frac{5}{6}X + 5$

$X \leq 3$

$Y \leq 5$

$Y \geq 0; X \in \mathbb{R}$

Model answer:

C. The following is the solving process of a mixed integer program - some variables are integers and other variables are continuous - using branch and bound with the objective of maximization. $x_1$ and $x_2$ are integer variables and $x_3$ is continuous. 5 nodes have been explored in the tree in the order that is shown in the upper right corner of each node. Has the optimal solution been obtained? Justify your answer with the information you see in the tree.

Model answer:

The optimal solution has been found, that is node 4. We see that solution 4 has both integer variables with the right domain as well as the continuous variable. The objective function value is 55 which implies that node 2, which has not been explored further (by pruning on variable ), does not need to be further explored since this has the value of 53 meaning that will never be able to surpass 55. This is so because it is a maximization problem. All children nodes are always of a worst objective value than their parent.

Part 6: Signal Processing¶

A continuous-time signal $x(t)$ is given as: $x(t)=A _1cos(2πf_1t)+A_2cos(2πf_2t)$, with $A_1=1, A_2=0.1, f_1=10 Hz$, and $f_2=80 Hz$. In three experiments the signal has been sampled using each time a different sampling frequency ($f_s$) and a different measurement duration ( T_meas). The frequency domain plots (magnitude spectrum in logarithmic scale, as a result of the DFT) are shown below; the spectrum is double sided, but only shown for positive frequencies, and, as commonly done in practice, up to the Nyquist frequency. The values $X_k$, straight from the FFT, have been divided by N, the number of samples.

Determine, for each experiment/plot, the sampling frequency ($f_s$), as well as the measurement duration (T_meas). Only the final numerical answers are asked!

A.

Model answer:

$f_s$: 110 Hz
T_meas: 2 s

B.

Model answer:

$f_s$: 180 Hz
T_meas: 0.2 s

C.

Model answer:

$f_s$: 70 Hz
T_meas: 1 s

Part 7: Time Series Analysis¶

The deformation pattern (in mm) of the East component of a global navigation satellite system (GNSS) station is expressed as the following equation: $y(t)=15+5t+3cos2πt+4sin2πt+cos6πt+2sin6πt$ where t is 'time' in the unit of year. In order to verify all the coefficients and frequencies of this deformation pattern, we have measured 2-year daily positions of this East component. The time series is then $y=[y(t_1),...,y($t₇₃₀$)]^T$, with $t_1=1/365, t_2=2/365$ and t₇₃₀ = 2 (years). Further we assume that the measurements consist of ARMA(2,0) noise, with a standard deviation of σ=5 mm. You are required to answer the following questions:

A. Assuming the expression for the deformation pattern given above, compute the amplitude and initial phase of the periodic signals of this time series (A=?,θ=?). You may make use of the following formulae:

$y(t)=acos(ωt)+bsin(ωt)=Asin(ωt+θ)$
$A = \sqrt{a^2 + b^2}$
$θ=arctan(\frac{a}{b})$

Model answer:

For the annual signal we have:

$A = \sqrt{a^2 + b^2} = \sqrt{3^2 + 4^2} = 5 mm$

$θ=arctan(\frac{a}{b})=arctan(\frac{3}{4})=0.6435rad=36.87deg$

For the tri-annual signal we have:

$A = \sqrt{a^2 + b^2} = \sqrt{1^2 + 2^2} = \sqrt5 mm$

$θ=arctan(\frac{a}{b})=arctan(\frac{1}{2})=0.4636rad=26.57deg$

B. Assume that we applied the least-squares harmonic estimation (LS-HE) to compute the power spectral density (PSD) for this time series. Sketch (rough drawing) to illustrate its PSD, and its critical value in a given confidence level. Label the horizontal and vertical axes, and indicate the values on the horizontal axis corresponding to the locations of the PSD peaks.

Model answer:

C. If we assume that the frequencies are given, but the coefficients of the periodic, linear, and bias terms are unknown, what would be your suggestion to estimate such unknown parameters? Write in particular the first and last rows of the design matrix A for the given time series.

Model answer:

We need to make the linear model of observation equations y=Ax+e and apply the Best Linear Unbiased Estimation (BLUE) to estimate $x=[y_0,r,a_1,b_1,a_3,b_3]^T$. For the time instant $t_i$ we have the observation equation as: $y(t_i)=y_0+rt_i+a_1cos2πt_i+b_1sin2πt_i+a_3cos6πt_i+b_3sin6πt_i+e(t_i)$.

The first row of A becomes:

$A_1=[1, 1/365,cos2π/365,sin2π/365,cos6π/365,sin6π/365]$

The last row of A is:

A₇₃₀$=[1,2,cos4π,sin4π,cos12π,sin12π]=[1,2,1,0,1,0]$

D. In order to implement prediction, the noise characteristics of time series should be determined using the normalized auto-covariance function (ACF) and partial ACF (PACF). Sketch the ACF and PACF for the given time series.

Model answer:

Part 8: Machine Learning 1¶

Devise a suitable machine learning approach for the problem described below concerning the prediction of water salinity in a river.

How would you frame this problem from a machine learning perspective, e.g., type(s) of learning and task(s)? Which techniques would you employ and why? What are the main steps needed to implement them correctly? Describe your approach based on what you learned in class in max 300 words. Use this number to gauge the level of detail in your description. Note that extra details on the case study or knowledge of this specific topic are not needed to answer correctly. While you can approach this problem using the tools of Time Series Analysis, here we ask you to use what you have learned in Machine Learning.

Problem description

To face increasing drinking water demands due to urbanization, the water managers of a given metropolitan area decide to draw water from a large river flowing nearby. Due to the high levels of dissolved minerals in the river’s catchment (i.e., the area of land that drains into a particular river), the water pumped from the river may experience high levels of salinity. Using water with high salinity has adverse effects on both domestic and industrial users. By knowing salinity values in advance, operators can change the pumping policies so that more water is pumped during days of low salinity and less water is pumped in high salinity days. YOUR TASK IS TO DEVELOP A MACHINE LEARNING MODEL THAT FORECASTS THE AVERAGE DAILY SALINITY IN THE RIVER NEARBY THE CITY, ONE WEEK IN ADVANCE, THAT IS AT TIME t + 7, WHERE t IDENTIFIES THE CURRENT DAY.

The water utility provides you with complete daily average time series data for salinity, flow, and water levels for multiple locations in the river and its tributaries (see map). This includes salinity data at the target location, measured nearby the city. You are also provided with the following information:

Some physical processes governing the river can be approximated as linear; others are nonlinear.
All measured variables are continuous; they have different units of measurement and distribution.
All measured variables, at all locations in the catchment, may help predict salinity nearby the city.
For each variable at each location, different lagged values (e.g., variable measured at t, t − 1, t − 2, ...) may provide additional explanatory power. This increases the overall amount of variables in the dataset to a few hundreds.
The final dataset is given in tabular format, where each column is a variable measured in a certain location at a given time lag. Given the nature of the problem and the different lags, there is likely a high linear correlation between many variables.
You have no access to other sources of data.

Model answer:

Main points that must be considered:

Since we have the output variable, this prediction problem is mainly framed as supervised machine learning. Since the variable is continuous, we consider a regression task. Since some processes linking input and outputs are nonlinear, then we must use a ML model that can account for these nonlinearities, such as a neural network. Given that we have many correlated variables, we can preprocess our dataset using principal component analysis (e.g., dimensionality reduction via unsupervised ML) to reduce the number of inputs. Since it is specified that the variables have different distributions and different units of measure, we first normalize the data. Model selection is essential to ensure the model does not overfit. We thus divide our dataset into training, validation and test (e.g., 70/20/10%) and we perform training with early stopping or regularization to prevent overfitting on the validation dataset. In this way, we also select optimal hyperparameters (e.g., the number of neurons of the neural network) that yield the best validation performance. We can perform the training using stochastic gradient descent.

Part 9: Machine Learning 2¶

Looking at the change of objective function when performing K-means clustering on a dataset, which number of clusters is optimal using the Elbow method?

3
4
5
10

Model answer:

5

Part 10: Machine Learning 3¶

When building machine learning models for regression, a number of crucial aspects related to model selection and bias/variance tradeoff must be considered. From the list of statements below, identify all the ones that are true. Each wrong answer will result in the loss of one point, with the lowest possible score being zero (i.e., we will not subtract points from the rest of the exam).

Increasing the number and/or size of the layers in a neural network can make it fit arbitrarily complex functions. This in turn means neural nets can achieve an expected loss of zero for arbitrarily noisy datasets of these complex functions;
Fitting noisy functions using neural networks with a large number of parameters tends to lead to models with high variance, that is, with high sensitivity to specific realizations of small datasets;
Switching from large neural networks to smaller linear basis function models should translate to significant increases in bias, leading to models that are more robust to changes in dataset;
When adding an L2 regularization term $\frac{λ}{2} w^Tw$ to the loss function, higher values of $λ$ should lead to models with higher variance and lower bias;
Given a model with a single regularization hyperparameter $λ$, an appropriate model selection procedure would be to pick the value of $λ$ that minimizes the error computed over a test dataset;

Model answer:

Fitting noisy functions using neural networks with a large number of parameters tends to lead to models with high variance, that is, with high sensitivity to specific realizations of small datasets;
Switching from large neural networks to smaller linear basis function models should translate to significant increases in bias, leading to models that are more robust to changes in dataset;

Part 11: Reliability 1¶

You are asked to evaluate the system reliability of an industrial facility, specifically with respect to limiting the chance that people working and living near the plant may die due to one of three different failure modes, $M_i$ (note: the descriptions of the failure modes and the numbers in the table are not realistic, but meant to illustrate how the system behaves)

$M_1$ : a toxic chemical leaks into the groundwater system, poisoning the entire community
$M_2$ : a toxic gas leak occurs and the detection system does not provide a warning
$M_3$ : an explosion occurs and the automatic system to put out the fire fails

All failure modes are mutually exclusive, and the probability of occurrence is provided in the following table, along with the expected fatalities:

A. Construct the FN curve for the industrial facility. Don’t worry about the scale of your plot being precise, as long as the FN values are clearly indicated at each point.

Model answer:

B. Which of the following statements best describes the dependence between each failure mode?

Statistically independent
Strong positive dependence
Strong negative dependence
Not possible to determine with information provided

Model answer:

Strong negative dependence

The regulatory agency in charge of the safety standards in this area have specified a limit line of the form $C/n^α$, where $C=0.2$ and $α=0.5$, which indicates the system may not be safe.

C. Explain why the system is not safe (include a quantitative result in your explanation and show your work; you may draw the limit line on the plot from your previous answers)

Model answer:

D. Propose one mitigation measure that you can implement to make the system safe. Be sure to specify exactly how your measure would influence the calculations made in the previous 3 questions.

Model answer:

Decrease N, decrease p(N). Increasing acceptable fatalities is not a good approach, because it asks for making the system safer, not changing the definition of ‘safe enough’.

Part 12: Reliability 2¶

You are designing a system that can be illustrated as follows, where the failure probabilities of components 1, 2 and 3 are 0.1, 0.2 and 0.5, respectively.

Recall that the failure probability of a series and parallel system can be determined using the following equations:

Parallel: $p_f = \prod_{i}^{n}P(F_i)$
Series: $p_f = 1 - \prod_{i}^{n}(1- P(F_i))$

A. What is the failure probability of the system?

Model answer:

$P$_f,sys$=0.1∗0.2∗0.5=0.01$

B. How would positive dependence between the components influence the system failure probability?

Increase
Decrease
No change
Not enough information provided

Model answer:

Increase

Resit 22/23 Q2¶

.markdown {width:100%; position: relative} article { position: relative }

Part 1: Coding¶

Part 2: Finite Difference Method¶

Part 3: Finite Element Method 1¶

Part 4: Finite Element Method 2¶

Part 5: Optimization¶

Part 6: Signal Processing¶

Part 7: Time Series Analysis¶

Part 8: Machine Learning 1¶

Part 9: Machine Learning 2¶

Part 10: Machine Learning 3¶

Part 11: Reliability 1¶

Part 12: Reliability 2¶