Parametric distributions from empirical data

Parametric distributions from empirical data#

Introduction#

Risk and reliability play a major role not only in engineering, but in insurance applications. In this group assignment, your team will derive some basic risk quantities both from parametric distributions and from empirical data.

As always, let us start by importing a few necessary libraries.

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats

# Default config for the plots
plt.rcParams["figure.figsize"] = (8, 4.5)
plt.rcParams["axes.grid"] = True

Initial calculations#

A reliability engineer is hired by an insurance company for his knowledge of probability and statistics and his affinity with numerical reasoning. The company will start working in a new market where they want to introduce their products. They start with simple models to analyze this new market. The senior actuarial is sure that the hazard rate for the population is not constant. However, she wants to start testing some simple models for their products.

To begin, she wants to know \(P(T>95)\), that is the probability of surviving to 95 years assuming a constant failure rate of \(\frac{1}{18}\).

Task 1:

Compute the probability \(P(T>95)\) assuming a constant failure rate of \(\frac{1}{18}\).

lambda_exp = 1/18  # hazard rate for t in years
t_target = 95 # target age t in years

S95 = np.exp(-lambda_exp * t_target)
print(f"The probability of surviving 95 years assuming a constant rate is {100*S95:.2f}%.")

The probability of surviving 95 years assuming a constant rate is 0.51%.

Task 2:

Compute the probability \(P(T>95)\) assuming the data follows Weibull survival times with \(α=1.2\) and \(λ=\frac{1}{20}\).

Hint: The Weibull function is scipy.stats.weibull_min. Check how the density is defined in scipy.stats and adapt the parameters, if necessary.

Solution:

The survival function \(\operatorname{sf}(x)\) is defined as \(1-\operatorname{cdf}(x)\). You can also use \(1-\operatorname{cdf}(x)\), if you prefer.

Erratum:

We made a mistake in this exercise: scale = 1/lambda is only true if the shape parameter is 1.

Explanation:

scipy.stats defines the weibull_min function’s PDF as:

\[f(x)=\frac{\operatorname{shape}}{\operatorname{scale}}\left(\frac{x}{\operatorname{scale}}\right)^{\operatorname{shape}-1}e^{-\left(\frac{x}{\operatorname{scale}}\right)^{\operatorname{shape}}}\]

With reformulation we get:

\[f(x)=\frac{\operatorname{shape}}{\operatorname{scale}}\left(\frac{x^{\operatorname{shape}-1}}{\operatorname{scale}^{\operatorname{shape}-1}}\right)e^{-\left(\frac{x^{\operatorname{shape}}}{\operatorname{scale}^{\operatorname{shape}}}\right)}\]

\[f(x)=\operatorname{shape}\left(\frac{x^{\operatorname{shape}-1}}{\operatorname{scale}^{\operatorname{shape}}}\right)e^{-\left(\frac{x^{\operatorname{shape}}}{\operatorname{scale}^{\operatorname{shape}}}\right)}\]

\[f(x)=\frac{1}{\operatorname{scale}^{\operatorname{shape}}}\operatorname{shape}x^{\operatorname{shape}-1}e^{-\frac{1}{\operatorname{scale}^{\operatorname{shape}}}x^{\operatorname{shape}}}\]

In the book, we have the formulation:

\[f(x)=\lambda \alpha x^{\alpha-1}e^{-\lambda x^\alpha},\]

where \(\alpha = \operatorname{shape}\). Comparing the two equations, we can see that:

\[\operatorname{lambda} = \frac{1}{\operatorname{scale}^{\operatorname{shape}}}\]

solve for \(\operatorname{shape}\) and we get:

\[\operatorname{scale}^{\operatorname{shape}} = \frac{1}{\operatorname{lambda}}\]

\[\operatorname{scale} = \left(\frac{1}{\operatorname{lambda}}\right)^{\frac{1}{\operatorname{shape}}}\]

\[\boxed{\operatorname{scale} = \frac{1}{\operatorname{lambda}^{\frac{1}{\operatorname{shape}}}}}\]

and, as we can see, \(\operatorname{scale} = \frac{1}{\operatorname{lambda}}\) is only true if \(\operatorname{shape}=1\), which is not the case in our example.

A big thanks to everyone that alerted us to that error, and please accept our apologies for the mistake!

alpha = 1.2                # shape
lambda_weib = 1 / 20       # lambda
scale = 1 / lambda_weib    # scale (inv of lambda)

S95_weib = scipy.stats.weibull_min.sf(t_target, c=alpha, scale=scale)
print(f"The probability of surviving 95 years assuming a changing rate is {100*S95_weib:.2f}%.")

The probability of surviving 95 years assuming a changing rate is 0.15%.

Task 3:

Plot both hazard rates within the \([0,99]\) interval in years.

ages = np.arange(0, 100, 1)

r_exp = np.full_like(ages, fill_value=lambda_exp, dtype=float)

pdf_w = scipy.stats.weibull_min.pdf(ages, c=alpha, scale=scale) # density of death at time t (pdf-based)
sf_w  = scipy.stats.weibull_min.sf(ages,  c=alpha, scale=scale) # survival function // reliability function -> probability of still being alive past time t (cdf-based)

r_weib = pdf_w / sf_w # hazard rate -> instant probability of failure

plt.figure()
plt.plot(ages, r_exp, label="Constant")
plt.plot(ages, r_weib, label="Weibull")
plt.xlabel("Age")
plt.ylabel(r"Instant probability of failure $r(t)$ (hazard rate)")
plt.legend()
plt.show()

../../_images/cca1523a314fb68e6578a62e397a9c96a65812f57889d09a7f3ac16995af565e.png

Now we have data!#

After some time, data becomes available to the team. It is provided by authorities in the following table where \(t\) indicates age in years and \(L(t)\) number of individuals alive at age \(t\).

t	L(t)	t	L(t)	t	L(t)	t	L(t)
0	1,023,102	15	962,270	50	810,900	85	78,221
1	1,000,000	20	951,483	55	754,191	90	21,577
2	994,230	25	939,197	60	677,771	95	3,011
3	990,114	30	924,609	65	577,822	99	125
4	986,767	35	906,554	70	454,548	100	-
5	983,817	40	883,342	75	315,982
10	971,804	45	852,554	80	181,765

The empirical density function may be estimated as:

\[ f(t)\approx \frac{\text{number of deaths during }[t,t +Δt)}{N \cdot Δt}=\frac{n(t+Δt)-n(t)}{N \cdot Δt} \]

The data is provided below:

Age = np.array([0, 1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65,
                70, 75, 80, 85, 90, 95, 99, 100], dtype=float)

Lives = np.array([1023102, 1000000, 994230, 990114, 986767, 983817, 971804, 962270,
                 951483, 939197, 924609, 906554, 883342, 852554, 810900, 754191,
                 677771, 577822, 454548, 315982, 181765, 78221, 21577, 3011,
                 125, 0], dtype=float)

Task 4:

Estimate and plot \(f(t)\) based on the available data.

Deaths = Lives[:-1] - Lives[1:] # vector of # of deaths
Delta  = Age[1:] - Age[:-1] # vector of associated \delta_t

n0 = Lives[0]  # initial sample size

f_hat = Deaths / (Delta * n0) # density estimation in each interval

plt.figure()
plt.plot(Age[:-1], f_hat, marker="o")
plt.xlabel("Age")
plt.ylabel(r"Estimated density $\hat{f}(t)$")
plt.show()

../../_images/620da850fb8ddbf402c204f448618280621e3a67bbb8650d01b2dbcc5f206626.png

Now we are interested in the reliability function. Based on the empirical data, this function \(\bar{F}(t)\) can be estimated empirically as the number of survivals at time t divided by the total population.

Task 5:

Estimate the reliability function \(\bar{F}(t)\) and the hazard-rate \(r(t)\) based on the available data.

F_hat = Lives / n0 # reliability function
r_hat = f_hat / F_hat[:-1]

Task 6:

Plot the computed hazard rate \(r(t)\) against the hazard rates you plotted in Task 3.

age_data = Age[:-1]

r_exp_data = np.full_like(age_data, lambda_exp, dtype=float)

pdf_w_data = scipy.stats.weibull_min.pdf(age_data, c=alpha, scale=scale)
sf_w_data  = scipy.stats.weibull_min.sf(age_data,  c=alpha, scale=scale)
r_weib_data = pdf_w_data / sf_w_data

plt.figure()
plt.plot(age_data, r_exp_data, marker="o", label="Constant")
plt.plot(age_data, r_weib_data, marker="*", label="Weibull")
plt.plot(age_data, r_hat, marker="^", label="Empirical")
plt.xlabel("Age")
plt.ylabel(r"Instant probability of failure $r(t)$ (hazard rate)")
plt.legend()
plt.show()

../../_images/562ff0dc4bed311d5b5606469c6d79598979b668915a95e60c553517c5758dc2.png

Task 7:

Discuss the hazard rate functions for the constant failure rate model and the Weibull model in comparison with the empirical (from data) estimation. What do you observe a) in the beginning of life and b) by the end of life?

Solution:

In the constant failure rate model, failure rates are (surprise, surprise) constant at all times. In the Weilbull model, failure rates are initially very low but increase until they plateau at a fixed value. In the empirical model, we see that failure rates are initially high (infant mortality), then reach a plateau of stability during childhood and maturity, and eventually increase again in old age, resulting in a “bathtub” curve.

Task 8:

Plot the reliability function \(\bar{F}(t)\) against the reliability functions of the distributions in Task 1 and Task 2.

S_exp = np.exp(-lambda_exp * Age)
S_weib = scipy.stats.weibull_min.sf(Age, c=alpha, scale=scale)
S_data = F_hat

plt.figure()
plt.plot(Age, S_exp, marker="o", label="Constant")
plt.plot(Age, S_weib, marker="*", label="Weibull")
plt.plot(Age, S_data, marker="^", label="Empirical")
plt.xlabel("Age")
plt.ylabel(r"Reliability function $F(t)$")
plt.legend()
plt.show()

../../_images/61c6580648727d61a8ddaf83b5ab65ef93ff262adab76c0e12bd2cf43b14898f.png

Task 9:

Discuss the reliability functions for the constant failure rate model and the Weibull model in comparison with the empirical (from data) estimation.

It’s all about the money#

Now that you have analyzed the data, let’s think about some implications of the failure rate models we have on the life insurance pricing our company offers. Answer these questions in the report.

Task 10:

How should life insurance be priced by the company under the three different models? Consider insurance policies that change their premiums with age.

Task 11:

Think of other examples in engineering or everyday life and discuss what failure curves these examples follow. Explain why. Extra points for finding engineering examples of “bathtub” curves.

By Max Ramgraber, Renan Barros and Oswaldo Morales Napoles, Delft University of Technology. CC BY 4.0, more info on the Credits page of Workbook