6.1. Introduction and k-Nearest Neighbors#

# pip install packages that are not in Pyodide
%pip install ipympl==0.9.3
%pip install seaborn==0.12.2
# Import the necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from mude_tools import magicplotter
from cycler import cycler
import seaborn as sns

%matplotlib widget

# Set the color scheme
sns.set_theme()
colors = [
    "#0076C2",
    "#EC6842",
    "#A50034",
    "#009B77",
    "#FFB81C",
    "#E03C31",
    "#6CC24A",
    "#EF60A3",
    "#0C2340",
    "#00B8C8",
    "#6F1D77",
]
plt.rcParams["axes.prop_cycle"] = cycler(color=colors)

This page contains interactive python element: click –> Live Code in the top right corner to activate it.

Introduction#

A very general problem often encountered is predicting some target variable \(t\) as a function of a predictor variable \(x\), but we only have inaccurate measurements of our target variable. Some examples of possible problems where we can apply this kind of framework are given below:

Problem

Target variable \(t\)

Predictor variable \(x\)

Structural health monitoring of a bridge

Displacement

Location along the bridge

Fatigue testing of a steel specimen

Stress

Number of loading cycles

Flood control of the river Maas

Volumetric flow rate

Precipitation in the Ardennes

Radioactive decay of a radon-222 sample

Number of \(\alpha\)-particles emitted

Time

Cooling rate of my coffee cup

Temperature

Time

To strip the problem down to its bare essentials, we will use a toy problem, where we actually know the ground truth \(f(x)\). In our case, this will be a simple sine wave, of which we make 100 noisy measurements from \(x=0\) to \(x=2 \pi\). Note that, in a practical setting, we would only have access to these noisy measurements and not to the true function that generated the data. Finding a good estimate of \(f(x)\) based on this contaminated data is one of our main objectives.

# The true function relating t to x
def f_truth(x, freq=1, **kwargs):
    # Return a sine with a frequency of freq
    return np.sin(x * freq)


# The data generation function
def f_data(epsilon=0.7, N=100, **kwargs):
    # Apply a seed if one is given
    if "seed" in kwargs:
        np.random.seed(kwargs["seed"])

    # Get the minimum and maximum
    xmin = kwargs.get("xmin", 0)
    xmax = kwargs.get("xmax", 2 * np.pi)

    # Generate N evenly spaced observation locations
    x = np.linspace(xmin, xmax, N)

    # Generate N noisy observations (1 at each location)
    t = f_truth(x, **kwargs) + np.random.normal(0, epsilon, N)

    # Return both the locations and the observations
    return x, t
https://files.mude.citg.tudelft.nl/fig0.png

Fig. 6.24 100 noisy measurements from sine wave ground truth#

k-nearest neighbors#

A (perhaps naive) approach to find \(y(x)\) (i.e. the approximation of \(f(x)\)) would be to simply look at the surrounding data points and take their average to get an estimate of \(f(x)\). This approach is called k-nearest neighbors, where \(k\) refers to the number of surrounding points we are looking at.

Implementing this is not trivial, but thankfully we can leverage existing implementations. The KNeighborsRegressor function from the sklearn.neighbors library can be used to fit our data and get \(y(x)\) to make predictions.

# Define the prediction locations
# (note that these are different from the locations where we observed our data)
x_pred = np.linspace(0, 2 * np.pi, 1000)


# Define a function that makes a KNN prediction at the given locations, based on the given (x,t) data
def KNN(x, t, x_pred, k=1, **kwargs):
    # Convert x and x_pred to a column vector in order for KNeighborsRegresser to work
    X = x.reshape(-1, 1)
    X_pred = x_pred.reshape(-1, 1)

    # Train the KNN based on the given (x,t) data
    neigh = KNeighborsRegressor(k)
    neigh.fit(X, t)

    # Make a prediction at the locations given by x_pred
    y = neigh.predict(X_pred)

    # Check if the regressor itself should be returned
    if kwargs.get("return_regressor", False):
        # If so, return the fitted KNN regressor
        return neigh

    else:
        # If not, return the predicted values
        return y
https://files.mude.citg.tudelft.nl/fig1.png

Fig. 6.25 k-nearest neighbors regression for different values of \(k\)#

Looking at the previous plots, a few questions might pop up:

  • For \(k=1\), we see that our prediction matches the observed data exactly in all data points. Is this desirable?

  • For \(k=30\), what is going on around \(x=0\) and \(x=2 \pi\)?

  • For \(k=100\), why is our prediction constant with respect to \(x\)?

Varying our model parameters#

Clearly, some value of \(k\) between 1 and 100 would give us the best predictions. In the next interactive plot the following variables can be adjusted:

  • \(N\), the size of the training data set

  • \(\varepsilon\), the level of noise associated with the data

  • \(k\), the number of neighbors over which the average is taken

  • The seed can be updated to generate new random data sets

  • The oscillation frequency of the underlying ground truth model, which controls how nonlinear the approximation should be

  • A probing location \(x_0\), allowing you to see which neighbors are used to get the average response \(y(x_0)\) of the kNN estimator

plot1 = magicplotter(f_data, f_truth, KNN, x_pred)
plot1.fig.canvas.toolbar_visible = False
plot1.add_sliders("epsilon", "k", "N", "freq")
plot1.add_buttons("truth", "seed", "reset")
plot1.add_probe()
plot1.show()

Playing around with the plots#

By visual inspection, use the slider of \(k\) to find its optimal value. The following questions might be interesting to ask yourself:

  • If the training size \(N\) increases/decreases, how does this affect my optimal value of \(k\)?

  • If the frequency increases/decreases, how does this affect my optimal value of \(k\)?

  • If my measurements are less/more noisy, how does this affect my optimal value of \(k\)?

  • If I generate new data by changing the seed, how is my prediction affected for small values of \(k\)? What about large values of \(k\)?

  • So far, all observations were distributed uniformly over \(x\). How would our predictions change if our observed data was more clustered?

  • If I do not know the truth, how do I figure out what my value of \(k\) should be?

Final remarks#

So far, we have looked at our k-nearest neighbors regressor mostly qualitatively. However, it is possible to apply a more quantitative framework to our model and find the optimal value for \(k\) in a more structured way. The following video and accompanying chapter will discuss how this is done. See you there!

Attribution

This chapter is written by Iuri Rocha, Anne Poot, Joep Storm and Leon Riccius. Find out more here.