Model Flexibility and the Risk of Overfitting

Introduction

When using a regression model, you've got to be super careful that you aren't overfitting your data. But what does that even mean? What causes an "overfitted" model?

Well, the answer to that question is excessive model flexibility. But again, "excessive flexibility" is not the most understandable of responses. Hopefully this post will shed a bit of light on the matter!

Please note: this analysis was inspired by Figure 2.9 of An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.

In [8]:
# Some usual setup for working in a notebook
%matplotlib nbagg

import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import UnivariateSpline

plt.style.use('ggplot')

An Unknown Function

Imagine we've got some random unknown function. Let's say the function describes "How much Donald Trump wants to build a wall depending on how many tacos he's eaten today". We don't know what this function really looks like, but we can sample it by feeding Trump different numbers of tacos and asking how much be wants to start hiring bricklayers.

Now we can be fairly sure that the number of tacos isn't the only thing that will affect Trump's wallbuilding proclivities - perhaps it's also affected by whether or not he's wearing his comfortable wig today. Unfortunately (fortunately?!) we don't know this though, so there's going to be a bit of noise associated with out measurement, just because there are other factors that affect the outcome which we don't have access to.

In [2]:
# Create a random function that will be our 'unknown function' 
def unknown_function(x):
    return ((20*x + -2*x**2 + 0.018*x**3) / 100) + 15

np.random.seed(1)

x = np.linspace(0, 100, 10000)
x_sample = x[::100]

y_clean = unknown_function(x)
y_clean_sample = unknown_function(x_sample)

y_noisy = y_clean + np.random.normal(scale = 3, size = len(x))
y_noisy_sample = y_clean_sample + np.random.normal(scale = 3, size = len(x_sample))

In the following plot, the blue dots represent our measurements (which are a bit noisy), and the orange line represents the the true function (which we could never really know in reality).

In [3]:
fig, ax = plt.subplots()

ax.plot(x, y_clean, label = 'Unknown function')
ax.scatter(x_sample, y_noisy_sample, label = 'Samples')

ax.set_ylabel('How much DT wants the wall')
ax.set_xlabel('How many tacos DT has eaten today')

plt.legend(loc = 3)
Out[3]:
<matplotlib.legend.Legend at 0x106b00e90>

Modelling the Unknown Function

So remember that what we really want to do is to try and model the original function - the orange line in the plot above. One of the ways of doing that is to use a 'Smoothing Spline' - see the appendix below for a bit more information about these things. The need-to-know fact is that these splines specify a number of "knots" through which the modelled function passes. The more knots you have, the more 'wiggly' your modelled function can be.

Here we specify two smoothing spline functions:

In [4]:
spline_overfit = UnivariateSpline(x_sample, y_noisy_sample, k = 3,  s = 100)
spline_justright = UnivariateSpline(x_sample, y_noisy_sample, k = 3,  s = 1000)
# Note that k defines the degree of the spline, so k = 3 is a cubic spline.
# s defines the smoothing factor, which in turn is used to define the number of knots. 
# A lower smoothing factor means less spline knots.

When we plot these two splines with different smoothing factors, we see some rather dramatic differences. The spline with lots of knots (in blue) is extremely wiggly, and passes through lots of the sampled points. It's clearly not a particularly good representation of the original 'unknown' function (in orange).

The spline with a large smoothing factor (in purple) - and just a few knots - does a much better job of matching the 'unknown' function.

In [5]:
fig, ax = plt.subplots()

ax.plot(x, y_clean, label = 'Unknown Function')
ax.scatter(x_sample, y_noisy_sample, label = 'Samples')

ax.plot(x, spline_overfit(x), label = 'Spline with lots of knots')
ax.plot(x, spline_justright(x), label = 'Spline with few knots')

ax.set_ylabel('How much DT wants the wall')
ax.set_xlabel('How many tacos DT has eaten today')

plt.legend(loc = 3)
Out[5]:
<matplotlib.legend.Legend at 0x106aa97d0>

Overfitting and Error

So this, fundamentally, is what is meant by "overfitting". The spline with lots of knots has too much flexibility - it's super wiggly and passes through lots of the individual samples. The spline with few knots on the other hand is constrained to broad strokes - it completely ignores the noise in the samples.

This means that the overfitted spline - as the name suggest - fits the sample points extremely well. But what if we take a completely new sample? Well now the wiggles are most likely going to be in completely the wrong place, and the discrepancy between the new samples and the model is going to be large. In other words, the overfitted spline function is completely useless for anything other than the sample points on which the spline was fit.

We can formalise this idea by using the mean squared error, or MSE. First, however, we should define a bit of jargon. In many data applications, we split the data we have into "training" and "testing" sets. We train the model on the former, and then we make sure that we haven't overfit the model on the latter. (This is a pretty big topic, so apologies for the brevity here!)

To demonstrate the effect of overfitting, we can do the following:

  1. Fit splines on the samples we already have for a wide range of smoothing factors. These samples become the training dataset.
  2. For each of these splines, we find out how many knots they have, and also what the MSE is with relation to the training dataset.
  3. Also for each of these splines, we find out what the MSE is with relation to the testing dataset, which can be any of the samples not used when fitting the model.
In [6]:
def mean_squared_error(y, x):
    return np.mean((y - x)**2)

training_MSEs = []
testing_MSEs = []
number_of_knots = []

for s in range(2000):
    this_spline = UnivariateSpline(x_sample, y_noisy_sample, k = 3,  s = s)
    knot_count = len(this_spline.get_knots())
    
    training_MSE = mean_squared_error(y_noisy_sample, this_spline(x_sample))
    testing_MSE = mean_squared_error(y_noisy, this_spline(x))
    
    training_MSEs.append(training_MSE)
    testing_MSEs.append(testing_MSE)
    number_of_knots.append(knot_count)

Excessive Flexibility

Remember that the number of knots is a measure of the amount of "flexibility" in a model. So... more knots leads to more flexibility, and more flexibility leads to a higher risk of overfitting. In other words, if we plot the MSE for the training and the testing datasets against the number of knots in the spline, we should see the MSE decreasing for the training data as the number of knots increases, but it the MSE should increase for the testing data.

Let's have a look (each point represents a single spline model - we tested 2000 of them):

In [9]:
fig, ax = plt.subplots()

ax.scatter(number_of_knots, training_MSEs, label = 'Training')
ax.scatter(number_of_knots, testing_MSEs, color = 'r', label = 'Testing')

ax.set_xlim([0,32.5])
ax.set_ylim([4.5,11])

plt.legend(loc=5)

ax.set_ylabel('Mean Squared Error (MSE)')
ax.set_xlabel('Number of knots')
Out[9]:
<matplotlib.text.Text at 0x106f63690>

As expected, more model flexibility generally leads to worse performance on the testing i.e. unseen dataset. Note also that the training error never dips below 9, which is the variance of the noise added to the unknown function.

So this is what is meant by excessive flexibility leading to overfitting: when you've got a model that is flexible enough to fit your training data super well, it almost certainly will not be a model that generalises to new, unseen data points. And that's bad, because then you'd never be able to predict how much Donald really wants that wall of his.

Appendix

Smoothing Splines

A smoothing spline is the function $g$ that minimizes the following cost function:

$$ \sum_{i=1}^{n} (y_i - g(x_i))^2 + s \int g''(t)^2 dt $$

$g$ in our example here is how much Trump wants to build a wall depending on number of tacos he's eaten. Similarly to lasso or ridge regression, $s \int g''(t)^2 dt $ is a penalty term. Instead of restricting the regression coefficients, however, it penalises the 'wiggliness' of the function, as defined by the integral of the function's second derivative.

The larger the smoothing factor s, the more the function is penalised for being wiggly.

blogroll

social