Overfitting with scissors and the utility of a train/test split.

class date


Learning through examples is one of the most important ways we acquire knowledge. Have you ever given someone an example to illustrate a concept, but that person gets too caught up in the details of the example and learns the wrong lesson? There is an expression for this: “missing the forest for the trees”1. Before reading these notes please listen to Act 3 of an episode of the podcast “This American Life” called “Kids Look Back” (10 minutes).

This American Life: Kids Look Back

Almost all predictive modeling – from linear regression to artificial intelligence based on deep learning – is accomplished though learning by example. In simple linear regression we show our algorithm a bunch of pairs of examples \((x_1, y_1), \dots, (x_n, y_n)\) and ask the linear model to best approximate \(y\) through the linear function \(\widehat{y} = b_0 + b_1 \cdot x\). The “lesson” that the model learns are the values of \(b_0\) and \(b_1\) that define the model, and those are composed of nothing more than the original observations2.

Missing the forest for the trees turns out to be one of the most important concepts predictive modeling. In statistics we call it overfitting.

The practice of using a predictive model is very effective at explaining the data used to fit it and correspondingly poor at making predictions on new data. Usually a result of applying a model that is too complex.

Let’s illustrate overfitting with an example 😉.

Example: Overfitting with Polynomials

Here we collected data about the association between number of hours studied and students’ test scores in a math class. Our goal is to predict the exam score from number of hours studied. Both plots below show the same data, but show the predictions from two different predictive models.


n_samples <- 10

min_n_hours <- 5
max_n_hours <- 10

exam_avg <- 75
point_per_hour <- 5

intercept <- exam_avg - point_per_hour * (min_n_hours + max_n_hours) / 2
noise <-  rnorm(n=n_samples, mean=0, sd=5)

# Setup data
math <- tibble(hours=runif(n=n_samples, min=min_n_hours, max=max_n_hours),
             score=point_per_hour * hours + intercept + noise)

# same x data, but much finer grid. useful for plotting lines
math_filled_in_range <- tibble(hours=seq(from=min_n_hours, to=max_n_hours, by=.01))

# fit models
lm_lin <- lm(score ~ hours, data=math)
lm_poly <- lm(score ~ poly(hours, degree=20, raw=T), data=math)

# add predictions into model
math <- math %>% 
    mutate(y_pred_linear = predict(lm_lin, newdata=math),
           y_pred_poly = predict(lm_poly, newdata=math))

math_filled_in_range_filled_in_range <- math_filled_in_range %>% 
    mutate(y_pred_linear = predict(lm_lin, newdata=math_filled_in_range),
           y_pred_poly = predict(lm_poly, newdata=math_filled_in_range))

plot_linear <- math %>% 
    ggplot(aes(x=hours, y=score)) +
    geom_point() +
    geom_line(aes(x=hours, y=y_pred_linear), color='blue') +
    lims(x=c(5, 10), y=c(65, 90)) +
    ggtitle("Math class data") +

plot_poly <- math %>% 
    ggplot(aes(x=hours, y=score)) +
    geom_point() +
    geom_line(data=math, aes(x=hours, y=y_pred_poly), color='red')+
    lims(x=c(5, 10), y=c(65, 90)) +

plot_linear + plot_poly

Which model looks more appropriate? More specifically,

  • Does it make sense that there should be a big difference between studying 6.7 hours vs studying 6.8 hours?

  • Should studying a little more make your score go down?

The predictive model on the left seems more reasonable: studying more should steadily increase your score. The predictive model on the right seems like it took the particular data points too seriously!

This is a clarion case of overfitting.

Model Complexity

With great modeling power comes great responsibility to not overfit.

The blue model on the left above is fairly simple (just a line!) while the red overfit model on the right looks much more complex (it is fairly irregular and goes up and down). We created the overfitted predictive model on the right by fitting a polynomial with a high degree.

Polynomials are quite powerful models and are capable of creating very complex predictive functions. The higher the polynomial degree, the more complex function it can create.

Let’s illustrate by fitting polynomial models with progressively higher degrees to the data set above.

plot_deg_1 <- math_filled_in_range %>% 
    mutate(y_pred = predict(lm(score ~ poly(hours, degree=1, raw=T), data=math), newdata=math_filled_in_range)) %>% 
    ggplot( aes(x=hours, y=score)) +
    geom_line(aes(x=hours, y=y_pred), color='red')+
    geom_point(data=math, aes(x=hours, y=score)) +
    lims(x=c(5, 10), y=c(65, 90)) +
    ggtitle("Degree 1 polynomial") +

plot_deg_3 <- math_filled_in_range %>%
    mutate(y_pred = predict(lm(score ~ poly(hours, degree=3, raw=T), data=math), newdata=math_filled_in_range)) %>%
    ggplot(aes(x=hours, y=score)) +
    geom_line(aes(x=hours, y=y_pred), color='red')+
    geom_point(data=math, aes(x=hours, y=score)) +
    lims(x=c(5, 10), y=c(65, 90)) +
    ggtitle("Degree 3 polynomial") +

plot_deg_5 <- math_filled_in_range %>%
    mutate(y_pred = predict(lm(score ~ poly(hours, degree=5, raw=T), data=math), newdata=math_filled_in_range)) %>%
    ggplot(aes(x=hours, y=score)) +
    geom_line(aes(x=hours, y=y_pred), color='red') +
    geom_point(data=math, aes(x=hours, y=score)) +
    lims(x=c(5, 10), y=c(65, 90)) +
    ggtitle("Degree 5 polynomial") +

plot_deg_1 + plot_deg_3 + plot_deg_5

The higher the polynomial degree, the closer the prediction function comes to perfectly fitting the data3. We can quantitatively evaluate this by looking at the \(R^2\) value. The plot below shows the \(R^2\) value for models fit with different polynomial degrees.

# Digression: for loops #

# We have not learned for loops. Here is a simple example where we calculate the squares of the
# first 10 integers
# squares <- c()
# for (i in 1:10){
#     squares <- c(squares, i^2)
# }
# squares

# Evaluate Rsquared #
degrees2evaluate <- 1:10

Rsq_values <- c()
# each iteration of this for loop evaluates the model for a different polynomial degree 
for (degree in degrees2evaluate) {
    # fit the model on the training data with the specified degree
    model <- lm(score~poly(hours, degree=degree, raw=T), data=math)
    rsq <- glance(model)$r.squared
    # track the Rsquared values
    # vector <- c(vector, value) will add the value to the end of the vector
    # i.e. c does concatenation
    Rsq_values <- c(Rsq_values, rsq)

           Rsq=Rsq_values) %>% 
  ggplot(aes(x=degree, y=Rsq)) + 
  geom_point() + 

The \(R^2\) value goes steadily upwards as the polynomial degree goes up. In fact this is mathematically guaranteed to happen: for a fixed data set the \(R^2\) value for a polynomial model with higher degree will always be higher than a polynomial model with lower degree.

This should be disconcerting. From visually inspecting the above plots we can see that higher degree polynomials seem to be worse models. Does that mean \(R^2\) is a good not metric to evaluate our model?

Training versus Testing

We can rescue \(R^2\) by changing our workflow to bring it into alignment with the nature of prediction.

Previously we both fit the model (set the \(b_0, b_1\) coefficients) and evaluated the model (calculated \(R^2\)) with the same data set. This \(R^2\), then, captures the degree to which model can explain the structure in the observed data. The task of prediction, however, involves formulating an estimate for a data point that is unobserved. This is a more difficult task.

We can mirror this important distinction in the structure of our data. Instead of thinking in terms of a single data set, we can partition the observations of the data set into two separate sets.

Training Set
The set of observations used to fit a predictive model; i.e. estimate the model coefficients.
Testing Set
The set of observations used to assess the accuracy of a predictive model. This set is disjoint from the training set.

Our new workflow for assessing the predictive ability of our model will have three steps.

  1. Split the observations into a training and a testing set
  2. Fit the model to the training set
  3. Evaluate the model (calculate \(R^2\)) using the testing set

In step one we must decide how large each set should be. We could put half of the observations in each set (a 50-50 split), however it can be useful to spend more of the data budget on estimating the coefficients. A standard partition is to dedicate 80% of the observations to the training set and the remainder to the testing set (a 80-20 split). The other question is how best to assign the observations to the two sets. In general, it is best to do this randomly to avoid one set that is categorically different than the other.

The partition of a data frame into training and testing sets is illustrated by the diagram below.

y x1 x2 x3

The original data frame consists of 10 observations. For each observation we have recorded a response variable, \(y\), and three predictors, \(x_1, x_2\), and \(x_3\). If we do an 80-20 split, then 8 of the rows will randomly be assigned to the training set (in blue). The 2 remaining rows (rows 2 and 6) are assigned to the testing set (in gold).

Example: Predicting Final Exam Scores

Let’s illustrate the training and testing approaching to evaluating predictions for the exam scores from a biology class with 200 students using as a predictor the number of hours that they have studied. Here we are going to compare two models: a simple linear model versus a 5th degree polynomial, both fit using the method of least squares.

  • \(\textbf{Model 1:} \quad \widehat{score} = b_0 + b_1 \times hours\)
  • \(\textbf{Model 2:} \quad \widehat{score} = b_0 + b_1 \times hours + b_2 \times hours^2 + b_3 \times hours^3 + b_4 \times hours^4 + b_5 \times hours^4\)

The first step is to split the observations into a training and a testing set. We’ll use an 80-20 split, with each observation assigned to its set randomly.

# Simulate a data set
n_samples_big <- 200
noise <-  rnorm(n = n_samples_big, mean = 0, sd = 5)

biology <- tibble(hours = runif(n = n_samples_big,
                                min = min_n_hours,
                  score = point_per_hour * hours + intercept + noise)
# fix the random seed so we get the same train/test split every time

# randomly sample train/test set split
set_type <- sample(x = c('train', 'test'), 
                   size = 200, 
                   replace = TRUE, 
                   prob = c(0.8, 0.2))

biology <- biology %>% 
    mutate(set_type = set_type)
# A tibble: 200 × 3
   hours score set_type
   <dbl> <dbl> <chr>   
 1  9.78  88.7 train   
 2  5.26  60.8 train   
 3  9.84  86.3 train   
 4  6.39  61.7 train   
 5  7.19  72.4 test    
 6  6.43  61.3 train   
 7  5.71  66.4 train   
 8  9.53  92.4 train   
 9  6.92  71.7 test    
10  7.72  80.8 train   
# ℹ 190 more rows

Let’s visualize these data.

biology %>% 
    ggplot(aes(x = hours, y = score, color = set_type)) +
    geom_point() + 
    ggtitle("Biology class") +

Next let’s split the original data frame into a training data frame and test data frame.

biology_train <- biology %>% 
    filter(set_type == 'train')

biology_test <- biology %>% 
    filter(set_type == 'test')

Now fit two models on the training data.

# notice we use the training data to fit the model!
lm_slr <- lm(score ~ hours, data = biology_train)
lm_poly <- lm(score ~ poly(hours, degree = 20, raw = T), data = biology_train)

Next we visually inspect the predictions.

# same x data, but much finer grid. useful for plotting lines
biology_filled_in_range <- tibble(hours = seq(from = min_n_hours, to = max_n_hours, by = .01))

# this is useful for plotting
biology_filled_in_range <- biology_filled_in_range %>% 
    mutate(y_pred_linear = predict(lm_slr,
                                   newdata = biology_filled_in_range),
           y_pred_poly = predict(lm_poly, 
                                 newdata = biology_filled_in_range)) 

biology %>% 
    ggplot(aes(x=hours, y=score, color=set_type)) +
    geom_point() +
    geom_line(data=biology_filled_in_range, aes(x=hours, y=y_pred_linear), color='red') +
    geom_line(data=biology_filled_in_range, aes(x=hours, y=y_pred_poly), color='blue') +
    lims(x=c(5, 10), y=c(65, 90)) +

First let’s evaluate the \(R^2\) for the training data. We can do this just like before with glance. Which model do you expect to have a better training set \(R^2\) value?

# this calculates the training r^2
glance(lm_slr) %>%
# A tibble: 1 × 1
1     0.544
glance(lm_poly) %>%
# A tibble: 1 × 1
1     0.588

Just as we might have guessed from looking at the model fits, the polynomial model has a better \(R^2\) value when evaluated on the training set. Next let’s evaluate these models on the test set.

# get predictions on test set
score_pred_linear <- predict(lm_slr, newdata = biology_test)
score_pred_poly <- predict(lm_poly, newdata = biology_test)
# calculate R squared for test set
biology_test %>%
    mutate(score_pred_linear = score_pred_linear,
           score_pred_poly = score_pred_poly,
           resid_sq_linear = (score - score_pred_linear)^2,
           resid_sq_poly = (score - score_pred_poly)^2) %>%
    summarize(TSS = sum((score - mean(score))^2),
              RSS_linear = sum(resid_sq_linear),
              RSS_poly = sum(resid_sq_poly)) %>%
    mutate(Rsq_linear = 1 - RSS_linear/TSS,
           Rsq_poly = 1 - RSS_poly/TSS) %>%
    select(Rsq_linear, Rsq_poly)
# A tibble: 1 × 2
  Rsq_linear Rsq_poly
       <dbl>    <dbl>
1      0.586    0.538

Voila the linear model’s test set \(R^2\) is better than the polynomial model’s test \(R^2\)!

So which is the better predictive model: Model 1 or Model 2? In terms of training \(R^2\), Model 2 came out of top, but Model 1 won out in testing \(R^2\). While training \(R^2\) can tell us how well a predictive model explains the structure in the data set upon which it was trained, it is deceptive to use as a metric of true predictive accuracy. The task of prediction is fundamentally one applied to unseen data, so testing \(R^2\) is the appropriate metric. Model 1, the simpler model, is the better predictive model.


This lecture is about overfitting: what happens when your model takes the particular data set it was built on too seriously. The more complex a model is, the more prone to overfitting it is. Polynomial models are able to create very complex functions thus high-degree polynomial models can easily overfit. Fitting a model and evaluating it on the same data set can be problematic; if the model is overfitted the evaluation metric (e.g. \(R^2\)) might be very good, but the model might be lousy on predictions on new data.

A better way to approach predictive modeling is to fit the model to a training set then evaluate it with a separate testing set.

Materials from class



  1. Definition (Merriam-Webster): to not understand or appreciate a larger situation, problem, etc., because one is considering only a few parts of it.↩︎

  2. Recall \(b_1 = r \frac{s_y}{s_x}\) and each of the statistics involved are functions of \(x_i, y_i\) and \(n\)).↩︎

  3. We say a function that perfectly predicts each data point interpolates the data. See the first red curve for the math class exams.↩︎