Overfitting with scissors and the utility of a train/test split.
class date
11/16/23
Learning through examples is one of the most important ways we acquire knowledge. Have you ever given someone an example to illustrate a concept, but that person gets too caught up in the details of the example and learns the wrong lesson? There is an expression for this: “missing the forest for the trees”1. Before reading these notes please listen to Act 3 of an episode of the podcast “This American Life” called “Kids Look Back” (10 minutes).
Almost all predictive modeling – from linear regression to artificial intelligence based on deep learning – is accomplished though learning by example. In simple linear regression we show our algorithm a bunch of pairs of examples \((x_1, y_1), \dots, (x_n, y_n)\) and ask the linear model to best approximate \(y\) through the linear function \(\widehat{y} = b_0 + b_1 \cdot x\). The “lesson” that the model learns are the values of \(b_0\) and \(b_1\) that define the model, and those are composed of nothing more than the original observations2.
Missing the forest for the trees turns out to be one of the most important concepts predictive modeling. In statistics we call it overfitting.
Overfitting
The practice of using a predictive model is very effective at explaining the data used to fit it and correspondingly poor at making predictions on new data. Usually a result of applying a model that is too complex.
Let’s illustrate overfitting with an example 😉.
Example: Overfitting with Polynomials
Here we collected data about the association between number of hours studied and students’ test scores in a math class. Our goal is to predict the exam score from number of hours studied. Both plots below show the same data, but show the predictions from two different predictive models.
Code
library(tidyverse)library(patchwork)library(broom)set.seed(1)n_samples<-10min_n_hours<-5max_n_hours<-10exam_avg<-75point_per_hour<-5intercept<-exam_avg-point_per_hour*(min_n_hours+max_n_hours)/2noise<-rnorm(n=n_samples, mean=0, sd=5)# Setup datamath<-tibble(hours=runif(n=n_samples, min=min_n_hours, max=max_n_hours), score=point_per_hour*hours+intercept+noise)# same x data, but much finer grid. useful for plotting linesmath_filled_in_range<-tibble(hours=seq(from=min_n_hours, to=max_n_hours, by=.01))# fit modelslm_lin<-lm(score~hours, data=math)lm_poly<-lm(score~poly(hours, degree=20, raw=T), data=math)# add predictions into modelmath<-math%>%mutate(y_pred_linear =predict(lm_lin, newdata=math), y_pred_poly =predict(lm_poly, newdata=math))math_filled_in_range_filled_in_range<-math_filled_in_range%>%mutate(y_pred_linear =predict(lm_lin, newdata=math_filled_in_range), y_pred_poly =predict(lm_poly, newdata=math_filled_in_range))plot_linear<-math%>%ggplot(aes(x=hours, y=score))+geom_point()+geom_line(aes(x=hours, y=y_pred_linear), color='blue')+lims(x=c(5, 10), y=c(65, 90))+ggtitle("Math class data")+theme_bw()plot_poly<-math%>%ggplot(aes(x=hours, y=score))+geom_point()+geom_line(data=math, aes(x=hours, y=y_pred_poly), color='red')+lims(x=c(5, 10), y=c(65, 90))+theme_bw()plot_linear+plot_poly
Which model looks more appropriate? More specifically,
Does it make sense that there should be a big difference between studying 6.7 hours vs studying 6.8 hours?
Should studying a little more make your score go down?
The predictive model on the left seems more reasonable: studying more should steadily increase your score. The predictive model on the right seems like it took the particular data points too seriously!
This is a clarion case of overfitting.
Model Complexity
With great modeling power comes great responsibility to not overfit.
The blue model on the left above is fairly simple (just a line!) while the red overfit model on the right looks much more complex (it is fairly irregular and goes up and down). We created the overfitted predictive model on the right by fitting a polynomial with a high degree.
Polynomials are quite powerful models and are capable of creating very complex predictive functions. The higher the polynomial degree, the more complex function it can create.
Let’s illustrate by fitting polynomial models with progressively higher degrees to the data set above.
The higher the polynomial degree, the closer the prediction function comes to perfectly fitting the data3. We can quantitatively evaluate this by looking at the \(R^2\) value. The plot below shows the \(R^2\) value for models fit with different polynomial degrees.
Code
########################## Digression: for loops ########################### We have not learned for loops. Here is a simple example where we calculate the squares of the# first 10 integers# squares <- c()# for (i in 1:10){# squares <- c(squares, i^2)# }# squares###################### Evaluate Rsquared ######################degrees2evaluate<-1:10Rsq_values<-c()# each iteration of this for loop evaluates the model for a different polynomial degree for(degreeindegrees2evaluate){# fit the model on the training data with the specified degreemodel<-lm(score~poly(hours, degree=degree, raw=T), data=math)rsq<-glance(model)$r.squared# track the Rsquared values# vector <- c(vector, value) will add the value to the end of the vector# i.e. c does concatenationRsq_values<-c(Rsq_values, rsq)}data.frame(degree=degrees2evaluate, Rsq=Rsq_values)%>%ggplot(aes(x=degree, y=Rsq))+geom_point()+scale_x_continuous(breaks=degrees2evaluate)+theme_bw()
The \(R^2\) value goes steadily upwards as the polynomial degree goes up. In fact this is mathematically guaranteed to happen: for a fixed data set the \(R^2\) value for a polynomial model with higher degree will always be higher than a polynomial model with lower degree.
This should be disconcerting. From visually inspecting the above plots we can see that higher degree polynomials seem to be worse models. Does that mean \(R^2\) is a good not metric to evaluate our model?
Training versus Testing
We can rescue \(R^2\) by changing our workflow to bring it into alignment with the nature of prediction.
Previously we both fit the model (set the \(b_0, b_1\) coefficients) and evaluated the model (calculated \(R^2\)) with the same data set. This \(R^2\), then, captures the degree to which model can explain the structure in the observed data. The task of prediction, however, involves formulating an estimate for a data point that is unobserved. This is a more difficult task.
We can mirror this important distinction in the structure of our data. Instead of thinking in terms of a single data set, we can partition the observations of the data set into two separate sets.
Training Set
The set of observations used to fit a predictive model; i.e. estimate the model coefficients.
Testing Set
The set of observations used to assess the accuracy of a predictive model. This set is disjoint from the training set.
Our new workflow for assessing the predictive ability of our model will have three steps.
Split the observations into a training and a testing set
Fit the model to the training set
Evaluate the model (calculate \(R^2\)) using the testing set
In step one we must decide how large each set should be. We could put half of the observations in each set (a 50-50 split), however it can be useful to spend more of the data budget on estimating the coefficients. A standard partition is to dedicate 80% of the observations to the training set and the remainder to the testing set (a 80-20 split). The other question is how best to assign the observations to the two sets. In general, it is best to do this randomly to avoid one set that is categorically different than the other.
The partition of a data frame into training and testing sets is illustrated by the diagram below.
y
x1
x2
x3
The original data frame consists of 10 observations. For each observation we have recorded a response variable, \(y\), and three predictors, \(x_1, x_2\), and \(x_3\). If we do an 80-20 split, then 8 of the rows will randomly be assigned to the training set (in blue). The 2 remaining rows (rows 2 and 6) are assigned to the testing set (in gold).
Example: Predicting Final Exam Scores
Let’s illustrate the training and testing approaching to evaluating predictions for the exam scores from a biology class with 200 students using as a predictor the number of hours that they have studied. Here we are going to compare two models: a simple linear model versus a 5th degree polynomial, both fit using the method of least squares.
The first step is to split the observations into a training and a testing set. We’ll use an 80-20 split, with each observation assigned to its set randomly.
Code
# Simulate a data setn_samples_big<-200set.seed(1233)noise<-rnorm(n =n_samples_big, mean =0, sd =5)biology<-tibble(hours =runif(n =n_samples_big, min =min_n_hours, max=max_n_hours), score =point_per_hour*hours+intercept+noise)
# fix the random seed so we get the same train/test split every timeset.seed(13)# randomly sample train/test set splitset_type<-sample(x =c('train', 'test'), size =200, replace =TRUE, prob =c(0.8, 0.2))biology<-biology%>%mutate(set_type =set_type)biology
# notice we use the training data to fit the model!lm_slr<-lm(score~hours, data =biology_train)lm_poly<-lm(score~poly(hours, degree =20, raw =T), data =biology_train)
Next we visually inspect the predictions.
Code
# same x data, but much finer grid. useful for plotting linesbiology_filled_in_range<-tibble(hours =seq(from =min_n_hours, to =max_n_hours, by =.01))# this is useful for plottingbiology_filled_in_range<-biology_filled_in_range%>%mutate(y_pred_linear =predict(lm_slr, newdata =biology_filled_in_range), y_pred_poly =predict(lm_poly, newdata =biology_filled_in_range))biology%>%ggplot(aes(x=hours, y=score, color=set_type))+geom_point()+geom_line(data=biology_filled_in_range, aes(x=hours, y=y_pred_linear), color='red')+geom_line(data=biology_filled_in_range, aes(x=hours, y=y_pred_poly), color='blue')+lims(x=c(5, 10), y=c(65, 90))+theme_bw()
First let’s evaluate the \(R^2\) for the training data. We can do this just like before with glance. Which model do you expect to have a better training set \(R^2\) value?
# this calculates the training r^2glance(lm_slr)%>%select(r.squared)
Just as we might have guessed from looking at the model fits, the polynomial model has a better \(R^2\) value when evaluated on the training set. Next let’s evaluate these models on the test set.
# get predictions on test setscore_pred_linear<-predict(lm_slr, newdata =biology_test)score_pred_poly<-predict(lm_poly, newdata =biology_test)
# calculate R squared for test setbiology_test%>%mutate(score_pred_linear =score_pred_linear, score_pred_poly =score_pred_poly, resid_sq_linear =(score-score_pred_linear)^2, resid_sq_poly =(score-score_pred_poly)^2)%>%summarize(TSS =sum((score-mean(score))^2), RSS_linear =sum(resid_sq_linear), RSS_poly =sum(resid_sq_poly))%>%mutate(Rsq_linear =1-RSS_linear/TSS, Rsq_poly =1-RSS_poly/TSS)%>%select(Rsq_linear, Rsq_poly)
Voila the linear model’s test set \(R^2\) is better than the polynomial model’s test \(R^2\)!
So which is the better predictive model: Model 1 or Model 2? In terms of training \(R^2\), Model 2 came out of top, but Model 1 won out in testing \(R^2\). While training \(R^2\) can tell us how well a predictive model explains the structure in the data set upon which it was trained, it is deceptive to use as a metric of true predictive accuracy. The task of prediction is fundamentally one applied to unseen data, so testing \(R^2\) is the appropriate metric. Model 1, the simpler model, is the better predictive model.
Summary
This lecture is about overfitting: what happens when your model takes the particular data set it was built on too seriously. The more complex a model is, the more prone to overfitting it is. Polynomial models are able to create very complex functions thus high-degree polynomial models can easily overfit. Fitting a model and evaluating it on the same data set can be problematic; if the model is overfitted the evaluation metric (e.g. \(R^2\)) might be very good, but the model might be lousy on predictions on new data.
A better way to approach predictive modeling is to fit the model to a training set then evaluate it with a separate testing set.