Can Linear Models Overfit?
We know that regularization is important for linear models, but what does overfitting mean in this context? I discuss this question.
If you’re reading this, you probably already know what overfitting means in statistics and machine learning. Overfitting occurs when a model too closely corresponds to training data and thereby fails to generalize on test data. The classic example of this is fitting a high-degree polynomial to linear data (Figure

A model that overfits does not adhere to Occam’s razor in its explanation of the data. I think of conspiracy theories as a good example of humans overfitting to data. For example, consider the following conspiracy theory:
Theory: John F. Kennedy was assassinated by aliens.
Is this possible? Sure. So what’s wrong with the theory? Why don’t most people believe this? The theory could be true, but one needs more evidence than is required for simpler explanations, such as that Kennedy was assassinated by Lee Harvey Oswald. Put differently, the alien explanation is more complex than the data we have. If your mental model of the world allows for highly improbable events, you can easily overfit to random noise.
The question I want to address in this post is: can a linear model overfit to data? This will motivate future posts on techniques such as Bayesian priors, Tikhonov regularization, and Lasso.
Overfitting in linear models
Consider fitting classical linear regression to 2D data
Importantly, we are assuming this model is too complex because the true
In Figure

Shrinkage and priors
Understanding overfitting in linear models—understanding that certain coefficients should be smaller because they just explain noise—motivates regularizing techniques such as shrinkage and Bayesian priors. In my mind, regularization is the process of adding constraints to an under-constrained problem. For example, both convolutional filters (shared weights) in deep neural networks and an
In linear models, two common forms of regularization are shrinkage methods and Bayesian priors. In shrinkage methods such as Lasso (Tibshirani, 1996) and ridge regression (Tikhonov, 1943), large weights are penalized. We can think of this as encouraging the model to be biased towards small coefficients. A Bayesian prior such as a zero-centered normal distribution on the weights achieves a similar goal: the data must overwhelm the noise. I will discuss these techniques in future posts.
- Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288.
- Tikhonov, A. N. (1943). On the stability of inverse problems. Dokl. Akad. Nauk SSSR, 39, 195–198.