Ordinary Least Squares

I discuss ordinary least squares or linear regression when the optimal coefficients minimize the residual sum of squares. I discuss various properties and interpretations of this classic model.

Linear regression

Suppose we have a regression problem with independent variables and dependent variables . Each independent variable is a -dimensional vector of predictors, while each is scalar response or target variable. In linear regression, we assume our response variables are a linear function of our predictor variables. Let denote a -vector of unknown parameters (or “weights” or “coefficients”) and let denote the th observation’s scalar error term. Then linear regression is

Written as vectors, Equation is

If we stack the independent variables into an matrix , sometimes called the design matrix, and stack the dependent variables and error terms into vectors and , then we can write the model in matrix form as

In classical linear regression, , and therefore is tall and skinny. We can add an intercept to this linear model by introducing a new parameter and adding a constant predictor as the first column of . I will discuss the intercept later in this post.

Errors and residuals

Before we discuss how to estimate the model parameters , let’s explore the linear assumption and introduce some useful notation. Given estimated parameters , linear regression predicts

However, most likely, our predictions will not match the response variables exactly. The residuals are the differences in the true response variables and what we predict or

where . Note that the residuals are not the error terms . The residuals are the differences in the true response variables and what we predict, while the errors are the differences between the true data and what we observe,

Thus, errors are related to the true data generating process , while residuals are related to the estimated model (Figure ). Clearly, if , then the residuals are zero.

Figure 1. (Left) Ground-truth (un-observed) univariate data , noisy observations , and estimated . (Middle) Errors between ground-truth data and noisy observations . (Right) Residuals between noisy observations and model predictions .

Normal equation

Now that we understand the basic model, let’s discuss solving for . Since is a tall and skinny matrix, solving for amounts to solving a linear system of equations with unknowns. Such a system is overdetermined, and it is unlikely that such a system has an exact solution. Classical linear regression is sometimes called ordinary least squares (OLS) because the best-fit coefficients are defined as those that solve the following minimization problem:

Thus, the OLS estimator minimizes the sum of squared residuals (Figure ). For a single data point, the squared error is zero if the prediction is exactly correct. Otherwise, the penalty increases quadratically, meaning classical linear regression heavily penalizes outliers. Other loss functions induce other linear models. See my previous post on interpreting these kinds of optimization problems.

Figure 2. Ordinary least squares linear regression on Scikit-learn's make_regression dataset. Data points are in blue. The predicted hyperplane is the red line. (Left) Red dashed vertical lines represent the residuals. (Right) Light red boxes represent the squared residuals.

In vector form, Equation is

Linear regression has an analytic or closed-form solution known as the normal equation,

See A1 for a complete derivation of Equation .

Hat matrix and residual maker

We’ll see in a moment where the name “normal equation” comes from. However, we must first understand some basic properties of OLS. Let’s define a matrix as

We’ll call this the hat matrix, since it “puts a hat” on , i.e. since it regresses the response variables onto the predicted variables :

Step is just the normal equation (Equation ). Note that is an orthogonal projection. See A2 for a proof. Furthermore, let’s call

the residual maker since it constructs the residuals, i.e. since it makes the residuals from the response variables :

The residual maker is also an orthogonal projector. See A3 for a proof.

Finally, note that and are orthogonal to each other:

Geometric view of OLS

There is a nice geometric interpretation to all this. When we multiply the response variables by , we are projecting into a space spanned by the columns of . This makes sense since the model is constrained to live in the space of linear combinations of the columns of ,

and an orthogonal projection is the closest to in Euclidean distance that we can get while staying in this constrained space. (One can find many nice visualizations of this fact online.) I’m pretty sure this is why the normal equation is so-named, since “normal” is another word for “perpendicular” in geometry.

Thus, we can summarize the predictions made by OLS (Equation ) by saying that we project our response variables onto a linear hyperplane defined by using the orthogonal projection matrix .

OLS with an intercept

Notice that Equation does not include an intercept. In other words, our linear model is constrained such that the hyperplane goes through the origin. However, we often want to model a shift in the response variable, since this can dramatically improve the goodness-of-fit in the model (Figure ). Thus, we want OLS with an intercept.

Figure 3. OLS without (red solid line) and with (red dashed line) an intercept. The model's goodness-of-fit changes dramatically depending on this modeling assumption.

In this section, I’ll discuss both partitioned regression or OLS when the predictors and corresponding coefficients can be logically separated into groups via block matrices and then use that result for the specific case when one of the block matrices in the design matrix is a constant, i.e. a constant term for an intercept.

Partitioned regression

Let’s rewrite Equation by splitting and into block matrices. Let (we’re stacking the columns horizontally) and (we’re stacking the rows vertically). Then partitioned regression is,

The normal equation (Equation ) can then be written as

which in turn can be written as two separate equations:

We can then solve for and , since we have two equations and two unknowns. Let’s first solve for . We have

We can then isolate by substituting into the second line of Equation :

If we define and —hat and residual matrices for —as

then Equation can be rewritten as:

And we have solved for in terms of the residual maker for , denoted :

As we’ll see in the next section, solving for in terms of will be make it easier to understand and compute the optimal parameters of OLS with an intercept.

The results in Equations and are part of the Frisch–Waugh–Lovell (FWL) theorem. The basic idea of the FWL theorem is that we can obtain by regressing onto , where these variables are just the residuals associated with . See (Greene, 2003) for further discussion.

Partitioned regression with a constant predictor

OLS with an intercept is just a special case of partitioned regression. Suppose that we add an intercept to our OLS model. This means we want to estimate a new parameter such that

Notice that if we simply add a constant to our predictor , we can “push” into the dot product. In matrix notation, we have

which is the same equation as Equation but with a column of ones prepended to . This means we can write our model as

We can use the results of the previous subsection to straightforwardly solve for the scalar and the -vector . First, note that the hat matrix is just an matrix filled with the value :

With this in mind, let’s compute using Equation . We know that . Furthermore, we can simplify as

where is a matrix where each column is an -vector with the mean of that respective column in repeated times since

where is defined as

or the mean of a column of . Thus, we are just mean centering in the calculation . This gives us

In other words, when OLS has an intercept, the optimal coefficients —which are the parameters associated with the predictors in our design matrix—are just the result of the normal equation after mean-centering our targets and predictors. Intuitively, this means that the hyperplane defined by just goes through the origin (Figure ).

Figure 4. OLS with an intercept (solid line) can be decomposed into OLS without an intercept (dashed-and-dotted line) and a bias term (dashed line). Without an intercept, OLS goes through the origin. With an intercept, the hyperplane is shifted by the distance between the original hyperplane and the mean of the data.

Finally, we can solve for the scalar —which is really the intercept parameter in Equation —as

where is a row of , i.e. a vector of means for each predictor. The interpretation of this bias parameter (Equation ) becomes more clear if we diagram these quantities (Figure ). As we can see, OLS with an intercept can be decomposed into OLS without an intercept, where the hyperplane defined by passes through the origin, and the bias parameter (again elsewhere denoted ), which shifts the hyperplane from the origin to the mean of the response variables.

A probabilistic perspective

OLS can be viewed from a probabilistic perspective. Recall the linear model

If we assume our error is additive Gaussian noise, , then this induces a conditionally Gaussian assumption on our response variables,

If our data is i.i.d., we can write

In this statistical framework, maximum likelihood (ML) estimation gives us the same optimal parameters as the normal equation. To compute the ML estimate, we first take derivative with respect to the parameter of the log likelihood function and then solve for . We can represent the log likelihood compactly using a multivariate normal distribution,

See A4 for a complete derivation of Equation . If we take the derivative of this log likelihood function with respect to the parameters, the first term is zero and the constant does not effect our optimization. Thus, we are looking for

Of course, maximizing the negation of a function is the same as minimizing the function directly. Thus, this is the same optimization problem as Equations and .

Furthermore, let and be the true generative parameters. Then

See A5 for a derivation of Equation . Since we know that the conditional expectation is the minimizer of the mean squared loss—see my previous post if needed—, we know that would be the best we can do given our model. An interpretation of the conditional variance in this context is that it is the smallest expected squared prediction error.

Conclusion

Classical linear regression or ordinary least squares is a linear model in which the estimated parameters minimize the sum of squared residuals. Geometrically, we can interpret OLS as orthogonally projecting our response variables onto a hyperplane defined by these linear coefficients. OLS typically includes an intercept, which shifts the hyperplane so that it goes through the targets’ mean, rather than through the origin. In a probabilistic view of OLS, the maximum likelihood estimator is equivalent to the solution to the normal equation.

   

Acknowledgements

I thank Andrei Margeloiu for correcting an error in an earlier version of this post. When , the residuals are zero, but the true errors are still unknown.

   

Appendix

A1. Normal equation

We want to find the parameters or coefficients that minimize the sum of squared residuals,

Note that we can write

This can be easily seen by writing out the vectorization explicitly. Let be a vector such that

The squared L2-norm is the sums the squared components of . This is equivalent to taking the dot product . Now define the function such that

To minimize , we take its derivative with respect to , set it equal to zero, and solve for ,

In step , we use the fact that the trace of a scalar is the scalar. In step , we use the linearity of differentiation and the trace operator. In step , we use the fact that . In step , we take the derivatives of the left and right terms using identities and from (Petersen et al., 2008), respectively.

If we set line equal to zero and divide both sides of the equation by two, we get the normal equation:

A2. Hat matrix is an orthogonal projection

A square matrix is a projection if . Thus, the hat matrix is a projection since

A real-valued projection is orthogonal if . Thus, the hat matrix is an orthogonal projection since

A3. Residual maker is an orthogonal projector

A square matrix is a projection if . For the residual maker , we have:

Step holds because we know that the hat matrix is an orthogonal projector. A real-valued projection is orthogonal if . For the residual maker , we have:

Again, step holds because is an orthogonal projector. Therefore, is an orthogonal projector.

A4. Multivariate normal representation of the log likelihood

The probability density function for a -dimensional multivariate normal distribution is

The mean parameter is a -vector, and the covariance matrix is a positive definite matrix. In the probabilistic view of classical linear regression, the data are i.i.d. Therefore, we can represent the likelihood function as

The above formulation leverages two properties from linear alegbra. First, if the dimensions of the covariance matrix are independent (in our case, each dimension is a sample), then is diagonal, and its matrix inverse is just a diagonal matrix with each value replaced by its reciprocal. Second, the determinant of a diagonal matrix is just the product of the diagonal elements.

The log likelihood is then

as desired.

A5. Conditional expectation and variance

  1. Greene, W. H. (2003). Econometric analysis. Pearson Education India.
  2. Petersen, K. B., Pedersen, M. S., & others. (2008). The matrix cookbook. Technical University of Denmark, 7(15), 510.