The Gauss–Markov Theorem

I discuss and prove the Gauss–Markov theorem, which states that under certain conditions, the least squares estimator is the minimum-variance linear unbiased estimator of the model parameters.

Informally, the Gauss–Markov theorem states that, under certain conditions, the ordinary least squares (OLS) estimator is the best linear model we can use. This is a powerful claim. Formally, the theorem states the following:

Gauss–Markov theorem. In a linear regression with response vector y\mathbf{y} and design matrix X\mathbf{X}, the least squares estimator β^(XX)1Xy\hat{\boldsymbol{\beta}} \triangleq (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y} is the minimum-variance linear unbiased estimator of the model parameter β\boldsymbol{\beta}, under the ordinary least squares assumptions.

Here, we can see that “best” is defined as both minimum variance and unbiased, and that the regularity conditions are the assumptions of OLS. There is no guarantee that a nonlinear method, for example, will not be better for our data by some other metric, but if we want to use an unbiased linear model and if the OLS assumptions hold for our data, then we should just use OLS.

An estimator that is optimal in this way is sometimes referred to as “BLUE”, for best linear unbiased estimator. The Gauss–Markov theorem could be stated even more succinctly as: “Under the OLS assumptions, the OLS estimator is BLUE.”

Obviously, if the OLS assumptions do not hold, then the OLS estimator is not necessarily BLUE. If our data have heteroscedasticity, for example, then a least squares regression fit to our data will not necessarily be optimal as defined above.

Proof

This proof is from (Greene, 2003). Consider a second linear and unbiased estimator β^0\hat{\boldsymbol{\beta}}_0:

β^0=Cy.(1) \hat{\boldsymbol{\beta}}_0 = \mathbf{C} \mathbf{y}. \tag{1}

Here, C\mathbf{C} is a P×NP \times N matrix, and therefore β^0\hat{\boldsymbol{\beta}}_0 is linear in that the predictions y^\hat{\mathbf{y}} are linear functions of the response, since

y^=Xβ^0=XCy.(2) \hat{\mathbf{y}} = \mathbf{X} \hat{\boldsymbol{\beta}}_0 = \mathbf{X} \mathbf{C} \mathbf{y}. \tag{2}

A linear function of a linear function is still a linear function, so XCy\mathbf{X} \mathbf{C} \mathbf{y} is simply a linear projection of our response variables y\mathbf{y} onto the space spanned by the columns of XC\mathbf{X} \mathbf{C}. Thus, this estimator adheres to the first assumption of OLS, linearity.

We assumed this new estimator β^0\hat{\boldsymbol{\beta}}_0, like the OLS estimator β^\hat{\boldsymbol{\beta}}, is unbiased. (Although we proved, not assumed, that the original OLS estimator is unbiased.) This implies

β=E[β^0X]=E[CyX]=E[C(Xβ+y)X]=E[CXβ+CεX].(3) \begin{aligned} \boldsymbol{\beta} &= \mathbb{E}[\hat{\boldsymbol{\beta}}_0 \mid \mathbf{X}] \\ &= \mathbb{E}[\mathbf{C}\mathbf{y} \mid \mathbf{X}] \\ &= \mathbb{E}[\mathbf{C} (\mathbf{X}\boldsymbol{\beta} + \mathbf{y}) \mid \mathbf{X}] \\ &= \mathbb{E}[\mathbf{C} \mathbf{X} \boldsymbol{\beta} + \mathbf{C} \boldsymbol{\varepsilon} \mid \mathbf{X}]. \end{aligned} \tag{3}

Thus, it must be true that

CX=I,CE[εX]=0,β^0=β+Cε.(4) \begin{aligned} \mathbf{C} \mathbf{X} &= \mathbf{I}, \\ \mathbf{C} \mathbb{E}[\boldsymbol{\varepsilon} \mid \mathbf{X}] &= \mathbf{0}, \\ \hat{\boldsymbol{\beta}}_0 &= \boldsymbol{\beta} + \mathbf{C} \boldsymbol{\varepsilon}. \end{aligned} \tag{4}

In other words, the properties in Equation 44 are true since we assume that β^0\hat{\boldsymbol{\beta}}_0 is unbiased. Note that E[εX]=0\mathbb{E}[\boldsymbol{\varepsilon} \mid \mathbf{X}] = \mathbf{0} is the second assumption of OLS, strict exogeneity.

Now the variance of this new linear estimator is

V[β^0X]=V[β+CεX]=V[CεX]=CV[εX]C=σ2CC.(5) \begin{aligned} \mathbb{V}[\hat{\boldsymbol{\beta}}_0 \mid \mathbf{X}] &= \mathbb{V}[\boldsymbol{\beta} + \mathbf{C} \boldsymbol{\varepsilon} \mid \mathbf{X}] \\ &= \mathbb{V}[\mathbf{C} \boldsymbol{\varepsilon} \mid \mathbf{X}] \\ &= \mathbf{C} \mathbb{V}[ \boldsymbol{\varepsilon} \mid \mathbf{X}] \mathbf{C}^{\top} \\ &= \sigma^2 \mathbf{C} \mathbf{C}^{\top}. \end{aligned} \tag{5}

Here, we have used the fact that both ε\boldsymbol{\varepsilon} and C\mathbf{C} are non-random. Furthermore, notice that we assumed that

V[εX]=σ2I.(6) \mathbb{V}[ \boldsymbol{\varepsilon} \mid \mathbf{X}] = \sigma^2 \mathbf{I}. \tag{6}

This is the fourth assumption of OLS, spherical errors. Now let’s define two new matrices A\mathbf{A} and D\mathbf{D}:

A(XX)1X,DCA.(7) \begin{aligned} \mathbf{A} &\triangleq (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top}, \\ \mathbf{D} &\triangleq \mathbf{C} - \mathbf{A}. \end{aligned} \tag{7}

A\mathbf{A} is the OLS analog to C\mathbf{C}, the linear map of the estimator that projects y\mathbf{y}. Then we can rewrite Equation 55 as:

V[β^0X]=σ2CC=σ2(D+A)(D+A)=σ2(DD+AD+DA+AA)=σ2(DD+AD+DA+(XX)1XX(XX)1)=σ2DD+σ2(XX)1V[β^X].(8) \begin{aligned} \mathbb{V}[\hat{\boldsymbol{\beta}}_0 \mid \mathbf{X}] &= \sigma^2 \mathbf{C} \mathbf{C}^{\top} \\ &= \sigma^2 \left(\mathbf{D} + \mathbf{A})(\mathbf{D} + \mathbf{A} \right)^{\top} \\ &= \sigma^2 \left(\mathbf{D}\mathbf{D}^{\top} + \mathbf{A}\mathbf{D}^{\top} + \mathbf{D}\mathbf{A}^{\top} + \mathbf{A}\mathbf{A}^{\top} \right) \\ &= \sigma^2 \left(\mathbf{D}\mathbf{D}^{\top} + \cancel{\mathbf{A}\mathbf{D}^{\top}} + \cancel{\mathbf{D}\mathbf{A}^{\top}} + \cancel{(\mathbf{X}^{\top} \mathbf{X})^{-1}} \cancel{\mathbf{X}^{\top} \mathbf{X}} (\mathbf{X}^{\top} \mathbf{X})^{-1} \right) \\ &= \sigma^2 \mathbf{D}\mathbf{D}^{\top} + \sigma^2 (\mathbf{X}^{\top}\mathbf{X})^{-1} \\ &\geq \mathbb{V}[\hat{\boldsymbol{\beta}} \mid \mathbf{X}]. \end{aligned} \tag{8}

The inequality follows because DD\mathbf{D}\mathbf{D}^{\top} is a positive definite matrix. So in words, the conditional variance of this new estimator β^0\hat{\boldsymbol{\beta}}_0 is greater or equal to the variance of β^\hat{\boldsymbol{\beta}}. This proves that β^\hat{\boldsymbol{\beta}} is the minimum-variance linear unbiased estimator or BLUE.

So why are the cross terms in Equation 88 zero? Because

DX=(CA)X=CXAX=CX(XX)1XX=II=0.(9) \begin{aligned} \mathbf{D}\mathbf{X} &= (\mathbf{C} - \mathbf{A}) \mathbf{X} \\ &= \mathbf{C}\mathbf{X} - \mathbf{A}\mathbf{X} \\ &= \mathbf{C}\mathbf{X} - (\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{X}^{\top} \mathbf{X} \\ &= \mathbf{I} - \mathbf{I} \\ &= \mathbf{0}. \end{aligned} \tag{9}

Note that CX=I\mathbf{C}\mathbf{X} = \mathbf{I} by the assumption that β^0\hat{\boldsymbol{\beta}}_0 is unbiased. Clearly, if DX=0\mathbf{D}\mathbf{X} = \mathbf{0}, then XD=0\mathbf{X}^{\top} \mathbf{D}^{\top} = \mathbf{0}^{\top}—both 0\mathbf{0} and 0\mathbf{0}^{\top} are P×PP \times P matrices of all zeros. Therefore the cross terms can be each written as

DA=DX(XX)1=0,AD=(XX)1XD=0.(10) \begin{aligned} \mathbf{D}\mathbf{A}^{\top} &= \mathbf{D} \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \\ &= \mathbf{0}, \\ \\ \mathbf{A} \mathbf{D}^{\top} &= (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{D}^{\top} \\ &= \mathbf{0}. \end{aligned} \tag{10}

Notice that implicit in the existence of the inverse of XX\mathbf{X}^{\top} \mathbf{X}, we assume that X\mathbf{X} is full-rank. This is the second OLS assumption, no multicollinearity. Thus, the Gauss–Markov theorem holds when we adhere to the four assumptions of OLS: linearity, no multicollinearity, strict exogeneity, and spherical errors. If we make these four assumptions, then β^\hat{\boldsymbol{\beta}} is BLUE, the best (minimum-variance) linear unbiased estimator.

  1. Greene, W. H. (2003). Econometric analysis. Pearson Education India.