Consistency of the OLS Estimator

A consistent estimator converges in probability to the true value. I discuss this idea in general and then prove that the ordinary least squares estimator is consistent.

Consider the linear model

y=Xβ+ε,(1) \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}, \tag{1}

where y\mathbf{y} is an NN-vector of response variables, X\mathbf{X} is an N×PN \times P matrix of PP-dimensional predictors, β\boldsymbol{\beta} specifies a PP-dimensional hyperplane, and ε\boldsymbol{\varepsilon} is an NN-vector of noise terms. The ordinary least squares (OLS) estimator of β\boldsymbol{\beta} is

β^=(XX)1Xy.(2) \hat{\boldsymbol{\beta}} = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y}. \tag{2}

In a post on the sampling distribution of the OLS estimator, I proved that β^\hat{\boldsymbol{\beta}} was unbiased, in addition to some other properties, such as its variance and its distribution under a normality assumption. However, this estimator is also consistent. The goal of this post is to understand what that means and then to prove that it is true.

Consistency in general

Let’s first discuss consistency in general. Let θ\boldsymbol{\theta} be a parameter of interest. Let θ^N\hat{\boldsymbol{\theta}}_N be an estimator of θ\boldsymbol{\theta}. The subscript NN makes it clear that θ^N\hat{\boldsymbol{\theta}}_N is a random variable that is a function of the sample size NN. The estimator θ^N\hat{\boldsymbol{\theta}}_N is consistent if it converges in probability to θ\boldsymbol{\theta}. Formally, this means

limNP(θθ^Nε)=0,for all ε>0.(3) \lim_{N \rightarrow \infty} \mathbb{P}(|\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}_N| \geq \varepsilon) = 0, \quad \text{for all $\varepsilon > 0$.} \tag{3}

The way I think about this is as follows: pick any ε>0\varepsilon > 0 that you would like; then I can find an NN such that θθ^N|\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}_N| is less than ε\varepsilon. In other words, this is a claim about how θ^N\hat{\boldsymbol{\theta}}_N behaves as NN increases. In particular, the claim is that θ^N\hat{\boldsymbol{\theta}}_N is well-behaved in the sense that we can make it arbitrarily close to θ^\hat{\boldsymbol{\theta}} by increasing NN.

This is different from unbiasedness. An unbiased estimator is one such that

E[θ^N]=θ.(4) \mathbb{E}[\hat{\boldsymbol{\theta}}_N] = \boldsymbol{\theta}. \tag{4}

This is true for all NN. So if I have thirty data points or three million, I know that the statistic θ^N\hat{\boldsymbol{\theta}}_N, if unbiased, is unbiased in the sense that it will not be too high or too low from the true value on average. (This average is over many samples X\mathbf{X} of size NN.) However, consistency is a property in which, as NN increases, the value of the θ^N\hat{\boldsymbol{\theta}}_N gets arbitrarily close to the true value θ\boldsymbol{\theta}.

One way to think about consistency is that it is a statement about the estimator’s variance as NN increases. For instance, Chebyshev’s inequality states that for any random variable XX with finite expected value μ\mu and variance σ2>0\sigma^2 > 0, the following inequality holds for α>0\alpha > 0:

P(Xμ>α)=σ2α2.(5) \mathbb{P}(|X - \mu| > \alpha) = \frac{\sigma^2}{\alpha^2}. \tag{5}

So if XX is an unbiased estimator, then E[X]=μ\mathbb{E}[X] = \mu. If we can show that σ2\sigma^2 goes to zero as NN \rightarrow \infty (XX is a function of NN here), then we can prove consistency. Of course, a biased estimator can be consistent, but I think this illustrates a scenario in which proving consistency is intuitive (Figure 11).

Figure 1. The sampling distribution of the OLS coefficient β^\hat{\beta} fit to N{10,100,1000}N \in \{10, 100, 1000\}. The ground-truth coefficient is β=2\beta = 2 and the model is correctly specified, i.e. y=2x+ε\mathbf{y} = 2 \mathbf{x} + \boldsymbol{\varepsilon}. Since the OLS estimator is consistent, the sampling distribution becomes more concentrated as NN increases.

The notation in Equation 33 is a bit clunky, and it is often simplified as

plimθ^N=θ.(6) \plim \hat{\boldsymbol{\theta}}_N = \boldsymbol{\theta}. \tag{6}

Two useful properties of plim\plim, which we will use below, are:

plim(a+b)=plim(a)+plim(b),plim(ab)=plim(a)plim(b),(7) \begin{aligned} \plim (\mathbf{a} + \mathbf{b}) &= \plim(\mathbf{a}) + \plim(\mathbf{b}), \\ \plim (\mathbf{a} \mathbf{b}) &= \plim(\mathbf{a}) \plim(\mathbf{b}), \end{aligned} \tag{7}

where a\mathbf{a} and b\mathbf{b} are scalars, vectors, or matrices. In particular, I find the second property surprising. Unfortunately, proving these properties would require a bigger dive into asymptotics than I am prepared to make right now. You can find a deeper discussion and proofs in textbooks on mathematical statistics, such as (Shao, 2003).

Consistency in OLS

We want to show that

plimβ^=β.(8) \plim \hat{\boldsymbol{\beta}} = \boldsymbol{\beta}. \tag{8}

First, let’s write down the definition of β^\hat{\boldsymbol{\beta}} and do some algebraic manipulation:

plimβ^=plim{(XX)1Xy}=plim{(XX)1X(Xβ+ε)}=plim{(XX)1XXβ+(XX)1Xε}=plimβ+plim{(XX)1Xε}=β+plim(XX)1plimXε(9) \begin{aligned} \plim \hat{\boldsymbol{\beta}} &= \plim \left\{ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y} \right\} \\ &= \plim \left\{ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} (\mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}) \right\} \\ &= \plim \left\{ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{X} \boldsymbol{\beta} + (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\varepsilon} \right\} \\ &= \plim \boldsymbol{\beta} + \plim \left\{ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\varepsilon} \right\} \\ &= \boldsymbol{\beta} + \plim (\mathbf{X}^{\top} \mathbf{X})^{-1} \plim \mathbf{X}^{\top} \boldsymbol{\varepsilon} \end{aligned} \tag{9}

Here, we have done nothing more than apply Equations 11 and 22, do some matrix algebra, and use some basic properties of probability limits. At this point, you may notice that as NN gets arbitrarily big, the sums in XX\mathbf{X}^{\top} \mathbf{X} will get arbitrarily large as well. However, we can simply multiply the rightmost term in the last line of Equation 99 by N/NN / N to introduce normalizing terms:

plimβ^=β+plim(1NXX)1plim1NXε(10) \plim \hat{\boldsymbol{\beta}} = \boldsymbol{\beta} + \plim \left(\frac{1}{N} \mathbf{X}^{\top} \mathbf{X} \right)^{-1} \plim \frac{1}{N} \mathbf{X}^{\top} \boldsymbol{\varepsilon} \tag{10}

At this point, the standard assumption is that

plim(1NXX)1=Q(11) \plim \left(\frac{1}{N} \mathbf{X}^{\top} \mathbf{X} \right)^{-1} = \mathbf{Q} \tag{11}

for some positive definite matrix Q\mathbf{Q}. In words, this just means that our data is “well-behaved” in the sense that the law of large numbers applies. To see this, recall that the weak law of large numbers (WLLN) is a statement about a probability limit. Let w\mathbf{w} be a random vector; then the WLLN states

plim[1Nn=1Nwi]=E[w].(12) \plim \left[ \frac{1}{N} \sum_{n=1}^N \mathbf{w}_i \right] = \mathbb{E}[\mathbf{w}]. \tag{12}

The assumption in Equation 1111 just says that the WLLN applies to each average in the covariance matrix. Because of the structure of XX\mathbf{X}^{\top} \mathbf{X}, Q\mathbf{Q} must be positive definite. So if Q\mathbf{Q} exists—and we assume it does—, clearly its inverse exists since it is positive definite. We can then write Equation 1010 as

plimβ^=β+Q1plim1NXε(13) \plim \hat{\boldsymbol{\beta}} = \boldsymbol{\beta} + \mathbf{Q}^{-1} \plim \frac{1}{N} \mathbf{X}^{\top} \boldsymbol{\varepsilon} \tag{13}

Thus, we only need to show that

plim1NXε=0,(14) \plim \frac{1}{N} \mathbf{X}^{\top} \boldsymbol{\varepsilon} = \mathbf{0}, \tag{14}

where 0\mathbf{0} is a PP-vector of zeros, and we’re done. We can write the matrix-vector multiplication in Equation 1414 as a sum

Xε=[x11x1NxP1xPN][ε11εN1]=[x11ε1++x1NεNxP1ε1++xPNεN]=n=1Nxnεn.(15) \mathbf{X}^{\top} \boldsymbol{\varepsilon} = \begin{bmatrix} x_{11} & \dots & x_{1N} \\ \vdots & \ddots & \vdots \\ x_{P1} & \dots & x_{PN} \end{bmatrix} \begin{bmatrix} \varepsilon_{11} \\ \vdots \\ \varepsilon_{N1} \end{bmatrix} = \begin{bmatrix} x_{11} \varepsilon_1 + \dots + x_{1N} \varepsilon_N \\ \vdots \\ x_{P1} \varepsilon_1 + \dots + x_{PN} \varepsilon_N \end{bmatrix} = \sum_{n=1}^N \mathbf{x}_n \varepsilon_n. \tag{15}

Thus, we can write Equation 1414 as an expectation,

plim1NXε=plim1Nn=1Nxnεn=E[xnεn].(16) \plim \frac{1}{N} \mathbf{X}^{\top} \boldsymbol{\varepsilon} = \plim \frac{1}{N} \sum_{n=1}^N \mathbf{x}_n \varepsilon_n = \mathbb{E}[\mathbf{x}_n \varepsilon_n]. \tag{16}

So if we can show that this expectation is equal to 0\mathbf{0}, we can just invoke the WLLN, and we are done. To show this, we just apply the law of total expectation:

E[xε]=E[E[xεX]]=E[xE[εX]]=E[x0]=0.(17) \begin{aligned} \mathbb{E}[\mathbf{x} \varepsilon] &= \mathbb{E}[ \mathbb{E}[ \mathbf{x} \varepsilon \mid \mathbf{X} ]] \\ &= \mathbb{E}[ \mathbf{x} \mathbb{E}[ \varepsilon \mid \mathbf{X} ]] \\ &\stackrel{\star}{=} \mathbb{E}[ \mathbf{x} 0 ] \\ &= \mathbf{0}. \end{aligned} \tag{17}

In step \star, we just use the strict exogeneity assumption of OLS.

To summarize, by the WLLN, Equation 1616 is equal to an expectation, which we just showed was 0\mathbf{0}. Therefore, the right term in Equation 1313 is zero, and we have

plimβ^=β(18) \plim \hat{\boldsymbol{\beta}} = \boldsymbol{\beta} \tag{18}

as desired. Intuitively, I think this result makes sense. The OLS estimator is unbiased because we assume our observations are uncorrelated with the noise terms. Thus, as NN increases, the WLLN simply kicks in, and the estimator converges in probability to the true value β\boldsymbol{\beta}.

  1. Shao, J. (2003). Mathematical statistics.