Consistency of the OLS Estimator

A consistent estimator converges in probability to the true value. I discuss this idea in general and then prove that the ordinary least squares estimator is consistent.

Published

29 January 2022

Consider the linear model

$\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}, \tag{1}$

where $\mathbf{y}$ is an $N$ -vector of response variables, $\mathbf{X}$ is an $N \times P$ matrix of $P$ -dimensional predictors, $\boldsymbol{\beta}$ specifies a $P$ -dimensional hyperplane, and $\boldsymbol{\varepsilon}$ is an $N$ -vector of noise terms. The ordinary least squares (OLS) estimator of $\boldsymbol{\beta}$ is

$\hat{\boldsymbol{\beta}} = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y}. \tag{2}$

In a post on the sampling distribution of the OLS estimator, I proved that $\hat{\boldsymbol{\beta}}$ was unbiased, in addition to some other properties, such as its variance and its distribution under a normality assumption. However, this estimator is also consistent. The goal of this post is to understand what that means and then to prove that it is true.

Consistency in general

Let’s first discuss consistency in general. Let $\boldsymbol{\theta}$ be a parameter of interest. Let $\hat{\boldsymbol{\theta}}_N$ be an estimator of $\boldsymbol{\theta}$ . The subscript $N$ makes it clear that $\hat{\boldsymbol{\theta}}_N$ is a random variable that is a function of the sample size $N$ . The estimator $\hat{\boldsymbol{\theta}}_N$ is consistent if it converges in probability to $\boldsymbol{\theta}$ . Formally, this means

$\lim_{N \rightarrow \infty} \mathbb{P}(|\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}_N| \geq \varepsilon) = 0, \quad \text{for all $\varepsilon > 0$.} \tag{3}$

The way I think about this is as follows: pick any $\varepsilon > 0$ that you would like; then I can find an $N$ such that $|\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}_N|$ is less than $\varepsilon$ . In other words, this is a claim about how $\hat{\boldsymbol{\theta}}_N$ behaves as $N$ increases. In particular, the claim is that $\hat{\boldsymbol{\theta}}_N$ is well-behaved in the sense that we can make it arbitrarily close to $\hat{\boldsymbol{\theta}}$ by increasing $N$ .

This is different from unbiasedness. An unbiased estimator is one such that

$\mathbb{E}[\hat{\boldsymbol{\theta}}_N] = \boldsymbol{\theta}. \tag{4}$

This is true for all $N$ . So if I have thirty data points or three million, I know that the statistic $\hat{\boldsymbol{\theta}}_N$ , if unbiased, is unbiased in the sense that it will not be too high or too low from the true value on average. (This average is over many samples $\mathbf{X}$ of size $N$ .) However, consistency is a property in which, as $N$ increases, the value of the $\hat{\boldsymbol{\theta}}_N$ gets arbitrarily close to the true value $\boldsymbol{\theta}$ .

One way to think about consistency is that it is a statement about the estimator’s variance as $N$ increases. For instance, Chebyshev’s inequality states that for any random variable $X$ with finite expected value $\mu$ and variance $\sigma^2 > 0$ , the following inequality holds for $\alpha > 0$ :

$\mathbb{P}(|X - \mu| > \alpha) = \frac{\sigma^2}{\alpha^2}. \tag{5}$

So if $X$ is an unbiased estimator, then $\mathbb{E}[X] = \mu$ . If we can show that $\sigma^2$ goes to zero as $N \rightarrow \infty$ ( $X$ is a function of $N$ here), then we can prove consistency. Of course, a biased estimator can be consistent, but I think this illustrates a scenario in which proving consistency is intuitive (Figure $1$ ).

Figure 1. The sampling distribution of the OLS coefficient

\hat{\beta}

fit to

N \in \{10, 100, 1000\}

. The ground-truth coefficient is

\beta = 2

and the model is correctly specified, i.e.

\mathbf{y} = 2 \mathbf{x} + \boldsymbol{\varepsilon}

. Since the OLS estimator is consistent, the sampling distribution becomes more concentrated as

N

increases.

The notation in Equation $3$ is a bit clunky, and it is often simplified as

$\plim \hat{\boldsymbol{\theta}}_N = \boldsymbol{\theta}. \tag{6}$

Two useful properties of $\plim$ , which we will use below, are:

$\begin{aligned} \plim (\mathbf{a} + \mathbf{b}) &= \plim(\mathbf{a}) + \plim(\mathbf{b}), \\ \plim (\mathbf{a} \mathbf{b}) &= \plim(\mathbf{a}) \plim(\mathbf{b}), \end{aligned} \tag{7}$

where $\mathbf{a}$ and $\mathbf{b}$ are scalars, vectors, or matrices. In particular, I find the second property surprising. Unfortunately, proving these properties would require a bigger dive into asymptotics than I am prepared to make right now. You can find a deeper discussion and proofs in textbooks on mathematical statistics, such as (Shao, 2003).

Consistency in OLS

We want to show that

$\plim \hat{\boldsymbol{\beta}} = \boldsymbol{\beta}. \tag{8}$

First, let’s write down the definition of $\hat{\boldsymbol{\beta}}$ and do some algebraic manipulation:

$\begin{aligned} \plim \hat{\boldsymbol{\beta}} &= \plim \left\{ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y} \right\} \\ &= \plim \left\{ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} (\mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}) \right\} \\ &= \plim \left\{ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{X} \boldsymbol{\beta} + (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\varepsilon} \right\} \\ &= \plim \boldsymbol{\beta} + \plim \left\{ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\varepsilon} \right\} \\ &= \boldsymbol{\beta} + \plim (\mathbf{X}^{\top} \mathbf{X})^{-1} \plim \mathbf{X}^{\top} \boldsymbol{\varepsilon} \end{aligned} \tag{9}$

Here, we have done nothing more than apply Equations $1$ and $2$ , do some matrix algebra, and use some basic properties of probability limits. At this point, you may notice that as $N$ gets arbitrarily big, the sums in $\mathbf{X}^{\top} \mathbf{X}$ will get arbitrarily large as well. However, we can simply multiply the rightmost term in the last line of Equation $9$ by $N / N$ to introduce normalizing terms:

$\plim \hat{\boldsymbol{\beta}} = \boldsymbol{\beta} + \plim \left(\frac{1}{N} \mathbf{X}^{\top} \mathbf{X} \right)^{-1} \plim \frac{1}{N} \mathbf{X}^{\top} \boldsymbol{\varepsilon} \tag{10}$

At this point, the standard assumption is that

$\plim \left(\frac{1}{N} \mathbf{X}^{\top} \mathbf{X} \right)^{-1} = \mathbf{Q} \tag{11}$

for some positive definite matrix $\mathbf{Q}$ . In words, this just means that our data is “well-behaved” in the sense that the law of large numbers applies. To see this, recall that the weak law of large numbers (WLLN) is a statement about a probability limit. Let $\mathbf{w}$ be a random vector; then the WLLN states

$\plim \left[ \frac{1}{N} \sum_{n=1}^N \mathbf{w}_i \right] = \mathbb{E}[\mathbf{w}]. \tag{12}$

The assumption in Equation $11$ just says that the WLLN applies to each average in the covariance matrix. Because of the structure of $\mathbf{X}^{\top} \mathbf{X}$ , $\mathbf{Q}$ must be positive definite. So if $\mathbf{Q}$ exists—and we assume it does—, clearly its inverse exists since it is positive definite. We can then write Equation $10$ as

$\plim \hat{\boldsymbol{\beta}} = \boldsymbol{\beta} + \mathbf{Q}^{-1} \plim \frac{1}{N} \mathbf{X}^{\top} \boldsymbol{\varepsilon} \tag{13}$

Thus, we only need to show that

$\plim \frac{1}{N} \mathbf{X}^{\top} \boldsymbol{\varepsilon} = \mathbf{0}, \tag{14}$

where $\mathbf{0}$ is a $P$ -vector of zeros, and we’re done. We can write the matrix-vector multiplication in Equation $14$ as a sum

$\mathbf{X}^{\top} \boldsymbol{\varepsilon} = \begin{bmatrix} x_{11} & \dots & x_{1N} \\ \vdots & \ddots & \vdots \\ x_{P1} & \dots & x_{PN} \end{bmatrix} \begin{bmatrix} \varepsilon_{11} \\ \vdots \\ \varepsilon_{N1} \end{bmatrix} = \begin{bmatrix} x_{11} \varepsilon_1 + \dots + x_{1N} \varepsilon_N \\ \vdots \\ x_{P1} \varepsilon_1 + \dots + x_{PN} \varepsilon_N \end{bmatrix} = \sum_{n=1}^N \mathbf{x}_n \varepsilon_n. \tag{15}$

Thus, we can write Equation $14$ as an expectation,

$\plim \frac{1}{N} \mathbf{X}^{\top} \boldsymbol{\varepsilon} = \plim \frac{1}{N} \sum_{n=1}^N \mathbf{x}_n \varepsilon_n = \mathbb{E}[\mathbf{x}_n \varepsilon_n]. \tag{16}$

So if we can show that this expectation is equal to $\mathbf{0}$ , we can just invoke the WLLN, and we are done. To show this, we just apply the law of total expectation:

$\begin{aligned} \mathbb{E}[\mathbf{x} \varepsilon] &= \mathbb{E}[ \mathbb{E}[ \mathbf{x} \varepsilon \mid \mathbf{X} ]] \\ &= \mathbb{E}[ \mathbf{x} \mathbb{E}[ \varepsilon \mid \mathbf{X} ]] \\ &\stackrel{\star}{=} \mathbb{E}[ \mathbf{x} 0 ] \\ &= \mathbf{0}. \end{aligned} \tag{17}$

In step $\star$ , we just use the strict exogeneity assumption of OLS.

To summarize, by the WLLN, Equation $16$ is equal to an expectation, which we just showed was $\mathbf{0}$ . Therefore, the right term in Equation $13$ is zero, and we have

$\plim \hat{\boldsymbol{\beta}} = \boldsymbol{\beta} \tag{18}$

as desired. Intuitively, I think this result makes sense. The OLS estimator is unbiased because we assume our observations are uncorrelated with the noise terms. Thus, as $N$ increases, the WLLN simply kicks in, and the estimator converges in probability to the true value $\boldsymbol{\beta}$ .

Shao, J. (2003). Mathematical statistics.