Sampling Distribution of the OLS Estimator

I derive the mean and variance of the OLS estimator, as well as an unbiased estimator of the OLS estimator's variance. I then show that the OLS estimator is normally distributed if we assume the error terms are normally distributed.

Published

26 August 2021

As introduced in my previous posts on ordinary least squares (OLS), the linear regression model has the form

$y_n = \beta_0 + \beta_1 x_{n,1} + \dots + \beta_P x_{n,P} + \varepsilon_n. \tag{1}$

To perform tasks such as hypothesis testing for a given estimated coefficient $\hat{\beta}_p$ , we need to pin down the sampling distribution of the OLS estimator $\hat{\boldsymbol{\beta}} = [\beta_1, \dots, \beta_P]^{\top}$ . To do this, we need to make some assumptions. We can then use those assumptions to derive some basic properties of $\hat{\boldsymbol{\beta}}$ .

I’ll start this post by working through the standard OLS assumptions. I’ll then show how these assumptions imply some established properties of the OLS estimator $\hat{\boldsymbol{\beta}}$ . Finally, I’ll show how if we assume our error terms are normally distributed, we can pin down the distribution of $\hat{\boldsymbol{\beta}}$ exactly.

Standard OLS assumptions

The standard assumptions of OLS are:

Linearity
Strict exogeneity
No multicollinearity
Spherical errors
Normality (optional)

Assumptions $1$ and $3$ are not terribly interesting here. Assumption $1$ is just Equation $1$ ; it means that we have correctly specified our model. Assumption $3$ is that our design matrix $\mathbf{X}$ is full rank; this property not relevant for this post, but I have another post on the topic for the curious.

Assumptions $2$ and $4$ are more interesting here. Assumption $2$ , strict exogeneity, is that the expectation of the error term is zero:

$\mathbb{E}[\varepsilon_n \mid \mathbf{X}] = 0, \quad n \in \{1, \dots, N\}. \tag{2}$

An exogenous variable is a variable that is not determined by other variables or parameters in the model. Here is a nice example of why Equation $2$ captures this intuition.

Assumption $4$ can be broken into two assumptions. The first is homoskedasticity, meaning that our observations have a constant variance $\sigma^2$ :

$\mathbb{V}[\varepsilon_n \mid \mathbf{X}] = \sigma^2, \quad n \in \{1, \dots, N\}. \tag{3}$

The second is that our error terms are uncorrelated:

$\mathbb{E}[\varepsilon_n \varepsilon_m \mid \mathbf{X}] = 0, \quad n,m \in \{1, \dots, N\}, \quad n \neq m. \tag{4}$

Taken together, these two sub-assumptions are typically stated as just spherical errors, since we can formalize both at once as

$\mathbb{V}[\boldsymbol{\varepsilon} \mid \mathbf{X}] = \sigma^2 \mathbf{I}_N. \tag{5}$

Finally, assumption $5$ is that our error terms are normally distributed. This assumption is not required for OLS theory, but some sort of distributional assumption about the noise is required for hypothesis testing in OLS. As we will see, the normality assumption will imply that the OLS estimator $\hat{\boldsymbol{\beta}}$ is normally distributed.

With these properties in mind, let’s prove some important facts about the OLS estimator $\hat{\boldsymbol{\beta}}$ .

OLS estimator is unbiased

First, let’s prove that $\hat{\boldsymbol{\beta}}$ is unbiased, i.e. that

$\mathbb{E}[\hat{\boldsymbol{\beta}} \mid \mathbf{X}] = \boldsymbol{\beta}. \tag{6}$

Equivalently, we just need to show that

$\mathbb{E}[\hat{\boldsymbol{\beta}} - \boldsymbol{\beta} \mid \mathbf{X}] = \mathbf{0}. \tag{7}$

The term in the expectation, $\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}$ , is sometimes called the sampling error, and we can write it in terms of the predictors and noise terms:

$\begin{aligned} \hat{\boldsymbol{\beta}} - \boldsymbol{\beta} &\stackrel{\star}{=} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y} - \boldsymbol{\beta} \\ &\stackrel{\dagger}{=} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} (\mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}) - \boldsymbol{\beta} \\ &= (\mathbf{X}^{\top} \mathbf{X})^{-1} (\mathbf{X}^{\top} \mathbf{X}) \boldsymbol{\beta} + (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\varepsilon}) - \boldsymbol{\beta} \\ &= \boldsymbol{\beta} + (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\varepsilon} - \boldsymbol{\beta} \\ &= (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\varepsilon}. \end{aligned} \tag{8}$

Step $\star$ is the normal equation, and step $\dagger$ is the matrix form of our linear assumption, $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ . Since we assume that $\mathbf{X}$ is non-random, we can pull it out of the expectation, and we’re done:

$\mathbb{E}[\hat{\boldsymbol{\beta}} - \boldsymbol{\beta} \mid \mathbf{X}] = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbb{E}[ \boldsymbol{\varepsilon} \mid \mathbf{X}] = \mathbf{0}. \tag{9}$

As we can see, we require strict exogeneity to prove that $\hat{\boldsymbol{\beta}}$ is unbiased.

Variance of the OLS estimator

This proof is from (Hayashi, 2000). The variance of the OLS estimator is

$\begin{aligned} \mathbb{V}[\hat{\boldsymbol{\beta}} \mid \mathbf{X}] &\stackrel{\star}{=} \mathbb{V}[\hat{\boldsymbol{\beta}} - \boldsymbol{\beta} \mid \mathbf{X}] \\ &\stackrel{\dagger}{=} \mathbb{V}[\overbrace{(\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top}}^{\mathbf{A}} \boldsymbol{\varepsilon} \mid \mathbf{X}] \\ &\stackrel{\ddagger}{=} \mathbf{A} \mathbb{V}[\boldsymbol{\varepsilon} \mid \mathbf{X}] \mathbf{A}^{\top} \\ &\stackrel{*}{=} \mathbf{A} (\sigma^2 \mathbf{I}_N) \mathbf{A}^{\top} \\ &= \sigma^2 \mathbf{A A}^{\top} \\ &= \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1}. \end{aligned} \tag{10}$

Step $\star$ is because the true value $\boldsymbol{\beta}$ is non-random; step $\dagger$ is just applying Equation $5$ from above; step $\ddagger$ is because $\mathbf{A}$ is non-random; and step $*$ is assumption $4$ or spherical errors.

As we can see, the basic idea of the proof is to write $\hat{\boldsymbol{\beta}}$ in terms of the random variables $\boldsymbol{\varepsilon}$ , since this is the quantity with constant variance $\sigma^2$ .

Unbiased variance estimator

This section is not strictly necessary for understanding the sampling distribution of $\hat{\boldsymbol{\beta}}$ , but it’s a useful property of the finite sample distribution, e.g. it shows up when computing $t$ -statistics for OLS. This proof is also from (Hayashi, 2000), but I’ve organized and expanded it to be more explicit.

An unbiased estimator of the variance $\sigma^2$ is $s^2$ where

$s^2 = \frac{\mathbf{e}^{\top} \mathbf{e}}{N - P}, \tag{11}$

where $\mathbf{e}$ is a vector of residuals, i.e. $e_n \triangleq y_n - \boldsymbol{\beta}^{\top} \mathbf{x}_n$ . To prove that $s^2$ is unbiased, it suffices to show that

$\begin{aligned} \mathbb{E}[s^2 \mid \mathbf{X}] &= \sigma^2, \\ &\Downarrow \\ \mathbb{E}\left[ \mathbf{e}^{\top} \mathbf{e} \mid \mathbf{X} \right] &= (N-P) \sigma^2. \end{aligned} \tag{12}$

We will prove this three step. First, we will show

$\mathbf{e}^{\top} \mathbf{e} = \boldsymbol{\varepsilon}^{\top} \mathbf{M} \boldsymbol{\varepsilon}, \tag{13}$

where $\mathbf{M}$ is the residual maker,

$\mathbf{M} = \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}, \tag{14}$

which I discussed in my first post on OLS. Second, we will show

$\mathbb{E}[\boldsymbol{\varepsilon}^{\top} \mathbf{M} \boldsymbol{\varepsilon} \mid \mathbf{X}] = \text{trace}(\mathbf{M}) \sigma^2. \tag{15}$

Third and finally, we will show

$\text{trace}(\mathbf{M}) = N-P. \tag{16}$

If each step is true, then the proof is complete.

Step 1. $\mathbf{e}^{\top} \mathbf{e} = \boldsymbol{\varepsilon}^{\top} \mathbf{M} \boldsymbol{\varepsilon}$

This subsection relies on facts about the residual maker $\mathbf{M}$ , which I discussed in my first post on OLS. The proof is

$\begin{aligned} \boldsymbol{\varepsilon}^{\top} \mathbf{M} \boldsymbol{\varepsilon} &= (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^{\top} \mathbf{M} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \\ &\stackrel{\star}{=} \mathbf{y}^{\top} \mathbf{M} \mathbf{y} + \cancel{\boldsymbol{\beta}^{\top} \mathbf{X}^{\top} \mathbf{M} \mathbf{X} \boldsymbol{\beta}} - \cancel{\mathbf{y}^{\top}\mathbf{M}\mathbf{X}\boldsymbol{\beta}} - \cancel{\mathbf{X}^{\top} \boldsymbol{\beta}^{\top} \mathbf{M} \mathbf{y}} \\ &= \mathbf{y}^{\top} \mathbf{M} \mathbf{y} \\ &\stackrel{\dagger}{=} \mathbf{y}^{\top} \mathbf{M} \mathbf{M} \mathbf{y} \\ &\stackrel{\ddagger}{=} \mathbf{e}^{\top} \mathbf{e}. \end{aligned} \tag{17}$

The middle and right cancellations in step $\star$ hold since $\mathbf{X}^{\top} \mathbf{e} = \mathbf{0}$ by the normal equation. That implies

$\begin{aligned} \mathbf{y}^{\top} \mathbf{M} \mathbf{X} \boldsymbol{\beta} &= \boldsymbol{\beta}^{\top} \mathbf{X}^{\top} \mathbf{M} \mathbf{y} \\ &= \boldsymbol{\beta}^{\top} \mathbf{X}^{\top} \mathbf{e} \\ &= \mathbf{0}. \end{aligned} \tag{18}$

It’s easy to see that the same property holds for $\mathbf{X}^{\top} \boldsymbol{\beta}^{\top} \mathbf{M} \mathbf{y}$ . The left cancellation is because $\mathbf{M}\mathbf{X} = \mathbf{0}$ :

$\begin{aligned} \mathbf{M}\mathbf{X} &= (\mathbf{I}_N - \mathbf{H}) \mathbf{X} \\ &= \mathbf{X} - \mathbf{H} \mathbf{X} \\ &= \mathbf{X} - \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{X} \\ &= \mathbf{0}. \end{aligned} \tag{19}$

Step $\dagger$ holds since $\mathbf{M}$ is an orthogonal projection matrix. And step $\ddagger$ holds because $\mathbf{M} \mathbf{y} = \mathbf{e}$ is the residual maker.

Step 2. $\mathbb{E}[\boldsymbol{\varepsilon}^{\top} \mathbf{M} \boldsymbol{\varepsilon} \mid \mathbf{X}] = \text{trace}(\mathbf{M}) \sigma^2$

Let’s write out the vectorization explicitly

$\begin{aligned} \mathbb{E}[\boldsymbol{\varepsilon}^{\top} \mathbf{M} \boldsymbol{\varepsilon} \mid \mathbf{X}] &= \mathbb{E}\left[ \begin{bmatrix} \varepsilon_1 & \dots & \varepsilon_N \end{bmatrix} \begin{bmatrix} M_{11} & \dots & M_{1N} \\ \vdots & \ddots & \vdots \\ M_{N1} & \dots & M_{NN} \end{bmatrix} \begin{bmatrix} \varepsilon_1 \\ \vdots \\ \varepsilon_N \end{bmatrix} \middle| \mathbf{X} \right] \\ &= \mathbb{E}\left[ \begin{bmatrix} \varepsilon_1 & \dots & \varepsilon_N \end{bmatrix} \begin{bmatrix} \sum_{i=1}^N M_{1i} \varepsilon_i \\ \vdots \\ \sum_{i=1}^N M_{Ni} \varepsilon_i \end{bmatrix} \middle| \mathbf{X} \right] \\ &= \mathbb{E}\left[ \sum_{j=1}^N \sum_{i=1}^N M_{ji} \varepsilon_j \varepsilon_i \middle| \mathbf{X} \right] \end{aligned} \tag{20}$

Now notice that $\mathbf{M}$ is just a function of $\mathbf{X}$ , which we’re conditioning on. So we can move the $M_{ji}$ terms out of the expectation to get

$\mathbf{E}[\boldsymbol{\varepsilon}^{\top} \mathbf{M} \boldsymbol{\varepsilon}] = \sum_{j=1}^N \sum_{i=1}^N M_{ji} \mathbb{E} \left[ \varepsilon_j \varepsilon_i \middle| \mathbf{X} \right]. \tag{21}$

Finally, observe that

$\mathbb{E} \left[ \varepsilon_j \varepsilon_i \middle| \mathbf{X} \right] = \begin{cases} \sigma^2 & \text{if $i = j$,} \\ 0 & \text{otherwise.} \end{cases} \tag{22}$

This is the assumption $4$ , spherical errors. Thus, we can write Equation $21$ as

$\mathbf{E}[\boldsymbol{\varepsilon}^{\top} \mathbf{M} \boldsymbol{\varepsilon}] = \sum_{i}^N M_{ii} \sigma^2 = \text{trace}(\mathbf{M}) \sigma^2. \tag{23}$

Step 3. $\text{trace}(\mathbf{M}) = N-P$

This proof uses basic properties of the trace operator:

$\begin{aligned} \text{trace}(\mathbf{M}) &= \text{trace}(\mathbf{I}_N - \mathbf{H}) \\ &= \text{trace}(\mathbf{I}_N) - \text{trace}(\mathbf{H}) \\ &= N - \text{trace}\left( \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \right) \\ &= N - \text{trace}\left( \mathbf{X}^{\top} \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \right) \\ &= N - \text{trace}\left( \mathbf{I}_P \right) \\ &= N - P. \end{aligned} \tag{24}$

And we’re done.

Sampling distribution of $\hat{\boldsymbol{\beta}}$

If we make assumption $5$ , that the error terms are normally distributed, then $\hat{\boldsymbol{\beta}}$ is also normally distributed. To see this, note that assumptions $2$ and $4$ already specify the mean and variance of $\boldsymbol{\varepsilon}$ . If we assume normality in the errors, then clearly

$\boldsymbol{\varepsilon} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_N), \tag{25}$

since the normal distribution is fully specified by its mean and variance. Since the random variable $\boldsymbol{\varepsilon}$ does not depend on $\mathbf{X}$ , clearly the marginal distribution is also normal,

$\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_N). \tag{26}$

Finally, note that Equation $8$ means we can write write the sampling error in terms of the residuals:

$\hat{\boldsymbol{\beta}} - \boldsymbol{\beta} = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\varepsilon}. \tag{27}$

Since $\mathbf{X}$ is a linear function, then $(\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top}$ is a linear function. A linear function of a normal random variable $\boldsymbol{\varepsilon}$ is still normally distributed, meaning that $\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}$ is normally distributed. We know the mean of $\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}$ from Equation $9$ , and we know the variance from Equation $10$ . Therefore we have:

$\hat{\boldsymbol{\beta}} - \boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1}). \tag{28}$

Using basic properties of the normal distribution, we can immediately derive the distribution of the OLS estimator:

$\hat{\boldsymbol{\beta}} \sim \mathcal{N}(\boldsymbol{\beta}, \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1}). \tag{29}$

In summary, we have derived a standard result for the OLS estimator when assuming normally distributed errors.

Conclusion

OLS makes a few important assumptions (assumptions $1$ - $4$ ), which mathematically imply some basic properties of the OLS estimator $\hat{\boldsymbol{\beta}}$ . For example, the unbiasedness of $\hat{\boldsymbol{\beta}}$ is due to strict exogeneity or assumption $2$ . However, without assuming a distribution on the noise (assumption $5$ ), we cannot pin down a sampling distribution on $\hat{\boldsymbol{\beta}}$ . If we assume normally distributed errors, then $\hat{\boldsymbol{\beta}}$ is itself normally distributed. Knowing this distribution is useful in analyzing the results of linear models, such as when performing hypothesis testing for a given estimated parameter $\hat{\beta}_p$ .

Acknowledgements

I thank Mattia Mariantoni for pointing out a typo in Equation $20$ .

Hayashi, F. (2000). Econometrics. Princeton University Press. Section, 1, 60–69.