Sampling Distribution of the OLS Estimator

I derive the mean and variance of the OLS estimator, as well as an unbiased estimator of the OLS estimator's variance. I then show that the OLS estimator is normally distributed if we assume the error terms are normally distributed.

As introduced in my previous posts on ordinary least squares (OLS), the linear regression model has the form

yn=β0+β1xn,1++βPxn,P+εn.(1) y_n = \beta_0 + \beta_1 x_{n,1} + \dots + \beta_P x_{n,P} + \varepsilon_n. \tag{1}

To perform tasks such as hypothesis testing for a given estimated coefficient β^p\hat{\beta}_p, we need to pin down the sampling distribution of the OLS estimator β^=[β1,,βP]\hat{\boldsymbol{\beta}} = [\beta_1, \dots, \beta_P]^{\top}. To do this, we need to make some assumptions. We can then use those assumptions to derive some basic properties of β^\hat{\boldsymbol{\beta}}.

I’ll start this post by working through the standard OLS assumptions. I’ll then show how these assumptions imply some established properties of the OLS estimator β^\hat{\boldsymbol{\beta}}. Finally, I’ll show how if we assume our error terms are normally distributed, we can pin down the distribution of β^\hat{\boldsymbol{\beta}} exactly.

Standard OLS assumptions

The standard assumptions of OLS are:

  1. Linearity
  2. Strict exogeneity
  3. No multicollinearity
  4. Spherical errors
  5. Normality (optional)

Assumptions 11 and 33 are not terribly interesting here. Assumption 11 is just Equation 11; it means that we have correctly specified our model. Assumption 33 is that our design matrix X\mathbf{X} is full rank; this property not relevant for this post, but I have another post on the topic for the curious.

Assumptions 22 and 44 are more interesting here. Assumption 22, strict exogeneity, is that the expectation of the error term is zero:

E[εnX]=0,n{1,,N}.(2) \mathbb{E}[\varepsilon_n \mid \mathbf{X}] = 0, \quad n \in \{1, \dots, N\}. \tag{2}

An exogenous variable is a variable that is not determined by other variables or parameters in the model. Here is a nice example of why Equation 22 captures this intuition.

Assumption 44 can be broken into two assumptions. The first is homoskedasticity, meaning that our observations have a constant variance σ2\sigma^2:

V[εnX]=σ2,n{1,,N}.(3) \mathbb{V}[\varepsilon_n \mid \mathbf{X}] = \sigma^2, \quad n \in \{1, \dots, N\}. \tag{3}

The second is that our error terms are uncorrelated:

E[εnεmX]=0,n,m{1,,N},nm.(4) \mathbb{E}[\varepsilon_n \varepsilon_m \mid \mathbf{X}] = 0, \quad n,m \in \{1, \dots, N\}, \quad n \neq m. \tag{4}

Taken together, these two sub-assumptions are typically stated as just spherical errors, since we can formalize both at once as

V[εX]=σ2IN.(5) \mathbb{V}[\boldsymbol{\varepsilon} \mid \mathbf{X}] = \sigma^2 \mathbf{I}_N. \tag{5}

Finally, assumption 55 is that our error terms are normally distributed. This assumption is not required for OLS theory, but some sort of distributional assumption about the noise is required for hypothesis testing in OLS. As we will see, the normality assumption will imply that the OLS estimator β^\hat{\boldsymbol{\beta}} is normally distributed.

With these properties in mind, let’s prove some important facts about the OLS estimator β^\hat{\boldsymbol{\beta}}.

OLS estimator is unbiased

First, let’s prove that β^\hat{\boldsymbol{\beta}} is unbiased, i.e. that

E[β^X]=β.(6) \mathbb{E}[\hat{\boldsymbol{\beta}} \mid \mathbf{X}] = \boldsymbol{\beta}. \tag{6}

Equivalently, we just need to show that

E[β^βX]=0.(7) \mathbb{E}[\hat{\boldsymbol{\beta}} - \boldsymbol{\beta} \mid \mathbf{X}] = \mathbf{0}. \tag{7}

The term in the expectation, β^β\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}, is sometimes called the sampling error, and we can write it in terms of the predictors and noise terms:

β^β=(XX)1Xyβ=(XX)1X(Xβ+ε)β=(XX)1(XX)β+(XX)1Xε)β=β+(XX)1Xεβ=(XX)1Xε.(8) \begin{aligned} \hat{\boldsymbol{\beta}} - \boldsymbol{\beta} &\stackrel{\star}{=} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y} - \boldsymbol{\beta} \\ &\stackrel{\dagger}{=} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} (\mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}) - \boldsymbol{\beta} \\ &= (\mathbf{X}^{\top} \mathbf{X})^{-1} (\mathbf{X}^{\top} \mathbf{X}) \boldsymbol{\beta} + (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\varepsilon}) - \boldsymbol{\beta} \\ &= \boldsymbol{\beta} + (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\varepsilon} - \boldsymbol{\beta} \\ &= (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\varepsilon}. \end{aligned} \tag{8}

Step \star is the normal equation, and step \dagger is the matrix form of our linear assumption, y=Xβ+ε\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}. Since we assume that X\mathbf{X} is non-random, we can pull it out of the expectation, and we’re done:

E[β^βX]=(XX)1XE[εX]=0.(9) \mathbb{E}[\hat{\boldsymbol{\beta}} - \boldsymbol{\beta} \mid \mathbf{X}] = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbb{E}[ \boldsymbol{\varepsilon} \mid \mathbf{X}] = \mathbf{0}. \tag{9}

As we can see, we require strict exogeneity to prove that β^\hat{\boldsymbol{\beta}} is unbiased.

Variance of the OLS estimator

This proof is from (Hayashi, 2000). The variance of the OLS estimator is

V[β^X]=V[β^βX]=V[(XX)1XAεX]=AV[εX]A=A(σ2IN)A=σ2AA=σ2(XX)1.(10) \begin{aligned} \mathbb{V}[\hat{\boldsymbol{\beta}} \mid \mathbf{X}] &\stackrel{\star}{=} \mathbb{V}[\hat{\boldsymbol{\beta}} - \boldsymbol{\beta} \mid \mathbf{X}] \\ &\stackrel{\dagger}{=} \mathbb{V}[\overbrace{(\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top}}^{\mathbf{A}} \boldsymbol{\varepsilon} \mid \mathbf{X}] \\ &\stackrel{\ddagger}{=} \mathbf{A} \mathbb{V}[\boldsymbol{\varepsilon} \mid \mathbf{X}] \mathbf{A}^{\top} \\ &\stackrel{*}{=} \mathbf{A} (\sigma^2 \mathbf{I}_N) \mathbf{A}^{\top} \\ &= \sigma^2 \mathbf{A A}^{\top} \\ &= \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1}. \end{aligned} \tag{10}

Step \star is because the true value β\boldsymbol{\beta} is non-random; step \dagger is just applying Equation 55 from above; step \ddagger is because A\mathbf{A} is non-random; and step * is assumption 44 or spherical errors.

As we can see, the basic idea of the proof is to write β^\hat{\boldsymbol{\beta}} in terms of the random variables ε\boldsymbol{\varepsilon}, since this is the quantity with constant variance σ2\sigma^2.

Unbiased variance estimator

This section is not strictly necessary for understanding the sampling distribution of β^\hat{\boldsymbol{\beta}}, but it’s a useful property of the finite sample distribution, e.g. it shows up when computing tt-statistics for OLS. This proof is also from (Hayashi, 2000), but I’ve organized and expanded it to be more explicit.

An unbiased estimator of the variance σ2\sigma^2 is s2s^2 where

s2=eeNP,(11) s^2 = \frac{\mathbf{e}^{\top} \mathbf{e}}{N - P}, \tag{11}

where e\mathbf{e} is a vector of residuals, i.e. enynβ^xne_n \triangleq y_n - \hat{\boldsymbol{\beta}}^{\top} \mathbf{x}_n. To prove that s2s^2 is unbiased, it suffices to show that

E[s2X]=σ2,E[eeX]=(NP)σ2.(12) \begin{aligned} \mathbb{E}[s^2 \mid \mathbf{X}] &= \sigma^2, \\ &\Downarrow \\ \mathbb{E}\left[ \mathbf{e}^{\top} \mathbf{e} \mid \mathbf{X} \right] &= (N-P) \sigma^2. \end{aligned} \tag{12}

We will prove this three step. First, we will show

ee=εMε,(13) \mathbf{e}^{\top} \mathbf{e} = \boldsymbol{\varepsilon}^{\top} \mathbf{M} \boldsymbol{\varepsilon}, \tag{13}

where M\mathbf{M} is the residual maker,

M=X(XX)1X,(14) \mathbf{M} = \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}, \tag{14}

which I discussed in my first post on OLS. Second, we will show

E[εMεX]=trace(M)σ2.(15) \mathbb{E}[\boldsymbol{\varepsilon}^{\top} \mathbf{M} \boldsymbol{\varepsilon} \mid \mathbf{X}] = \text{trace}(\mathbf{M}) \sigma^2. \tag{15}

Third and finally, we will show

trace(M)=NP.(16) \text{trace}(\mathbf{M}) = N-P. \tag{16}

If each step is true, then the proof is complete.

Step 1. ee=εMε\mathbf{e}^{\top} \mathbf{e} = \boldsymbol{\varepsilon}^{\top} \mathbf{M} \boldsymbol{\varepsilon}

This subsection relies on facts about the residual maker M\mathbf{M}, which I discussed in my first post on OLS. The proof is

εMε=(yXβ)M(yXβ)=yMy+βXMXβyMXβXβMy=yMy=yMMy=ee.(17) \begin{aligned} \boldsymbol{\varepsilon}^{\top} \mathbf{M} \boldsymbol{\varepsilon} &= (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^{\top} \mathbf{M} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \\ &\stackrel{\star}{=} \mathbf{y}^{\top} \mathbf{M} \mathbf{y} + \cancel{\boldsymbol{\beta}^{\top} \mathbf{X}^{\top} \mathbf{M} \mathbf{X} \boldsymbol{\beta}} - \cancel{\mathbf{y}^{\top}\mathbf{M}\mathbf{X}\boldsymbol{\beta}} - \cancel{\mathbf{X}^{\top} \boldsymbol{\beta}^{\top} \mathbf{M} \mathbf{y}} \\ &= \mathbf{y}^{\top} \mathbf{M} \mathbf{y} \\ &\stackrel{\dagger}{=} \mathbf{y}^{\top} \mathbf{M} \mathbf{M} \mathbf{y} \\ &\stackrel{\ddagger}{=} \mathbf{e}^{\top} \mathbf{e}. \end{aligned} \tag{17}

The middle and right cancellations in step \star hold since Xe=0\mathbf{X}^{\top} \mathbf{e} = \mathbf{0} by the normal equation. That implies

yMXβ=βXMy=βXe=0.(18) \begin{aligned} \mathbf{y}^{\top} \mathbf{M} \mathbf{X} \boldsymbol{\beta} &= \boldsymbol{\beta}^{\top} \mathbf{X}^{\top} \mathbf{M} \mathbf{y} \\ &= \boldsymbol{\beta}^{\top} \mathbf{X}^{\top} \mathbf{e} \\ &= \mathbf{0}. \end{aligned} \tag{18}

It’s easy to see that the same property holds for XβMy\mathbf{X}^{\top} \boldsymbol{\beta}^{\top} \mathbf{M} \mathbf{y}. The left cancellation is because MX=0\mathbf{M}\mathbf{X} = \mathbf{0}:

MX=(INH)X=XHX=XX(XX)1XX=0.(19) \begin{aligned} \mathbf{M}\mathbf{X} &= (\mathbf{I}_N - \mathbf{H}) \mathbf{X} \\ &= \mathbf{X} - \mathbf{H} \mathbf{X} \\ &= \mathbf{X} - \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{X} \\ &= \mathbf{0}. \end{aligned} \tag{19}

Step \dagger holds since M\mathbf{M} is an orthogonal projection matrix. And step \ddagger holds because My=e\mathbf{M} \mathbf{y} = \mathbf{e} is the residual maker.

Step 2. E[εMεX]=trace(M)σ2\mathbb{E}[\boldsymbol{\varepsilon}^{\top} \mathbf{M} \boldsymbol{\varepsilon} \mid \mathbf{X}] = \text{trace}(\mathbf{M}) \sigma^2

Let’s write out the vectorization explicitly

E[εMεX]=E[[ε1εN][M11M1NMN1MNN][ε1εN]|X]=E[[ε1εN][i=1NM1iεii=1NMNiεi]|X]=E[j=1Ni=1NMjiεjεi|X](20) \begin{aligned} \mathbb{E}[\boldsymbol{\varepsilon}^{\top} \mathbf{M} \boldsymbol{\varepsilon} \mid \mathbf{X}] &= \mathbb{E}\left[ \begin{bmatrix} \varepsilon_1 & \dots & \varepsilon_N \end{bmatrix} \begin{bmatrix} M_{11} & \dots & M_{1N} \\ \vdots & \ddots & \vdots \\ M_{N1} & \dots & M_{NN} \end{bmatrix} \begin{bmatrix} \varepsilon_1 \\ \vdots \\ \varepsilon_N \end{bmatrix} \middle| \mathbf{X} \right] \\ &= \mathbb{E}\left[ \begin{bmatrix} \varepsilon_1 & \dots & \varepsilon_N \end{bmatrix} \begin{bmatrix} \sum_{i=1}^N M_{1i} \varepsilon_i \\ \vdots \\ \sum_{i=1}^N M_{Ni} \varepsilon_i \end{bmatrix} \middle| \mathbf{X} \right] \\ &= \mathbb{E}\left[ \sum_{j=1}^N \sum_{i=1}^N M_{ji} \varepsilon_j \varepsilon_i \middle| \mathbf{X} \right] \end{aligned} \tag{20}

Now notice that M\mathbf{M} is just a function of X\mathbf{X}, which we’re conditioning on. So we can move the MjiM_{ji} terms out of the expectation to get

E[εMε]=j=1Ni=1NMjiE[εjεi|X].(21) \mathbf{E}[\boldsymbol{\varepsilon}^{\top} \mathbf{M} \boldsymbol{\varepsilon}] = \sum_{j=1}^N \sum_{i=1}^N M_{ji} \mathbb{E} \left[ \varepsilon_j \varepsilon_i \middle| \mathbf{X} \right]. \tag{21}

Finally, observe that

E[εjεi|X]={σ2if i=j,0otherwise.(22) \mathbb{E} \left[ \varepsilon_j \varepsilon_i \middle| \mathbf{X} \right] = \begin{cases} \sigma^2 & \text{if $i = j$,} \\ 0 & \text{otherwise.} \end{cases} \tag{22}

This is the assumption 44, spherical errors. Thus, we can write Equation 2121 as

E[εMε]=iNMiiσ2=trace(M)σ2.(23) \mathbf{E}[\boldsymbol{\varepsilon}^{\top} \mathbf{M} \boldsymbol{\varepsilon}] = \sum_{i}^N M_{ii} \sigma^2 = \text{trace}(\mathbf{M}) \sigma^2. \tag{23}

Step 3. trace(M)=NP\text{trace}(\mathbf{M}) = N-P

This proof uses basic properties of the trace operator:

trace(M)=trace(INH)=trace(IN)trace(H)=Ntrace(X(XX)1X)=Ntrace(XX(XX)1)=Ntrace(IP)=NP.(24) \begin{aligned} \text{trace}(\mathbf{M}) &= \text{trace}(\mathbf{I}_N - \mathbf{H}) \\ &= \text{trace}(\mathbf{I}_N) - \text{trace}(\mathbf{H}) \\ &= N - \text{trace}\left( \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \right) \\ &= N - \text{trace}\left( \mathbf{X}^{\top} \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \right) \\ &= N - \text{trace}\left( \mathbf{I}_P \right) \\ &= N - P. \end{aligned} \tag{24}

And we’re done.

Sampling distribution of β^\hat{\boldsymbol{\beta}}

If we make assumption 55, that the error terms are normally distributed, then β^\hat{\boldsymbol{\beta}} is also normally distributed. To see this, note that assumptions 22 and 44 already specify the mean and variance of ε\boldsymbol{\varepsilon}. If we assume normality in the errors, then clearly

εXN(0,σ2IN),(25) \boldsymbol{\varepsilon} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_N), \tag{25}

since the normal distribution is fully specified by its mean and variance. Since the random variable ε\boldsymbol{\varepsilon} does not depend on X\mathbf{X}, clearly the marginal distribution is also normal,

εN(0,σ2IN).(26) \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_N). \tag{26}

Finally, note that Equation 88 means we can write write the sampling error in terms of the residuals:

β^β=(XX)1Xε.(27) \hat{\boldsymbol{\beta}} - \boldsymbol{\beta} = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\varepsilon}. \tag{27}

Since X\mathbf{X} is a linear function, then (XX)1X(\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} is a linear function. A linear function of a normal random variable ε\boldsymbol{\varepsilon} is still normally distributed, meaning that β^β\hat{\boldsymbol{\beta}} - \boldsymbol{\beta} is normally distributed. We know the mean of β^β\hat{\boldsymbol{\beta}} - \boldsymbol{\beta} from Equation 99, and we know the variance from Equation 1010. Therefore we have:

β^βN(0,σ2(XX)1).(28) \hat{\boldsymbol{\beta}} - \boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1}). \tag{28}

Using basic properties of the normal distribution, we can immediately derive the distribution of the OLS estimator:

β^N(β,σ2(XX)1).(29) \hat{\boldsymbol{\beta}} \sim \mathcal{N}(\boldsymbol{\beta}, \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1}). \tag{29}

In summary, we have derived a standard result for the OLS estimator when assuming normally distributed errors.

Conclusion

OLS makes a few important assumptions (assumptions 11-44), which mathematically imply some basic properties of the OLS estimator β^\hat{\boldsymbol{\beta}}. For example, the unbiasedness of β^\hat{\boldsymbol{\beta}} is due to strict exogeneity or assumption 22. However, without assuming a distribution on the noise (assumption 55), we cannot pin down a sampling distribution on β^\hat{\boldsymbol{\beta}}. If we assume normally distributed errors, then β^\hat{\boldsymbol{\beta}} is itself normally distributed. Knowing this distribution is useful in analyzing the results of linear models, such as when performing hypothesis testing for a given estimated parameter β^p\hat{\beta}_p.

   

Acknowledgements

I thank Mattia Mariantoni for pointing out a typo in Equation 2020.

  1. Hayashi, F. (2000). Econometrics. Princeton University Press. Section, 1, 60–69.