Hypothesis Testing for OLS

When can we be confident in our estimated coefficients when using OLS? We typically use a tt-statistic to quantify whether an inferred coefficient was likely to have happened by chance. I discuss hypothesis testing and tt-statistics for OLS.

Imagine we fit ordinary least squares (OLS),

yn=β0+β1xn,1++βPxn,P+εn,(1) y_n = \beta_0 + \beta_1 x_{n,1} + \dots + \beta_{P} x_{n,P} + \varepsilon_n, \tag{1}

and find that the pp-th estimated coefficient β^p\hat{\beta}_p is some value, say β^p=1.2\hat{\beta}_p = 1.2. How confident are we in this result? For example, imagine that the true parameter is actually zero, meaning that there is no relationship between our independent variables xn\mathbf{x}_n and our dependent variables yny_n. Then clearly the estimated β^=1.2\hat{\beta} = 1.2 is wrong and simply happened by chance, due to some properties of our finite sample X={x1,,xN}\mathbf{X} = \{\mathbf{x}_1, \dots, \mathbf{x}_N\}. In hypothesis testing, we address this problem by answering the question, “What is the probability that β^p=1.2\hat{\beta}_p = 1.2 just happened by chance?” Answering this question allows us to calibrate our confidence in our model’s inferences.

In this post, I discuss hypothesis testing for OLS in detail. The three big concepts at play in this discussion are pp-values, zz-scores, and tt-statistics. While these ideas are not specific to linear models, I will ground the discussion in OLS. This is because the derivations for zz-scores and tt-statistics require assuming a distribution on β^p\hat{\beta}_p, resulting in many OLS-specific calculations.

Null hypothesis, pp-values, and test statistics

In statistical hypothesis testing, the null hypothesis is that βp=bp\beta_p = b_p, where bpb_p is the true value for the pp-th coefficient. The alternative hypothesis is βpbp\beta_p \neq b_p. These options are typically denoted as

H0:βp=bp,(null hypothesis)H1:βpbp.(alternative hypothesis)(2) \begin{aligned} \textsf{H}_0 &: \beta_p = b_p, && \text{(null hypothesis)} \\ \textsf{H}_1 &: \beta_p \neq b_p. && \text{(alternative hypothesis)} \end{aligned} \tag{2}

The pp-value for this test is the probability that we observe a result at least as extreme as β^p\hat{\beta}_p under the null hypothesis H0\textsf{H}_0. Imagine we have some way to calculate such a pp-value. If it is small, then this means it is unlikely that our estimate β^p\hat{\beta}_p occurred by chance.

Of course, even if this pp-value is small, it is still possible that our estimated β^p\hat{\beta}_p did in fact occur by chance. We can specify our tolerance for this kind of error using a significance level α(0,1)\alpha \in (0, 1). For example, α=0.05\alpha = 0.05 means our result is significant if the pp-value is less than 5%5\%. This means that 95%95\% of the time, we will not see a value as extreme as the one we observed.

To compute pp-values—which are ultimately just probabilities—with respect to β^p\hat{\beta}_p, we need to assume a distribution on β^p\hat{\beta}_p. If we assume normally distributed error terms εn\varepsilon_n, then we can show that the sampling error β^pbp\hat{\beta}_p - b_p is normally distributed:

β^pbpN(0,σ2[(XX)1]pp).(3) \hat{\beta}_p - b_p \sim \mathcal{N} \left( 0, \sigma^2 \left [(\mathbf{X}^{\top} \mathbf{X})^{-1} \right]_{pp} \right). \tag{3}

The notation [(XX)1]pp[(\mathbf{X}^{\top} \mathbf{X})^{-1}]_{pp} denotes the pp-th row and pp-th column of the inverted Gram matrix. See my previous post on finite sample properties of the OLS estimator for a detailed derivation of Equation 33.

Equation 33 immediately implies β^p\hat{\beta}_p is normally distributed, and we’ll use this fact in the next section to construct test statistics. A test statistic is the output of a function of our finite sample X\mathbf{X}. This is useful because a test statistic typically has a well-defined distribution. We can then use this distribution to compute the appropriate pp-value to see how likely it is that our observed value happened by chance. For example, if we have some test statistic S(X)S(\mathbf{X}) which follows some well-defined distribution, and then we observe S(X)=sS(\mathbf{X}) = s, we can back out the pp-value since we can compute the value that is at least as extreme as ss under the null hypothesis.

In the next two sections, we’ll construct two common test statistics, and see how they apply in OLS.

Standard score (zz-score)

In statistics, a standard score or zz-score is any quantity

zxμσ,(4) z \triangleq \frac{x - \mu}{\sigma}, \tag{4}

where xx is the value to be standardized, sometimes called the raw score; μ\mu is its mean; and σ\sigma is its standard deviation. The difference xμx - \mu in the same units as the standard deviation, e.g. if xx is in meters, xμx - \mu is in meters. zz is positive when the raw score is above the mean and negative when it is below the mean. Since both the numerator and denominator have the same units, zz is dimensionless. For example, z=2z = 2 means that xμ=2σx - \mu = 2 \sigma or a difference of two standard deviations.

In OLS, if we know the variance σ2\sigma^2, we can compute a standard score for the pp-th coefficient by normalizing the sampling error β^pbp\hat{\beta}_p - b_p:

zpβ^pbpσ2[(XX)1]pp.(5) z_p \triangleq \frac{\hat{\beta}_p - b_p}{\sqrt{\sigma^2 \left [(\mathbf{X}^{\top} \mathbf{X})^{-1} \right]_{pp}}}. \tag{5}

Since we know that β^pbp\hat{\beta}_p - b_p is normally distributed with variance σ2[(XX)1]pp\sigma^2 [(\mathbf{X}^{\top} \mathbf{X})^{-1}]_{pp}, clearly

zpN(0,1).(6) z_p \sim \mathcal{N}(0, 1). \tag{6}

Thus, for a particular true value bpb_p, estimated parameter β^p\hat{\beta}_p, and known σ2\sigma^2, we can compute zpz_p. If zpz_p is large relative to its mean zero, it suggests the sampling error β^pbp\hat{\beta}_p - b_p is large. In other words, a larger zpz_p is more surprising. We can see how this is related to hypothesis testing, where a surprising zpz_p can be deemed significant if the probability of it taking or exceeding its value happening by chance is less than α\alpha. We can easily compute this probability using the cumulative distribution function (CDF) of the normal distribution.

tt-statistic

What if we don’t know the variance σ2\sigma^2? We can instead use the OLS estimator of σ2\sigma^2, denoted s2s^2:

s2eeNP.(7) s^2 \triangleq \frac{\mathbf{e}^{\top} \mathbf{e}}{N-P}. \tag{7}

Here, e\mathbf{e} is an NN-vector of residuals. See my previous post on finite sample properties of the OLS estimator for a proof that s2s^2 is unbiased, i.e. that E[s2]=σ2\mathbb{E}[s^2] = \sigma^2. Thus, we can replace the standard deviation in Equation 55 with the standard error,

se(β^p)s2[(XX)1]pp.(8) \text{se}(\hat{\beta}_p) \triangleq \sqrt{s^2 [(\mathbf{X}^{\top} \mathbf{X})^{-1}]_{pp}}. \tag{8}

A tt-statistic, often called a “tt-stat”, is like a standard score, but it replaces the (unknown) population standard deviation σ\sigma with the standard error:

tpβ^pbpse(β^p).(9) t_p \triangleq \frac{\hat{\beta}_p - b_p}{\text{se}(\hat{\beta}_p)}. \tag{9}

Similar to the standard score, the tt-statistic’s numerator is in units of standard error. So tp=2t_p = 2 means that β^pbp=2se(β^p)\hat{\beta}_p - b_p = 2 \cdot \text{se}(\hat{\beta}_p).

When computing a zz-score in Equation 55, we divided a normally distributed random variable β^pbp\hat{\beta}_p - b_p by non-random values σ2\sigma^2 and X\mathbf{X}. Thus, zpz_p was normally distributed. However, when computing a tt-statistic in Equation 99, we are dividing a normally distributed random variable β^pbp\hat{\beta}_p - b_p by a function of β^p\hat{\beta}_p, and therefore tpt_p is not normally distributed. However, one can prove that tpt_p is tt-distributed—hence it’s name—with NPN-P degrees of freedom:

tpt(NP).(10) t_p \sim t(N - P). \tag{10}

Here, t(k)t(k) denotes the tt-distribution with kk degrees of freedom. See A1 for a proof of this claim. The main point is that, again, we have a well-behaved distribution for our test statistic, and we can therefore compute pp-values. Note that NN and PP are both predetermined by our data, and we can compute everything we need in Equation 99.

Note that tt-stats can be negative; it simply means that bp>β^pb_p > \hat{\beta}_p.

Testing the null hypothesis

Now that we know the distribution of tt-statistics, we have a method for deciding if an estimated coefficient β^p\hat{\beta}_p is statistically significant. First, given a hypothesized true value bpb_p, compute the tt-statistic tpt_p. Then for a given significance level α\alpha, for example α=0.05\alpha = 0.05, we can compute the critical values, which are the boundaries of the region in which we would accept the null hypothesis. Accept the null hypothesis if tpt_p is less extreme than the critical values, if it is in the acceptance region. Reject the null hypothesis if tpt_p is more extreme than the critical values (Figure 11).

Figure 1. tt-distribution with critical values defined by the significance level α\alpha and the inverse CDF F1()F^{-1}(\cdot).

For example, imagine that bp=0b_p = 0, tp=1t_p = 1, α=0.05\alpha=0.05, and NP=30N-P=30. We want to compute the critical values, x1x_1 and x2x_2. We can do this using the CDF FF and the inverse CDF F1F^{-1}:

α/2=F(x1)    x1=F1(0.05/2),α/2=1F(x2)    x2=F1(10.05/2).(11) \begin{aligned} \alpha/2 &= F(x_1) &\implies x_1 &= F^{-1}(0.05/2), \\ \alpha/2 &= 1 - F(x_2) &\implies x_2 &= F^{-1}(1 - 0.05/2). \end{aligned} \tag{11}

We can then compute x1x_1 and x2x_2 to see if tp=1t_p = 1 is outside of the acceptance region. Historically, before easy access to statistical software, one would look up the critical values in a tt-table. However, today, it is easy to quickly compute the critical values, e.g. in Python:

>>> from scipy.stats import t
>>> tdist = t(df=30)
>>> alpha = 0.05
>>> x1 = tdist.ppf(alpha/2)
>>> x2 = tdist.ppf(1 - alpha/2)
>>> (x1, x2)
(-2.042272456301238, 2.0422724563012373)

We can see that tp=1t_p = 1 is not significant at level α=0.05\alpha=0.05 with NP=30N-P=30.

zz-scores vs tt-statistics

At this point, we have enough of an understanding of zz-scores and tt-statistics to say why we typically use tt-statistics in hypothesis testing for OLS. First, we typically don’t know the population standard deviation σ\sigma. Second, if NPN - P is sufficiently large, then the tt-distribution is approximately normal (Figure 22). Given that both these conditions are often true, it makes sense to just use tt-statistics by default.

Figure 2. tt-distribution with varying values for the degrees of freedom parameter (blue lines) versus standard normal distribution (red line).

Note that when NPN - P is reasonably large, where “reasonably large” is traditionally defined as 3030, the tt-distribution is well-approximated by the standard normal distribution, N(0,1)\mathcal{N}(0, 1). Here, we know that roughly 95%95\% of the mass is within two standard deviations of the mean (Figure 33). Thus, as a first approximation without using tt-tables or statistical software, we can say that a tt-statistic greater than 2σ=22\sigma = 2 is significant when NP>30N - P > 30.

Figure 3. The normal distribution with mean μ\mu and standard deviation σ\sigma. Roughly 95%95\% of the mass is within two standard deviations of the mean.

Statistical significance without knowing β^\hat{\beta}

What if we know that the correlation between our regression targets and a single predictor was ρxy=0.1\rho_{xy} = 0.1. Without knowing β^\hat{\beta}, can we compute the minimum number of samples NN we need to have a statistically significant inference? In fact, we can. This is a fun result that’s worth knowing about.

In a previous post, I re-derived a standard result that shows that the residual sum of squares (RSS) can be written in terms of the un-normalized sample variance and Pearson’s correlation:

RSS=Sy2(1ρxy2).(12) \textsf{RSS} = S_y^2 (1 - \rho_{xy}^2). \tag{12}

And the OLS estimator of the variance, s2s^2, can be written in terms of RSS,

s2=eeNP=RSSNP=Sy2(1ρxy2)NP.(13) s^2 = \frac{\mathbf{e}^{\top} \mathbf{e}}{N-P} = \frac{\text{RSS}}{N-P} = \frac{S_y^2 (1 - \rho_{xy}^2)}{N-P}. \tag{13}

Furthermore, we know that β^\hat{\beta} can be written in terms of the correlation and un-normalized sample standard deviations,

β^=ρxy(SySx).(14) \hat{\beta} = \rho_{xy} \left( \frac{S_y}{S_x} \right). \tag{14}

See my previous post on simple linear regression and correlation for a derivation of this fact. Putting this together with Equation 88 and assuming bp=0b_p = 0, we can re-write the tt-statistic as

t=β^s2/Sx2=ρxy(Sy/Sx)SySx(1ρxy2)NP=ρxyNP1ρxy2(15) \begin{aligned} t &= \frac{\hat{\beta}}{\sqrt{s^2 / S_x^2}} \\ &= \frac{\rho_{xy} \left( S_y / S_x \right)}{\frac{S_y}{S_x} \sqrt{\frac{(1 - \rho_{xy}^2)}{N-P}}} \\ &= \frac{\rho_{xy} \sqrt{N-P}}{\sqrt{1 - \rho_{xy}^2}} \end{aligned} \tag{15}

In simple linear regression, P=1P=1. And we know ρxy=0.1\rho_{xy} = 0.1, so we have

0.1N110.01(16) 0.1 \frac{\sqrt{N-1}}{\sqrt{1 - 0.01}} \tag{16}

So for a confidence level α=0.05\alpha = 0.05, we need N400N \approx 400 samples to have a statistically significant result. This section is a bit tangential to the main ideas of this post, but this is a fun result.

Conclusion

In OLS, assuming normal error terms implies our estimated coefficients are normally distributed. This allows us to construct standard scores and tt-statistics with well-defined distributions. We can use these test statistics to back out pp-values, which quantify the probability that we observe a result at least as extreme as β^p\hat{\beta}_p under the null hypothesis.

   

Acknowledgements

I thank Mattia Mariantoni for pointing out a typo in Equation 1515.

   

Appendix

A1. Proof that tt-statistics are tt-distributed with NPN-P degrees of freedom

This proof is from (Hayashi, 2000). We can write the tt-statistic for the pp-th predictor in Equation 99 in terms of its zz-score:

tp=β^pbpσ2[(XX)1]ppσ2s2=zpσ2s2=zpee/(NP)σ2=zpqσ2.(A1.1) \begin{aligned} t_p &= \frac{\hat{\beta}_p - b_p}{\sqrt{\sigma^2 [(\mathbf{X}^{\top} \mathbf{X})^{-1}]_{pp}}} \cdot \sqrt{\frac{\sigma^2}{s^2}} \\ &= z_p \cdot \sqrt{\frac{\sigma^2}{s^2}} \\ &\stackrel{\star}{=} \frac{z_p}{\sqrt{\frac{\mathbf{e}^{\top} \mathbf{e} / (N-P)}{\sigma^2}}} \\ &\stackrel{\dagger}{=} \frac{z_p}{\sqrt{\frac{q}{\sigma^2}}}. \end{aligned} \tag{A1.1}

Step \star uses the definition of s2s^2 (Equation 77). Step \dagger introduces a new variable, qee/σ2q \triangleq \mathbf{e}^{\top} \mathbf{e} / \sigma^2.

Now the logic of the proof is as follows. First, we know that zpz_p has distribution N(0,1)\mathcal{N}(0, 1). We will then show that qXχ2(NP)q \mid \mathbf{X} \sim \chi^2(N - P), or qq conditioned on the predictors is chi-squared distributed with NPN - P degrees of freedom. Next, we will prove that, conditional on X\mathbf{X}, zpz_p and qq are independent. This immediately implies that tpt_p is tt-distributed, since the tt-distribution is a ratio distribution arising from a normal random variable divided by an independent chi-distributed (or square root of a chid-squared-distributed) random variable.

Step 1. qXχ2(NP)q \mid \mathbf{X} \sim \chi^2(N - P)

The proof that the quadratic form in Equation A1.4\text{A1.4} below is chi-squared is from here. A chi-squared distributed random variable with KK degrees of freedom is the distribution of the sum of KK independent standard normal random variables, each squared. Formally, if z1,,zKz_1, \dots, z_K are i.i.d. from N(0,1)\mathcal{N}(0, 1), then gg where

gk=1Kzk2,(A1.2) g \triangleq \sum_{k=1}^K z_k^2, \tag{A1.2}

is chi-squared distributed. We want to show that

q=eeσ2(A1.3) q = \frac{\mathbf{e}^{\top} \mathbf{e}}{\sigma^2} \tag{A1.3}

is chi-squared distributed. We saw in a previous post (see Equation 1717 here) that

ee=εMε,(A1.4) \mathbf{e}^{\top} \mathbf{e} = \boldsymbol{\varepsilon}^{\top} \mathbf{M} \boldsymbol{\varepsilon}, \tag{A1.4}

where M\mathbf{M} is the residual maker matrix (see Equation 1212 here for a definition). Since M=IH\mathbf{M} = \mathbf{I} - \mathbf{H} and since H\mathbf{H} is symmetric (see Equation A2.2A2.2 here), then clearly M\mathbf{M} is symmetric. Thus, M\mathbf{M} is diagonalizable with an orthogonal matrix P\mathbf{P},

PMP=Λ=[λ1λ2λN].(A1.5) \mathbf{P}^{\top} \mathbf{M} \mathbf{P} = \boldsymbol{\Lambda} = \begin{bmatrix} \lambda_1 & & & \\ & \lambda_2 & & \\ & & \ddots & \\ & & & \lambda_N \end{bmatrix}. \tag{A1.5}

Since M\mathbf{M} is idempotent, it’s eigenvalues are either zero or one, and the number of non-zero eigenvalues is equal to the rank of M\mathbf{M}. Now consider the distribution of

vPε.(A1.6) \mathbf{v} \triangleq \mathbf{P}^{\top} \boldsymbol{\varepsilon}. \tag{A1.6}

Since ε\boldsymbol{\varepsilon} is normally distributed, and since P\mathbf{P} is a linear map, we know that v\mathbf{v} is normally distributed. The normal distribution is fully specified by its mean and variance, which for v\mathbf{v} are

E[v]=E[Pε]=PE[ε]=0,V[v]=V[Pε]=PV[ε]P=P(σ2I)P=σ2I.(A1.7) \begin{aligned} \mathbb{E}[\mathbf{v}] &= \mathbb{E}[\mathbf{P}^{\top} \boldsymbol{\varepsilon}] \\ &= \mathbf{P}^{\top} \mathbb{E}[\boldsymbol{\varepsilon}] \\ &= \mathbf{0}, \\\\ \mathbb{V}[\mathbf{v}] &= \mathbb{V}[\mathbf{P}^{\top} \boldsymbol{\varepsilon}] \\ &= \mathbf{P}^{\top} \mathbb{V}[ \boldsymbol{\varepsilon}] \mathbf{P} \\ &= \mathbf{P}^{\top} \left( \sigma^2 \mathbf{I} \right) \mathbf{P} \\ &= \sigma^2 \mathbf{I}. \end{aligned} \tag{A1.7}

So we have shown that

vN(0,σ2I).(A1.8) \mathbf{v} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}). \tag{A1.8}

Now we can write qq in Equation A1.3\text{A1.3} in terms of v\mathbf{v},

q=eeσ2=εMεσ2=vPMPvσ2=1σ2vΛv.(A1.9) \begin{aligned} q &= \frac{\mathbf{e}^{\top} \mathbf{e}}{\sigma^2} \\ &= \frac{\boldsymbol{\varepsilon}^{\top} \mathbf{M} \boldsymbol{\varepsilon}}{\sigma^2} \\ &= \frac{\mathbf{v}^{\top} \mathbf{P}^{\top} \mathbf{M} \mathbf{P} \mathbf{v}}{\sigma^2} \\ &= \frac{1}{\sigma^2} \mathbf{v}^{\top} \boldsymbol{\Lambda} \mathbf{v}. \end{aligned} \tag{A1.9}

As we said above, since M\mathbf{M} is idempotent, it’s eigenvalues are either zero or one, and it has Krank(M)K \triangleq \text{rank}(\mathbf{M}) non-zero eigenvalues. So the last line of Equation A1.9\text{A1.9} can be written as

q=1σ2vΛv=1σ2k=1Kvk2=k=1K(vkσ)2.(A1.10) q = \frac{1}{\sigma^2} \mathbf{v}^{\top} \boldsymbol{\Lambda} \mathbf{v} = \frac{1}{\sigma^2} \sum_{k=1}^K v_k^2 = \sum_{k=1}^K \left( \frac{v_k}{\sigma} \right)^2. \tag{A1.10}

Since each vi/σN(0,1)v_i / \sigma \sim \mathcal{N}(0, 1), then qq is chi-squared distributed with KK degrees of freedom. We saw in a previous post that K=NPK = N - P (see Equation 2424 here).

Step 2. zpz_p and qq are independent given X\mathbf{X}

We want to prove that zpz_p and qq are independent. We’ll do this indirectly, by proving that β^\hat{\boldsymbol{\beta}} and e\mathbf{e} are jointly Gaussian and therefore independent. Since zpz_p is a function of β^\hat{\beta} and qq is a function of e\mathbf{e}, this would imply that zpz_p and qq are independent.

The noise terms ε\boldsymbol{\varepsilon} are multivariate normal,

εN(0,σ2I).(A1.11) \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}). \tag{A1.11}

Furthermore, both the OLS estimator β^\hat{\boldsymbol{\beta}} and the residuals e\mathbf{e} are linear functions of ε\boldsymbol{\varepsilon}. To see this, recall that we proved the following relationship about the sampling error β^β\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}:

β^β=(XX)1Xε.(A1.12) \hat{\boldsymbol{\beta}} - \boldsymbol{\beta} = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\varepsilon}. \tag{A1.12}

(See Equation 88 here.) Then we can write both β^\hat{\boldsymbol{\beta}} and e\mathbf{e} as

β^=(XX)1Xε+β,e=yXβ^=Xβ+εX(XX)1Xε+β.(A1.13) \begin{aligned} \hat{\boldsymbol{\beta}} &= (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\varepsilon} + \boldsymbol{\beta}, \\ \\ \mathbf{e} &= \mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}} \\ &= \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon} - \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\varepsilon} + \boldsymbol{\beta}. \end{aligned} \tag{A1.13}

So both β^\hat{\boldsymbol{\beta}} and e\mathbf{e} are normal. Now a necessary and sufficient condition for β^\hat{\boldsymbol{\beta}} and e\mathbf{e} to be jointly Gaussian is that for every pair of scalars (a,b)(a, b), the linear combination aβ^+bea \hat{\boldsymbol{\beta}} + b \mathbf{e} is normal. We have:

aβ^+be=a[(XX)1Xε+β]+b[Xβ+εX(XX)1Xε+β]=a(XX)1Xε+aβ+bXβ+εbX(XX)1Xε+bβ(A1.14) \begin{aligned} a \hat{\boldsymbol{\beta}} + b \mathbf{e} &= a \left[ (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\varepsilon} + \boldsymbol{\beta} \right] + b \left[ \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon} - \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\varepsilon} + \boldsymbol{\beta} \right] \\ &= a (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\varepsilon} + a \boldsymbol{\beta} + b \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon} - b \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\varepsilon} + b \boldsymbol{\beta} \end{aligned} \tag{A1.14}

This is clearly Gaussian, since ε\boldsymbol{\varepsilon} is the only random quantity, and the rest of the terms are linear functions or scalars. Thus, conditional on X\mathbf{X}, β^\hat{\boldsymbol{\beta}} and e\mathbf{e} and are jointly normal and therefore independent. Thus, zpz_p and qq are independent.

Taken together, steps 11 and 22 prove that the tt-statistic is tt-distributed with NPN-P degrees of freedom.

  1. Hayashi, F. (2000). Econometrics. Princeton University Press. Section, 1, 60–69.