OLS with Heteroscedasticity

The ordinary least squares estimator is inefficient when the homoscedasticity assumption does not hold. I provide a simple example of a nonsensical tt-statistic from data with heteroscedasticity and discuss why this happens in general.

Consider this linear relationship with true coefficient β=0.01\beta = 0.01 and no intercept (Figure 11):

from numpy.random import RandomState

rng   = RandomState(seed=0)
x     = rng.randn(1000)
beta  = 0.01
noise = rng.randn(1000) + x*rng.randn(1000)
y     = beta * x + noise

Let’s fit ordinary least squares (OLS) to these data and look at the estimated β^\hat{\beta} and tt-statistic:

import statsmodels.api as sm

ols = sm.OLS(y, x).fit()
print(f'{ols.params[0]:.4f}')   # beta  : -0.1634
print(f'{ols.tvalues[0]:.4f}')  # t-stat: -3.7863

Remarkably, we see that the sign of the coefficient is wrong, yet the tt-statistic is large in magnitude. Assuming normally distributed error terms, a tt-statistic of 3.78-3.78 indicates that our estimated coefficient, β^=0.1634\hat{\beta} = -0.1634, is nearly four standard deviations from the null hypothesis (here, zero). (See my previous post on hypothesis testing for OLS if this claim does not make sense.) This is obviously wrong, but what’s going on here? The problem is in the noise term:

noise = rng.randn(1000) + x*rng.randn(1000)

Each element in the vector noise is a function of its corresponding element in x. This breaks an assumption in OLS and results in incorrectly computed tt-statistics. The goal of this post is to look at this phenomenon (heteroscedasticity) and some solutions in general.

Figure 1. OLS fit to data with heteroscedastic errors. The sign of the coefficient is wrong, yet the model has estimated a large tt-statistic due to heteroscedasticity which is unaccounted for.

Non-spherical errors

Recall the OLS model,

y=Xβ+ε,(1) \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}, \tag{1}

where y\mathbf{y} is an NN-vector of response variables, X\mathbf{X} is an N×PN \times P matrix of PP-dimensional predictors, β\boldsymbol{\beta} specifies a PP-dimensional hyperplane, and ε\boldsymbol{\varepsilon} is an NN-vector of noise terms. We could include an intercept by adding a column of ones to X\mathbf{X}. A standard assumption of OLS is spherical errors (see Assumption 44 here). Formally, spherical errors assumes

V[εX]=σ2I.(2) \mathbb{V}[\boldsymbol{\varepsilon} \mid \mathbf{X}] = \sigma^2 \mathbf{I}. \tag{2}

In words, all noise terms εn\boldsymbol{\varepsilon}_n for n{1,,N}n \in \{1,\dots,N\} share a constant variance (Figure 22, left). This assumption may be reasonable when our data are approximately i.i.d., but this is often not the case. Let’s look at the simple scenario in which this assumption does not hold: when the error terms are heteroscedastic.

In statistics, scedasticity refers to the distribution of the error terms. OLS assumes homoscedasticity, which means that each sample’s error term has a constant variance (σ2\sigma^2 in Equation 22). With heterscedasticity, each sample xn\mathbf{x}_n has an error term with its own variance. Formally, Equation 22 becomes

V[εX]=[σ12000σ22000σN2].(3) \mathbb{V}[\boldsymbol{\varepsilon} \mid \mathbf{X}] = \begin{bmatrix} \sigma_1^2 & 0 & \dots & 0 \\ 0 & \sigma_2^2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \sigma_N^2 \end{bmatrix}. \tag{3}

There are multiple forms of heteroscedasticity. Consider, for example, data which can be partitioned into distinct groups. These types of data may exhibit discrete heteroscedasticity, since each group may have its own variance (Figure 22, middle). A real-world instance of discrete heteroscedasticity might arise when data is collected from multiple sources, such as if each city or province in a country is responsible for reporting its own census data.

Alternatively, consider data in which each datum’s variance depends on a single predictor. These data may exhibit heteroscedasticity along that dimension (Figure 22, right). For example, in time series data, variance might change over time. This type of heteroscedasticity was the source of our problems in the leading example.

Figure 2. Data exhibiting homoscedasticity (left), discrete heteroscedasticity (middle), and heteroscedasticity of the form εnN(0,xn2)\varepsilon_n \sim \mathcal{N}(0, x_n^2).

A second challenge for OLS is when the error terms are correlated, inducing autocorrelation in the observations. Then Equation 33 becomes

V[εX]=[σ2ρ12ρ1Nρ21σ2ρ2NρN1ρN2σ2].(4) \mathbb{V}[\boldsymbol{\varepsilon} \mid \mathbf{X}] = \begin{bmatrix} \sigma^2 & \rho_{12} & \dots & \rho_{1N} \\ \rho_{21} & \sigma^2 & \dots & \rho_{2N} \\ \vdots & \vdots & \ddots & \vdots \\ \rho_{N1} & \rho_{N2} & \dots & \sigma^2 \end{bmatrix}. \tag{4}

We could, of course, have both heteroscedasticity and autocorrelation, in which case the diagonal in Equation 44 would be replaced with the diagonal in Equation 33. In this post, I will limit the discussion to only the first scenario, heteroscedastic errors.

Incorrect estimation with OLS

So what happens when we apply classic OLS to data with variance

V[εX]=σ2Ω,(5) \mathbb{V}[\boldsymbol{\varepsilon} \mid \mathbf{X}] = \sigma^2 \boldsymbol{\Omega}, \tag{5}

instead of spherical errors. Here, Ω\boldsymbol{\Omega} is a positive definite matrix that allows for heteroscedasticity or autocorrelation. First, observe that the estimator from classic OLS,

β^=(XX)1Xy(6) \hat{\boldsymbol{\beta}} = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y} \tag{6}

is still unbiased, even with a generalized noise term:

E[β^X]=E[(XX)1XyX]=(XX)1XE[yX]=(XX)1XXβ=β(7) \begin{aligned} \mathbb{E}[\hat{\boldsymbol{\beta}} \mid \mathbf{X}] &= \mathbb{E}[(\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y} \mid \mathbf{X}] \\ &= (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbb{E}[ \mathbf{y} \mid \mathbf{X}] \\ &= (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{X} \boldsymbol{\beta} \\ &= \boldsymbol{\beta} \end{aligned} \tag{7}

This holds because we still assume that E[εX]=0\mathbb{E}[\boldsymbol{\varepsilon} \mid \mathbf{X}] = \mathbf{0}, if we still assume that while the errors can be heteroscedastic, their conditional expectation is still zero.

However, despite still being unbiased, the OLS estimator still has a problem. Deriving the OLS variance,

V[β^X]=σ2(XX)1.(8) \mathbb{V}[\hat{\boldsymbol{\beta}} \mid \mathbf{X}] = \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1}. \tag{8}

requires assuming homoscedasticity (see Equation 1010 here). Without this assumption, the correct variance would be

V[β^X]=σ2(XX)1XΩX(XX)1.(9) \mathbb{V}[\hat{\boldsymbol{\beta}} \mid \mathbf{X}] = \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\Omega} \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1}. \tag{9}

This follows immediately from plugging in V[εX]=σ2Ω\mathbb{V}[\boldsymbol{\varepsilon} \mid \mathbf{X}] = \sigma^2 \boldsymbol{\Omega} into Equation 1010 in the previous post, and we cannot simplify it further without assuming some structure for the error terms.

The second problem is that s2s^2 is no longer an unbiased estimate of σ2\sigma^2. Logically, this holds since there is no longer a single σ2\sigma^2. Furthermore, the derivation for the unbiasedness of s2s^2 depends on homoscedasticity. (See Equation 2222 here.)

We can now see that this is why we estimated incorrect tt-statistics in the leading example. Recall that the tt-statistic for the pp-th coefficient of β^\hat{\boldsymbol{\beta}} is

tp=β^pbps2[(XX)1]pp(10) t_p = \frac{\hat{\beta}_p - b_p}{\sqrt{s^2 \left[ (\mathbf{X}^{\top} \mathbf{X})^{-1} \right]_{pp}}} \tag{10}

where bpb_p is the hypothesized true value and the denominator is the standard error. (See my previous post for details.) Both the estimated variance and s2s^2 are incorrect without homoscedasticity being true. In fact, it’s impossible to know if the tt-statistic will be too large, too small, or correct without knowing more about σ2Ω\sigma^2 \boldsymbol{\Omega}.

Asymptotic behavior

Finally, let’s consider the asymptotic behavior of the OLS estimator. A standard assumption is that

plim1NXX=Q,(11) \plim \frac{1}{N} \mathbf{X}^{\top} \mathbf{X} = \mathbf{Q}, \tag{11}

for some positive definite matrix Q\mathbf{Q}. This is a reasonable assumption because of the weak law of large numbers. The basic idea is that our data are “well-behaved” in the sense that they converge to their expected values. (See this post on OLS consistency for discussion.)

However, notice that XΩX\mathbf{X}^{\top} \boldsymbol{\Omega} \mathbf{X} (the middle term in Equation 99) is more subtle. Without knowing anything about Ω\boldsymbol{\Omega}, there is no guarantee that XΩX\mathbf{X}^{\top} \boldsymbol{\Omega} \mathbf{X} will converge in probability to some matrix. However, even if we did assume this, that

plim1NXΩX=C(12) \plim \frac{1}{N} \mathbf{X}^{\top} \boldsymbol{\Omega} \mathbf{X} = \mathbf{C} \tag{12}

for some matrix C\mathbf{C}, then all we could say is that the OLS estimator converges in distribution to

N(β^β)dN(0,σ2Q1CQ).(13) \sqrt{N}(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}) \stackrel{d}{\rightarrow} \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{Q}^{-1} \mathbf{C} \mathbf{Q}). \tag{13}

But we cannot simplify the covariance matrix, and plims2σ2\plim s^2 \neq \sigma^2.

Summary

In short, without homoscedasticity, the nice properties of the OLS estimator do not hold, and inference using OLS on heteroscedastic data will be incorrect. There are a few solutions to this problem. The first is to use generalized least squares (GLS) if we do know the variance. This is because GLS is the best linear unbiased estimator when assuming generalized error terms. Another approach is to still use OLS but to use heteroscedasticity-robust standard errors instead of the usual standard errors. Finally, in some situations, we can use weighted least squares. I will discuss these topics in more detail in future posts.