Generalized Least Squares

I discuss generalized least squares (GLS), which extends ordinary least squares by assuming heteroscedastic errors. I prove some basic properties of GLS, particularly that it is the best linear unbiased estimator, and work through a complete example.

Ordinary least squares (OLS), when all its assumptions hold, is the best linear unbiased estimator (BLUE). This is the main result of the Gauss–Markov theorem. However, OLS with heteroscedasticity leads to incorrect and inefficient inference. In particular, the tt-statistics are unreliable, and the estimator is no longer BLUE. See this post for a simple example and code.

However, if we do know the covariance structure of our data, we can use generalized least squares (GLS), developed by Alexander Aitken (Aitkin, 1935). The basic idea of GLS is to normalize our data using the known covariance such that the normalized data adheres to the OLS assumptions. We can then apply the OLS estimator, which is BLUE, to these transformed data.

The goal of this post is to walk through GLS in detail. I’ll present the model, an example, and then prove some basic properties.

Aitken’s generalized least squares

In generalized least squares, we assume the following model:

y=Xβ+ε,E[εX]=0,V[εX]=σ2Ω=Σ.(1) \begin{aligned} \mathbf{y} &= \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}, \\ \mathbb{E}[\boldsymbol{\varepsilon} \mid \mathbf{X}] &= \mathbf{0}, \\ \mathbb{V}[\boldsymbol{\varepsilon} \mid \mathbf{X}] &= \sigma^2 \boldsymbol{\Omega} = \boldsymbol{\Sigma}. \end{aligned} \tag{1}

Here, Ω\boldsymbol{\Omega} is a positive definite matrix, and we assume it is known. This is called generalized least squares because OLS is just a special case of this model, when Ω=I\boldsymbol{\Omega} = \mathbf{I}.

The GLS optimization problem is to minimize the following Gaussian kernel (or Mahalanobis distance):

β^GLS=arg ⁣minβ{(yXβ)Ω1(yXβ)}.(2) \hat{\boldsymbol{\beta}}_{\textsf{GLS}} = \arg\!\min_{\boldsymbol{\beta}} \left\{ (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^{\top} \boldsymbol{\Omega}^{-1} (\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \right\}. \tag{2}

This is different from the OLS optimization problem, which has constant variance which does not effect the optimal coefficients since it can be factored out:

β^OLS=arg ⁣minβ{(yXβ)(yXβ)}.(3) \hat{\boldsymbol{\beta}}_{\textsf{OLS}} = \arg\!\min_{\boldsymbol{\beta}} \left\{ (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^{\top} (\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \right\}. \tag{3}

Now we could take the gradient of Equation 22, set it equal to zero, and solve for βGLS\boldsymbol{\beta}_{\textsf{GLS}}. However, in my mind, there is a simpler way. Instead, simply notice that since Ω\boldsymbol{\Omega} is a positive definite matrix, it has a matrix inverse which a Cholesky decomposition of the form

Ω1=LL.(4) \boldsymbol{\Omega}^{-1} = \mathbf{L} \mathbf{L}^{\top}. \tag{4}

And we can rewrite the Gaussian kernel in Equation 22 as

(yXβ)Ω1(yXβ)=(yXβ)LL(yXβ)=(LyLXβ)(LyLXβ).(5) \begin{aligned} &(\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^{\top} \boldsymbol{\Omega}^{-1} (\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \\ &= (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^{\top} \mathbf{L} \mathbf{L}^{\top} (\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \\ &= (\mathbf{L}^{\top} \mathbf{y} - \mathbf{L}^{\top} \mathbf{X} \boldsymbol{\beta})^{\top} (\mathbf{L}^{\top} \mathbf{y} - \mathbf{L}^{\top} \mathbf{X} \boldsymbol{\beta}). \end{aligned} \tag{5}

In other words, the objective with transformed data Ly\mathbf{L}^{\top} \mathbf{y} and LX\mathbf{L}^{\top} \mathbf{X} is just the OLS objective, and the transformed errors are homoscedastic,

E[LεεLLX]=LE[εεLX]L=L(σ2Ω)L=σ2I.(6) \begin{aligned} \mathbb{E}[\mathbf{L}^{\top} \boldsymbol{\varepsilon} \boldsymbol{\varepsilon}^{\top} \mathbf{L} \mid \mathbf{L}^{\top} \mathbf{X}] &= \mathbf{L}^{\top} \mathbb{E}[\boldsymbol{\varepsilon} \boldsymbol{\varepsilon}^{\top} \mid \mathbf{L}^{\top} \mathbf{X}] \mathbf{L} \\ &= \mathbf{L}^{\top} \left( \sigma^2 \boldsymbol{\Omega} \right) \mathbf{L} \\ &= \sigma^2 \mathbf{I}. \end{aligned} \tag{6}

Why does this work? Since the univariate analog to Σ\boldsymbol{\Sigma} is σ2\sigma^2, then L\mathbf{L}^{\top} is the multivariate analog to 1/σ1 / \sigma. In other words, L\mathbf{L}^{\top} represents a linear transformation to our data such that the new data’s variance is normalized.

Furthermore, note that Ly\mathbf{L}^{\top} \mathbf{y} and LX\mathbf{L}^{\top} \mathbf{X} adhere to the standard assumptions of OLS: LX\mathbf{L}^{\top} \mathbf{X} is non-random since X\mathbf{X} is non-random and L\mathbf{L} is non-random (since Ω\boldsymbol{\Omega} is known), LX\mathbf{L}^{\top} \mathbf{X} is full-rank since X\mathbf{X} is full-rank, and the conditional mean of our observations Ly\mathbf{L}^{\top} \mathbf{y} is what we would expect, i.e.

E[LyLX]=LXβ.(7) \mathbb{E}[\mathbf{L}^{\top} \mathbf{y} \mid \mathbf{L}^{\top} \mathbf{X}] = \mathbf{L}^{\top} \mathbf{X} \boldsymbol{\beta}. \tag{7}

So GLS is OLS but with y\mathbf{y} and X\mathbf{X} mapped by the linear function L\mathbf{L}^{\top}. To make this point clear, it is common to use the following notation:

y=Ly,X=LX,(8) \begin{aligned} \mathbf{y}_{*} &= \mathbf{L}^{\top} \mathbf{y}, \\ \mathbf{X}_{*} &= \mathbf{L}^{\top} \mathbf{X}, \end{aligned} \tag{8}

and to rewrite the GLS objective in Equation 22 as

β^GLS=arg ⁣minβ{(yXβ)(yXβ)}.(9) \hat{\boldsymbol{\beta}}_{\textsf{GLS}} = \arg\!\min_{\boldsymbol{\beta}} \left\{ (\mathbf{y}_{*} - \mathbf{X}_{*} \boldsymbol{\beta})^{\top} (\mathbf{y}_{*} - \mathbf{X}_{*} \boldsymbol{\beta}) \right\}. \tag{9}

Since the OLS estimator,

β^OLS=(XX)1Xy,(10) \hat{\boldsymbol{\beta}}_{\textsf{OLS}} = (\mathbf{X}_{*}^{\top} \mathbf{X}_{*})^{-1} \mathbf{X}_{*}^{\top} \mathbf{y}_{*}, \tag{10}

is BLUE, this implies the GLS estimator, which can just Equation 1010 but un-transformed, is also BLUE:

β^GLS=(XLLX)1XLLy=(XΩ1X)1XΩ1y.(11) \begin{aligned} \hat{\boldsymbol{\beta}}_{\textsf{GLS}} &= (\mathbf{X}^{\top} \mathbf{L} \mathbf{L}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{L} \mathbf{L}^{\top} \mathbf{y} \\ &= (\mathbf{X}^{\top} \boldsymbol{\Omega}^{-1} \mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\Omega}^{-1} \mathbf{y}. \end{aligned} \tag{11}

So the GLS estimator is BLUE in the presence of heteroscedastic errors, governed by Ω\boldsymbol{\Omega}. The trick was to normalize the data by the known covariance and then apply the OLS estimator to these transformed data.

Example

Before discussing GLS any further, let’s look at a simple example of data with heteroscedasticity and show that GLS outperforms OLS on these data. First, let’s construct a positive definite matrix with dependencies between the samples, to break the homoscedasticity assumption for OLS:

import numpy as np
from   sklearn.datasets import make_s_curve
from   sklearn.metrics.pairwise import rbf_kernel

size = 100
rng  = np.random.RandomState(seed=1)
X, t = make_s_curve(size, random_state=rng)
X    = np.delete(X, obj=1, axis=1)
X    = X / np.std(X, axis=0)
X    = X[t.argsort()]
O    = rbf_kernel(X) + 1e-6 * np.eye(size)

This constructs a covariance matrix that is definitely not just spherical errors (Figure 11, left). Then if we assume the error terms are multivariate normal, this implies that our observations are multivariate normal, and we can construct random data (Figure 11, right) as

from   scipy.stats import multivariate_normal

b = np.array([0.7, -0.3])
X = rng.randn(size, 2)
y = mvn(X @ b, O).rvs(random_state=rng)

Now let’s apply OLS to these data:

import statsmodels.api as sm

ols = sm.OLS(y, sm.add_constant(X)).fit()
print(ols.params[1:])   # [0.71, -0.24]
print(ols.tvalues[1:])  # [8.03, -3.01]

We can see that the inference is not great. The tt-statistics are high, but inferred parameter values are incorrect. For example, the second parameter has a tt-statistic over 33 but is off from the true value by roughly 20%20\%.

Figure 1. (Left) Heatmap of covariance matrix for multivariate normally distributed error terms. (Right) 100 samples drawn from yN(Xβ,Ω)\mathbf{y} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \boldsymbol{\Omega}). Red plane is the true hyperplane β=[0.7,0.3]\boldsymbol{\beta} = [0.7, -0.3].

Finally, let’s apply GLS rather than OLS to these data, specifying the structure of our noise terms using the sigma argument:

gls = sm.GLS(y, sm.add_constant(X), sigma=O).fit()
print(gls.params[1:])   # [0.70, -0.30]
print(gls.tvalues[1:])  # [4945.06, -2346.34]

We can see that the inference is improved. The estimated values are correct, and the tt-stats are huge. This empirically validates the argument so far. If we know the structure of our error terms, we can still use OLS for efficient inference.

Sampling distribution

Finally, let’s derive some basic properties of GLS.

Unbiasedness. We know that the GLS estimator is unbiased, that

E[β^GLS]=β.(12) \mathbb{E}\left[\hat{\boldsymbol{\beta}}_{\textsf{GLS}}\right] = \boldsymbol{\beta}. \tag{12}

This is because Equation 1010 is unbiased, since OLS is unbiased. However, we can also follow the standard derivation for OLS unbiasedness to prove this directly:

E[β^GLSβX]=E[(XΩ1X)1XΩ1yβX]=E[(XΩ1X)1XΩ1(Xβ+ε)βX]=E[β+(XΩ1X)1XΩ1εβX]=E[(XΩ1X)1XΩ1εX]=(XΩ1X)1XΩ1E[εX]=0.(13) \begin{aligned} \mathbb{E}\left[\hat{\boldsymbol{\beta}}_{\textsf{GLS}} - \boldsymbol{\beta} \mid \mathbf{X}\right] &= \mathbb{E}[(\mathbf{X}^{\top} \boldsymbol{\Omega}^{-1} \mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\Omega}^{-1} \mathbf{y} - \boldsymbol{\beta} \mid \mathbf{X}] \\ &= \mathbb{E}[(\mathbf{X}^{\top} \boldsymbol{\Omega}^{-1} \mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\Omega}^{-1} (\mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}) - \boldsymbol{\beta} \mid \mathbf{X}] \\ &= \mathbb{E}[\boldsymbol{\beta} + (\mathbf{X}^{\top} \boldsymbol{\Omega}^{-1} \mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\Omega}^{-1} \boldsymbol{\varepsilon} - \boldsymbol{\beta} \mid \mathbf{X}] \\ &= \mathbb{E}[(\mathbf{X}^{\top} \boldsymbol{\Omega}^{-1} \mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\Omega}^{-1} \boldsymbol{\varepsilon} \mid \mathbf{X}] \\ &= (\mathbf{X}^{\top} \boldsymbol{\Omega}^{-1} \mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\Omega}^{-1} \mathbb{E}[\boldsymbol{\varepsilon} \mid \mathbf{X}] \\ &= \mathbf{0}. \end{aligned} \tag{13}

GLS estimator variance. To compute the variance of the GLS estimator, we can follow the standard derivation for the OLS variance:

V[β^GLSX]=V[β^GLSβX]=V[(XΩ1X)1XΩ1AεX]=AV[εX]A=σ2AΩA=σ2(XΩ1X)1XΩ1ΩΩ1X(XΩ1X)1=σ2(XΩ1X)1.(14) \begin{aligned} \mathbb{V}\left[\hat{\boldsymbol{\beta}}_{\textsf{GLS}} \mid \mathbf{X}\right] &= \mathbb{V} \left[\hat{\boldsymbol{\beta}}_{\textsf{GLS}} - \boldsymbol{\beta} \mid \mathbf{X} \right] \\ &= \mathbb{V} [ \overbrace{(\mathbf{X}^{\top} \boldsymbol{\Omega}^{-1} \mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\Omega}^{-1}}^{\mathbf{A}} \boldsymbol{\varepsilon} \mid \mathbf{X} ] \\ &= \mathbf{A} \mathbb{V}[\boldsymbol{\varepsilon} \mid \mathbf{X}] \mathbf{A}^{\top} \\ &= \sigma^2 \mathbf{A} \boldsymbol{\Omega} \mathbf{A}^{\top} \\ &= \sigma^2 (\mathbf{X}^{\top} \boldsymbol{\Omega}^{-1} \mathbf{X})^{-1} \mathbf{X}^{\top} \boldsymbol{\Omega}^{-1} \boldsymbol{\Omega} \boldsymbol{\Omega}^{-1} \mathbf{X} (\mathbf{X}^{\top} \boldsymbol{\Omega}^{-1} \mathbf{X})^{-1} \\ &= \sigma^2 (\mathbf{X}^{\top} \boldsymbol{\Omega}^{-1} \mathbf{X})^{-1}. \end{aligned} \tag{14}

Or we could have proved this more directly, since

V[β^OLSX]=σ2(XX)1=σ2(XLLX)1=σ2(XΩ1X)1=V[β^GLSX].(15) \begin{aligned} \mathbb{V}\left[\hat{\boldsymbol{\beta}}_{\textsf{OLS}} \mid \mathbf{X}_{*}\right] &= \sigma^2 (\mathbf{X}_{*}^{\top} \mathbf{X}_{*})^{-1} \\ &= \sigma^2 (\mathbf{X}^{\top} \mathbf{L}\mathbf{L}^{\top} \mathbf{X})^{-1} \\ &= \sigma^2 (\mathbf{X}^{\top} \boldsymbol{\Omega}^{-1} \mathbf{X})^{-1} \\ &= \mathbb{V}\left[\hat{\boldsymbol{\beta}}_{\textsf{GLS}} \mid \mathbf{X}\right]. \end{aligned} \tag{15}

Unbiased GLS variance estimator. Let e\mathbf{e} denote the residuals, i.e.

e=yXβ^.(16) \mathbf{e} = \mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}}. \tag{16}

In OLS, an unbiased estimator of the variance, provided the OLS assumptions hold, is

sOLS2=eeNP,(17) s_{\textsf{OLS}}^2 = \frac{\mathbf{e}^{\top} \mathbf{e}}{N-P}, \tag{17}

where PP is the data’s dimension. Of course, we are now assuming Equation 11, meaning the OLS assumptions do not hold. However, we can immediately derive an equivalent estimator since

s2=eeNP(18) s^2 = \frac{\mathbf{e}_{*}^{\top} \mathbf{e}_{*}}{N-P} \tag{18}

is unbiased, when e=yXβ^\mathbf{e}_{*} = \mathbf{y}_{*} - \mathbf{X}_{*} \hat{\boldsymbol{\beta}}. Thus, unbiased estimator of the GLS variance is

sGLS2=1NP(yXβ^GLS)(yXβ^GLS)=1NP(yXβ^GLS)Ω1(yXβ^GLS).(19) \begin{aligned} s_{\textsf{GLS}}^2 &= \frac{1}{N-P} (\mathbf{y}_{*} - \mathbf{X}_{*} \hat{\boldsymbol{\beta}}_{\textsf{GLS}})^{\top} (\mathbf{y}_{*} - \mathbf{X}_{*} \hat{\boldsymbol{\beta}}_{\textsf{GLS}}) \\ &= \frac{1}{N-P} (\mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}}_{\textsf{GLS}})^{\top} \boldsymbol{\Omega}^{-1} (\mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}}_{\textsf{GLS}}). \end{aligned} \tag{19}

Sampling distribution. So we know the mean and variance of the sampling distribution. If we assume that the noise terms are multivariate normal,

εN(0,σ2Ω),(20) \boldsymbol{\varepsilon} \sim \mathcal{N}\left( \mathbf{0}, \sigma^2 \boldsymbol{\Omega} \right), \tag{20}

then our observations are multivariate normal,

yXN(Xβ,σ2Ω).(21) \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}\left( \mathbf{X} \boldsymbol{\beta}, \sigma^2 \boldsymbol{\Omega} \right). \tag{21}

And the sampling distribution for the GLS estimator is also multivariate normal:

β^GLSXN(β,σ2(XΩ1X)1).(22) \hat{\boldsymbol{\beta}}_{\textsf{GLS}} \mid \mathbf{X} \sim \mathcal{N}\left(\boldsymbol{\beta}, \sigma^2 (\mathbf{X}^{\top} \boldsymbol{\Omega}^{-1} \mathbf{X})^{-1} \right). \tag{22}

Once again, this logic just follows the standard derivations for the OLS sampling distribution.

Conclusion

When our data have heteroscedasticity and when we know the structure of that heteroscedasticity, we can use generalized least squares. This is just OLS applied to the normalized data. In practice, I suspect that knowing precisely Ω\boldsymbol{\Omega} is rare, although perhaps it can be estimated from held out data. Another approach, which I will discuss in a future post, is using estimators of standard errors that are robust to heteroscedasticity.

  1. Aitkin, A. C. (1935). On least squares and linear combination of observations. Proceedings of the Royal Society of Edinburgh, 55, 42–48.