Weighted Least Squares

Weighted least squares (WLS) is a generalization of ordinary least squares in which each observation is assigned a weight, which scales the squared residual error. I discuss WLS and then derive its estimator in detail.

In ordinary least squares (OLS), we assume homoscedasticity, that our observations have a constant variance. Let n{1,2,N}n \in \{1,2\dots,N\} index independent samples, and let εn\varepsilon_n denote the noise term for the nn-th sample. Then this assumption can be expressed as

V[εn]=σ2.(1) \mathbb{V}[\varepsilon_n] = \sigma^2. \tag{1}

However, in many practical problems of interest, the assumption of homoscedasticity does not hold. If we know the covariance structure of our data, then we can use generalized least squares (GLS) (Aitkin, 1935). The GLS objective is to estimate linear coefficients β\boldsymbol{\beta} that minimize the sum of squared residuals, while accounting for sample-specific variances:

β^GLS=arg ⁣minβ{(yXβ)Ω1(yXβ)}.(2) \hat{\boldsymbol{\beta}}_{\textsf{GLS}} = \arg\!\min_{\boldsymbol{\beta}} \left\{ (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^{\top} \boldsymbol{\Omega}^{-1} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \right\}. \tag{2}

Here, y\mathbf{y} is an NN-vector of responses, X\mathbf{X} is an N×PN \times P matrix of predictors, β\boldsymbol{\beta} is a PP-vector of linear coefficients, and Ω\boldsymbol{\Omega} is the N×NN \times N covariance matrix of the error term V[εX]=σ2Ω\mathbb{V}[\boldsymbol{\varepsilon} \mid \mathbf{X}] = \sigma^2 \boldsymbol{\Omega}.

A special case of GLS is weighted least squares (WLS), which assumes heteroscedasticity but with uncorrelated errors, i.e. the cross-covariance terms in Ω\boldsymbol{\Omega} are zero. Here, each observation is assigned a weight wnw_n that scales the squared residual error:

β^WLS=arg ⁣minβ{n=1Nwn(ynxnβ)2}.(3) \hat{\boldsymbol{\beta}}_{\textsf{WLS}} = \arg\!\min_{\boldsymbol{\beta}} \left\{ \sum_{n=1}^N w_n \left( y_n - \mathbf{x}_n^{\top} \boldsymbol{\beta} \right)^2 \right\}. \tag{3}

Clearly, when wn=1/σn2w_n = 1 / \sigma_n^2, we get a special case of GLS with uncorrelated errors. Thus, if we know σn2\sigma_n^2 for each sample and use weights that are the reciprocal of the variances, the WLS is the best linear unbiased estimator (BLUE), since the GLS estimator is BLUE. Alternatively, if wn=1w_n = 1 for all nn, WLS reduces to OLS.

The goal of this post is to derive WLS estimator β^WLS\hat{\boldsymbol{\beta}}_{\textsf{WLS}} in detail. I’ll first work through the case of simple weighted linear regression and then work through the multivariate case.

Simple regression

Consider linear regression with a single independent variable, or simple linear regression,

yn=α+βxn+εn.(4) y_n = \alpha + \beta x_n + \varepsilon_n. \tag{4}

β\beta is the model’s slope, and α\alpha is the model’s intercept. The standard objective is minimize the sum of squared residuals,

J(β,α)=n=1N(ynαβxn)2,α^,β^=arg ⁣minα,βJ(β,α).(5) \begin{aligned} J(\beta, \alpha) &= \sum_{n=1}^N (y_n - \alpha - \beta x_n)^2, \\ \hat{\alpha}, \hat{\beta} &= \arg\!\min_{\alpha, \beta} J(\beta, \alpha). \end{aligned} \tag{5}

However, in WLS, we minimize the weighted sum of squared residuals, where JJ is now

J(β,α)=n=1Nwn(ynαβxn)2.(6) J(\beta, \alpha) = \sum_{n=1}^N w_n (y_n - \alpha - \beta x_n)^2. \tag{6}

As with simple linear regression, we find the minimizers for α\alpha and β\beta by differentiating JJ w.r.t. to each parameter and setting this derivative equal to zero.

First, let’s solve for intercept α\alpha. We take the derivative of the objective function w.r.t. to α\alpha,

dJdα=n=1Nddαwn(ynαβxn)2=n=1N2wn(ynαβxn)(1)=2(n=1Nβwnxnwnyn+wnα),(7) \begin{aligned} \frac{d J}{d \alpha} &= \sum_{n=1}^N \frac{d}{d \alpha} w_n (y_n - \alpha - \beta x_n)^2 \\ &= \sum_{n=1}^N 2 w_n (y_n - \alpha - \beta x_n) (- 1) \\ &= 2 \left( \sum_{n=1}^N \beta w_n x_n - w_n y_n + w_n \alpha \right), \end{aligned} \tag{7}

set this equal to zero, and then solve for α\alpha:

0=2(n=1Nβwnxnn=1Nwnyn+Nwα),α^=yˉwβxˉw,(8) \begin{aligned} 0 &= 2 \left( \sum_{n=1}^N \beta w_n x_n - \sum_{n=1}^N w_n y_n + N_w \alpha \right), \\ &\Downarrow \\ \hat{\alpha} &= \bar{y}_w - \beta \bar{x}_w, \end{aligned} \tag{8}

where Nw=nwnN_w = \sum_n w_n and where xˉw\bar{x}_w and yˉw\bar{y}_w are the weighted sample means, e.g.

xˉw=1Nwn=1Nwnxn.(9) \bar{x}_w = \frac{1}{N_w} \sum_{n=1}^N w_n x_n. \tag{9}

Next, let’s solve for the slope β\beta. We take the derivative of our objective function w.r.t. β\beta,

dJdβ=n=1Nddβwn(ynαβxn)2=n=1N2wn(ynαβxn)(xn)=2(n=1Nβwnxn2wnxnyn+αwnxn),(10) \begin{aligned} \frac{d J}{d \beta} &= \sum_{n=1}^N \frac{d}{d \beta} w_n (y_n - \alpha - \beta x_n)^2 \\ &= \sum_{n=1}^N 2 w_n (y_n - \alpha - \beta x_n) (- x_n) \\ &= 2 \left( \sum_{n=1}^N \beta w_n x_n^2 - w_n x_n y_n + \alpha w_n x_n \right), \end{aligned} \tag{10}

set this equal to zero, and solve, plugging in the value of α\alpha that we just computed:

0=2[n=1Nβwnxn2wnxnyn+αwnxn],αn=1Nwnxn+βn=1Nwnxn2=n=1Nwnxnyn(yˉwxˉwβ)n=1Nwnxn+βn=1Nwnxn2=n=1Nwnxnynyˉn=1Nwnxnxˉβn=1Nwnxn+βn=1Nwnxn2=n=1Nwnxnynβ[n=1Nwnxn21Nwn=1Nwnxnn=1Nwnxn]=n=1Nwnxnyn1Nwn=1Nwnynn=1Nwnxn.(11) \begin{aligned} 0 &= 2 \left[ \sum_{n=1}^N \beta w_n x_n^2 - w_n x_n y_n + \alpha w_n x_n \right], \\ &\Downarrow \\ \alpha \sum_{n=1}^N w_n x_n + \beta \sum_{n=1}^N w_n x_n^2 &= \sum_{n=1}^N w_n x_n y_n \\ (\bar{y}_w - \bar{x}_w \beta) \sum_{n=1}^N w_n x_n + \beta \sum_{n=1}^N w_n x_n^2 &= \sum_{n=1}^N w_n x_n y_n \\ \bar{y} \sum_{n=1}^N w_n x_n - \bar{x} \beta \sum_{n=1}^N w_n x_n + \beta \sum_{n=1}^N w_n x_n^2 &= \sum_{n=1}^N w_n x_n y_n \\ \beta \left[ \sum_{n=1}^N w_n x_n^2 - \frac{1}{N_w} \sum_{n=1}^N w_n x_n \sum_{n=1}^N w_n x_n \right] &= \sum_{n=1}^N w_n x_n y_n - \frac{1}{N_w} \sum_{n=1}^N w_n y_n \sum_{n=1}^N w_n x_n. \end{aligned} \tag{11}

Solving for β^\hat{\beta}, we get

β^=n=1Nwnxnyn(1/Nw)n=1Nwnynn=1Nwnxnn=1Nwnxn2(1/Nw)n=1Nwnxnn=1Nxn.(12) \hat{\beta} = \frac{\sum_{n=1}^N w_n x_n y_n - (1/N_w) \sum_{n=1}^N w_n y_n \sum_{n=1}^N w_n x_n}{\sum_{n=1}^N w_n x_n^2 - (1/N_w) \sum_{n=1}^N w_n x_n \sum_{n=1}^N x_n }. \tag{12}

However, often the WLS estimator β^\hat{\beta} is written as

β^=n=1Nwn(xnxˉw)(ynyˉw)n=1Nwn(xnxˉw)2.(13) \hat{\beta} = \frac{\sum_{n=1}^N w_n (x_n - \bar{x}_w) (y_n - \bar{y}_w)}{\sum_{n=1}^N w_n (x_n - \bar{x}_w)^2}. \tag{13}

We can derive this representation from Equation 1313 through some algebraic manipulation of both the numerator,

n=1Nwnxnyn1Nwn=1Nwnynn=1Nwnxn=n=1Nwnxnynyˉwn=1Nwnxnxˉwn=1Nwnyn+Nw1Nwn=1Nwnyn1Nwn=1Nwnxn=n=1Nwnxnynyˉwn=1Nwnxnxˉwn=1Nwnyn+Nwyˉwxˉw=n=1Nwn(xnxˉw)(ynyˉw),(14) \begin{aligned} &\sum_{n=1}^N w_n x_n y_n - \frac{1}{N_w} \sum_{n=1}^N w_n y_n \sum_{n=1}^N w_n x_n \\ &= \sum_{n=1}^N w_n x_n y_n - \bar{y}_w \sum_{n=1}^N w_n x_n - \bar{x}_w \sum_{n=1}^N w_n y_n + N_w \frac{1}{N_w} \sum_{n=1}^N w_n y_n \frac{1}{N_w} \sum_{n=1}^N w_n x_n \\ &= \sum_{n=1}^N w_n x_n y_n - \bar{y}_w \sum_{n=1}^N w_n x_n - \bar{x}_w \sum_{n=1}^N w_n y_n + N_w \bar{y}_w \bar{x}_w \\ &= \sum_{n=1}^N w_n (x_n - \bar{x}_w)(y_n - \bar{y}_w), \end{aligned} \tag{14}

and the denominator:

n=1Nwnxn21Nwn=1Nwnxnn=1Nwnxn=n=1Nwnxn221Nwn=1Nwnxnn=1Nwnxn+Nw1Nwn=1Nwnxn1Nwn=1Nwnxn=n=1Nwnxn22xˉwn=1Nwnxn+Nwxˉw2=n=1Nwn(xnxˉ)(xnxˉw).(15) \begin{aligned} &\sum_{n=1}^N w_n x_n^2 - \frac{1}{N_w} \sum_{n=1}^N w_n x_n \sum_{n=1}^N w_n x_n \\ &= \sum_{n=1}^N w_n x_n^2 - 2 \frac{1}{N_w} \sum_{n=1}^N w_n x_n \sum_{n=1}^N w_n x_n + N_w \frac{1}{N_w} \sum_{n=1}^N w_n x_n \frac{1}{N_w} \sum_{n=1}^N w_n x_n \\ &= \sum_{n=1}^N w_n x_n^2 - 2 \bar{x}_w \sum_{n=1}^N w_n x_n + N_w \bar{x}_w^2 \\ &= \sum_{n=1}^N w_n (x_n - \bar{x}) (x_n - \bar{x}_w). \end{aligned} \tag{15}

To summarize, the WLS estimators for α^\hat{\alpha} and β^\hat{\beta} are

α^=yˉwβ^xˉw,β^=n=1Nwn(xnxˉw)(ynyˉw)n=1Nwn(xnxˉw)2.(16) \begin{aligned} \hat{\alpha} &= \bar{y}_w - \hat{\beta} \bar{x}_w, \\ \hat{\beta} &= \frac{\sum_{n=1}^N w_n (x_n - \bar{x}_w) (y_n - \bar{y}_w)}{\sum_{n=1}^N w_n (x_n - \bar{x}_w)^2}. \end{aligned} \tag{16}

Multivariate linear regression

In weighted least squares with multivariate predictors, the objective is to minimize

J(β)=n=1Nwn(ynxnβ)2.(17) J(\boldsymbol{\beta}) = \sum_{n=1}^N w_n \left( y_n - \mathbf{x}_n^{\top} \boldsymbol{\beta} \right)^2. \tag{17}

Here, we are ignoring the bias term α\alpha, since this can be handled by adding a dummy predictor of all ones. We can represent this objective function via matrix-vector multiplications as:

J(β)=(yXβ)W(yXβ)(18) J(\boldsymbol{\beta}) = \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \right)^{\top} \mathbf{W} \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \right) \tag{18}

where y\mathbf{y} is an NN-vector of response variables, X\mathbf{X} is an N×PN \times P matrix of predictors, β\boldsymbol{\beta} is a PP-vector, and W\mathbf{W} is an N×NN \times N diagonal matrix whose a diagonal is filled with the NN weights. To convince ourselves that Equation 1818 is correct, we can write this out explicitly:

[y1x1βyNxNβ][w1000w2000wN][y1x1βyNxNβ].(19) \begin{bmatrix} y_1 - \mathbf{x}_1^{\top} \boldsymbol{\beta} & \dots & y_N - \mathbf{x}_N^{\top} \boldsymbol{\beta} \end{bmatrix} \begin{bmatrix} w_1 & 0 & \dots & 0 \\ 0 & w_2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & w_N \end{bmatrix} \begin{bmatrix} y_1 - \mathbf{x}_1^{\top} \boldsymbol{\beta} \\ \vdots \\ y_N - \mathbf{x}_N^{\top} \boldsymbol{\beta} \end{bmatrix}. \tag{19}

To minimize J()J(\cdot), we take its derivative with respect to β\boldsymbol{\beta}, set it equal to zero, and solve for β\boldsymbol{\beta},

J(β)=1[(yXβ)W(yXβ)]=2[(yβX)W(yXβ)]=3[βXWXβyWXβ+yWyβXWy]=4tr(βXWXβyWXβ+yWyβXWy)=5tr(βXWXβ)tr(yWXβ)+tr(yWy)tr(βXWy)=6tr(βXWXβ)2tr(βXWy)=72XWXβ2XWy.(20) \begin{aligned} &\nabla J(\boldsymbol{\beta}) \\ &\stackrel{1}{=} \nabla \Big[ (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^{\top} \mathbf{W} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \Big] \\ &\stackrel{2}{=} \nabla \Big[ (\mathbf{y}^{\top} - \boldsymbol{\beta}^{\top} \mathbf{X}^{\top}) \mathbf{W} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \Big] \\ &\stackrel{3}{=} \nabla \Big[ \boldsymbol{\beta}^{\top} \mathbf{X}^{\top} \mathbf{W} \mathbf{X}\boldsymbol{\beta} - \mathbf{y}^{\top} \mathbf{W} \mathbf{X} \boldsymbol{\beta} + \mathbf{y}^{\top} \mathbf{W} \mathbf{y} - \boldsymbol{\beta}^{\top} \mathbf{X}^{\top} \mathbf{W} \mathbf{y} \Big] \\ &\stackrel{4}{=} \nabla \,\text{tr} \big( \boldsymbol{\beta}^{\top} \mathbf{X}^{\top} \mathbf{W} \mathbf{X}\boldsymbol{\beta} - \mathbf{y}^{\top} \mathbf{W} \mathbf{X} \boldsymbol{\beta} + \mathbf{y}^{\top} \mathbf{W} \mathbf{y} - \boldsymbol{\beta}^{\top} \mathbf{X}^{\top} \mathbf{W} \mathbf{y} \big) \\ &\stackrel{5}{=} \nabla \,\text{tr} \big( \boldsymbol{\beta}^{\top} \mathbf{X}^{\top} \mathbf{W} \mathbf{X} \boldsymbol{\beta} \big) - \nabla \,\text{tr}\big(\mathbf{y}^{\top} \mathbf{W} \mathbf{X} \boldsymbol{\beta}\big) + \cancel{\nabla \,\text{tr}\big(\mathbf{y}^{\top} \mathbf{W} \mathbf{y}\big)} - \nabla \,\text{tr}\big(\boldsymbol{\beta}^{\top} \mathbf{X}^{\top} \mathbf{W} \mathbf{y} \big) \\ &\stackrel{6}{=} \nabla \,\text{tr} \big( \boldsymbol{\beta}^{\top} \mathbf{X}^{\top} \mathbf{W} \mathbf{X}\boldsymbol{\beta} \big) - 2 \nabla \,\text{tr}\big(\boldsymbol{\beta}^{\top}\mathbf{X}^{\top} \mathbf{W} \mathbf{y}\big) \\ &\stackrel{7}{=} 2 \mathbf{X}^{\top} \mathbf{W} \mathbf{X} \boldsymbol{\beta} - 2 \mathbf{X}^{\top} \mathbf{W} \mathbf{y}. \end{aligned} \tag{20}

This is identical to the derivation for the OLS normal equation, except that we must pass the matrix W\mathbf{W} through the derivation. In step 44, we use the fact that the trace of a scalar is the scalar. In step 55, we use the linearity of differentiation and the trace operator. In step 66, we use the fact that tr(A)=tr(A)\text{tr}(\mathbf{A}) = \text{tr}(\mathbf{A}^{\top}). In step 77, we take the derivatives of the left and right terms using identities 108108 and 103103 from (Petersen et al., 2008), respectively.

If we set line 77 equal to zero and solve for β\boldsymbol{\beta}, we get the WLS normal equation:

β=(XWX)1XWy.(21) \boldsymbol{\beta} = \left( \mathbf{X}^{\top} \mathbf{W} \mathbf{X} \right)^{-1} \mathbf{X}^{\top} \mathbf{W} \mathbf{y}. \tag{21}

And we’re done.

  1. Aitkin, A. C. (1935). On least squares and linear combination of observations. Proceedings of the Royal Society of Edinburgh, 55, 42–48.
  2. Petersen, K. B., Pedersen, M. S., & others. (2008). The matrix cookbook. Technical University of Denmark, 7(15), 510.