Weighted Least Squares

Weighted least squares (WLS) is a generalization of ordinary least squares in which each observation is assigned a weight, which scales the squared residual error. I discuss WLS and then derive its estimator in detail.

Published

09 August 2022

In ordinary least squares (OLS), we assume homoscedasticity, that our observations have a constant variance. Let $n \in \{1,2\dots,N\}$ index independent samples, and let $\varepsilon_n$ denote the noise term for the $n$ -th sample. Then this assumption can be expressed as

$\mathbb{V}[\varepsilon_n] = \sigma^2. \tag{1}$

However, in many practical problems of interest, the assumption of homoscedasticity does not hold. If we know the covariance structure of our data, then we can use generalized least squares (GLS) (Aitkin, 1935). The GLS objective is to estimate linear coefficients $\boldsymbol{\beta}$ that minimize the sum of squared residuals, while accounting for sample-specific variances:

$\hat{\boldsymbol{\beta}}_{\textsf{GLS}} = \arg\!\min_{\boldsymbol{\beta}} \left\{ (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^{\top} \boldsymbol{\Omega}^{-1} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \right\}. \tag{2}$

Here, $\mathbf{y}$ is an $N$ -vector of responses, $\mathbf{X}$ is an $N \times P$ matrix of predictors, $\boldsymbol{\beta}$ is a $P$ -vector of linear coefficients, and $\boldsymbol{\Omega}$ is the $N \times N$ covariance matrix of the error term $\mathbb{V}[\boldsymbol{\varepsilon} \mid \mathbf{X}] = \sigma^2 \boldsymbol{\Omega}$ .

A special case of GLS is weighted least squares (WLS), which assumes heteroscedasticity but with uncorrelated errors, i.e. the cross-covariance terms in $\boldsymbol{\Omega}$ are zero. Here, each observation is assigned a weight $w_n$ that scales the squared residual error:

$\hat{\boldsymbol{\beta}}_{\textsf{WLS}} = \arg\!\min_{\boldsymbol{\beta}} \left\{ \sum_{n=1}^N w_n \left( y_n - \mathbf{x}_n^{\top} \boldsymbol{\beta} \right)^2 \right\}. \tag{3}$

Clearly, when $w_n = 1 / \sigma_n^2$ , we get a special case of GLS with uncorrelated errors. Thus, if we know $\sigma_n^2$ for each sample and use weights that are the reciprocal of the variances, the WLS is the best linear unbiased estimator (BLUE), since the GLS estimator is BLUE. Alternatively, if $w_n = 1$ for all $n$ , WLS reduces to OLS.

The goal of this post is to derive WLS estimator $\hat{\boldsymbol{\beta}}_{\textsf{WLS}}$ in detail. I’ll first work through the case of simple weighted linear regression and then work through the multivariate case.

Simple regression

Consider linear regression with a single independent variable, or simple linear regression,

$y_n = \alpha + \beta x_n + \varepsilon_n. \tag{4}$

$\beta$ is the model’s slope, and $\alpha$ is the model’s intercept. The standard objective is minimize the sum of squared residuals,

$\begin{aligned} J(\beta, \alpha) &= \sum_{n=1}^N (y_n - \alpha - \beta x_n)^2, \\ \hat{\alpha}, \hat{\beta} &= \arg\!\min_{\alpha, \beta} J(\beta, \alpha). \end{aligned} \tag{5}$

However, in WLS, we minimize the weighted sum of squared residuals, where $J$ is now

$J(\beta, \alpha) = \sum_{n=1}^N w_n (y_n - \alpha - \beta x_n)^2. \tag{6}$

As with simple linear regression, we find the minimizers for $\alpha$ and $\beta$ by differentiating $J$ w.r.t. to each parameter and setting this derivative equal to zero.

First, let’s solve for intercept $\alpha$ . We take the derivative of the objective function w.r.t. to $\alpha$ ,

$\begin{aligned} \frac{d J}{d \alpha} &= \sum_{n=1}^N \frac{d}{d \alpha} w_n (y_n - \alpha - \beta x_n)^2 \\ &= \sum_{n=1}^N 2 w_n (y_n - \alpha - \beta x_n) (- 1) \\ &= 2 \left( \sum_{n=1}^N \beta w_n x_n - w_n y_n + w_n \alpha \right), \end{aligned} \tag{7}$

set this equal to zero, and then solve for $\alpha$ :

$\begin{aligned} 0 &= 2 \left( \sum_{n=1}^N \beta w_n x_n - \sum_{n=1}^N w_n y_n + N_w \alpha \right), \\ &\Downarrow \\ \hat{\alpha} &= \bar{y}_w - \beta \bar{x}_w, \end{aligned} \tag{8}$

where $N_w = \sum_n w_n$ and where $\bar{x}_w$ and $\bar{y}_w$ are the weighted sample means, e.g.

$\bar{x}_w = \frac{1}{N_w} \sum_{n=1}^N w_n x_n. \tag{9}$

Next, let’s solve for the slope $\beta$ . We take the derivative of our objective function w.r.t. $\beta$ ,

$\begin{aligned} \frac{d J}{d \beta} &= \sum_{n=1}^N \frac{d}{d \beta} w_n (y_n - \alpha - \beta x_n)^2 \\ &= \sum_{n=1}^N 2 w_n (y_n - \alpha - \beta x_n) (- x_n) \\ &= 2 \left( \sum_{n=1}^N \beta w_n x_n^2 - w_n x_n y_n + \alpha w_n x_n \right), \end{aligned} \tag{10}$

set this equal to zero, and solve, plugging in the value of $\alpha$ that we just computed:

$\begin{aligned} 0 &= 2 \left[ \sum_{n=1}^N \beta w_n x_n^2 - w_n x_n y_n + \alpha w_n x_n \right], \\ &\Downarrow \\ \alpha \sum_{n=1}^N w_n x_n + \beta \sum_{n=1}^N w_n x_n^2 &= \sum_{n=1}^N w_n x_n y_n \\ (\bar{y}_w - \bar{x}_w \beta) \sum_{n=1}^N w_n x_n + \beta \sum_{n=1}^N w_n x_n^2 &= \sum_{n=1}^N w_n x_n y_n \\ \bar{y} \sum_{n=1}^N w_n x_n - \bar{x} \beta \sum_{n=1}^N w_n x_n + \beta \sum_{n=1}^N w_n x_n^2 &= \sum_{n=1}^N w_n x_n y_n \\ \beta \left[ \sum_{n=1}^N w_n x_n^2 - \frac{1}{N_w} \sum_{n=1}^N w_n x_n \sum_{n=1}^N w_n x_n \right] &= \sum_{n=1}^N w_n x_n y_n - \frac{1}{N_w} \sum_{n=1}^N w_n y_n \sum_{n=1}^N w_n x_n. \end{aligned} \tag{11}$

Solving for $\hat{\beta}$ , we get

$\hat{\beta} = \frac{\sum_{n=1}^N w_n x_n y_n - (1/N_w) \sum_{n=1}^N w_n y_n \sum_{n=1}^N w_n x_n}{\sum_{n=1}^N w_n x_n^2 - (1/N_w) \sum_{n=1}^N w_n x_n \sum_{n=1}^N x_n }. \tag{12}$

However, often the WLS estimator $\hat{\beta}$ is written as

$\hat{\beta} = \frac{\sum_{n=1}^N w_n (x_n - \bar{x}_w) (y_n - \bar{y}_w)}{\sum_{n=1}^N w_n (x_n - \bar{x}_w)^2}. \tag{13}$

We can derive this representation from Equation $13$ through some algebraic manipulation of both the numerator,

$\begin{aligned} &\sum_{n=1}^N w_n x_n y_n - \frac{1}{N_w} \sum_{n=1}^N w_n y_n \sum_{n=1}^N w_n x_n \\ &= \sum_{n=1}^N w_n x_n y_n - \bar{y}_w \sum_{n=1}^N w_n x_n - \bar{x}_w \sum_{n=1}^N w_n y_n + N_w \frac{1}{N_w} \sum_{n=1}^N w_n y_n \frac{1}{N_w} \sum_{n=1}^N w_n x_n \\ &= \sum_{n=1}^N w_n x_n y_n - \bar{y}_w \sum_{n=1}^N w_n x_n - \bar{x}_w \sum_{n=1}^N w_n y_n + N_w \bar{y}_w \bar{x}_w \\ &= \sum_{n=1}^N w_n (x_n - \bar{x}_w)(y_n - \bar{y}_w), \end{aligned} \tag{14}$

and the denominator:

$\begin{aligned} &\sum_{n=1}^N w_n x_n^2 - \frac{1}{N_w} \sum_{n=1}^N w_n x_n \sum_{n=1}^N w_n x_n \\ &= \sum_{n=1}^N w_n x_n^2 - 2 \frac{1}{N_w} \sum_{n=1}^N w_n x_n \sum_{n=1}^N w_n x_n + N_w \frac{1}{N_w} \sum_{n=1}^N w_n x_n \frac{1}{N_w} \sum_{n=1}^N w_n x_n \\ &= \sum_{n=1}^N w_n x_n^2 - 2 \bar{x}_w \sum_{n=1}^N w_n x_n + N_w \bar{x}_w^2 \\ &= \sum_{n=1}^N w_n (x_n - \bar{x}) (x_n - \bar{x}_w). \end{aligned} \tag{15}$

To summarize, the WLS estimators for $\hat{\alpha}$ and $\hat{\beta}$ are

$\begin{aligned} \hat{\alpha} &= \bar{y}_w - \hat{\beta} \bar{x}_w, \\ \hat{\beta} &= \frac{\sum_{n=1}^N w_n (x_n - \bar{x}_w) (y_n - \bar{y}_w)}{\sum_{n=1}^N w_n (x_n - \bar{x}_w)^2}. \end{aligned} \tag{16}$

Multivariate linear regression

In weighted least squares with multivariate predictors, the objective is to minimize

$J(\boldsymbol{\beta}) = \sum_{n=1}^N w_n \left( y_n - \mathbf{x}_n^{\top} \boldsymbol{\beta} \right)^2. \tag{17}$

Here, we are ignoring the bias term $\alpha$ , since this can be handled by adding a dummy predictor of all ones. We can represent this objective function via matrix-vector multiplications as:

$J(\boldsymbol{\beta}) = \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \right)^{\top} \mathbf{W} \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \right) \tag{18}$

where $\mathbf{y}$ is an $N$ -vector of response variables, $\mathbf{X}$ is an $N \times P$ matrix of predictors, $\boldsymbol{\beta}$ is a $P$ -vector, and $\mathbf{W}$ is an $N \times N$ diagonal matrix whose a diagonal is filled with the $N$ weights. To convince ourselves that Equation $18$ is correct, we can write this out explicitly:

$\begin{bmatrix} y_1 - \mathbf{x}_1^{\top} \boldsymbol{\beta} & \dots & y_N - \mathbf{x}_N^{\top} \boldsymbol{\beta} \end{bmatrix} \begin{bmatrix} w_1 & 0 & \dots & 0 \\ 0 & w_2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & w_N \end{bmatrix} \begin{bmatrix} y_1 - \mathbf{x}_1^{\top} \boldsymbol{\beta} \\ \vdots \\ y_N - \mathbf{x}_N^{\top} \boldsymbol{\beta} \end{bmatrix}. \tag{19}$

To minimize $J(\cdot)$ , we take its derivative with respect to $\boldsymbol{\beta}$ , set it equal to zero, and solve for $\boldsymbol{\beta}$ ,

$\begin{aligned} &\nabla J(\boldsymbol{\beta}) \\ &\stackrel{1}{=} \nabla \Big[ (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^{\top} \mathbf{W} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \Big] \\ &\stackrel{2}{=} \nabla \Big[ (\mathbf{y}^{\top} - \boldsymbol{\beta}^{\top} \mathbf{X}^{\top}) \mathbf{W} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \Big] \\ &\stackrel{3}{=} \nabla \Big[ \boldsymbol{\beta}^{\top} \mathbf{X}^{\top} \mathbf{W} \mathbf{X}\boldsymbol{\beta} - \mathbf{y}^{\top} \mathbf{W} \mathbf{X} \boldsymbol{\beta} + \mathbf{y}^{\top} \mathbf{W} \mathbf{y} - \boldsymbol{\beta}^{\top} \mathbf{X}^{\top} \mathbf{W} \mathbf{y} \Big] \\ &\stackrel{4}{=} \nabla \,\text{tr} \big( \boldsymbol{\beta}^{\top} \mathbf{X}^{\top} \mathbf{W} \mathbf{X}\boldsymbol{\beta} - \mathbf{y}^{\top} \mathbf{W} \mathbf{X} \boldsymbol{\beta} + \mathbf{y}^{\top} \mathbf{W} \mathbf{y} - \boldsymbol{\beta}^{\top} \mathbf{X}^{\top} \mathbf{W} \mathbf{y} \big) \\ &\stackrel{5}{=} \nabla \,\text{tr} \big( \boldsymbol{\beta}^{\top} \mathbf{X}^{\top} \mathbf{W} \mathbf{X} \boldsymbol{\beta} \big) - \nabla \,\text{tr}\big(\mathbf{y}^{\top} \mathbf{W} \mathbf{X} \boldsymbol{\beta}\big) + \cancel{\nabla \,\text{tr}\big(\mathbf{y}^{\top} \mathbf{W} \mathbf{y}\big)} - \nabla \,\text{tr}\big(\boldsymbol{\beta}^{\top} \mathbf{X}^{\top} \mathbf{W} \mathbf{y} \big) \\ &\stackrel{6}{=} \nabla \,\text{tr} \big( \boldsymbol{\beta}^{\top} \mathbf{X}^{\top} \mathbf{W} \mathbf{X}\boldsymbol{\beta} \big) - 2 \nabla \,\text{tr}\big(\boldsymbol{\beta}^{\top}\mathbf{X}^{\top} \mathbf{W} \mathbf{y}\big) \\ &\stackrel{7}{=} 2 \mathbf{X}^{\top} \mathbf{W} \mathbf{X} \boldsymbol{\beta} - 2 \mathbf{X}^{\top} \mathbf{W} \mathbf{y}. \end{aligned} \tag{20}$

This is identical to the derivation for the OLS normal equation, except that we must pass the matrix $\mathbf{W}$ through the derivation. In step $4$ , we use the fact that the trace of a scalar is the scalar. In step $5$ , we use the linearity of differentiation and the trace operator. In step $6$ , we use the fact that $\text{tr}(\mathbf{A}) = \text{tr}(\mathbf{A}^{\top})$ . In step $7$ , we take the derivatives of the left and right terms using identities $108$ and $103$ from (Petersen et al., 2008), respectively.

If we set line $7$ equal to zero and solve for $\boldsymbol{\beta}$ , we get the WLS normal equation:

$\boldsymbol{\beta} = \left( \mathbf{X}^{\top} \mathbf{W} \mathbf{X} \right)^{-1} \mathbf{X}^{\top} \mathbf{W} \mathbf{y}. \tag{21}$

And we’re done.

Aitkin, A. C. (1935). On least squares and linear combination of observations. Proceedings of the Royal Society of Edinburgh, 55, 42–48.
Petersen, K. B., Pedersen, M. S., & others. (2008). The matrix cookbook. Technical University of Denmark, 7(15), 510.