Weighted least squares (WLS) is a generalization of ordinary least squares in which each observation is assigned a weight, which scales the squared residual error. I discuss WLS and then derive its estimator in detail.
Published
09 August 2022
In ordinary least squares (OLS), we assume homoscedasticity, that our observations have a constant variance. Let n∈{1,2…,N} index independent samples, and let εn denote the noise term for the n-th sample. Then this assumption can be expressed as
V[εn]=σ2.(1)
However, in many practical problems of interest, the assumption of homoscedasticity does not hold. If we know the covariance structure of our data, then we can use generalized least squares (GLS) (Aitkin, 1935). The GLS objective is to estimate linear coefficients β that minimize the sum of squared residuals, while accounting for sample-specific variances:
β^GLS=argβmin{(y−Xβ)⊤Ω−1(y−Xβ)}.(2)
Here, y is an N-vector of responses, X is an N×P matrix of predictors, β is a P-vector of linear coefficients, and Ω is the N×N covariance matrix of the error term V[ε∣X]=σ2Ω.
A special case of GLS is weighted least squares (WLS), which assumes heteroscedasticity but with uncorrelated errors, i.e. the cross-covariance terms in Ω are zero. Here, each observation is assigned a weight wn that scales the squared residual error:
β^WLS=argβmin{n=1∑Nwn(yn−xn⊤β)2}.(3)
Clearly, when wn=1/σn2, we get a special case of GLS with uncorrelated errors. Thus, if we know σn2 for each sample and use weights that are the reciprocal of the variances, the WLS is the best linear unbiased estimator (BLUE), since the GLS estimator is BLUE. Alternatively, if wn=1 for all n, WLS reduces to OLS.
The goal of this post is to derive WLS estimator β^WLS in detail. I’ll first work through the case of simple weighted linear regression and then work through the multivariate case.
However, in WLS, we minimize the weighted sum of squared residuals, where J is now
J(β,α)=n=1∑Nwn(yn−α−βxn)2.(6)
As with simple linear regression, we find the minimizers for α and β by differentiating J w.r.t. to each parameter and setting this derivative equal to zero.
First, let’s solve for intercept α. We take the derivative of the objective function w.r.t. to α,
In weighted least squares with multivariate predictors, the objective is to minimize
J(β)=n=1∑Nwn(yn−xn⊤β)2.(17)
Here, we are ignoring the bias term α, since this can be handled by adding a dummy predictor of all ones. We can represent this objective function via matrix-vector multiplications as:
J(β)=(y−Xβ)⊤W(y−Xβ)(18)
where y is an N-vector of response variables, X is an N×P matrix of predictors, β is a P-vector, and W is an N×N diagonal matrix whose a diagonal is filled with the N weights. To convince ourselves that Equation 18 is correct, we can write this out explicitly:
This is identical to the derivation for the OLS normal equation, except that we must pass the matrix W through the derivation. In step 4, we use the fact that the trace of a scalar is the scalar. In step 5, we use the linearity of differentiation and the trace operator. In step 6, we use the fact that tr(A)=tr(A⊤). In step 7, we take the derivatives of the left and right terms using identities 108 and 103 from (Petersen et al., 2008), respectively.
If we set line 7 equal to zero and solve for β, we get the WLS normal equation:
β=(X⊤WX)−1X⊤Wy.(21)
And we’re done.
Aitkin, A. C. (1935). On least squares and linear combination of observations. Proceedings of the Royal Society of Edinburgh, 55, 42–48.
Petersen, K. B., Pedersen, M. S., & others. (2008). The matrix cookbook. Technical University of Denmark, 7(15), 510.