Multicollinearity

Multicollinearity is when two or more predictors are linearly dependent. This can impact the interpretability of a linear model's estimated coefficients. I discuss this phenomenon in detail.

Linearly dependent predictors

In my first post on linear models, we derived the normal equation for linear regression or ordinary least squares (OLS). Let y\mathbf{y} be an NN-vector of targets, and let X\mathbf{X} be an N×PN \times P design matrix of predictors. Then the normal equation for the OLS coefficients β^=[β^1,,β^P]\hat{\boldsymbol{\beta}} = [\hat{\beta}_1, \dots, \hat{\beta}_P]^{\top} is

β^=(XX)1Xy.(1) \hat{\boldsymbol{\beta}} = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y}. \tag{1}

Note that estimating β^\hat{\boldsymbol{\beta}} depends on the Gram matrix XX\mathbf{X}^{\top} \mathbf{X} being invertible. Furthermore, the variance of this estimator,

V[β^X]=σ2(XX)1,(2) \mathbb{V}[\hat{\boldsymbol{\beta}} \mid \mathbf{X}] = \sigma^2 (\mathbf{X}^{\top} \mathbf{X})^{-1}, \tag{2}

also depends on XX\mathbf{X}^{\top} \mathbf{X} being invertible. Therefore, a key question for OLS is: when is ΣXX\boldsymbol{\Sigma} \triangleq \mathbf{X}^{\top} \mathbf{X} invertible? Σ\boldsymbol{\Sigma} is invertible if and only if X\mathbf{X} is full rank, and X\mathbf{X} is full rank when none of its PP columns can be represented as linear combinations of any other columns. See A1 for a proof of this claim. (Notice that this immediately implies that NPN \geq P, since the PP columns of X\mathbf{X} cannot be linearly independent if their size is less than PP. See my discussion of the linear–dimension inequality in my post on linearly dependence if needed. Thus, OLS requires that there be as many observations as predictors.)

If the predictors of X\mathbf{X} are linearly dependent, then we say they are multicollinear, and we call this phenomenon exact multicollinearity. However, in practice, our predictors rarely exhibit exact multicollinearity, as this would mean two predictors are effectively the same. However, multicollinearity can still be a problem without a perfect linear relationship between predictors. First, models can be difficult to interpret since multicollinear predictors can be predictive of the same response. Second, the matrix inversion of Σ\boldsymbol{\Sigma} may be impossible at worst and unstable at best. If the inversion is unstable, the estimated β^\hat{\boldsymbol{\beta}} can change dramatically given a small change in the input.

Let’s discuss both problems in turn.

Uninterpretible features

The first problem with multicollinearity is that it makes it difficult to disambiguate between the effect of two or more multicollinear predictors. Imagine, for example, that we use a linear model to predict housing prices and that two predictors, x1x_1 and x2x_2, both measure the size of the house. One is the area in square feet, and the other is the area in square meters. Since x1x_1 and x2x_2 are collinear, then x1=Kx2x_1 = K x_2 for some constant KK and therefore:

y=β0+β1x1+β2x2++βPxP+ε=β0+(β1K+β2)βx2++βPxP+ε.(3) \begin{aligned} y &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_P x_P + \varepsilon \\ &= \beta_0 + \overbrace{(\beta_1 K + \beta_2)}^{\beta_{\star}} x_2 + \dots + \beta_P x_P + \varepsilon. \end{aligned} \tag{3}

Here, there is no unique solution to the normal equation (recall there are NN such equations and PNP \leq N). For any choice of β1\beta_1, there is an infinite number of choices for β2\beta_2 through the equation ββ1K+β2\beta_{\star} \triangleq \beta_1 K + \beta_2. Given this fact, it is difficult to properly interpret β1\beta_1 and β2\beta_2 since, in principal, they can be any value. This also has downstream effects. If the variance of β^\hat{\boldsymbol{\beta}} is high, then it will be harder to reject the null hypothesis.

As an example of this behavior, consider predicting blood pressure using the data in A2. The predictors are age (years), weight (kilograms), body-surface area (square meters), duration of hypertension (years), basal pulse (beats per minute), and stress index. Since body-surface area (BSA) and weight are highly correlated (Figure 11), we hypothesize that these data exhibit multicollinearity. (I’ll discuss a quantitative test for multicollinearity later.)

Figure 1. Correlation matrix for predictors in the blood pressure data set. Weight and BSA are highly correlated.

To see the affects of multicollinearity, I fit OLS to three different subsets of predictors: all predictors, only weight, and only BSA. As we can see in Figure 22, the OLS coefficients can be tricky to interpret. Based on the coefficients from fitting OLS to all predictors (Figure 22, left), we might expect BSA to be the most important feature for predicting blood pressure. However, looking at the coefficients for OLS fit to just weight (Figure 22, middle) and just BSA (Figure 22, right) complicates the picture. We find that weight is actually more predictive of blood pressure, as measured by the coefficient of determination R2R^2. This is because BSA, weight, and blood pressure are all correlated, but weight is more highly correlated with blood pressure than BSA is. However, when both predictors are included in the model, multicollinearity complicates the analysis.

Figure 2. OLS coefficients and goodness-of-fit (R2R^2) on blood pressure data for (left) all predictors, (middle) just the weight predictor, and (right) just the BSA predictor.

Also notice that the BSA-only model’s coefficient is quite large. My understanding is that extreme values, as well as sign changes, suggest that a data set may suffer from multicollinearity.

Unstable parameter estimates

The second problem of multicollinearity is that our parameter estimates will be algorithmically unstable. To see this, let’s quantify multicollinearity using the condition number of a matrix. In general, a well-conditioned problem is a problem in which a small change in the input x\mathbf{x} results in a small change in the output f(x)f(\mathbf{x}). An ill-conditioned problem is a problem in which a small change in the input x\mathbf{x} results in a large change in the output f(x)f(\mathbf{x}).

The condition number of a matrix A\mathbf{A}, denoted κ(A)\kappa(\mathbf{A}), is the ratio of its smallest to largest singular values,

κ(A)σ1(A)σP(A),(4) \kappa(\mathbf{A}) \triangleq \frac{\sigma_1(\mathbf{A})}{\sigma_P(\mathbf{A})}, \tag{4}

where σ1(A)σ2(A)σP(A)\sigma_1(\mathbf{A}) \geq \sigma_2(\mathbf{A}) \geq \dots \geq \sigma_P(\mathbf{A}) are the singular values of A\mathbf{A}. As we can see, when the smallest singular value σP(A)\sigma_P(\mathbf{A}) is close to zero, the condition number of A\mathbf{A} is big.

But what’s the intuition behind this definition? Consider the singular value decomposition (SVD) of a matrix A\mathbf{A}:

A=USV,(5) \mathbf{A} = \mathbf{U}\mathbf{S}\mathbf{V}^{\top}, \tag{5}

where U\mathbf{U} and V\mathbf{V} are orthogonal matrices, and S\mathbf{S} is a diagonal matrix of singular values:

S=[σ1(A)σ2(A)σP(A)].(6) \mathbf{S} = \begin{bmatrix} \sigma_1(\mathbf{A}) & & & \\ & \sigma_2(\mathbf{A}) & & \\ & & \ddots & \\ & & & \sigma_P(\mathbf{A}) \\ \end{bmatrix}. \tag{6}

Then the inverse of A\mathbf{A} can be written as

A1=VS1U,(7) \mathbf{A}^{-1} = \mathbf{V}\mathbf{S}^{-1}\mathbf{U}^{\top}, \tag{7}

where, since S\mathbf{S} is a diagonal matrix, it’s inverse is just the inverse of its diagonal elements:

S1=[1/σ1(A)1/σ2(A)1/σP(A)].(8) \mathbf{S}^{-1} = \begin{bmatrix} 1/\sigma_1(\mathbf{A}) & & & \\ & 1/\sigma_2(\mathbf{A}) & & \\ & & \ddots & \\ & & & 1/\sigma_P(\mathbf{A}) \\ \end{bmatrix}. \tag{8}

As we can see, when any singular value of A\mathbf{A} is zero, the matrix inverse does not exist. And when any singular value is small, then matrix inversion can be numerically unstable. This is because 1/σp(A)1 / \sigma_p(\mathbf{A}) is sensitive to the value of σp(A)\sigma_p(\mathbf{A}) because we are representing continuous values with discrete machines. For example, a small increase in σp(A)\sigma_p(\mathbf{A}) might result in numerical underflow.

Thus, the size of the singular values matters for both the mathematical existence of the matrix inverse A1\mathbf{A}^{-1}, as well as algorithmic stability of computing that inverse even when it does technically exist, i.e. when no singular value is exactly zero. See (Trefethen & Bau III, 1997) for a thorough discussion of condition numbers and algorithmic stability.

This is why Python libraries such statsmodels will warn us about eigenvalues (square of the singular values) if we fit OLS to data with multicollinearity, e.g.:

import numpy as np
import statsmodels.api as sm

N = 100
X = np.random.normal(size=(N, 1))
X = np.hstack([X, 2*X])
y = np.random.normal(size=N)

ols = sm.OLS(y, sm.add_constant(X))
res = ols.fit()
print(res.summary())

# The smallest eigenvalue is 1.88e-30. This might indicate that there are
# strong multicollinearity problems or that the design matrix is singular.

It’s not necessarily correlation

As a caveat, note that multicollinearity is about linear dependence between columns X\mathbf{X}, which is not the same thing as correlation between the predictors, although it often is. In fact, we can have a set of columns of X\mathbf{X} that are multicollinear without significant correlation between predictors. Consider, for example, three i.i.d. Gaussian random vectors and one vector that is the average of the previous three. The vectors are not highly correlated (Figure 33), but the design matrix has a large condition number. Running this experiment once on my machine, the condition number of the correlation matrix of the first three predictors (the top left block of Figure 33) is approximately 1.51.5. The condition number of all four predictors is approximately 1.3×10151.3 \times 10^{15}.

Figure 3. Pairwise relationships of three predictors which are i.i.d. Gaussian and a fourth which is the average of the first three. The design matrix has a large condition number, but the predictors are not highly correlated.

Treatment

Now that we understand multicollinearity and how to detect it, how can we treat the issue? An obvious thing to try would be to just drop a predictor if it is highly correlated with another one. However, this may not be the best approach. As we saw above, it is possible for a set of predictors to exhibit multicollinearity without any two predictors being too highly correlated. Furthermore, all things being equal, we’d probably prefer to use data rather than discard it.

However, one useful treatment is to use dimension reduction. Methods such as partial least squares regression and principal component regression both perform linear dimension reduction on the predictors. For example, I fit OLS to both the blood pressure data (Figure 44, left) and the latent factors after fitting principal components analysis (PCA) to the blood pressure data (Figure 44, right). Both approaches have a similar coefficient of determination, but the second approach does not exhibit multicollinearity.

Figure 4. The correlation matrices of the blood pressure data (left) and latent factors from PCA (right).

Another approach to mitigating the affects of multicollinearity is to use ridge regression. Recall that ridge regression introduces Tikhonov regularization or an 2\ell_2-norm penalty to the weights β\boldsymbol{\beta}. Ridge regression is so-named because the new optimization amounts to adding values along the diagonal (or “ridge”) of the covariance matrix,

β^ridge=(αI+XX)1Xy,(9) \hat{\boldsymbol{\beta}}_{\textsf{ridge}} = (\alpha \mathbf{I} + \mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y}, \tag{9}

where α>0\alpha > 0 is a hyperparameter. Adding elements to the diagonal of Σ\boldsymbol{\Sigma} ensures that the matrix is full rank, and thus invertible. This is also useful in low-data scenarios, when P>NP \gt N.

Conclusion

Multicollinearity is an important challenge when fitting and interpreting linear models. The phenomenon is not necessarily about correlation between predictors, and it is not necessarily an issue for the goodness-of-fit or the quality of the predictions. Rather, multicollinearity is about linear dependence between predictors in the design matrix X\mathbf{X}, which results in numerical instability when estimating β\boldsymbol{\beta} through inverting the Gram matrix Σ\boldsymbol{\Sigma}. We can detect multicollinearity by analyzing the singular values of X\mathbf{X}, and we can mitigate its effects through techniques such as dimension reduction and regularization.

   

Appendix

A1. Proof that G=AA\mathbf{G} = \mathbf{A}^{\top} \mathbf{A} is invertible iff A\mathbf{A} is full rank

Any Gram matrix GAA\mathbf{G} \triangleq \mathbf{A}^{\top} \mathbf{A} is uninvertible if and only if the columns of A\mathbf{A} are linearly dependent. This is straightforward to prove. First, assume that the columns of A\mathbf{A} are linearly dependent. By definition of linear dependence, there exists some vector x0\mathbf{x} \neq \mathbf{0} such that Ax=0\mathbf{Ax} = \mathbf{0}. It follows immediately that

Ax=0AAx=0Gx=0.(A1.1) \begin{aligned} \mathbf{A} \mathbf{x} &= \mathbf{0} \\ \mathbf{A}^{\top} \mathbf{A} \mathbf{x} &= \mathbf{0} \\ \mathbf{G} \mathbf{x} &= \mathbf{0}. \end{aligned} \tag{A1.1}

Since x0\mathbf{x} \neq \mathbf{0} by assumption, G\mathbf{G} is uninvertible. The last step is because one definition of an invertible matrix is that the equation Ax=0\mathbf{A}\mathbf{x} = \mathbf{0} only has the trivial solution x=0\mathbf{x} = \mathbf{0}. Now assume that G\mathbf{G} is uninvertible. So there exists a vector x0\mathbf{x} \neq \mathbf{0} such that Gx=0\mathbf{Gx} = \mathbf{0}. Then

Gx=0xGx=0Ax22=0.(A1.2) \begin{aligned} \mathbf{G} \mathbf{x} &= \mathbf{0} \\ \mathbf{x}^{\top} \mathbf{G} \mathbf{x} &= \mathbf{0} \\ \lVert \mathbf{Ax} \rVert_2^2 &= \mathbf{0}. \end{aligned} \tag{A1.2}

Therefore Ax=0\mathbf{Ax} = \mathbf{0} while x0\mathbf{x} \neq \mathbf{0}. The columns of A\mathbf{A} are linearly dependent.

A2. Blood pressure data

These data are taken from here:

Pt  BP   Age  Weight  BSA   Dur  Pulse  Stress
1   105  47   85.4    1.75  5.1  63     33
2   115  49   94.2    2.10  3.8  70     14
3   116  49   95.3    1.98  8.2  72     10
4   117  50   94.7    2.01  5.8  73     99
5   112  51   89.4    1.89  7.0  72     95
6   121  48   99.5    2.25  9.3  71     10
7   121  49   99.8    2.25  2.5  69     42
8   110  47   90.9    1.90  6.2  66     8
9   110  49   89.2    1.83  7.1  69     62
10  114  48   92.7    2.07  5.6  64     35
11  114  47   94.4    2.07  5.3  74     90
12  115  49   94.1    1.98  5.6  71     21
13  114  50   91.6    2.05  10.2 68     47
14  106  45   87.1    1.92  5.6  67     80
15  125  52   101.3   2.19  10.0 76     98
16  114  46   94.5    1.98  7.4  69     95
17  106  46   87.0    1.87  3.6  62     18
18  113  46   94.5    1.90  4.3  70     12
19  110  48   90.5    1.88  9.0  71     99
20  122  56   95.7    2.09  7.0  75     99
  1. Trefethen, L. N., & Bau III, D. (1997). Numerical linear algebra (Vol. 50). Siam.