Coefficient of Determination

In ordinary least squares, the coefficient of determination quantifies the variation in the dependent variables that can be explained by the model. However, this interpretation has a few assumptions which are worth understanding. I explore this metric and the assumptions in detail.

Consider a linear model

y=Xβ+ε,(1) \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}, \tag{1}

where y=[y1,,yN]\mathbf{y} = [y_1, \dots, y_N]^{\top} are dependent or target variables, X=[x1,,xN]\mathbf{X} = [\mathbf{x}_1, \dots, \mathbf{x}_N]^{\top} is an N×PN \times P design matrix of independent or predictor variables, ε=[ε1,,εN]\boldsymbol{\varepsilon} = [\varepsilon_1, \dots, \varepsilon_N]^{\top} are error terms, and β=[β1,,βP]\boldsymbol{\beta} = [\beta_1, \dots, \beta_P]^{\top} are linear coefficients or model parameters. If we estimate β\boldsymbol{\beta} by solving the following quadratic minimization problem,

β^=arg ⁣minβyXβ22,(2) \hat{\boldsymbol{\beta}} = \arg\!\min_{\boldsymbol{\beta}} \lVert \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \rVert_2^2, \tag{2}

then this model is ordinary least squares (OLS). See my previous post on OLS for details.

The goal of this post is to explain a common metric for goodness-of-fit for OLS, called the coefficient of determination or R2R^2. I’ll start by deriving R2R^2 from the fraction of variance unexplained, without stating all my assumptions. Then I’ll state the assumptions and show that they hold for OLS with an intercept but that they are not true in general. I’ll end with a couple discussion points about R2R^2, such as when R2R^2 can be negative and its relationship to Pearson’s correlation coefficient.

Fraction of variance unexplained

A common metric for the goodness-of-fit of a statistical model is the fraction of the variance of the target variables y\mathbf{y} which cannot be explained by the model’s predictions y^\hat{\mathbf{y}}. The fraction of variance unexplained (FVU) is

FVU=1/Nn=1N(yny^n)21/Nn=1N(ynyˉn)2.(3) \text{FVU} = \frac{1/N \sum_{n=1}^N (y_n - \hat{y}_n)^2}{1/N \sum_{n=1}^N (y_n - \bar{y}_n)^2}. \tag{3}

The numerator is the variance of our residuals, eyy^\mathbf{e} \triangleq \mathbf{y} - \hat{\mathbf{y}}, and the denominator is the variance of our target variables. Clearly, the 1/N1/N terms cancel, and we can write FVU in vector notation as

FVU=(yy^)(yy^)(yyˉ)(yyˉ).(4) \text{FVU} = \frac{(\mathbf{y} - \hat{\mathbf{y}})^{\top} (\mathbf{y} - \hat{\mathbf{y}})}{(\mathbf{y} - \bar{y})^{\top} (\mathbf{y} - \bar{y})}. \tag{4}

If we have perfect predictions, then y^=y\hat{\mathbf{y}} = \mathbf{y}, and FVU is zero. In other words, there is no variation in the targets that our model cannot explain. This is a perfect fit. What should our model do if it has no predictive power? The best thing to try is to simply predict the mean of our targets, yˉ=(1/N)nyn\bar{y} = (1/N) \sum_n y_n. In this case, FVU is one. If the model does worse than predicting the mean of the targets, then FVU can be greater than one.

Coefficient of determination

Under certain assumptions, we can derive a new metric from FVU. For now, assume these unspoken assumptions hold. Then the denominator in Equation 44 can be decomposed as

n=1N(ynyˉ)2=n=1N(y^n+enyˉ)2=n=1Ny^n2+yˉ2+en2+2y^nen2y^nyˉ2enyˉ=n=1N(y^nyˉ)2+n=1Nen2+2n=1Ny^nen2yˉn=1Nen.(5) \begin{aligned} \sum_{n=1}^N (y_n - \bar{y})^2 &= \sum_{n=1}^N (\hat{y}_n + e_n - \bar{y})^2 \\ &= \sum_{n=1}^N \hat{y}_n^2 + \bar{y}^2 + e_n^2 + 2 \hat{y}_n e_n - 2 \hat{y}_n \bar{y} - 2 e_n \bar{y} \\ &= \sum_{n=1}^N (\hat{y}_n - \bar{y})^2 + \sum_{n=1}^N e_n^2 + \cancel{2 \sum_{n=1}^N \hat{y}_n e_n} - \cancel{2 \bar{y} \sum_{n=1}^N e_n}. \end{aligned} \tag{5}

The cancellations are not true in general, and depend on the aforementioned assumptions. But if they hold, then we can write Equation 55 in vector notation in terms of FVU in Equation 44:

(y^yˉ)(y^yˉ)(yyˉ)(yyˉ)=1(yy^)(yy^)(yyˉ)(yyˉ)FVU.(6) \frac{(\hat{\mathbf{y}} - \bar{y})^{\top} (\hat{\mathbf{y}} - \bar{y})}{(\mathbf{y} - \bar{y})^{\top} (\mathbf{y} - \bar{y})} = 1 - \overbrace{\frac{(\mathbf{y} - \hat{\mathbf{y}})^{\top} (\mathbf{y} - \hat{\mathbf{y}})}{(\mathbf{y} - \bar{y})^{\top} (\mathbf{y} - \bar{y})}}^{\text{FVU}}. \tag{6}

The fraction on the left-hand side of Equation 55 is just one minus FVU. This fraction is called the coefficient of determination or R2R^2:

R2=1FVU.(7) R^2 = 1 - \text{FVU}. \tag{7}

Thus, R2R^2 is simply one minus the fraction of variance unexplained. When R2R^2 is one, our model has perfect predictive power. This occurs when the residuals e\mathbf{e} are all zero. When R2R^2 is zero, our model has simply predicted the mean, and y^n=yˉ\hat{y}_n = \bar{y} for all nn.

The terms in Equation 66 are important enough to all have names. The numerator in the FVU term is called the residual sum of squares (RSS), since it the squared difference between the true and predicted values, or the residuals. The denominator in the FVU term is called the total sum of squares (TSS), since it is the total variation in our target variables. Finally the numerator on the left-hand-side of Equation 66 is called explained sum of squares (ESS), which is the squared difference between the predictions and mean of our targets. To summarize:

(yyˉ)(yyˉ),total sum of squares (TSS)(yy^)(yy^),residual sum of squares (RSS)(y^yˉ)(y^yˉ).explained sum of squares (ESS)(8) \begin{aligned} &(\mathbf{y} - \bar{y})^{\top} (\mathbf{y} - \bar{y}), && \quad\text{total sum of squares (TSS)} \\ &(\mathbf{y} - \hat{\mathbf{y}})^{\top} (\mathbf{y} - \hat{\mathbf{y}}), && \quad\text{residual sum of squares (RSS)} \\ &(\hat{\mathbf{y}} - \bar{y})^{\top} (\hat{\mathbf{y}} - \bar{y}). && \quad\text{explained sum of squares (ESS)} \end{aligned} \tag{8}

Using this terminology, Equation 66 says that the total sum of squares decomposes into the explained and residual sum of squares. In other words, the variation in our observations can be explained by OLS in terms of the variation in our residuals and the variation in our regression predictions.

OLS with an intercept

In the previous section, we derived R2R^2 in Equation 55 through two cancellations. When do these cancellations hold? They hold when we assume OLS has an intercept. Let’s see this.

The first cancellation is

2n=1Ny^nen=0.(9) 2 \sum_{n=1}^N \hat{y}_n e_n = 0. \tag{9}

This is true for OLS because of the normal equation:

XXβ^=Xy0=X(yXβ^)0=Xe.(10) \begin{aligned} \mathbf{X}^{\top} \mathbf{X} \hat{\boldsymbol{\beta}} &= \mathbf{X}^{\top} \mathbf{y} \\ \mathbf{0} &= \mathbf{X}^{\top} (\mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}}) \\ \mathbf{0} &= \mathbf{X}^{\top} \mathbf{e}. \end{aligned} \tag{10}

This implies that

y^e=β^Xe=0.(11) \hat{\mathbf{y}}^{\top} \mathbf{e} = \hat{\boldsymbol{\beta}}^{\top} \mathbf{X}^{\top} \mathbf{e} = \mathbf{0}. \tag{11}

Thus, the first term cancels because our statistical model is OLS. If we did not assume OLS or a linear model that minimizes the sum of squared residuals (Equation 22), then we could not necessarily write FVU in terms of R2R^2. This doesn’t mean that the metric R2R^2 would be meaningless, but it would lose its interpretation as the amount of variation in the dependent variables that is explained by our model’s predictions.

The second cancellation is

2yˉn=1Nen=0.(12) 2 \bar{y} \sum_{n=1}^N e_n = 0. \tag{12}

This is true if we assume a constant predictor, i.e. when

X=[cX2],(13) \mathbf{X} = \begin{bmatrix} \mathbf{c} & \mathbf{X}_2 \end{bmatrix}, \tag{13}

where X2\mathbf{X}_2 is our original design matrix and c\mathbf{c} is an NN vector filled with the constant cc. If X\mathbf{X} has a constant predictor, then the last line of Equation 1010 can be written as

[ccx11x1NxP1xPN][e1eN]=[00].(14) \begin{bmatrix} c & \dots & c \\ x_{11} & \dots & x_{1N} \\ \vdots & \ddots & \vdots \\ x_{P1} & \dots & x_{PN} \end{bmatrix} \begin{bmatrix} e_1 \\ \vdots \\ e_N \end{bmatrix} = \begin{bmatrix} 0 \\ \vdots \\ 0 \end{bmatrix}. \tag{14}

This immediately implies that the sum of the residuals is zero:

n=1Ncen=cn=1N(yny^n)=0.(15) \sum_{n=1}^N c e_n = c \sum_{n=1}^N (y_n - \hat{y}_n) = 0. \tag{15}

To summarize, R2R^2 can only be interpreted in terms of the fraction of variance unexplained if we assume both OLS and OLS with an intercept. Without these assumptions, the decomposition in Equation 55 is no longer valid, and R2R^2 loses this standard interpretation.

Uncentered R2R^2

What if our model does not contain an intercept? The decomposition in Equation 55 is no longer valid. However, the centered total sum of squares still decomposes as

yy=(y^+e)(y^+e)=y^y^+ee+2y^e=y^y^+ee+2β^X^e=y^y^+ee.(16) \begin{aligned} \mathbf{y}^{\top} \mathbf{y} &= (\hat{\mathbf{y}} + \mathbf{e})^{\top} (\hat{\mathbf{y}} + \mathbf{e}) \\ &= \hat{\mathbf{y}}^{\top} \hat{\mathbf{y}} + \mathbf{e}^{\top} \mathbf{e} + 2 \hat{\mathbf{y}}^{\top} \mathbf{e} \\ &= \hat{\mathbf{y}}^{\top} \hat{\mathbf{y}} + \mathbf{e}^{\top} \mathbf{e} + 2 \hat{\boldsymbol{\beta}}^{\top} \hat{\mathbf{X}}^{\top} \mathbf{e} \\ &\stackrel{\star}{=} \hat{\mathbf{y}}^{\top} \hat{\mathbf{y}} + \mathbf{e}^{\top} \mathbf{e}. \end{aligned} \tag{16}

Again, the step labeled \star holds because 0=Xe\mathbf{0} = \mathbf{X}^{\top} \mathbf{e}, meaning that we still assume OLS, just not OLS with an intercept. See A1 for an alternative derivation of this decomposition. We can rewrite Equation 1616 to rederive the uncentered R2R^2, which I’ll denote as Ruc2R^2_{\textsf{uc}}:

Ruc2y^y^yy=1eeyy.(17) R_{\textsf{uc}}^2 \triangleq \frac{\hat{\mathbf{y}}^{\top} \hat{\mathbf{y}}}{\mathbf{y}^{\top} \mathbf{y}} = 1 - \frac{\mathbf{e}^{\top} \mathbf{e}}{\mathbf{y}^{\top} \mathbf{y}}. \tag{17}

Since y^y^0\hat{\mathbf{y}}^{\top} \hat{\mathbf{y}} \geq 0 and ee0\mathbf{e}^{\top} \mathbf{e} \geq 0 and since

Ruc2=y^y^y^y^+ee,(18) R_{\textsf{uc}}^2 = \frac{\hat{\mathbf{y}}^{\top} \hat{\mathbf{y}}}{\hat{\mathbf{y}}^{\top} \hat{\mathbf{y}} + \mathbf{e}^{\top} \mathbf{e}}, \tag{18}

then 0Ruc210 \leq R_{\textsf{uc}}^2 \leq 1. As we can see, this measures the goodness-of-fit of OLS. If the residuals are small, then Ruc2R_{\textsf{uc}}^2 is close to 11. If the residuals are big, then Ruc2R_{\textsf{uc}}^2 is close 00.

Negative R2R^2

As we have seen, the centered metric Ruc2R_{\textsf{uc}}^2 cannot be negative. This is simply an algebraic impossibility given Equation 1818. However, in practice, one might find that the coefficient of determination or uncentered R2R^2 can be negative. Why does this occur? There are two scenarios. Let’s look at each in turn.

When we use OLS

If our predictors include a constant, then R2R^2 cannot be negative. We know R2R^2 cannot be negative because this would require

1(yy^)(yy^)(yyˉ)(yyˉ)<0(yyˉ)(yyˉ)<(yXβ^)(yXβ^).(19) \begin{aligned} 1 - \frac{(\mathbf{y} - \hat{\mathbf{y}})^{\top} (\mathbf{y} - \hat{\mathbf{y}})}{(\mathbf{y} - \bar{y})^{\top} (\mathbf{y} - \bar{y})} &\lt 0 \\ &\Downarrow \\ (\mathbf{y} - \bar{y})^{\top} (\mathbf{y} - \bar{y}) &\lt (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})^{\top} (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}). \end{aligned} \tag{19}

However, the OLS estimator β^\hat{\boldsymbol{\beta}} is the minimizer of the sum of squared residuals (see Equation 22). If Equation 1919 were true, then β^\hat{\boldsymbol{\beta}} would no longer the minimizer. Indeed, the smallest value that R2R^2 can be is when the predictions are simply the target mean. It’s easy to see that this occurs when X\mathbf{X} is un-informative, i.e. when the predictors are constant. See A2 for a proof.

Figure 1. OLS without (red solid line) and with (red dashed line) an intercept. The model's parameters change depending on this modeling assumption.

However, if we do not include an intercept and if our software package still computes R2R^2 using Equation 66, then R2R^2 can be negative. Intuitively, this is because the results of OLS can change dramatically when an intercept is added or removed (Figure 11). In other words, when we do not include an intercept, our model can do much worse than simply predicting the sample mean. Arguably, a negative R2R^2 with OLS is a user error, since it implies the model would do better with the right modeling assumption.

When we do not use OLS

As we saw, the decomposition of the total sum of squares, centered or uncentered, depends on the normal equation, or at least it depends on Xe=0\mathbf{X}^{\top} \mathbf{e} = \mathbf{0}. Thus, when using model that is not linear regression, such as a non-linear deep neural network, R2R^2 loses its interpretation as the amount of variation in the target variables that are explained by our predictions, and we have no guarantee that R2R^2 or Ruc2R_{\textsf{uc}}^2 cannot be negative.

Correlation coefficient squared

The coefficient of determination is called “R2R^2” because it is the square of the Pearson correlation coefficient, which is commonly denoted RR, between the targets y\mathbf{y} and predictions y^\hat{\mathbf{y}}. To see this, let’s first write down the correlation coefficient:

Rn=1N(ynyˉ)(y^nyˉ)n=1N(ynyˉ)2n=1N(y^nyˉ)2.(20) R \triangleq \frac{\sum_{n=1}^N (y_n - \bar{y})(\hat{y}_n - \bar{y})}{\sqrt{\sum_{n=1}^N (y_n - \bar{y})^2} \cdot \sqrt{\sum_{n=1}^N (\hat{y}_n - \bar{y})^2}}. \tag{20}

Note that we have written yˉ\bar{y} for the mean of the predictions, rather than y^ˉ\bar{\hat{y}}. This is because yet another important property of OLS with an intercept is that the mean of the predictions equals the mean of the observations, i.e. yˉ=y^ˉ\bar{y} = \bar{\hat{y}}. Why? Since en=yny^ne_n = y_n - \hat{y}_n by definition, we can write down a relationship between the means of our residuals, predictions, and observations:

1Nn=1Nen=1Nn=1y^n1Nn=1Nyn0=y^ˉyˉyˉ=y^ˉ.(21) \begin{aligned} \frac{1}{N} \sum_{n=1}^N e_n &= \frac{1}{N} \sum_{n=1} \hat{y}_n - \frac{1}{N} \sum_{n=1}^N y_n \\ 0 &= \bar{\hat{y}} - \bar{y} \\ &\Downarrow \\ \bar{y} &= \bar{\hat{y}}. \end{aligned} \tag{21}

Again, this depends on the fact that the residuals sum to zero, as they do when we have a constant predictor in X\mathbf{X}. Now let’s apply the following simplification to the numerator of Equation 2020:

n=1N(ynyˉ)(y^nyˉ)=n=1N[(yny^n)+(y^nyˉ)](y^nyˉ)=n=1N(yny^n)(y^nyˉ)+n=1N(y^nyˉ)2=n=1N(y^nyˉ)2.(22) \begin{aligned} \sum_{n=1}^N (y_n - \bar{y})(\hat{y}_n - \bar{y}) &= \sum_{n=1}^N \left[(y_n - \hat{y}_n) + (\hat{y}_n - \bar{y})\right](\hat{y}_n - \bar{y}) \\ &= \sum_{n=1}^N (y_n - \hat{y}_n)(\hat{y}_n - \bar{y}) + \sum_{n=1}^N (\hat{y}_n - \bar{y})^2 \\ &\stackrel{\star}{=} \sum_{n=1}^N (\hat{y}_n - \bar{y})^2. \end{aligned} \tag{22}

Again, step \star holds since, if we use OLS with an intercept, the residuals sum to zero. See A3 for a complete derivation of this step. Thus, we can write RR as

R=n=1N(y^nyˉ)2n=1N(ynyˉ)2n=1N(y^nyˉ)2=n=1N(y^nyˉ)2n=1N(ynyˉ)2.(23) \begin{aligned} R &= \frac{\sum_{n=1}^N (\hat{y}_n - \bar{y})^2}{\sqrt{\sum_{n=1}^N (y_n - \bar{y})^2} \cdot \sqrt{\sum_{n=1}^N (\hat{y}_n - \bar{y})^2}} \\ &= \sqrt{\frac{\sum_{n=1}^N (\hat{y}_n - \bar{y})^2}{\sum_{n=1}^N (y_n - \bar{y})^2}}. \end{aligned} \tag{23}

If we square this quantity, we get R2R^2 as in Equation 66.

What this means is that if the targets and predictions are highly negatively or positively correlated, then OLS with an intercept will have a good fit or high R2R^2. As the correlation decreases (regardless of sign), the goodness-of-fit decreases.

Summary

The coefficient of determination is an important metric for quantifying the goodness-of-fit of ordinary least squares. However, its interpretation as the fraction of variance explained by the model depends on the assumption that we are either using OLS with an intercept or we have mean-centered data. A negative R2R^2 is only possible when either not using linear regression or not using linear regression with an intercept. We refer to the coefficient of determination as “R2R^2” because it is the square of Pearson’s correlation coefficient RR. When our predictions are either negatively or positively correlated with our regression targets, then a linear model will fit the data better.

   

Appendix

A1. Decomposition of the sum of squares

In my previous post on OLS, we saw the hat and residual maker matrices,

HX(XX)1Xy,MIM,(A1.1) \begin{aligned} \mathbf{H} &\triangleq \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y}, \\ \mathbf{M} &\triangleq \mathbf{I} - \mathbf{M}, \end{aligned} \tag{A1.1}

are orthogonal projectors that are also orthogonal to each other. We can write the sum of squares as

yy=(Hy+My)(Hy+My)=yHHy+yHMy+yMHy+yMMy=y^y^+ee.(A1.2) \begin{aligned} \mathbf{y}^{\top} \mathbf{y} &= (\mathbf{H}\mathbf{y} + \mathbf{M} \mathbf{y})^{\top} (\mathbf{H}\mathbf{y} + \mathbf{M} \mathbf{y}) \\ &= \mathbf{y}^{\top} \mathbf{H} \mathbf{H} \mathbf{y} + \cancel{\mathbf{y}^{\top} \mathbf{H} \mathbf{M} \mathbf{y}} + \cancel{\mathbf{y}^{\top} \mathbf{M} \mathbf{H} \mathbf{y}} + \mathbf{y}^{\top} \mathbf{M} \mathbf{M} \mathbf{y} \\ &= \hat{\mathbf{y}}^{\top} \hat{\mathbf{y}} + \mathbf{e}^{\top} \mathbf{e}. \end{aligned} \tag{A1.2}

The cancellations arise because the matrices are orthogonal to each other, i.e. MH=0\mathbf{MH} = \mathbf{0}.

A2. OLS parameters with un-informative predictors

Consider the scenario in which X\mathbf{X} is an N×1N \times 1 matrix of completely un-informative predictors, i.e. a column vector with a constant cc. Let’s write this as Xc=[c,,c]\mathbf{X} \triangleq \mathbf{c} = [c, \dots, c]^{\top}. Then clearly

β^=(XX)1Xy=(cc)1cy=(1Nc)cy=[1/N1 N]y=yˉ.(A2.1) \begin{aligned} \hat{\boldsymbol{\beta}} &= (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y} \\ &= (\mathbf{c}^{\top} \mathbf{c})^{-1} \mathbf{c}^{\top} \mathbf{y} \\ &= \left( \frac{1}{Nc} \right) \mathbf{c}^{\top} \mathbf{y} \\ &= \begin{bmatrix} 1/N & \dots & 1 \ N \end{bmatrix} \mathbf{y} \\ &= \bar{y}. \end{aligned} \tag{A2.1}

In words, when X\mathbf{X} is a constant predictor, the OLS estimator is the mean of the dependent variables.

A3. Lemma for correlation coefficient derivation

We want to show that

n=1N(yny^n)(y^nyˉ)=0.(A3.1) \sum_{n=1}^N (y_n - \hat{y}_n)(\hat{y}_n - \bar{y}) = 0. \tag{A3.1}

Let’s expand the terms,

n=1N(yny^n)(y^nyˉ)=n=1N(yny^ny^nyˉy^n2+y^nyˉ)=n=1Nyny^nyˉn=1Ny^nn=1Ny^n2+yˉn=1Ny^n=n=1Ny^n(yny^n).(A3.2) \begin{aligned} \sum_{n=1}^N (y_n - \hat{y}_n)(\hat{y}_n - \bar{y}) &= \sum_{n=1}^N \left( y_n \hat{y}_n - \hat{y}_n \bar{y} - \hat{y}_n^2 + \hat{y}_n \bar{y} \right) \\ &= \sum_{n=1}^N y_n \hat{y}_n - \bar{y} \sum_{n=1}^N \hat{y}_n - \sum_{n=1}^N \hat{y}_n^2 + \bar{y} \sum_{n=1}^N \hat{y}_n \\ &= \sum_{n=1}^N \hat{y}_n (y_n - \hat{y}_n). \end{aligned} \tag{A3.2}

Note that this last is just

n=1Nβ^xnen,(A3.3) \sum_{n=1}^N \hat{\boldsymbol{\beta}}^{\top} \mathbf{x}_n e_n, \tag{A3.3}

which we already saw was equal to zero (Equations 1010 and 1111).