Coefficient of Determination

In ordinary least squares, the coefficient of determination quantifies the variation in the dependent variables that can be explained by the model. However, this interpretation has a few assumptions which are worth understanding. I explore this metric and the assumptions in detail.

Published

09 August 2021

Consider a linear model

$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}, \tag{1}$

where $\mathbf{y} = [y_1, \dots, y_N]^{\top}$ are dependent or target variables, $\mathbf{X} = [\mathbf{x}_1, \dots, \mathbf{x}_N]^{\top}$ is an $N \times P$ design matrix of independent or predictor variables, $\boldsymbol{\varepsilon} = [\varepsilon_1, \dots, \varepsilon_N]^{\top}$ are error terms, and $\boldsymbol{\beta} = [\beta_1, \dots, \beta_P]^{\top}$ are linear coefficients or model parameters. If we estimate $\boldsymbol{\beta}$ by solving the following quadratic minimization problem,

$\hat{\boldsymbol{\beta}} = \arg\!\min_{\boldsymbol{\beta}} \lVert \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \rVert_2^2, \tag{2}$

then this model is ordinary least squares (OLS). See my previous post on OLS for details.

The goal of this post is to explain a common metric for goodness-of-fit for OLS, called the coefficient of determination or $R^2$ . I’ll start by deriving $R^2$ from the fraction of variance unexplained, without stating all my assumptions. Then I’ll state the assumptions and show that they hold for OLS with an intercept but that they are not true in general. I’ll end with a couple discussion points about $R^2$ , such as when $R^2$ can be negative and its relationship to Pearson’s correlation coefficient.

Fraction of variance unexplained

A common metric for the goodness-of-fit of a statistical model is the fraction of the variance of the target variables $\mathbf{y}$ which cannot be explained by the model’s predictions $\hat{\mathbf{y}}$ . The fraction of variance unexplained (FVU) is

$\text{FVU} = \frac{1/N \sum_{n=1}^N (y_n - \hat{y}_n)^2}{1/N \sum_{n=1}^N (y_n - \bar{y}_n)^2}. \tag{3}$

The numerator is the variance of our residuals, $\mathbf{e} \triangleq \mathbf{y} - \hat{\mathbf{y}}$ , and the denominator is the variance of our target variables. Clearly, the $1/N$ terms cancel, and we can write FVU in vector notation as

$\text{FVU} = \frac{(\mathbf{y} - \hat{\mathbf{y}})^{\top} (\mathbf{y} - \hat{\mathbf{y}})}{(\mathbf{y} - \bar{y})^{\top} (\mathbf{y} - \bar{y})}. \tag{4}$

If we have perfect predictions, then $\hat{\mathbf{y}} = \mathbf{y}$ , and FVU is zero. In other words, there is no variation in the targets that our model cannot explain. This is a perfect fit. What should our model do if it has no predictive power? The best thing to try is to simply predict the mean of our targets, $\bar{y} = (1/N) \sum_n y_n$ . In this case, FVU is one. If the model does worse than predicting the mean of the targets, then FVU can be greater than one.

Coefficient of determination

Under certain assumptions, we can derive a new metric from FVU. For now, assume these unspoken assumptions hold. Then the denominator in Equation $4$ can be decomposed as

$\begin{aligned} \sum_{n=1}^N (y_n - \bar{y})^2 &= \sum_{n=1}^N (\hat{y}_n + e_n - \bar{y})^2 \\ &= \sum_{n=1}^N \hat{y}_n^2 + \bar{y}^2 + e_n^2 + 2 \hat{y}_n e_n - 2 \hat{y}_n \bar{y} - 2 e_n \bar{y} \\ &= \sum_{n=1}^N (\hat{y}_n - \bar{y})^2 + \sum_{n=1}^N e_n^2 + \cancel{2 \sum_{n=1}^N \hat{y}_n e_n} - \cancel{2 \bar{y} \sum_{n=1}^N e_n}. \end{aligned} \tag{5}$

The cancellations are not true in general, and depend on the aforementioned assumptions. But if they hold, then we can write Equation $5$ in vector notation in terms of FVU in Equation $4$ :

$\frac{(\hat{\mathbf{y}} - \bar{y})^{\top} (\hat{\mathbf{y}} - \bar{y})}{(\mathbf{y} - \bar{y})^{\top} (\mathbf{y} - \bar{y})} = 1 - \overbrace{\frac{(\mathbf{y} - \hat{\mathbf{y}})^{\top} (\mathbf{y} - \hat{\mathbf{y}})}{(\mathbf{y} - \bar{y})^{\top} (\mathbf{y} - \bar{y})}}^{\text{FVU}}. \tag{6}$

The fraction on the left-hand side of Equation $5$ is just one minus FVU. This fraction is called the coefficient of determination or $R^2$ :

$R^2 = 1 - \text{FVU}. \tag{7}$

Thus, $R^2$ is simply one minus the fraction of variance unexplained. When $R^2$ is one, our model has perfect predictive power. This occurs when the residuals $\mathbf{e}$ are all zero. When $R^2$ is zero, our model has simply predicted the mean, and $\hat{y}_n = \bar{y}$ for all $n$ .

The terms in Equation $6$ are important enough to all have names. The numerator in the FVU term is called the residual sum of squares (RSS), since it the squared difference between the true and predicted values, or the residuals. The denominator in the FVU term is called the total sum of squares (TSS), since it is the total variation in our target variables. Finally the numerator on the left-hand-side of Equation $6$ is called explained sum of squares (ESS), which is the squared difference between the predictions and mean of our targets. To summarize:

$\begin{aligned} &(\mathbf{y} - \bar{y})^{\top} (\mathbf{y} - \bar{y}), && \quad\text{total sum of squares (TSS)} \\ &(\mathbf{y} - \hat{\mathbf{y}})^{\top} (\mathbf{y} - \hat{\mathbf{y}}), && \quad\text{residual sum of squares (RSS)} \\ &(\hat{\mathbf{y}} - \bar{y})^{\top} (\hat{\mathbf{y}} - \bar{y}). && \quad\text{explained sum of squares (ESS)} \end{aligned} \tag{8}$

Using this terminology, Equation $6$ says that the total sum of squares decomposes into the explained and residual sum of squares. In other words, the variation in our observations can be explained by OLS in terms of the variation in our residuals and the variation in our regression predictions.

OLS with an intercept

In the previous section, we derived $R^2$ in Equation $5$ through two cancellations. When do these cancellations hold? They hold when we assume OLS has an intercept. Let’s see this.

The first cancellation is

$2 \sum_{n=1}^N \hat{y}_n e_n = 0. \tag{9}$

This is true for OLS because of the normal equation:

$\begin{aligned} \mathbf{X}^{\top} \mathbf{X} \hat{\boldsymbol{\beta}} &= \mathbf{X}^{\top} \mathbf{y} \\ \mathbf{0} &= \mathbf{X}^{\top} (\mathbf{y} - \mathbf{X} \hat{\boldsymbol{\beta}}) \\ \mathbf{0} &= \mathbf{X}^{\top} \mathbf{e}. \end{aligned} \tag{10}$

This implies that

$\hat{\mathbf{y}}^{\top} \mathbf{e} = \hat{\boldsymbol{\beta}}^{\top} \mathbf{X}^{\top} \mathbf{e} = \mathbf{0}. \tag{11}$

Thus, the first term cancels because our statistical model is OLS. If we did not assume OLS or a linear model that minimizes the sum of squared residuals (Equation $2$ ), then we could not necessarily write FVU in terms of $R^2$ . This doesn’t mean that the metric $R^2$ would be meaningless, but it would lose its interpretation as the amount of variation in the dependent variables that is explained by our model’s predictions.

The second cancellation is

$2 \bar{y} \sum_{n=1}^N e_n = 0. \tag{12}$

This is true if we assume a constant predictor, i.e. when

$\mathbf{X} = \begin{bmatrix} \mathbf{c} & \mathbf{X}_2 \end{bmatrix}, \tag{13}$

where $\mathbf{X}_2$ is our original design matrix and $\mathbf{c}$ is an $N$ vector filled with the constant $c$ . If $\mathbf{X}$ has a constant predictor, then the last line of Equation $10$ can be written as

$\begin{bmatrix} c & \dots & c \\ x_{11} & \dots & x_{1N} \\ \vdots & \ddots & \vdots \\ x_{P1} & \dots & x_{PN} \end{bmatrix} \begin{bmatrix} e_1 \\ \vdots \\ e_N \end{bmatrix} = \begin{bmatrix} 0 \\ \vdots \\ 0 \end{bmatrix}. \tag{14}$

This immediately implies that the sum of the residuals is zero:

$\sum_{n=1}^N c e_n = c \sum_{n=1}^N (y_n - \hat{y}_n) = 0. \tag{15}$

To summarize, $R^2$ can only be interpreted in terms of the fraction of variance unexplained if we assume both OLS and OLS with an intercept. Without these assumptions, the decomposition in Equation $5$ is no longer valid, and $R^2$ loses this standard interpretation.

Uncentered $R^2$

What if our model does not contain an intercept? The decomposition in Equation $5$ is no longer valid. However, the centered total sum of squares still decomposes as

$\begin{aligned} \mathbf{y}^{\top} \mathbf{y} &= (\hat{\mathbf{y}} + \mathbf{e})^{\top} (\hat{\mathbf{y}} + \mathbf{e}) \\ &= \hat{\mathbf{y}}^{\top} \hat{\mathbf{y}} + \mathbf{e}^{\top} \mathbf{e} + 2 \hat{\mathbf{y}}^{\top} \mathbf{e} \\ &= \hat{\mathbf{y}}^{\top} \hat{\mathbf{y}} + \mathbf{e}^{\top} \mathbf{e} + 2 \hat{\boldsymbol{\beta}}^{\top} \hat{\mathbf{X}}^{\top} \mathbf{e} \\ &\stackrel{\star}{=} \hat{\mathbf{y}}^{\top} \hat{\mathbf{y}} + \mathbf{e}^{\top} \mathbf{e}. \end{aligned} \tag{16}$

Again, the step labeled $\star$ holds because $\mathbf{0} = \mathbf{X}^{\top} \mathbf{e}$ , meaning that we still assume OLS, just not OLS with an intercept. See A1 for an alternative derivation of this decomposition. We can rewrite Equation $16$ to rederive the uncentered $R^2$ , which I’ll denote as $R^2_{\textsf{uc}}$ :

$R_{\textsf{uc}}^2 \triangleq \frac{\hat{\mathbf{y}}^{\top} \hat{\mathbf{y}}}{\mathbf{y}^{\top} \mathbf{y}} = 1 - \frac{\mathbf{e}^{\top} \mathbf{e}}{\mathbf{y}^{\top} \mathbf{y}}. \tag{17}$

Since $\hat{\mathbf{y}}^{\top} \hat{\mathbf{y}} \geq 0$ and $\mathbf{e}^{\top} \mathbf{e} \geq 0$ and since

$R_{\textsf{uc}}^2 = \frac{\hat{\mathbf{y}}^{\top} \hat{\mathbf{y}}}{\hat{\mathbf{y}}^{\top} \hat{\mathbf{y}} + \mathbf{e}^{\top} \mathbf{e}}, \tag{18}$

then $0 \leq R_{\textsf{uc}}^2 \leq 1$ . As we can see, this measures the goodness-of-fit of OLS. If the residuals are small, then $R_{\textsf{uc}}^2$ is close to $1$ . If the residuals are big, then $R_{\textsf{uc}}^2$ is close $0$ .

Negative $R^2$

As we have seen, the centered metric $R_{\textsf{uc}}^2$ cannot be negative. This is simply an algebraic impossibility given Equation $18$ . However, in practice, one might find that the coefficient of determination or uncentered $R^2$ can be negative. Why does this occur? There are two scenarios. Let’s look at each in turn.

When we use OLS

If our predictors include a constant, then $R^2$ cannot be negative. We know $R^2$ cannot be negative because this would require

$\begin{aligned} 1 - \frac{(\mathbf{y} - \hat{\mathbf{y}})^{\top} (\mathbf{y} - \hat{\mathbf{y}})}{(\mathbf{y} - \bar{y})^{\top} (\mathbf{y} - \bar{y})} &\lt 0 \\ &\Downarrow \\ (\mathbf{y} - \bar{y})^{\top} (\mathbf{y} - \bar{y}) &\lt (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})^{\top} (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}). \end{aligned} \tag{19}$

However, the OLS estimator $\hat{\boldsymbol{\beta}}$ is the minimizer of the sum of squared residuals (see Equation $2$ ). If Equation $19$ were true, then $\hat{\boldsymbol{\beta}}$ would no longer the minimizer. Indeed, the smallest value that $R^2$ can be is when the predictions are simply the target mean. It’s easy to see that this occurs when $\mathbf{X}$ is un-informative, i.e. when the predictors are constant. See A2 for a proof.

Figure 1. OLS without (red solid line) and with (red dashed line) an intercept. The model's parameters change depending on this modeling assumption.

However, if we do not include an intercept and if our software package still computes $R^2$ using Equation $6$ , then $R^2$ can be negative. Intuitively, this is because the results of OLS can change dramatically when an intercept is added or removed (Figure $1$ ). In other words, when we do not include an intercept, our model can do much worse than simply predicting the sample mean. Arguably, a negative $R^2$ with OLS is a user error, since it implies the model would do better with the right modeling assumption.

When we do not use OLS

As we saw, the decomposition of the total sum of squares, centered or uncentered, depends on the normal equation, or at least it depends on $\mathbf{X}^{\top} \mathbf{e} = \mathbf{0}$ . Thus, when using model that is not linear regression, such as a non-linear deep neural network, $R^2$ loses its interpretation as the amount of variation in the target variables that are explained by our predictions, and we have no guarantee that $R^2$ or $R_{\textsf{uc}}^2$ cannot be negative.

Correlation coefficient squared

The coefficient of determination is called “ $R^2$ ” because it is the square of the Pearson correlation coefficient, which is commonly denoted $R$ , between the targets $\mathbf{y}$ and predictions $\hat{\mathbf{y}}$ . To see this, let’s first write down the correlation coefficient:

$R \triangleq \frac{\sum_{n=1}^N (y_n - \bar{y})(\hat{y}_n - \bar{y})}{\sqrt{\sum_{n=1}^N (y_n - \bar{y})^2} \cdot \sqrt{\sum_{n=1}^N (\hat{y}_n - \bar{y})^2}}. \tag{20}$

Note that we have written $\bar{y}$ for the mean of the predictions, rather than $\bar{\hat{y}}$ . This is because yet another important property of OLS with an intercept is that the mean of the predictions equals the mean of the observations, i.e. $\bar{y} = \bar{\hat{y}}$ . Why? Since $e_n = y_n - \hat{y}_n$ by definition, we can write down a relationship between the means of our residuals, predictions, and observations:

$\begin{aligned} \frac{1}{N} \sum_{n=1}^N e_n &= \frac{1}{N} \sum_{n=1} \hat{y}_n - \frac{1}{N} \sum_{n=1}^N y_n \\ 0 &= \bar{\hat{y}} - \bar{y} \\ &\Downarrow \\ \bar{y} &= \bar{\hat{y}}. \end{aligned} \tag{21}$

Again, this depends on the fact that the residuals sum to zero, as they do when we have a constant predictor in $\mathbf{X}$ . Now let’s apply the following simplification to the numerator of Equation $20$ :

$\begin{aligned} \sum_{n=1}^N (y_n - \bar{y})(\hat{y}_n - \bar{y}) &= \sum_{n=1}^N \left[(y_n - \hat{y}_n) + (\hat{y}_n - \bar{y})\right](\hat{y}_n - \bar{y}) \\ &= \sum_{n=1}^N (y_n - \hat{y}_n)(\hat{y}_n - \bar{y}) + \sum_{n=1}^N (\hat{y}_n - \bar{y})^2 \\ &\stackrel{\star}{=} \sum_{n=1}^N (\hat{y}_n - \bar{y})^2. \end{aligned} \tag{22}$

Again, step $\star$ holds since, if we use OLS with an intercept, the residuals sum to zero. See A3 for a complete derivation of this step. Thus, we can write $R$ as

$\begin{aligned} R &= \frac{\sum_{n=1}^N (\hat{y}_n - \bar{y})^2}{\sqrt{\sum_{n=1}^N (y_n - \bar{y})^2} \cdot \sqrt{\sum_{n=1}^N (\hat{y}_n - \bar{y})^2}} \\ &= \sqrt{\frac{\sum_{n=1}^N (\hat{y}_n - \bar{y})^2}{\sum_{n=1}^N (y_n - \bar{y})^2}}. \end{aligned} \tag{23}$

If we square this quantity, we get $R^2$ as in Equation $6$ .

What this means is that if the targets and predictions are highly negatively or positively correlated, then OLS with an intercept will have a good fit or high $R^2$ . As the correlation decreases (regardless of sign), the goodness-of-fit decreases.

Summary

The coefficient of determination is an important metric for quantifying the goodness-of-fit of ordinary least squares. However, its interpretation as the fraction of variance explained by the model depends on the assumption that we are either using OLS with an intercept or we have mean-centered data. A negative $R^2$ is only possible when either not using linear regression or not using linear regression with an intercept. We refer to the coefficient of determination as “ $R^2$ ” because it is the square of Pearson’s correlation coefficient $R$ . When our predictions are either negatively or positively correlated with our regression targets, then a linear model will fit the data better.

Appendix

A1. Decomposition of the sum of squares

In my previous post on OLS, we saw the hat and residual maker matrices,

$\begin{aligned} \mathbf{H} &\triangleq \mathbf{X} (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y}, \\ \mathbf{M} &\triangleq \mathbf{I} - \mathbf{M}, \end{aligned} \tag{A1.1}$

are orthogonal projectors that are also orthogonal to each other. We can write the sum of squares as

$\begin{aligned} \mathbf{y}^{\top} \mathbf{y} &= (\mathbf{H}\mathbf{y} + \mathbf{M} \mathbf{y})^{\top} (\mathbf{H}\mathbf{y} + \mathbf{M} \mathbf{y}) \\ &= \mathbf{y}^{\top} \mathbf{H} \mathbf{H} \mathbf{y} + \cancel{\mathbf{y}^{\top} \mathbf{H} \mathbf{M} \mathbf{y}} + \cancel{\mathbf{y}^{\top} \mathbf{M} \mathbf{H} \mathbf{y}} + \mathbf{y}^{\top} \mathbf{M} \mathbf{M} \mathbf{y} \\ &= \hat{\mathbf{y}}^{\top} \hat{\mathbf{y}} + \mathbf{e}^{\top} \mathbf{e}. \end{aligned} \tag{A1.2}$

The cancellations arise because the matrices are orthogonal to each other, i.e. $\mathbf{MH} = \mathbf{0}$ .

A2. OLS parameters with un-informative predictors

Consider the scenario in which $\mathbf{X}$ is an $N \times 1$ matrix of completely un-informative predictors, i.e. a column vector with a constant $c$ . Let’s write this as $\mathbf{X} \triangleq \mathbf{c} = [c, \dots, c]^{\top}$ . Then clearly

$\begin{aligned} \hat{\boldsymbol{\beta}} &= (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y} \\ &= (\mathbf{c}^{\top} \mathbf{c})^{-1} \mathbf{c}^{\top} \mathbf{y} \\ &= \left( \frac{1}{Nc} \right) \mathbf{c}^{\top} \mathbf{y} \\ &= \begin{bmatrix} 1/N & \dots & 1 \ N \end{bmatrix} \mathbf{y} \\ &= \bar{y}. \end{aligned} \tag{A2.1}$

In words, when $\mathbf{X}$ is a constant predictor, the OLS estimator is the mean of the dependent variables.

A3. Lemma for correlation coefficient derivation

We want to show that

$\sum_{n=1}^N (y_n - \hat{y}_n)(\hat{y}_n - \bar{y}) = 0. \tag{A3.1}$

Let’s expand the terms,

$\begin{aligned} \sum_{n=1}^N (y_n - \hat{y}_n)(\hat{y}_n - \bar{y}) &= \sum_{n=1}^N \left( y_n \hat{y}_n - \hat{y}_n \bar{y} - \hat{y}_n^2 + \hat{y}_n \bar{y} \right) \\ &= \sum_{n=1}^N y_n \hat{y}_n - \bar{y} \sum_{n=1}^N \hat{y}_n - \sum_{n=1}^N \hat{y}_n^2 + \bar{y} \sum_{n=1}^N \hat{y}_n \\ &= \sum_{n=1}^N \hat{y}_n (y_n - \hat{y}_n). \end{aligned} \tag{A3.2}$

Note that this last is just

$\sum_{n=1}^N \hat{\boldsymbol{\beta}}^{\top} \mathbf{x}_n e_n, \tag{A3.3}$

which we already saw was equal to zero (Equations $10$ and $11$ ).