Simple Linear Regression and Correlation

In simple linear regression, the slope parameter is a simple function of the correlation between the targets and predictors. I derive this result and discuss a few consequences.

Consider simple linear regression or linear regression with a single independent variable,

yn=α+βxn+εn.(1) y_n = \alpha + \beta x_n + \varepsilon_n. \tag{1}

β\beta is the model’s slope, and α\alpha is the model’s intercept. There is an interesting relationship between the estimated linear coefficient β^\hat{\beta} and Pearson’s correlation coefficient between the predictors and targets, ρxy\rho_{xy}. The goal of this post is to understand this relationship better.

Univariate normal equation

Let’s rederive the normal equations for ordinary least squares (OLS), which minimizes the sum of squared residuals:

J(β,α)=n=1N(ynαβxn)2,α^,β^=arg ⁣minα,βJ(β,α).(2) \begin{aligned} J(\beta, \alpha) &= \sum_{n=1}^N (y_n - \alpha - \beta x_n)^2, \\ \hat{\alpha}, \hat{\beta} &= \arg\!\min_{\alpha, \beta} J(\beta, \alpha). \end{aligned} \tag{2}

We can find the minimizers for β\beta and α\alpha by differentiating JJ w.r.t. these parameters and solving for them after setting the derivative equal to zero. (We are ignoring the endpoints and second-order conditions since this is an established result.)

First, let’s solve for intercept α\alpha. We take the derivative of the objective function w.r.t. to α\alpha,

dJdα=n=1Nddα(ynαβxn)2=n=1N2(ynαβxn)(1)=2(n=1Nβxnyn+α),(3) \begin{aligned} \frac{d J}{d \alpha} &= \sum_{n=1}^N \frac{d}{d \alpha} (y_n - \alpha - \beta x_n)^2 \\ &= \sum_{n=1}^N 2 (y_n - \alpha - \beta x_n) (- 1) \\ &= 2 \left( \sum_{n=1}^N \beta x_n - y_n + \alpha \right), \end{aligned} \tag{3}

set this equal to zero, and then solve for α\alpha:

0=2(n=1Nβxnn=1Nyn+Nα),α^=yˉβxˉ,(4) \begin{aligned} 0 &= 2 \left( \sum_{n=1}^N \beta x_n - \sum_{n=1}^N y_n + N \alpha \right), \\ &\Downarrow \\ \hat{\alpha} &= \bar{y} - \beta \bar{x}, \end{aligned} \tag{4}

where xˉ\bar{x} and yˉ\bar{y} are the sample means, e.g. xˉ=(1/N)n=1Nxn\bar{x} = (1/N) \sum_{n=1}^N x_n.

Next, let’s solve for the slope β\beta. We take the derivative of our objective function w.r.t. β\beta,

dJdβ=n=1Nddβ(ynαβxn)2=n=1N2(ynαβxn)(xn)=2(n=1Nβxn2xnyn+αxn),(5) \begin{aligned} \frac{d J}{d \beta} &= \sum_{n=1}^N \frac{d}{d \beta} (y_n - \alpha - \beta x_n)^2 \\ &= \sum_{n=1}^N 2 (y_n - \alpha - \beta x_n) (- x_n) \\ &= 2 \left( \sum_{n=1}^N \beta x_n^2 - x_n y_n + \alpha x_n \right), \end{aligned} \tag{5}

set this equal to zero, and solve, plugging in the value of α\alpha that we just computed:

0=2[n=1Nβxn2xnyn+αxn],αn=1Nxn+βn=1Nxn2=n=1Nxnyn(yˉxˉβ)n=1Nxn+βn=1Nxn2=n=1Nxnynyˉn=1Nxnxˉβn=1Nxn+βn=1Nxn2=n=1Nxnynβ[n=1Nxn21Nn=1Nxnn=1Nxn]=n=1Nxnyn1Nn=1Nynn=1Nxn.(6) \begin{aligned} 0 &= 2 \left[ \sum_{n=1}^N \beta x_n^2 - x_n y_n + \alpha x_n \right], \\ &\Downarrow \\ \alpha \sum_{n=1}^N x_n + \beta \sum_{n=1}^N x_n^2 &= \sum_{n=1}^N x_n y_n \\ (\bar{y} - \bar{x} \beta) \sum_{n=1}^N x_n + \beta \sum_{n=1}^N x_n^2 &= \sum_{n=1}^N x_n y_n \\ \bar{y} \sum_{n=1}^N x_n - \bar{x} \beta \sum_{n=1}^N x_n + \beta \sum_{n=1}^N x_n^2 &= \sum_{n=1}^N x_n y_n \\ \beta \left[ \sum_{n=1}^N x_n^2 - \frac{1}{N} \sum_{n=1}^N x_n \sum_{n=1}^N x_n \right] &= \sum_{n=1}^N x_n y_n - \frac{1}{N} \sum_{n=1}^N y_n \sum_{n=1}^N x_n . \end{aligned} \tag{6}

To summarize, the OLS estimates for α^\hat{\alpha} and β^\hat{\beta} are

α^=yˉβ^xˉ,β^=n=1Nxnyn1Nn=1Nynn=1Nxnn=1Nxn21Nn=1Nxnn=1Nxn.(7) \begin{aligned} \hat{\alpha} &= \bar{y} - \hat{\beta} \bar{x}, \\ \hat{\beta} &= \frac{\sum_{n=1}^N x_n y_n - \frac{1}{N} \sum_{n=1}^N y_n \sum_{n=1}^N x_n }{\sum_{n=1}^N x_n^2 - \frac{1}{N} \sum_{n=1}^N x_n \sum_{n=1}^N x_n }. \end{aligned} \tag{7}

Figure 1. OLS with an intercept (solid line) can be decomposed into OLS without an intercept (dotted line) and intercept term (dashed line). Without an intercept, OLS goes through the origin. With an intercept, the hyperplane is shifted by the distance between the original hyperplane and the mean of the data.

I find it useful to visualize these two parameter estimates (Figure 11). The slope β^\hat{\beta} passes through the origin (0,0)(0, 0), while the intercept α^\hat{\alpha} shifts this slope so that it passes through the data’s mean, (xˉ,yˉ)(\bar{x}, \bar{y}).

β^\hat{\beta} in terms of correlation

We can write β^\hat{\beta} in terms of Pearson’s correlation between our targets y\mathbf{y} and predictors x\mathbf{x}, denoted ρxy\rho_{xy}. First, let’s denote the sample standard deviations for x\mathbf{x} and y\mathbf{y} as SxS_x and SyS_y respectively, i.e.

Sx1Nn=1N(xnxˉ)2,Sy1Nn=1N(ynyˉ)2.(8) S_x \triangleq \sqrt{\frac{1}{N} \sum_{n=1}^N (x_n - \bar{x})^2}, \qquad S_y \triangleq \sqrt{\frac{1}{N} \sum_{n=1}^N (y_n - \bar{y})^2}. \tag{8}

Now note that β^\hat{\beta} in Equation 77 can be written in terms of the covariance of x\mathbf{x} and y\mathbf{y} (numerator) and variance of x\mathbf{x} (denominator). This is because

n=1Nxnyn1Nn=1Nynn=1Nxn=n=1N(xnxˉ)(ynyˉ)NSxy,n=1Nxn21Nn=1Nxnn=1Nxn=n=1N(xnxˉ)2NSx2.(9) \begin{aligned} \sum_{n=1}^N x_n y_n - \frac{1}{N} \sum_{n=1}^N y_n \sum_{n=1}^N x_n &= \sum_{n=1}^N (x_n - \bar{x})(y_n - \bar{y}) \triangleq N S_{xy}, \\ \sum_{n=1}^N x_n^2 - \frac{1}{N} \sum_{n=1}^N x_n \sum_{n=1}^N x_n &= \sum_{n=1}^N (x_n - \bar{x})^2 \triangleq N S_{x}^2. \end{aligned} \tag{9}

See A1 for complete derivations of each. Then we can write β^\hat{\beta} in terms of covariance and variance,

β^=n=1N(xnxˉ)(ynyˉ)n=1N(xnxˉ)2=SxySx2,(10) \hat{\beta} = \frac{\sum_{n=1}^N (x_n - \bar{x})(y_n - \bar{y})}{\sum_{n=1}^N (x_n - \bar{x})^2} = \frac{S_{xy}}{S_{x}^2}, \tag{10}

which in turn can be written in terms of Pearson’s correlation coefficient ρxy\rho_{xy},

β^=(n=1N(xnxˉ)(ynyˉ)n=1N(xnxˉ)2n=1N(ynyˉ)2)ρxy(n=1N(ynyˉ)2n=1N(xnxˉ)2)=ρxyn=1N(ynyˉ)2n=1N(xnxˉ)2=ρxySySx.(11) \begin{aligned} \hat{\beta} &= \overbrace{\left( \frac{\sum_{n=1}^N (x_n - \bar{x}) (y_n - \bar{y})}{\sqrt{\sum_{n=1}^N (x_n - \bar{x})^2} \sqrt{\sum_{n=1}^N (y_n - \bar{y})^2}} \right)}^{\rho_{xy}} \left( \frac{\sqrt{\sum_{n=1}^N (y_n - \bar{y})^2}}{\sqrt{\sum_{n=1}^N (x_n - \bar{x})^2}} \right) \\ &= \rho_{xy} \frac{\sqrt{\sum_{n=1}^N (y_n - \bar{y})^2}}{\sqrt{\sum_{n=1}^N (x_n - \bar{x})^2}} \\ &= \rho_{xy} \frac{S_y}{S_x}. \end{aligned} \tag{11}

In other words, if we standardize our data, the estimated slope β^\hat{\beta} is just ρxy\rho_{xy}, the correlation between x\mathbf{x} and y\mathbf{y}. Note that if we were to use OLS without an intercept, we must also mean-center our data for this claim to be true. This is because β^\hat{\beta} without an intercept is

β^=n=1Nxnynn=1Nxn2.(12) \hat{\beta} = \frac{\sum_{n=1}^N x_n y_n}{\sum_{n=1}^N x_n^2}. \tag{12}

See A2 for a derivation.

Since OLS predicts y^=α^+β^x\hat{y} = \hat{\alpha} + \hat{\beta} x, we can write our predictions in terms of the correlation:

y^=α^+β^xy^=(yˉβ^xˉ)+β^xy^=yˉ+β^(xxˉ)y^yˉ=(ρxySySx)(xxˉ)y^y^ˉSy=ρxy(xxˉSx).(13) \begin{aligned} \hat{y} &= \hat{\alpha} + \hat{\beta} x \\ \hat{y} &= (\bar{y} - \hat{\beta} \bar{x}) + \hat{\beta} x \\ \hat{y} &= \bar{y} + \hat{\beta}(x - \bar{x}) \\ \hat{y} - \bar{y} &= \left( \rho_{xy} \frac{S_y}{S_x}\right)(x - \bar{x}) \\ &\Downarrow \\ \frac{\hat{y} - \bar{\hat{y}}}{S_y} &= \rho_{xy} \left( \frac{x - \bar{x}}{S_x} \right). \end{aligned} \tag{13}

We saw in a previous post that yˉ=y^ˉ\bar{y} = \bar{\hat{y}}.

Figure 2. Slope parameter β^\hat{\beta} for OLS fit to targets and predictors (x,y)(\mathbf{x}, \mathbf{y}) (top left), (x,2y)(\mathbf{x}, 2 \mathbf{y}) (top right), and (2x,y)(2 \mathbf{x}, \mathbf{y}) (bottom left). The slope halves or doubles, depending on how the standard deviation terms change.

Understanding β^\hat{\beta} in this way makes it easier to understand implications to changes in our predictors. Example, imagine that we doubled our predictors, i.e. we fit OLS to 2x=[2x1,,2xN]2 \mathbf{x} = [2x_1, \dots, 2x_N]^{\top} or we doubled our targets, i.e. we fit OLS to 2y=[2y1,,2yN]2 \mathbf{y} = [2y_1, \dots, 2y_N]^{\top}. How would this change our OLS estimates? We know that the correlation would not change, but both the mean and standard deviations would double, so β^\hat{\beta} would either halve when x\mathbf{x} is doubled or double when y\mathbf{y} doubles. We can see this directly in Equation 1111, and I have visualized it in Figure 22.

 

Acknowledgements

I thank Andrei Margeloiu for pointing out some confusing text regarding when β^\hat{\beta} is equal to ρxy\rho_{xy}.

 

Appendix

A1. Rewriting Equation 77

Equation 77’s numerator can be written as the un-normalized sample covariance between x\mathbf{x} and y\mathbf{y}:

n=1Nxnyn1Nn=1Nynn=1Nxn=n=1Nxnynyˉn=1Nxnxˉn=1Nyn+N1Nn=1Nyn1Nn=1Nxn=n=1Nxnynyˉn=1Nxnxˉn=1Nyn+Nyˉxˉ=n=1N(xnxˉ)(ynyˉ).(A1.1) \begin{aligned} &\sum_{n=1}^N x_n y_n - \frac{1}{N} \sum_{n=1}^N y_n \sum_{n=1}^N x_n \\ &= \sum_{n=1}^N x_n y_n - \bar{y} \sum_{n=1}^N x_n - \bar{x} \sum_{n=1}^N y_n + N \frac{1}{N} \sum_{n=1}^N y_n \frac{1}{N} \sum_{n=1}^N x_n \\ &= \sum_{n=1}^N x_n y_n - \bar{y} \sum_{n=1}^N x_n - \bar{x} \sum_{n=1}^N y_n + N \bar{y} \bar{x} \\ &= \sum_{n=1}^N (x_n - \bar{x})(y_n - \bar{y}). \end{aligned} \tag{A1.1}

Equation 77’s denominator can be written as the un-normalized sample variance of x\mathbf{x}:

n=1Nxn21Nn=1Nxnn=1Nxn=n=1Nxn221Nn=1Nxnn=1Nxn+N1Nn=1Nxn1Nn=1Nxn=n=1Nxn22xˉn=1Nxn+Nxˉ2=n=1N(xnxˉ)(xnxˉ).(A1.2) \begin{aligned} &\sum_{n=1}^N x_n^2 - \frac{1}{N} \sum_{n=1}^N x_n \sum_{n=1}^N x_n \\ &= \sum_{n=1}^N x_n^2 - 2 \frac{1}{N} \sum_{n=1}^N x_n \sum_{n=1}^N x_n + N \frac{1}{N} \sum_{n=1}^N x_n \frac{1}{N} \sum_{n=1}^N x_n \\ &= \sum_{n=1}^N x_n^2 - 2 \bar{x} \sum_{n=1}^N x_n + N \bar{x}^2 \\ &= \sum_{n=1}^N (x_n - \bar{x}) (x_n - \bar{x}). \end{aligned} \tag{A1.2}

A2. OLS estimator without an intercept

Without an intercept α\alpha, the optimal β\beta is

dJdβ=n=1Nddβ(ynβxn)2=n=1N2(ynβxn)(xn)=2(n=1Nβxn2xnyn),0=2[n=1Nβxn2xnyn],β=n=1Nxnynn=1Nxn2 \begin{aligned} \frac{d J}{d \beta} &= \sum_{n=1}^N \frac{d}{d \beta} (y_n - \beta x_n)^2 \\ &= \sum_{n=1}^N 2 (y_n - \beta x_n) (- x_n) \\ &= 2 \left( \sum_{n=1}^N \beta x_n^2 - x_n y_n \right), \\ &\Downarrow \\ 0 &= 2 \left[ \sum_{n=1}^N \beta x_n^2 - x_n y_n \right], \\ &\Downarrow \\ \beta &= \frac{ \sum_{n=1}^N x_n y_n }{ \sum_{n=1}^N x_n^2 } \end{aligned}

This makes sense. Here, the optimal β\beta does not mean center the data, since an intercept is not a modeling assumption.