Home
Blog

Correlation in detail

Correlation is conceptually straightforward, but I want to explore its mathematical formulation in detail. To do so, I build up to the formalization from first principles.

Published

14 June 2018

Formalizing intuition

The layman’s idea of correlation is a dependence relationship between events or variables. For example, every time the sun rises, the earth gets warmer. When the president talks of war, the stock market dips. The fact that such a relationship is not necessarily causative has been canonized with the phrase, “Correlation does not imply causation.” These ideas are fairly easy to understand at a high level, but we want to formalize our thinking. By thinking mathematically, we can make exact statements, prove things, and build complex systems with guarantees. Rather than simply provide the equation, I would like to build up to the equation from first principles.

As a motivating example, consider Figure 1, which consists of four time series with varying degrees of correlation. Each series represents a random variable over time, so each $y$-coordinate is an observation. Intuitively, we say that two time series are more correlated the closer the lines are to each other. For example, the red and blue series (squares and stars) are highly correlated with each other, while the green series (triangles) is moderately correlated with the red and blue series. The purple series (dots) is negatively correlated with the red and blue series.

Figure 1: TODO.

But what does correlation mean mathematically? As I mentioned, I find it useful to try to reconstruct mathematical equations. So rather than simply provide the equation, I would like to build up to the equation from first principles. To begin, let’s try to develop our own measure of correlation that captures our intuition of the time series data above. One idea is to measure the difference between the two series:

$\text{corr}_{\text{diff}}(\textbf{x}, \textbf{y}) = \sum_{i} \textbf{x}_i - \textbf{y}_i$

Here, we just measure the total difference between one variable and another. If the two time series were identical, the correlation would be measured as $0$. But this measure does not work well for a few reasons. For starters, the measure is asymmetric; the correlation may be negative or positive depending on whether or not $\textbf{y}$ is greater than $\textbf{x}$ on average. For example, $\text{corr}(\color{red}{\text{r}}, \color{blue}{\text{b}}) < 0$ while $\text{corr}(\color{blue}{\text{b}}, \color{red}{\text{r}}) > 0$. We can visualize this as the total area between the two series, where the vertical and horizontal hatching denote negative or positive area:

Figure 2: TODO.

This does not, I think, capture our intuition about correlation.

We would like a measure such that strictly greater or lesser values mean strictly greater or lesser correlation. One idea would be to perform element-wise multiplication of the two time series. This way, two data points that are both positive or both negative result in positive correlation, while negative correlation is only when the data points have different signs:

$\text{corr}_{\text{prod}}(\textbf{x}, \textbf{y}) = \sum_{i} \textbf{x}_i \textbf{y}_i = \textbf{x} \cdot \textbf{y}$

This is okay. With this measure, perfect correlation would have some value, say $Z$, and the closer the correlation is to $Z$, the more correlated the two series are. Negative correlation would be close to $-Z$. For example, here are the numbers for the time series in Figure 1:

$\begin{align} \text{corr}(\color{red}{\text{r}}, \color{red}{\text{r}}) = \text{corr}(\color{red}{\text{r}}, \color{red}{\text{r}}) &\approx 55278 \\ \text{corr}(\color{red}{\text{r}}, \color{blue}{\text{b}}) = \text{corr}(\color{blue}{\text{b}}, \color{red}{\text{r}}) &\approx 63400 \\ \text{corr}(\color{red}{\text{r}}, \color{green}{\text{g}}) = \text{corr}(\color{green}{\text{g}}, \color{red}{\text{r}}) &\approx 79342 \\ \text{corr}(\color{red}{\text{r}}, \color{purple}{\text{p}}) = \text{corr}(\color{purple}{\text{p}}, \color{red}{\text{r}}) &\approx -63757 \end{align}$

This is reasonable. But we still have a problem. With this measure, scale means very little. It is not clear what a number like $63400$ means in isolation. If someone told you, “The two data are correlated with measure $63400$”, you would not know what this meant—it would depend on numerical issues like the size of the variables being measured.

What if we could scale our measure between $-1$ and $1$? That way, if we said that two variables were correlated with measure $1$, it would mean the data are perfectly correlated, while measure $0$ meant no correlation and $-1$ meant negative correlation. This suggests normalization, meaning we want to divide the measure by the same relative quantity to constrain the range of possible values. One idea would be to divide the measure by a series’ correlation with itself:

$\text{corr}_{\text{norm}}(\textbf{x}, \textbf{y}) = \frac{\sum_{i} \textbf{x}_i \textbf{y}_i}{\sum_{i} \textbf{x}_i \textbf{x}_i}$

There are several types of numerical measures for correlation, called correlation coefficients. You can think of each coefficient as being a different way of mathematically expressing the same statistical idea. For simplicity, I will focus on the most common one, the Pearson correlation coefficient. Given random variables $\textbf{x}$ and $\textbf{y}$ with means $\mu_{x}$ and $\mu_y$ respectively, the Pearson correlation coefficient is defined as:

$\text{corr}(\textbf{x}, \textbf{y}) = \frac{\text{cov}(\textbf{x}, \textbf{y})}{\sqrt{\text{var}(\textbf{x})} \sqrt{\text{var}(\textbf{y})}} = \frac{\mathbb{E}[(\textbf{x} - \mu_x)(\textbf{y} - \mu_y)]}{\sqrt{\mathbb{E}[(\textbf{x} - \mu_x)^2]} \sqrt{\mathbb{E}[(\textbf{y} - \mu_y)^2]}}$

Clearly, correlation is a function of covariance and variance. Variance measures the expected deviation of a random variable from its mean.

The geometric interpretation

Now that we have a good grasp of what correlation means mathematically, let’s explore its geometric interpration. To simplify the math, let’s assume our data are mean centered, i.e. $\textbf{x} = \textbf{x} - \mu_x$. When this is the case, the Pearson correlation coefficient is equivalent to the cosine distance:

$\begin{align} \rho &= \frac{\sum (x_n - \mu_x)(y_n - \mu_y)}{\sqrt{\sum (x_n - \mu_x)} \sqrt{\sum (y_n - \mu_y)}} \\ \\ &= \frac{\sum x_n y_n}{\sqrt{\sum x_n} \sqrt{\sum y_n}} \\ \\ &= \frac{\textbf{x} \cdot \textbf{y}}{\lVert \textbf{x} \rVert_2 \lVert \textbf{y} \rVert_2} \\ \\ &= \cos \theta \end{align}$

Correlation is scale invariant

A property that is critical to the concept of correlation is that it is scale invariant. This means that two variables can be perfectly correlated even though one variable changes at a scale that is disproportionate to the other. For example, these two time series are perfectly correlated:

[TODO: Image of two time series that are perfectly correlated but at different scales]

We can see this mathematically by exploring what happens when we compute the correlation between two vectors $\textbf{x}$ and $\textbf{y} = \alpha \textbf{x}$ where $\alpha$ is a scalar. Let’s compute the correlation between the two vectors:

$\begin{align} \text{corr}(\textbf{x}, \alpha \textbf{x}) &= \frac{\mathbb{E}[(\textbf{x} - \mathbb{E}[\textbf{x}])(\alpha \textbf{x} - \mathbb{E}[\alpha \textbf{x}])]}{\sqrt{\text{var}(\textbf{x})} \sqrt{\text{var}(\alpha \textbf{x})}} \\ \\ &= \frac{\mathbb{E}[(\textbf{x} - \mathbb{E}[\textbf{x}])(\alpha(\textbf{x} - \mathbb{E}[\textbf{x}]))]}{\sqrt{\text{var}(\textbf{x})} \sqrt{\alpha^2 \text{var}(\textbf{x})}} \\ \\ &= \frac{\alpha \mathbb{E}[(\textbf{x} - \mathbb{E}[\textbf{x}])(\textbf{x} - \mathbb{E}[\textbf{x}])]}{\alpha \sqrt{ \text{var}(\textbf{x})} \sqrt{\text{var}(\textbf{x})}} \\ \\ &= \text{corr}(\textbf{x}, \textbf{x}) \end{align}$