Canonical Correlation Analysis in Detail

Canonical correlation analsyis is conceptually straightforward, but I want to define its objective and derive its solution in detail, both mathematically and programmatically.

Published

17 July 2018

Canonical correlation analysis (CCA) is a multivariate statistical method for finding two linear projections, one for each set of observations in a paired dataset, such that the projected data points are maximally correlated. Provided the data are mean-centered, this procedure can be visualized fairly easily (Figure $1$ ). CCA finds linear projections such that paired data points are close to each other in a lower-dimensional space. And since mean-centered points are vectors with tails at $(0, 0)$ , CCA can be thought of as minimizing the cosine distance between paired points.

Figure 1: A visualization of canonical correlation analysis. Let

n = 2

be the number of observations. Two datasets

\mathbf{X}_a \in \mathbb{R}^{n \times 3}

and

\mathbf{X}_b \in \mathbb{R}^{n \times 2}

are transformed by projections

W_a \in \mathbb{R}^{3 \times 2}

and

W_b \in \mathbb{R}^{2 \times 2}

such that the paired embeddings,

(\textbf{v}_a, \textbf{v}_b)

and

(\textbf{u}_a, \textbf{u}_b)

, are maximally correlated with unit length in the projected space.

The goal of this post is to go through the CCA procedure in painstaking detail, denoting the dimensionality of every vector and matrix, explaining the objective and constraints, walking through a derivation that proves how to solve the objective, and providing working code.

CCA objective

Consider two paired datasets, $\mathbf{X}_a \in \mathbb{R}^{n \times p}$ and $\mathbf{X}_b \in \mathbb{R}^{n \times q}$ . By “paired”, I mean that the $i$ -th sample consist of two row vectors, $( \textbf{x}_{a}^{i}, \textbf{x}_{b}^{i} )$ , with dimensionality $p$ and $q$ respectively. Thus, there are $n$ paired empirical multivariate observations. We assume the data are mean-centered. See the Appendix, Section 1 for a mathematical discussion of what this means.

Let $\textbf{w}_a \in \mathbb{R}^{p}$ and $\textbf{w}_b \in \mathbb{R}^{q}$ denote linear transformations of the data (note that these are vectors and not matrices, as in Figure $1$ ; this will be explained). A pair of canonical variables $\textbf{z}_a \in \mathbb{R}^{n}$ and $\textbf{z}_b \in \mathbb{R}^{n}$ are the images of $\mathbf{X}_a$ and $\mathbf{X}_b$ after projection, i.e.:

$\textbf{z}_a = \mathbf{X}_a \textbf{w}_a \qquad \textbf{z}_b = \mathbf{X}_b \textbf{w}_b$

The goal of CCA is to learn linear transformations $\textbf{w}_a$ and $\textbf{w}_b$ such that this pair of canonical variables are maximally correlated. Let $\boldsymbol{\Sigma}_{ij}$ denote the covariance matrix between datasets $\mathbf{X}_i$ and $\mathbf{X}_j$ for $i, j \in \{a, b\}$ . Then the CCA objective is:

$\begin{aligned} \textbf{w}_a^{\star}, \textbf{w}_b^{\star} &= \arg\!\max_{\textbf{w}_a, \textbf{w}_b} \text{corr}(\mathbf{X}_a \textbf{w}_a, \mathbf{X}_b \textbf{w}_b) \\ &= \arg\!\max_{\textbf{w}_a, \textbf{w}_b} \frac{ \textbf{w}_a^{\top} \boldsymbol{\Sigma}_{ab} \textbf{w}_b }{ \sqrt{\textbf{w}_a^{\top} \boldsymbol{\Sigma}_{aa} \textbf{w}_a} \sqrt{\textbf{w}_b^{\top} \boldsymbol{\Sigma}_{bb} \textbf{w}_b} } \end{aligned}$

Where the numerator is the cross-covariance of the projected data and the denominator is the standard deviations of the projected data, both in vectorized form. Notice that since the following derivation is true,

$\begin{aligned} \textbf{w}_i^{\top} \boldsymbol{\Sigma}_{ij} \textbf{w}_j &= \textbf{w}_i^{\top} \mathbf{X}_i^{\top} \mathbf{X}_j \textbf{w}_j \\ &= (\mathbf{X}_i \textbf{w}_i)^{\top} \mathbf{X}_j \textbf{w}_j \\ &= \textbf{z}_i^{\top} \textbf{z}_j \end{aligned}$

we can rewrite the objective as finding the minimum angle $\theta$ between the two canonical variables $\textbf{z}_a$ and $\textbf{z}_b$ :

$\begin{aligned} \cos \theta &= \max_{\textbf{z}_a, \textbf{z}_b} \frac{ \textbf{z}_a^{\top} \textbf{z}_b }{ \sqrt{\textbf{z}_a^{\top} \textbf{z}_a} \sqrt{\textbf{z}_b^{\top} \textbf{z}_b} } \\ &= \max_{\textbf{z}_a, \textbf{z}_b} \frac{ \textbf{z}_a^{\top} \textbf{z}_b }{ \lVert \textbf{z}_a \rVert_2 \lVert \textbf{z}_b \rVert_2 } \end{aligned}$

I like this formulation because it has a nice geometric interpretation for what is happening. We want projections such that our embedded data are vectors pointing in the same direction in lower-dimensional space (again, Figure $1$ ). See my post on the dot product if the previous statement did not make sense.

There are two additional constraints to the above CCA objective. First, the objective is scale invariant, and therefore the canonical variables $\textbf{z}_a$ and $\textbf{z}_b$ have unit length. See Appendix, Section 2 for a mathematical derivation of why correlation is scale invariant. This further simplifies our objective to:

$\cos \theta = \max_{ \textbf{z}_a, \textbf{z}_b }, \{ \textbf{z}_a^{\top} \textbf{z}_b \} \\ \lVert \textbf{z}_a \lVert_2 = 1, \qquad \lVert \textbf{z}_b \lVert_2 = 1$

So far, we have only talked about finding $p$ - and $q$ -dimensional projections to $n$ -dimensional canonical variables. But we can find multiple pairs of canonical variables that satisfy the above objective (we’ll see why later). If we would like to find a second pair of canonical variables, the second constraint is that these must be orthogonal or uncorrelated with the first pair. In general, for the $i$ -th and $j$ -th pair of canonical variables, we have:

${\textbf{z}_a^i}^{\top} \textbf{z}_a^j = 0, \qquad {\textbf{z}_b^i}^{\top} \textbf{z}_b^j = 0$

The number of pairs of canonical variables, call this $r$ , cannot be greater than the minimum of $p$ and $q$ (again, we’ll see why later):

$r = \min(p, q)$

Putting this objective and the constraints in one place, we have:

$\begin{aligned} \cos \theta_i &= \max_{ \textbf{z}_a, \textbf{z}_b } \{ {\textbf{z}_a^i}^{\top} \textbf{z}_b^i \}, \\ \lVert \textbf{z}_a^i \lVert_2 &= 1, \qquad \lVert \textbf{z}_b^i \lVert_2 = 1, \\ {\textbf{z}_a^i}^{\top} \textbf{z}_a^j &= 0, \qquad {\textbf{z}_b^i}^{\top} \textbf{z}_b^j = 0 \qquad \forall j \neq i: i,j \in \{1, 2, \dots, r \}. \end{aligned} \tag{1}$

With this notation, we define the goal of CCA as finding $r$ linear projections, $(\textbf{w}_a^i, \textbf{w}_a^i)$ for $i \in \{1, 2, \dots r \}$ , that satisfy the above constraints. Now, let’s see how to find these projections.

Solving for the objective

There are multiple ways to solve for the projections $\textbf{w}_a$ and $\textbf{w}_a$ . I will discuss the one proposed by Hotelling (Hotelling, 1936), which uses the standard eigenvalue problem, solving for eigenvector $\textbf{v}_i$ and eigenvalue $\lambda_i$ given a matrix $\mathbf{A}$ :

$\mathbf{A} \textbf{v}_i = \lambda_i \textbf{v}_i$

Or equivalently, solving the characteristic equation:

$\text{det}(\mathbf{A} - \lambda_i \mathbf{I}) = 0 \quad \text{where} \quad (\mathbf{A} - \lambda_i \mathbf{I}) \textbf{v}_i = \mathbf{0}$

First, recall our definition of $\boldsymbol{\Sigma}_{ij}$ as the empirical variance matrix between variables in $\mathbf{X}_i$ and $\mathbf{X}_j$ , so:

$\boldsymbol{\Sigma}_{aa} = \frac{1}{n-1} \mathbf{X}_a^{\top} \mathbf{X}_a \\ \boldsymbol{\Sigma}_{ab} = \frac{1}{n-1} \mathbf{X}_a^{\top} \mathbf{X}_b \\ \boldsymbol{\Sigma}_{bb} = \frac{1}{n-1} \mathbf{X}_b^{\top} \mathbf{X}_b$

And the joint covariance matrix is:

$\boldsymbol{\Sigma} = \begin{bmatrix} \boldsymbol{\Sigma}_{aa} & \boldsymbol{\Sigma}_{ab} \\ \boldsymbol{\Sigma}_{ba} & \boldsymbol{\Sigma}_{bb} \end{bmatrix}$

Recall that we constrained the canonical variables to have unit length, which implies:

$\begin{aligned} \lVert \textbf{z}_a \rVert_2^2 &= \textbf{z}_a^{\top} \textbf{z}_a \\ &= \textbf{w}_a^{\top} \mathbf{X}_a^{\top} \mathbf{X}_a \textbf{w}_a \\ &= \textbf{w}_a^{\top} \boldsymbol{\Sigma}_{aa} \textbf{w}_a \\ &= 1 \end{aligned} \tag{2}$

$\begin{aligned} \lVert \textbf{z}_b \rVert_2^2 &= \textbf{z}_b^{\top} \textbf{z}_b \\ &= \textbf{w}_b^{\top} \mathbf{X}_b^{\top} \mathbf{X}_b \textbf{w}_b \\ &= \textbf{w}_b^{\top} \boldsymbol{\Sigma}_{bb} \textbf{w}_b \\ &= 1 \end{aligned} \tag{3}$

Furthermore, note:

$\begin{aligned} \textbf{z}_a^{\top} \textbf{z}_b &= \textbf{w}_a^{\top} \mathbf{X}_a^{\top} \mathbf{X}_b \textbf{w}_b & \tag{4} \\ &= \textbf{w}_a^{\top} \boldsymbol{\Sigma}_{ab} \textbf{w}_b \\ \end{aligned}$

Substituting Equations ( $2$ ), ( $3$ ), and ( $4$ ) into ( $1$ ), we get:

$\cos \theta_i = \max_{ \textbf{w}_a, \textbf{w}_b } \{ \textbf{w}_a^{\top} \boldsymbol{\Sigma}_{ab} \textbf{w}_b \} \tag{6} \\ \lVert \textbf{z}_a \lVert_2 = \sqrt{\textbf{w}_a^{\top} \boldsymbol{\Sigma}_{aa} \textbf{w}_a} = 1 \qquad \lVert \textbf{z}_b \lVert_2 = \sqrt{\textbf{w}_b^{\top} \boldsymbol{\Sigma}_{bb} \textbf{w}_b} = 1$

Let’s ignore the orthogonality constraints for now. We’ll see in a minute that these are handled by how we solve the problem. We can frame our optimization plus constraint problem using Lagrange multipliers to get:

$\mathcal{L} = \overbrace{\textbf{w}_a^{\top} \boldsymbol{\Sigma}_{ab} \textbf{w}_b}^{\text{Optimize}} - \overbrace{\frac{\rho_1}{2}(\textbf{w}_a^{\top} \boldsymbol{\Sigma}_{aa} \textbf{w}_a - 1)}^{\text{Constraint}} - \overbrace{\frac{\rho_2}{2}(\textbf{w}_b^{\top} \boldsymbol{\Sigma}_{bb} \textbf{w}_b - 1)}^{\text{Constraint}}$

Where $\rho_1$ and $\rho_2$ are the Lagrange multipliers, which we divide by $2$ to make taking the derivative easier. Taking the derivative of the loss with respect to $\textbf{w}_a$ , we get:

$\frac{\partial \mathcal{L}}{\partial \textbf{w}_a} = \color{#bc2612}{\frac{\partial}{\partial \textbf{w}_a} \Big( \textbf{w}_a^{\top} \boldsymbol{\Sigma}_{ab} \textbf{w}_b \Big)} - \color{#11accd}{\frac{\partial}{\partial \textbf{w}_a} \Big( \frac{\rho_1}{2}(\textbf{w}_a^{\top} \boldsymbol{\Sigma}_{aa} \textbf{w}_a - 1) \Big)} - \color{#807504}{\frac{\partial}{\partial \textbf{w}_a} \Big( \frac{\rho_2}{2}(\textbf{w}_b^{\top} \boldsymbol{\Sigma}_{bb} \textbf{w}_b - 1) \Big)}$

Let’s solve each term separately. I’ll write out the first term (in red) in painful detail just to be sure we understand the derivation:

$\begin{aligned} \color{#bc2612}{\frac{\partial}{\partial \textbf{w}_a} \Big( \textbf{w}_a^{\top} \boldsymbol{\Sigma}_{ab} \textbf{w}_b \Big)} &= \frac{\partial}{\partial \textbf{w}_a} \begin{bmatrix} w_a^{(1)} & \dots & w_a^{(p)} \end{bmatrix} \begin{bmatrix} \boldsymbol{\Sigma}_{ab}^{(11)} & \dots & \boldsymbol{\Sigma}_{ab}^{(1q)} \\ \dots & \dots & \dots \\ \boldsymbol{\Sigma}_{ab}^{(p1)} & \dots & \boldsymbol{\Sigma}_{ab}^{(pq)} \end{bmatrix} \begin{bmatrix} w_b^{(1)} \\ \dots \\ w_b^{(q)} \end{bmatrix} \\ &= \frac{\partial}{\partial \textbf{w}_a} \begin{bmatrix} w_a^{1} \boldsymbol{\Sigma}_{ab}^{(11)} + \dots + w_a^{p} \boldsymbol{\Sigma}_{ab}^{(p1)}, & \dots & , w_a^{1} \boldsymbol{\Sigma}_{ab}^{(1q)} + \dots + w_a^{p} \boldsymbol{\Sigma}_{ab}^{(pq)} \end{bmatrix} \begin{bmatrix} w_b^{(1)} \\ \dots \\ w_b^{(q)} \end{bmatrix} \\ \\ &= \frac{\partial}{\partial \textbf{w}_a} \begin{bmatrix} w_b^{(1)} \Big( w_a^{1} \boldsymbol{\Sigma}_{ab}^{(11)} + \dots + w_a^{p} \boldsymbol{\Sigma}_{ab}^{(p1)} \Big) + \dots + w_b^{(q)} \Big( w_a^{1} \boldsymbol{\Sigma}_{ab}^{(1q)} + \dots + w_a^{p} \boldsymbol{\Sigma}_{ab}^{(pq)} \Big) \end{bmatrix} \end{aligned}$

By the linearity of differentiation, we can compute the derivative as the sum of the terms above. So the $w_a$ terms drop out, and we get:

$\color{#bc2612}{\frac{\partial}{\partial \textbf{w}_a} \Big( \textbf{w}_a^{\top} \boldsymbol{\Sigma}_{ab} \textbf{w}_b \Big) = \boldsymbol{\Sigma}_{ab} \textbf{w}_b}$

I won’t work through the other derivatives in the same detail because the reasoning is the same, but you can convince yourself that the derivatives are:

$\color{#11accd}{\frac{\partial}{\partial \textbf{w}_a} \Big( \frac{\rho_1}{2}(\textbf{w}_a^{\top} \boldsymbol{\Sigma}_{aa} \textbf{w}_a - 1) \Big) = \rho_1 \boldsymbol{\Sigma}_{aa} \textbf{w}_a }$

$\color{#807504}{\frac{\partial}{\partial \textbf{w}_a} \Big( \frac{\rho_2}{2}(\textbf{w}_b^{\top} \boldsymbol{\Sigma}_{bb} \textbf{w}_b - 1) \Big) = 0 }$

The derivatives with respect to $\textbf{w}_b$ are symmetric. Putting it all together, we get:

$\frac{\partial \mathcal{L}}{\partial \textbf{w}_a} = \color{#bc2612}{\boldsymbol{\Sigma}_{ab} \textbf{w}_b} - \color{#11accd}{\rho_1 \boldsymbol{\Sigma}_{aa} \textbf{w}_a} = \textbf{0} \tag{7}$

And for $\textbf{w}_b$ , we get:

$\frac{\partial \mathcal{L}}{\partial \textbf{w}_b} = \color{#bc2612}{\boldsymbol{\Sigma}_{ba} \textbf{w}_a} - \color{#807504}{\rho_2 \boldsymbol{\Sigma}_{bb} \textbf{w}_b} = \textbf{0} \tag{8}$

Now, note that the Lagrange multipliers $\rho_1$ and $\rho_2$ are the same. If we multiply $7$ by $\textbf{w}_a^{\top}$ , we get:

$\textbf{w}_a^{\top}(\boldsymbol{\Sigma}_{ab} \textbf{w}_b - \rho_1 \boldsymbol{\Sigma}_{aa} \textbf{w}_a) = \textbf{w}_a^{\top} \boldsymbol{\Sigma}_{ab} \textbf{w}_b - \rho_1 \underbrace{\textbf{w}_a^{\top} \boldsymbol{\Sigma}_{aa} \textbf{w}_a}_{\lVert \textbf{z}_a \rVert_2^2 = 1}$

The same logic applies if we multiply $\textbf{w}_b^{\top}$ by $8$ . Putting these two equations together, we get:

$0 = \textbf{w}_a^{\top} \boldsymbol{\Sigma}_{ab} \textbf{w}_b - \rho_1 \\ 0 = \textbf{w}_b^{\top} \boldsymbol{\Sigma}_{ba} \textbf{w}_a - \rho_2 \\ \downarrow \\ \rho_1 = \rho_2$

Let $\rho = \rho_1 = \rho_2$ . Now we have two equations, ( $7$ ) and ( $8$ ), and three unknowns, $\textbf{w}_a$ and $\textbf{w}_b$ and $\rho$ . Let’s solve for $\textbf{w}_a$ in $7$ and then substitute it into $8$ . First, solving for $\textbf{w}_a$ :

$\begin{aligned} \textbf{0} &= \boldsymbol{\Sigma}_{ab} \textbf{w}_b - \rho \boldsymbol{\Sigma}_{aa} \textbf{w}_a \\ \rho \boldsymbol{\Sigma}_{aa} \textbf{w}_a &= \boldsymbol{\Sigma}_{ab} \textbf{w}_b \\ \textbf{w}_a &= \frac{\boldsymbol{\Sigma}_{aa}^{-1} \boldsymbol{\Sigma}_{ab} \textbf{w}_b}{\rho} \tag{9} \end{aligned}$

Next, substituting $\textbf{w}_a$ into $8$ so we have one equation and two unknowns:

$\begin{aligned} \textbf{0} &= \boldsymbol{\Sigma}_{ba} \textbf{w}_a - \rho \boldsymbol{\Sigma}_{bb} \textbf{w}_b \\ &= \boldsymbol{\Sigma}_{ba} \Bigg( \frac{\boldsymbol{\Sigma}_{aa}^{-1} \boldsymbol{\Sigma}_{ab} \textbf{w}_b}{\rho} \Bigg) - \rho \boldsymbol{\Sigma}_{bb} \textbf{w}_b \\ \rho \boldsymbol{\Sigma}_{bb} \textbf{w}_b &= \boldsymbol{\Sigma}_{ba} \Bigg( \frac{\boldsymbol{\Sigma}_{aa}^{-1} \boldsymbol{\Sigma}_{ab} \textbf{w}_b}{\rho} \Bigg) \\ \rho^2 \boldsymbol{\Sigma}_{bb} \textbf{w}_b &= \boldsymbol{\Sigma}_{ba} \boldsymbol{\Sigma}_{aa}^{-1} \boldsymbol{\Sigma}_{ab} \textbf{w}_b \end{aligned}$

The standard eigenvalue method assumes $\boldsymbol{\Sigma}_{bb}$ is invertible, and we have:

$\begin{aligned} \rho^2 \boldsymbol{\Sigma}_{bb} \textbf{w}_b &= \boldsymbol{\Sigma}_{ba} \boldsymbol{\Sigma}_{aa}^{-1} \boldsymbol{\Sigma}_{ab} \textbf{w}_b \\ \rho^2 \textbf{w}_b &= \big(\boldsymbol{\Sigma}_{bb}^{-1} \boldsymbol{\Sigma}_{ba} \boldsymbol{\Sigma}_{aa}^{-1} \boldsymbol{\Sigma}_{ab} \big) \textbf{w}_b \end{aligned}$

We can see that this is the standard eigenvalue problem where $\lambda = \rho^2$ and $A = \boldsymbol{\Sigma}_{bb}^{-1} \boldsymbol{\Sigma}_{ba} \boldsymbol{\Sigma}_{aa}^{-1} \boldsymbol{\Sigma}_{ab}$ , which implies that $\textbf{w}_b$ is an eigenvector that satisfies this equation:

$\big( \boldsymbol{\Sigma}_{bb}^{-1} \boldsymbol{\Sigma}_{ba} \boldsymbol{\Sigma}_{aa}^{-1} \boldsymbol{\Sigma}_{ab} - \rho^2 I \big) \textbf{w}_b = \mathbf{0}$

We can solve for $\textbf{w}_a$ using ( $9$ ) after solving for $\textbf{w}_b$ .

See the Appendix, Section 3 for code.

Discussion

Now that we know how to solve for the linear projections $\textbf{w}_a$ and $\textbf{w}_b$ using the standard eigenvalue method, I think a few points make more sense, and I’d like to review them.

$n$ -th pair of canonical variables

In my experience, people typically talk about finding multiple vector projections, each $\textbf{w}_b^r \in \mathbb{R}^n$ , rather than a single matrix projection, $W_b \in \mathbb{R}^{n \times r}$ , for the view $\mathbf{X}_b$ (and likewise for $\mathbf{X}_a$ ). This makes more sense once you realize that each eigenvector $\textbf{w}_b^{r}$ is a solution to the optimization objective. The final projections $W_a$ and $W_b$ are indeed matrices, but their columns represent different solutions to the same problem.

The number of pairs of canonical variables

Furthermore, this also explains why the number of canonical variable pairs is $r = \min(p, q)$ , where $p$ and $q$ are the dimensionality of $\mathbf{X}_a$ and $\mathbf{X}_b$ respectively. Now we can say that $r$ is simply the maximum number of linearly independent eigenvectors. Recall that our matrix $\mathbf{A}$ in the characteristic equation $\text{det}(\mathbf{A} - \lambda \mathbf{I}) = \mathbf{0}$ was:

$\mathbf{A} = \boldsymbol{\Sigma}_{bb}^{-1} \boldsymbol{\Sigma}_{ba} \boldsymbol{\Sigma}_{aa}^{-1} \boldsymbol{\Sigma}_{ab} \in \mathbb{R}^{q \times p}$

Orthogonality

Finally, we can show that our respective canonical variables are orthogonal to each other. Consider Equations ( $7$ ) and ( $8$ ) after rearranging terms and setting $\rho_1 = \rho_2 = \rho$ :

$\begin{aligned} \boldsymbol{\Sigma}_{ab} \textbf{w}_b &= \rho \boldsymbol{\Sigma}_{aa} \textbf{w}_a \end{aligned} \tag{10}$

$\begin{aligned} \boldsymbol{\Sigma}_{ba} \textbf{w}_a &= \rho \boldsymbol{\Sigma}_{bb} \textbf{w}_b \end{aligned} \tag{11}$

Consider a new pair of canonical variables, denoted $\textbf{z}_a^{\star}$ and $\textbf{z}_b^{\star}$ with projections $\textbf{w}_a^{\star}$ and $\textbf{w}_b^{\star}$ . We want to show that the respective canonical variables are uncorrelated, i.e.:

$\textbf{z}_a^{\top} \textbf{z}_a^{\star} = 0 \\ \textbf{z}_b^{\top} \textbf{z}_b^{\star} = 0$

Let’s multiply ( $10$ ) by $\textbf{w}_a^{\star}$ :

$\begin{aligned} \boldsymbol{\Sigma}_{ab} \textbf{w}_b &= \rho \boldsymbol{\Sigma}_{aa} \textbf{w}_a \\ (\boldsymbol{\Sigma}_{ab} \textbf{w}_b)^{\top} \textbf{w}_a^{\star} &= \rho (\boldsymbol{\Sigma}_{aa} \textbf{w}_a)^{\top} \textbf{w}_a^{\star} \\ \textbf{w}_b^{\top} \boldsymbol{\Sigma}_{ab}^{\top} \textbf{w}_a^{\star} &= \rho \textbf{w}_a^{\top} \boldsymbol{\Sigma}_{aa}^{\top} \textbf{w}_a^{\star} \\ \textbf{z}_b^{\top} \textbf{z}_a^{\star} &= \rho \textbf{z}_a^{\top} \textbf{z}_a^{\star} \end{aligned} \tag{12}$

And ( $11$ ) by $\textbf{w}_b^{\star}$ :

$\begin{aligned} \boldsymbol{\Sigma}_{ba} \textbf{w}_a &= \rho \boldsymbol{\Sigma}_{bb} \textbf{w}_b \\ (\boldsymbol{\Sigma}_{ba} \textbf{w}_a)^{\top} \textbf{w}_b^{\star} &= \rho (\boldsymbol{\Sigma}_{bb} \textbf{w}_b)^{\top} \textbf{w}_b^{\star} \\ \textbf{w}_a^{\top} \boldsymbol{\Sigma}_{ba}^{\top} \textbf{w}_b^{\star} &= \rho \textbf{w}_b^{\top} \boldsymbol{\Sigma}_{bb}^{\top} \textbf{w}_b^{\star} \\ \textbf{z}_a^{\top} \textbf{z}_b^{\star} &= \rho \textbf{z}_b^{\top} \textbf{z}_b^{\star} \end{aligned} \tag{13}$

Now, notice that we can switch the ordering of the canonical pairs. This is because both pairs of canonical variables satisfy the same constraints, namely Equations ( $10$ ) and ( $11$ ):

$\begin{aligned} {\textbf{w}_b^{\star}}^{\top} \boldsymbol{\Sigma}_{ab}^{\top} \textbf{w}_a &= \rho^{\star} {\textbf{w}_a^{\star}}^{\top} \boldsymbol{\Sigma}_{aa}^{\top} \textbf{w}_a \\ {\textbf{z}_b^{\star}}^{\top} \textbf{z}_a &= \rho^{\star} {\textbf{z}_a^{\star}}^{\top} \textbf{z}_a \tag{14} \end{aligned}$

And:

$\begin{aligned} {\textbf{w}_a^{\star}}^{\top} \boldsymbol{\Sigma}_{ba}^{\top} \textbf{w}_b &= \rho^{\star} {\textbf{w}_b^{\star}}^{\top} \boldsymbol{\Sigma}_{bb}^{\top} \textbf{w}_b \\ {\textbf{z}_a^{\star}}^{\top} \textbf{z}_b &= \rho^{\star} {\textbf{z}_b^{\star}}^{\top} \textbf{z}_b \end{aligned} \tag{15}$

Since the dot product is commutative, we can set ( $12$ ) equal to ( $15$ ) and ( $13$ ) equal to ( $14$ ). To make it easy on the eyes, I’ve color-coded the same values in the equations below:

$\rho (\color{#11accd}{\textbf{z}_a^{\top} \textbf{z}_a^{\star}}) = \rho^{\star} (\color{#bc2612}{ {\textbf{z}_b^{\star}}^{\top} \textbf{z}_b}) \\ \\ \rho (\color{#bc2612}{\textbf{z}_b^{\top} \textbf{z}_b^{\star}}) = \rho^{\star} (\color{#11accd}{ {\textbf{z}_a^{\star}}^{\top} \textbf{z}_a})$

These equations are both true only when $\rho = \rho^{\star}$ or $\textbf{z}_a^{\top} \textbf{z}_a^{\star} = \textbf{z}_b^{\top} \textbf{z}_b^{\star} = 0$ . In words, the correlations among canonical variables are zero except when when they are associated with the same canonical correlation.

Conclusion

CCA is a powerful tool for finding relationships between high-dimensional datasets. There are many natural extensions to this method, such as probabilistic CCA (Bach & Jordan, 2005), which uses maximum likelihood estimates of the parameters to fit a statistical model and deep CCA (Andrew et al., 2013), which uses nonlinear projections, i.e. neural networks, while constraining the outputs of the networks to be maximally linearly correlated. The former has the benefit of handling uncertainty and modeling the generative process for the data using probabilistic machine learning. The latter leverages the rich representational power of deep learning.

Acknowledgements

I relied heavily on Hotelling’s original paper (Hotelling, 1936) and Uurtio’s fantastic survey (Uurtio et al., 2017) while writing this post and implementing CCA.

Appendix

1. Mean-centered data

To visualize mean-centering, let’s consider two time series datasets. We can think of time-series data as $1$ -dimensional, where each time step represents an observation. If two series are identical except for a difference in their $y$ -intercepts (Figure $2$ ), we would expect them to have a correlation coefficient of $1$ . Note that if their $y$ -intercepts are different, their means will be proportionally different.

Figure 2: Two

1

-D time series datasets that are perfectly correlated but have different

y

-intercepts and therefore different means.

The standard mathematical way to deal with this problem is to subtract the mean from each series. To see why this works, look at what happens mathematically when one series is simply the other series with some additive offset, i.e. $\textbf{y} = \textbf{x} + \alpha$ . First, let’s put the mean of $\textbf{y}$ , $\mu_y$ , in terms of the mean of $\textbf{x}$ , $\mu_x$ :

$\begin{aligned} \mu_y &= \frac{\sum_{i}^n x_i + \alpha}{n} = \frac{\sum_i^n x_i}{n} + \frac{n \alpha}{n} \\ &= \mu_x + \alpha \end{aligned}$

Now let’s compute the correlation between $\textbf{x}$ and $\textbf{y}$ :

$\begin{aligned} \text{corr}(\textbf{x}, \textbf{y}) &= \text{corr}(\textbf{x}, \textbf{x} + \alpha) \\ \\ &= \frac{\textbf{x} \cdot (\textbf{x} + \alpha)}{\lVert \textbf{x} \rVert_2 \lVert \textbf{x} + \alpha \rVert_2 } \\ \\ &= \frac{\sum_i (x_i - \mu_x)(x_i + \alpha - \mu_y)}{\sqrt{\sum_i (x_i - \mu_x)^2} \sqrt{\sum_i (x_i + \alpha - \mu_y)^2}} \\ \\ &= \frac{\sum_i (x_i - \mu_x)(x_i + \alpha - \mu_x + \alpha)}{\sqrt{\sum_i (x_i - \mu_x)^2} \sqrt{\sum_i (x_i + \alpha - \mu_x + \alpha)^2}} \\ \\ &= \frac{\sum_i (x_i - \mu_x)(x_i - \mu_x)}{\sqrt{\sum_i (x_i - \mu_x)^2} \sqrt{\sum_i (x_i - \mu_x)^2}} \\ \\ &= \text{corr}(\textbf{x}, \textbf{x}) \end{aligned}$

In words, by subtracting the mean, we allow the series to move up or down without effecting the measure of correlation. If we had an $K$ -dimensional dataset, we would subtract the means from all $K$ dimensions.

2. Scale invariance

Another important property for correlation is scale invariance. This means that two variables can be perfectly correlated even though one variable changes at a scale that is disproportionate to the other. For example, the two time series in Figure $3$ are perfectly correlated.

Figure 3: Two time series that are perfectly correlated. Every value in the red time series is multiplied by

3

to generate the blue series.

We can see this property mathematically by exploring what happens when we compute the correlation between two vectors $\textbf{x}$ and $\textbf{y} = \alpha \textbf{x}$ where $\alpha$ is a scalar. First, let’s again note the mean of $\textbf{y}$ :

$\mu_y = \frac{\sum_i^n \alpha x_i}{n} = \alpha \frac{\sum_i^n x_i}{n} = \alpha \mu_x$

$\begin{aligned} \text{corr}(\textbf{x}, \textbf{y}) &= \text{corr}(\textbf{x}, \alpha \textbf{x}) \\ \\ &= \frac{\textbf{x} \cdot (\alpha \textbf{x})}{\lVert \textbf{x} \rVert_2 \lVert \alpha \textbf{x} \rVert_2 } \\ \\ &= \frac{\sum_i (x_i - \mu_x)(\alpha x_i - \alpha \mu_x)}{\sqrt{\sum_i (x_i - \mu_x)^2} \sqrt{\sum_i (\alpha x_i - \alpha \mu_x)^2}} \\ \\ &= \frac{\sum_i (\alpha)(x_i - \mu_x)( x_i \mu_x)}{\sqrt{\sum_i (x_i - \mu_x)^2} \sqrt{\sum_i (\alpha)^2(x_i - \mu_x)^2}} \\ \\ &= \frac{\alpha \sum_i (x_i - \mu_x)( x_i \mu_x)}{\alpha \sqrt{\sum_i (x_i - \mu_x)^2} \sqrt{\sum_i (x_i - \mu_x)^2}} \\ \\ &= \frac{\sum_i (x_i - \mu_x)(x_i - \mu_x)}{\sqrt{\sum_i (x_i - \mu_x)^2} \sqrt{\sum_i (x_i - \mu_x)^2}} \\ \\ &= \text{corr}(\textbf{x}, \textbf{x}) \end{aligned}$

In words, $\textbf{y}$ is perfectly correlated with $\textbf{x}$ despite the two series being scaled differently.

3. Code

For completeness, here is the code to solve for $\textbf{w}_a$ and $\textbf{w}_b$ using Hotelling’s standard eigenvalue problem method. See my GitHub repository for a complete working version.

import numpy as np

# ------------------------------------------------------------------------------

inv  = np.linalg.inv
mm   = np.matmul

# ------------------------------------------------------------------------------

def fit_standard_eigenvalue_method(Xa, Xb):
    """
    Fits CCA parameters using the standard eigenvalue problem.
    
    :param Xa: Observations with shape (n_samps, p_dim).
    :param Xb: Observations with shape (n_samps, q_dim).
    :return:   Linear transformations Wa and Wb.
    """
    N, p = Xa.shape
    N, q = Xb.shape
    r    = min(p, q)

    Xa -= Xa.mean(axis=0)
    Xa /= Xa.std(axis=0)
    Xb -= Xb.mean(axis=0)
    Xb /= Xb.std(axis=0)

    p = Xa.shape[1]
    C   = np.cov(Xa.T, Xb.T)
    Caa = C[:p, :p]
    Cbb = C[p:, p:]
    Cab = C[:p, p:]
    Cba = C[p:, :p]

    # Either branch results in: r x r matrix where r = min(p, q).
    if q < p:
        M = mm(mm(inv(Cbb), Cba), mm(inv(Caa), Cab))
    else:
        M = mm(mm(inv(Caa), Cab), mm(inv(Cbb), Cba))

    # Solving the characteristic equation,
    #
    #     det(M - rho^2 I) = 0
    #
    # is equivalent to solving for rho^2, which are the eigenvalues of the
    # matrix.
    eigvals, eigvecs = np.linalg.eig(M)
    rhos = np.sqrt(eigvals)

    # Ensure we go through eigenvectors in descending order.
    inds    = (-rhos).argsort()
    rhos    = rhos[inds]
    eigvals = eigvals[inds]
    # NumPy returns each eigenvector as a column in a matrix.
    eigvecs = eigvecs.T[inds].T
    Wb      = eigvecs

    Wa = np.zeros((p, r))
    for i, (rho, wb_i) in enumerate(zip(rhos, Wb.T)):
        wa_i = mm(mm(inv(Caa), Cab), wb_i) / rho
        Wa[:, i] = wa_i

    # Sanity check: canonical correlations are equal to the rhos.
    Za = np.linalg.norm(mm(Xa, Wa), 2, axis=0)
    Zb = np.linalg.norm(mm(Xb, Wb), 2, axis=0)
    CCs = np.zeros(r)

    for i in range(r):
        za = Za[:, i]
        zb = Zb[:, i]
        CCs[i] = np.dot(za, zb)
    assert np.allclose(CCs, rhos)

    return Wa, Wb

Note that according to NumPy’s documentation on the eig function, eigenvalues and eigenvectors are not necessarily ordered, hence the ordering using argsort.

Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28(3/4), 321–377.
Bach, F. R., & Jordan, M. I. (2005). A probabilistic interpretation of canonical correlation analysis.
Andrew, G., Arora, R., Bilmes, J., & Livescu, K. (2013). Deep canonical correlation analysis. International Conference on Machine Learning, 1247–1255.
Uurtio, V., Monteiro, J. M., Kandola, J., Shawe-Taylor, J., Fernandez-Reyes, D., & Rousu, J. (2017). A tutorial on canonical correlation methods. ACM Computing Surveys (CSUR), 50(6), 95.