Canonical Correlation Analysis in Detail

Canonical correlation analsyis is conceptually straightforward, but I want to define its objective and derive its solution in detail, both mathematically and programmatically.

Canonical correlation analysis (CCA) is a multivariate statistical method for finding two linear projections, one for each set of observations in a paired dataset, such that the projected data points are maximally correlated. Provided the data are mean-centered, this procedure can be visualized fairly easily (Figure 11). CCA finds linear projections such that paired data points are close to each other in a lower-dimensional space. And since mean-centered points are vectors with tails at (0,0) (0, 0), CCA can be thought of as minimizing the cosine distance between paired points.

Figure 1: A visualization of canonical correlation analysis. Let n=2n = 2 be the number of observations. Two datasets XaRn×3\mathbf{X}_a \in \mathbb{R}^{n \times 3} and XbRn×2\mathbf{X}_b \in \mathbb{R}^{n \times 2} are transformed by projections WaR3×2W_a \in \mathbb{R}^{3 \times 2} and WbR2×2W_b \in \mathbb{R}^{2 \times 2} such that the paired embeddings, (va,vb)(\textbf{v}_a, \textbf{v}_b) and (ua,ub)(\textbf{u}_a, \textbf{u}_b), are maximally correlated with unit length in the projected space.

The goal of this post is to go through the CCA procedure in painstaking detail, denoting the dimensionality of every vector and matrix, explaining the objective and constraints, walking through a derivation that proves how to solve the objective, and providing working code.

CCA objective

Consider two paired datasets, XaRn×p\mathbf{X}_a \in \mathbb{R}^{n \times p} and XbRn×q\mathbf{X}_b \in \mathbb{R}^{n \times q}. By “paired”, I mean that the ii-th sample consist of two row vectors, (xai,xbi)( \textbf{x}_{a}^{i}, \textbf{x}_{b}^{i} ), with dimensionality pp and qq respectively. Thus, there are nn paired empirical multivariate observations. We assume the data are mean-centered. See the Appendix, Section 1 for a mathematical discussion of what this means.

Let waRp\textbf{w}_a \in \mathbb{R}^{p} and wbRq\textbf{w}_b \in \mathbb{R}^{q} denote linear transformations of the data (note that these are vectors and not matrices, as in Figure 11; this will be explained). A pair of canonical variables zaRn\textbf{z}_a \in \mathbb{R}^{n} and zbRn\textbf{z}_b \in \mathbb{R}^{n} are the images of Xa\mathbf{X}_a and Xb\mathbf{X}_b after projection, i.e.:

za=Xawazb=Xbwb \textbf{z}_a = \mathbf{X}_a \textbf{w}_a \qquad \textbf{z}_b = \mathbf{X}_b \textbf{w}_b

The goal of CCA is to learn linear transformations wa\textbf{w}_a and wb\textbf{w}_b such that this pair of canonical variables are maximally correlated. Let Σij\boldsymbol{\Sigma}_{ij} denote the covariance matrix between datasets Xi\mathbf{X}_i and Xj\mathbf{X}_j for i,j{a,b}i, j \in \{a, b\}. Then the CCA objective is:

wa,wb=arg ⁣maxwa,wbcorr(Xawa,Xbwb)=arg ⁣maxwa,wbwaΣabwbwaΣaawawbΣbbwb \begin{aligned} \textbf{w}_a^{\star}, \textbf{w}_b^{\star} &= \arg\!\max_{\textbf{w}_a, \textbf{w}_b} \text{corr}(\mathbf{X}_a \textbf{w}_a, \mathbf{X}_b \textbf{w}_b) \\ &= \arg\!\max_{\textbf{w}_a, \textbf{w}_b} \frac{ \textbf{w}_a^{\top} \boldsymbol{\Sigma}_{ab} \textbf{w}_b }{ \sqrt{\textbf{w}_a^{\top} \boldsymbol{\Sigma}_{aa} \textbf{w}_a} \sqrt{\textbf{w}_b^{\top} \boldsymbol{\Sigma}_{bb} \textbf{w}_b} } \end{aligned}

Where the numerator is the cross-covariance of the projected data and the denominator is the standard deviations of the projected data, both in vectorized form. Notice that since the following derivation is true,

wiΣijwj=wiXiXjwj=(Xiwi)Xjwj=zizj \begin{aligned} \textbf{w}_i^{\top} \boldsymbol{\Sigma}_{ij} \textbf{w}_j &= \textbf{w}_i^{\top} \mathbf{X}_i^{\top} \mathbf{X}_j \textbf{w}_j \\ &= (\mathbf{X}_i \textbf{w}_i)^{\top} \mathbf{X}_j \textbf{w}_j \\ &= \textbf{z}_i^{\top} \textbf{z}_j \end{aligned}

we can rewrite the objective as finding the minimum angle θ\theta between the two canonical variables za\textbf{z}_a and zb\textbf{z}_b:

cosθ=maxza,zbzazbzazazbzb=maxza,zbzazbza2zb2 \begin{aligned} \cos \theta &= \max_{\textbf{z}_a, \textbf{z}_b} \frac{ \textbf{z}_a^{\top} \textbf{z}_b }{ \sqrt{\textbf{z}_a^{\top} \textbf{z}_a} \sqrt{\textbf{z}_b^{\top} \textbf{z}_b} } \\ &= \max_{\textbf{z}_a, \textbf{z}_b} \frac{ \textbf{z}_a^{\top} \textbf{z}_b }{ \lVert \textbf{z}_a \rVert_2 \lVert \textbf{z}_b \rVert_2 } \end{aligned}

I like this formulation because it has a nice geometric interpretation for what is happening. We want projections such that our embedded data are vectors pointing in the same direction in lower-dimensional space (again, Figure 11). See my post on the dot product if the previous statement did not make sense.

There are two additional constraints to the above CCA objective. First, the objective is scale invariant, and therefore the canonical variables za\textbf{z}_a and zb\textbf{z}_b have unit length. See Appendix, Section 2 for a mathematical derivation of why correlation is scale invariant. This further simplifies our objective to:

cosθ=maxza,zb,{zazb}za2=1,zb2=1 \cos \theta = \max_{ \textbf{z}_a, \textbf{z}_b }, \{ \textbf{z}_a^{\top} \textbf{z}_b \} \\ \lVert \textbf{z}_a \lVert_2 = 1, \qquad \lVert \textbf{z}_b \lVert_2 = 1

So far, we have only talked about finding pp- and qq-dimensional projections to nn-dimensional canonical variables. But we can find multiple pairs of canonical variables that satisfy the above objective (we’ll see why later). If we would like to find a second pair of canonical variables, the second constraint is that these must be orthogonal or uncorrelated with the first pair. In general, for the ii-th and jj-th pair of canonical variables, we have:

zaizaj=0,zbizbj=0 {\textbf{z}_a^i}^{\top} \textbf{z}_a^j = 0, \qquad {\textbf{z}_b^i}^{\top} \textbf{z}_b^j = 0

The number of pairs of canonical variables, call this rr, cannot be greater than the minimum of pp and qq (again, we’ll see why later):

r=min(p,q) r = \min(p, q)

Putting this objective and the constraints in one place, we have:

cosθi=maxza,zb{zaizbi},zai2=1,zbi2=1,zaizaj=0,zbizbj=0ji:i,j{1,2,,r}.(1) \begin{aligned} \cos \theta_i &= \max_{ \textbf{z}_a, \textbf{z}_b } \{ {\textbf{z}_a^i}^{\top} \textbf{z}_b^i \}, \\ \lVert \textbf{z}_a^i \lVert_2 &= 1, \qquad \lVert \textbf{z}_b^i \lVert_2 = 1, \\ {\textbf{z}_a^i}^{\top} \textbf{z}_a^j &= 0, \qquad {\textbf{z}_b^i}^{\top} \textbf{z}_b^j = 0 \qquad \forall j \neq i: i,j \in \{1, 2, \dots, r \}. \end{aligned} \tag{1}

With this notation, we define the goal of CCA as finding rr linear projections, (wai,wai)(\textbf{w}_a^i, \textbf{w}_a^i) for i{1,2,r}i \in \{1, 2, \dots r \}, that satisfy the above constraints. Now, let’s see how to find these projections.

Solving for the objective

There are multiple ways to solve for the projections wa\textbf{w}_a and wa\textbf{w}_a. I will discuss the one proposed by Hotelling (Hotelling, 1936), which uses the standard eigenvalue problem, solving for eigenvector vi\textbf{v}_i and eigenvalue λi\lambda_i given a matrix A\mathbf{A}:

Avi=λivi \mathbf{A} \textbf{v}_i = \lambda_i \textbf{v}_i

Or equivalently, solving the characteristic equation:

det(AλiI)=0where(AλiI)vi=0 \text{det}(\mathbf{A} - \lambda_i \mathbf{I}) = 0 \quad \text{where} \quad (\mathbf{A} - \lambda_i \mathbf{I}) \textbf{v}_i = \mathbf{0}

First, recall our definition of Σij\boldsymbol{\Sigma}_{ij} as the empirical variance matrix between variables in Xi\mathbf{X}_i and Xj\mathbf{X}_j, so:

Σaa=1n1XaXaΣab=1n1XaXbΣbb=1n1XbXb \boldsymbol{\Sigma}_{aa} = \frac{1}{n-1} \mathbf{X}_a^{\top} \mathbf{X}_a \\ \boldsymbol{\Sigma}_{ab} = \frac{1}{n-1} \mathbf{X}_a^{\top} \mathbf{X}_b \\ \boldsymbol{\Sigma}_{bb} = \frac{1}{n-1} \mathbf{X}_b^{\top} \mathbf{X}_b

And the joint covariance matrix is:

Σ=[ΣaaΣabΣbaΣbb] \boldsymbol{\Sigma} = \begin{bmatrix} \boldsymbol{\Sigma}_{aa} & \boldsymbol{\Sigma}_{ab} \\ \boldsymbol{\Sigma}_{ba} & \boldsymbol{\Sigma}_{bb} \end{bmatrix}

Recall that we constrained the canonical variables to have unit length, which implies:

za22=zaza=waXaXawa=waΣaawa=1(2) \begin{aligned} \lVert \textbf{z}_a \rVert_2^2 &= \textbf{z}_a^{\top} \textbf{z}_a \\ &= \textbf{w}_a^{\top} \mathbf{X}_a^{\top} \mathbf{X}_a \textbf{w}_a \\ &= \textbf{w}_a^{\top} \boldsymbol{\Sigma}_{aa} \textbf{w}_a \\ &= 1 \end{aligned} \tag{2}

zb22=zbzb=wbXbXbwb=wbΣbbwb=1(3) \begin{aligned} \lVert \textbf{z}_b \rVert_2^2 &= \textbf{z}_b^{\top} \textbf{z}_b \\ &= \textbf{w}_b^{\top} \mathbf{X}_b^{\top} \mathbf{X}_b \textbf{w}_b \\ &= \textbf{w}_b^{\top} \boldsymbol{\Sigma}_{bb} \textbf{w}_b \\ &= 1 \end{aligned} \tag{3}

Furthermore, note:

zazb=waXaXbwb=waΣabwb(4) \begin{aligned} \textbf{z}_a^{\top} \textbf{z}_b &= \textbf{w}_a^{\top} \mathbf{X}_a^{\top} \mathbf{X}_b \textbf{w}_b & \tag{4} \\ &= \textbf{w}_a^{\top} \boldsymbol{\Sigma}_{ab} \textbf{w}_b \\ \end{aligned}

Substituting Equations (22), (33), and (44) into (11), we get:

cosθi=maxwa,wb{waΣabwb}za2=waΣaawa=1zb2=wbΣbbwb=1(6) \cos \theta_i = \max_{ \textbf{w}_a, \textbf{w}_b } \{ \textbf{w}_a^{\top} \boldsymbol{\Sigma}_{ab} \textbf{w}_b \} \tag{6} \\ \lVert \textbf{z}_a \lVert_2 = \sqrt{\textbf{w}_a^{\top} \boldsymbol{\Sigma}_{aa} \textbf{w}_a} = 1 \qquad \lVert \textbf{z}_b \lVert_2 = \sqrt{\textbf{w}_b^{\top} \boldsymbol{\Sigma}_{bb} \textbf{w}_b} = 1

Let’s ignore the orthogonality constraints for now. We’ll see in a minute that these are handled by how we solve the problem. We can frame our optimization plus constraint problem using Lagrange multipliers to get:

L=waΣabwbOptimizeρ12(waΣaawa1)Constraintρ22(wbΣbbwb1)Constraint \mathcal{L} = \overbrace{\textbf{w}_a^{\top} \boldsymbol{\Sigma}_{ab} \textbf{w}_b}^{\text{Optimize}} - \overbrace{\frac{\rho_1}{2}(\textbf{w}_a^{\top} \boldsymbol{\Sigma}_{aa} \textbf{w}_a - 1)}^{\text{Constraint}} - \overbrace{\frac{\rho_2}{2}(\textbf{w}_b^{\top} \boldsymbol{\Sigma}_{bb} \textbf{w}_b - 1)}^{\text{Constraint}}

Where ρ1\rho_1 and ρ2\rho_2 are the Lagrange multipliers, which we divide by 22 to make taking the derivative easier. Taking the derivative of the loss with respect to wa\textbf{w}_a, we get:

Lwa=wa(waΣabwb)wa(ρ12(waΣaawa1))wa(ρ22(wbΣbbwb1)) \frac{\partial \mathcal{L}}{\partial \textbf{w}_a} = \color{#bc2612}{\frac{\partial}{\partial \textbf{w}_a} \Big( \textbf{w}_a^{\top} \boldsymbol{\Sigma}_{ab} \textbf{w}_b \Big)} - \color{#11accd}{\frac{\partial}{\partial \textbf{w}_a} \Big( \frac{\rho_1}{2}(\textbf{w}_a^{\top} \boldsymbol{\Sigma}_{aa} \textbf{w}_a - 1) \Big)} - \color{#807504}{\frac{\partial}{\partial \textbf{w}_a} \Big( \frac{\rho_2}{2}(\textbf{w}_b^{\top} \boldsymbol{\Sigma}_{bb} \textbf{w}_b - 1) \Big)}

Let’s solve each term separately. I’ll write out the first term (in red) in painful detail just to be sure we understand the derivation:

wa(waΣabwb)=wa[wa(1)wa(p)][Σab(11)Σab(1q)Σab(p1)Σab(pq)][wb(1)wb(q)]=wa[wa1Σab(11)++wapΣab(p1),,wa1Σab(1q)++wapΣab(pq)][wb(1)wb(q)]=wa[wb(1)(wa1Σab(11)++wapΣab(p1))++wb(q)(wa1Σab(1q)++wapΣab(pq))] \begin{aligned} \color{#bc2612}{\frac{\partial}{\partial \textbf{w}_a} \Big( \textbf{w}_a^{\top} \boldsymbol{\Sigma}_{ab} \textbf{w}_b \Big)} &= \frac{\partial}{\partial \textbf{w}_a} \begin{bmatrix} w_a^{(1)} & \dots & w_a^{(p)} \end{bmatrix} \begin{bmatrix} \boldsymbol{\Sigma}_{ab}^{(11)} & \dots & \boldsymbol{\Sigma}_{ab}^{(1q)} \\ \dots & \dots & \dots \\ \boldsymbol{\Sigma}_{ab}^{(p1)} & \dots & \boldsymbol{\Sigma}_{ab}^{(pq)} \end{bmatrix} \begin{bmatrix} w_b^{(1)} \\ \dots \\ w_b^{(q)} \end{bmatrix} \\ &= \frac{\partial}{\partial \textbf{w}_a} \begin{bmatrix} w_a^{1} \boldsymbol{\Sigma}_{ab}^{(11)} + \dots + w_a^{p} \boldsymbol{\Sigma}_{ab}^{(p1)}, & \dots & , w_a^{1} \boldsymbol{\Sigma}_{ab}^{(1q)} + \dots + w_a^{p} \boldsymbol{\Sigma}_{ab}^{(pq)} \end{bmatrix} \begin{bmatrix} w_b^{(1)} \\ \dots \\ w_b^{(q)} \end{bmatrix} \\ \\ &= \frac{\partial}{\partial \textbf{w}_a} \begin{bmatrix} w_b^{(1)} \Big( w_a^{1} \boldsymbol{\Sigma}_{ab}^{(11)} + \dots + w_a^{p} \boldsymbol{\Sigma}_{ab}^{(p1)} \Big) + \dots + w_b^{(q)} \Big( w_a^{1} \boldsymbol{\Sigma}_{ab}^{(1q)} + \dots + w_a^{p} \boldsymbol{\Sigma}_{ab}^{(pq)} \Big) \end{bmatrix} \end{aligned}

By the linearity of differentiation, we can compute the derivative as the sum of the terms above. So the waw_a terms drop out, and we get:

wa(waΣabwb)=Σabwb \color{#bc2612}{\frac{\partial}{\partial \textbf{w}_a} \Big( \textbf{w}_a^{\top} \boldsymbol{\Sigma}_{ab} \textbf{w}_b \Big) = \boldsymbol{\Sigma}_{ab} \textbf{w}_b}

I won’t work through the other derivatives in the same detail because the reasoning is the same, but you can convince yourself that the derivatives are:

wa(ρ12(waΣaawa1))=ρ1Σaawa \color{#11accd}{\frac{\partial}{\partial \textbf{w}_a} \Big( \frac{\rho_1}{2}(\textbf{w}_a^{\top} \boldsymbol{\Sigma}_{aa} \textbf{w}_a - 1) \Big) = \rho_1 \boldsymbol{\Sigma}_{aa} \textbf{w}_a }

wa(ρ22(wbΣbbwb1))=0 \color{#807504}{\frac{\partial}{\partial \textbf{w}_a} \Big( \frac{\rho_2}{2}(\textbf{w}_b^{\top} \boldsymbol{\Sigma}_{bb} \textbf{w}_b - 1) \Big) = 0 }

The derivatives with respect to wb\textbf{w}_b are symmetric. Putting it all together, we get:

Lwa=Σabwbρ1Σaawa=0(7) \frac{\partial \mathcal{L}}{\partial \textbf{w}_a} = \color{#bc2612}{\boldsymbol{\Sigma}_{ab} \textbf{w}_b} - \color{#11accd}{\rho_1 \boldsymbol{\Sigma}_{aa} \textbf{w}_a} = \textbf{0} \tag{7}

And for wb\textbf{w}_b, we get:

Lwb=Σbawaρ2Σbbwb=0(8) \frac{\partial \mathcal{L}}{\partial \textbf{w}_b} = \color{#bc2612}{\boldsymbol{\Sigma}_{ba} \textbf{w}_a} - \color{#807504}{\rho_2 \boldsymbol{\Sigma}_{bb} \textbf{w}_b} = \textbf{0} \tag{8}

Now, note that the Lagrange multipliers ρ1\rho_1 and ρ2\rho_2 are the same. If we multiply 77 by wa\textbf{w}_a^{\top}, we get:

wa(Σabwbρ1Σaawa)=waΣabwbρ1waΣaawaza22=1 \textbf{w}_a^{\top}(\boldsymbol{\Sigma}_{ab} \textbf{w}_b - \rho_1 \boldsymbol{\Sigma}_{aa} \textbf{w}_a) = \textbf{w}_a^{\top} \boldsymbol{\Sigma}_{ab} \textbf{w}_b - \rho_1 \underbrace{\textbf{w}_a^{\top} \boldsymbol{\Sigma}_{aa} \textbf{w}_a}_{\lVert \textbf{z}_a \rVert_2^2 = 1}

The same logic applies if we multiply wb\textbf{w}_b^{\top} by 88. Putting these two equations together, we get:

0=waΣabwbρ10=wbΣbawaρ2ρ1=ρ2 0 = \textbf{w}_a^{\top} \boldsymbol{\Sigma}_{ab} \textbf{w}_b - \rho_1 \\ 0 = \textbf{w}_b^{\top} \boldsymbol{\Sigma}_{ba} \textbf{w}_a - \rho_2 \\ \downarrow \\ \rho_1 = \rho_2

Let ρ=ρ1=ρ2\rho = \rho_1 = \rho_2. Now we have two equations, (77) and (88), and three unknowns, wa\textbf{w}_a and wb\textbf{w}_b and ρ\rho. Let’s solve for wa\textbf{w}_a in 77 and then substitute it into 88. First, solving for wa\textbf{w}_a:

0=ΣabwbρΣaawaρΣaawa=Σabwbwa=Σaa1Σabwbρ(9) \begin{aligned} \textbf{0} &= \boldsymbol{\Sigma}_{ab} \textbf{w}_b - \rho \boldsymbol{\Sigma}_{aa} \textbf{w}_a \\ \rho \boldsymbol{\Sigma}_{aa} \textbf{w}_a &= \boldsymbol{\Sigma}_{ab} \textbf{w}_b \\ \textbf{w}_a &= \frac{\boldsymbol{\Sigma}_{aa}^{-1} \boldsymbol{\Sigma}_{ab} \textbf{w}_b}{\rho} \tag{9} \end{aligned}

Next, substituting wa\textbf{w}_a into 88 so we have one equation and two unknowns:

0=ΣbawaρΣbbwb=Σba(Σaa1Σabwbρ)ρΣbbwbρΣbbwb=Σba(Σaa1Σabwbρ)ρ2Σbbwb=ΣbaΣaa1Σabwb \begin{aligned} \textbf{0} &= \boldsymbol{\Sigma}_{ba} \textbf{w}_a - \rho \boldsymbol{\Sigma}_{bb} \textbf{w}_b \\ &= \boldsymbol{\Sigma}_{ba} \Bigg( \frac{\boldsymbol{\Sigma}_{aa}^{-1} \boldsymbol{\Sigma}_{ab} \textbf{w}_b}{\rho} \Bigg) - \rho \boldsymbol{\Sigma}_{bb} \textbf{w}_b \\ \rho \boldsymbol{\Sigma}_{bb} \textbf{w}_b &= \boldsymbol{\Sigma}_{ba} \Bigg( \frac{\boldsymbol{\Sigma}_{aa}^{-1} \boldsymbol{\Sigma}_{ab} \textbf{w}_b}{\rho} \Bigg) \\ \rho^2 \boldsymbol{\Sigma}_{bb} \textbf{w}_b &= \boldsymbol{\Sigma}_{ba} \boldsymbol{\Sigma}_{aa}^{-1} \boldsymbol{\Sigma}_{ab} \textbf{w}_b \end{aligned}

The standard eigenvalue method assumes Σbb\boldsymbol{\Sigma}_{bb} is invertible, and we have:

ρ2Σbbwb=ΣbaΣaa1Σabwbρ2wb=(Σbb1ΣbaΣaa1Σab)wb \begin{aligned} \rho^2 \boldsymbol{\Sigma}_{bb} \textbf{w}_b &= \boldsymbol{\Sigma}_{ba} \boldsymbol{\Sigma}_{aa}^{-1} \boldsymbol{\Sigma}_{ab} \textbf{w}_b \\ \rho^2 \textbf{w}_b &= \big(\boldsymbol{\Sigma}_{bb}^{-1} \boldsymbol{\Sigma}_{ba} \boldsymbol{\Sigma}_{aa}^{-1} \boldsymbol{\Sigma}_{ab} \big) \textbf{w}_b \end{aligned}

We can see that this is the standard eigenvalue problem where λ=ρ2\lambda = \rho^2 and A=Σbb1ΣbaΣaa1ΣabA = \boldsymbol{\Sigma}_{bb}^{-1} \boldsymbol{\Sigma}_{ba} \boldsymbol{\Sigma}_{aa}^{-1} \boldsymbol{\Sigma}_{ab}, which implies that wb\textbf{w}_b is an eigenvector that satisfies this equation:

(Σbb1ΣbaΣaa1Σabρ2I)wb=0 \big( \boldsymbol{\Sigma}_{bb}^{-1} \boldsymbol{\Sigma}_{ba} \boldsymbol{\Sigma}_{aa}^{-1} \boldsymbol{\Sigma}_{ab} - \rho^2 I \big) \textbf{w}_b = \mathbf{0}

We can solve for wa\textbf{w}_a using (99) after solving for wb\textbf{w}_b.

See the Appendix, Section 3 for code.

Discussion

Now that we know how to solve for the linear projections wa\textbf{w}_a and wb\textbf{w}_b using the standard eigenvalue method, I think a few points make more sense, and I’d like to review them.

nn-th pair of canonical variables

In my experience, people typically talk about finding multiple vector projections, each wbrRn\textbf{w}_b^r \in \mathbb{R}^n, rather than a single matrix projection, WbRn×rW_b \in \mathbb{R}^{n \times r}, for the view Xb\mathbf{X}_b (and likewise for Xa\mathbf{X}_a). This makes more sense once you realize that each eigenvector wbr\textbf{w}_b^{r} is a solution to the optimization objective. The final projections WaW_a and WbW_b are indeed matrices, but their columns represent different solutions to the same problem.

The number of pairs of canonical variables

Furthermore, this also explains why the number of canonical variable pairs is r=min(p,q)r = \min(p, q), where pp and qq are the dimensionality of Xa\mathbf{X}_a and Xb\mathbf{X}_b respectively. Now we can say that rr is simply the maximum number of linearly independent eigenvectors. Recall that our matrix A\mathbf{A} in the characteristic equation det(AλI)=0\text{det}(\mathbf{A} - \lambda \mathbf{I}) = \mathbf{0} was:

A=Σbb1ΣbaΣaa1ΣabRq×p \mathbf{A} = \boldsymbol{\Sigma}_{bb}^{-1} \boldsymbol{\Sigma}_{ba} \boldsymbol{\Sigma}_{aa}^{-1} \boldsymbol{\Sigma}_{ab} \in \mathbb{R}^{q \times p}

Orthogonality

Finally, we can show that our respective canonical variables are orthogonal to each other. Consider Equations (77) and (88) after rearranging terms and setting ρ1=ρ2=ρ\rho_1 = \rho_2 = \rho:

Σabwb=ρΣaawa(10) \begin{aligned} \boldsymbol{\Sigma}_{ab} \textbf{w}_b &= \rho \boldsymbol{\Sigma}_{aa} \textbf{w}_a \end{aligned} \tag{10}

Σbawa=ρΣbbwb(11) \begin{aligned} \boldsymbol{\Sigma}_{ba} \textbf{w}_a &= \rho \boldsymbol{\Sigma}_{bb} \textbf{w}_b \end{aligned} \tag{11}

Consider a new pair of canonical variables, denoted za\textbf{z}_a^{\star} and zb\textbf{z}_b^{\star} with projections wa\textbf{w}_a^{\star} and wb\textbf{w}_b^{\star}. We want to show that the respective canonical variables are uncorrelated, i.e.:

zaza=0zbzb=0 \textbf{z}_a^{\top} \textbf{z}_a^{\star} = 0 \\ \textbf{z}_b^{\top} \textbf{z}_b^{\star} = 0

Let’s multiply (1010) by wa\textbf{w}_a^{\star}:

Σabwb=ρΣaawa(Σabwb)wa=ρ(Σaawa)wawbΣabwa=ρwaΣaawazbza=ρzaza(12) \begin{aligned} \boldsymbol{\Sigma}_{ab} \textbf{w}_b &= \rho \boldsymbol{\Sigma}_{aa} \textbf{w}_a \\ (\boldsymbol{\Sigma}_{ab} \textbf{w}_b)^{\top} \textbf{w}_a^{\star} &= \rho (\boldsymbol{\Sigma}_{aa} \textbf{w}_a)^{\top} \textbf{w}_a^{\star} \\ \textbf{w}_b^{\top} \boldsymbol{\Sigma}_{ab}^{\top} \textbf{w}_a^{\star} &= \rho \textbf{w}_a^{\top} \boldsymbol{\Sigma}_{aa}^{\top} \textbf{w}_a^{\star} \\ \textbf{z}_b^{\top} \textbf{z}_a^{\star} &= \rho \textbf{z}_a^{\top} \textbf{z}_a^{\star} \end{aligned} \tag{12}

And (1111) by wb\textbf{w}_b^{\star}:

Σbawa=ρΣbbwb(Σbawa)wb=ρ(Σbbwb)wbwaΣbawb=ρwbΣbbwbzazb=ρzbzb(13) \begin{aligned} \boldsymbol{\Sigma}_{ba} \textbf{w}_a &= \rho \boldsymbol{\Sigma}_{bb} \textbf{w}_b \\ (\boldsymbol{\Sigma}_{ba} \textbf{w}_a)^{\top} \textbf{w}_b^{\star} &= \rho (\boldsymbol{\Sigma}_{bb} \textbf{w}_b)^{\top} \textbf{w}_b^{\star} \\ \textbf{w}_a^{\top} \boldsymbol{\Sigma}_{ba}^{\top} \textbf{w}_b^{\star} &= \rho \textbf{w}_b^{\top} \boldsymbol{\Sigma}_{bb}^{\top} \textbf{w}_b^{\star} \\ \textbf{z}_a^{\top} \textbf{z}_b^{\star} &= \rho \textbf{z}_b^{\top} \textbf{z}_b^{\star} \end{aligned} \tag{13}

Now, notice that we can switch the ordering of the canonical pairs. This is because both pairs of canonical variables satisfy the same constraints, namely Equations (1010) and (1111):

wbΣabwa=ρwaΣaawazbza=ρzaza(14) \begin{aligned} {\textbf{w}_b^{\star}}^{\top} \boldsymbol{\Sigma}_{ab}^{\top} \textbf{w}_a &= \rho^{\star} {\textbf{w}_a^{\star}}^{\top} \boldsymbol{\Sigma}_{aa}^{\top} \textbf{w}_a \\ {\textbf{z}_b^{\star}}^{\top} \textbf{z}_a &= \rho^{\star} {\textbf{z}_a^{\star}}^{\top} \textbf{z}_a \tag{14} \end{aligned}

And:

waΣbawb=ρwbΣbbwbzazb=ρzbzb(15) \begin{aligned} {\textbf{w}_a^{\star}}^{\top} \boldsymbol{\Sigma}_{ba}^{\top} \textbf{w}_b &= \rho^{\star} {\textbf{w}_b^{\star}}^{\top} \boldsymbol{\Sigma}_{bb}^{\top} \textbf{w}_b \\ {\textbf{z}_a^{\star}}^{\top} \textbf{z}_b &= \rho^{\star} {\textbf{z}_b^{\star}}^{\top} \textbf{z}_b \end{aligned} \tag{15}

Since the dot product is commutative, we can set (1212) equal to (1515) and (1313) equal to (1414). To make it easy on the eyes, I’ve color-coded the same values in the equations below:

ρ(zaza)=ρ(zbzb)ρ(zbzb)=ρ(zaza) \rho (\color{#11accd}{\textbf{z}_a^{\top} \textbf{z}_a^{\star}}) = \rho^{\star} (\color{#bc2612}{ {\textbf{z}_b^{\star}}^{\top} \textbf{z}_b}) \\ \\ \rho (\color{#bc2612}{\textbf{z}_b^{\top} \textbf{z}_b^{\star}}) = \rho^{\star} (\color{#11accd}{ {\textbf{z}_a^{\star}}^{\top} \textbf{z}_a})

These equations are both true only when ρ=ρ\rho = \rho^{\star} or zaza=zbzb=0\textbf{z}_a^{\top} \textbf{z}_a^{\star} = \textbf{z}_b^{\top} \textbf{z}_b^{\star} = 0. In words, the correlations among canonical variables are zero except when when they are associated with the same canonical correlation.

Conclusion

CCA is a powerful tool for finding relationships between high-dimensional datasets. There are many natural extensions to this method, such as probabilistic CCA (Bach & Jordan, 2005), which uses maximum likelihood estimates of the parameters to fit a statistical model and deep CCA (Andrew et al., 2013), which uses nonlinear projections, i.e. neural networks, while constraining the outputs of the networks to be maximally linearly correlated. The former has the benefit of handling uncertainty and modeling the generative process for the data using probabilistic machine learning. The latter leverages the rich representational power of deep learning.

Acknowledgements

I relied heavily on Hotelling’s original paper (Hotelling, 1936) and Uurtio’s fantastic survey (Uurtio et al., 2017) while writing this post and implementing CCA.

   

Appendix

1. Mean-centered data

To visualize mean-centering, let’s consider two time series datasets. We can think of time-series data as 11-dimensional, where each time step represents an observation. If two series are identical except for a difference in their yy-intercepts (Figure 22), we would expect them to have a correlation coefficient of 11. Note that if their yy-intercepts are different, their means will be proportionally different.

Figure 2: Two 11-D time series datasets that are perfectly correlated but have different yy-intercepts and therefore different means.

The standard mathematical way to deal with this problem is to subtract the mean from each series. To see why this works, look at what happens mathematically when one series is simply the other series with some additive offset, i.e. y=x+α\textbf{y} = \textbf{x} + \alpha. First, let’s put the mean of y\textbf{y}, μy\mu_y, in terms of the mean of x\textbf{x}, μx\mu_x:

μy=inxi+αn=inxin+nαn=μx+α \begin{aligned} \mu_y &= \frac{\sum_{i}^n x_i + \alpha}{n} = \frac{\sum_i^n x_i}{n} + \frac{n \alpha}{n} \\ &= \mu_x + \alpha \end{aligned}

Now let’s compute the correlation between x\textbf{x} and y\textbf{y}:

corr(x,y)=corr(x,x+α)=x(x+α)x2x+α2=i(xiμx)(xi+αμy)i(xiμx)2i(xi+αμy)2=i(xiμx)(xi+αμx+α)i(xiμx)2i(xi+αμx+α)2=i(xiμx)(xiμx)i(xiμx)2i(xiμx)2=corr(x,x) \begin{aligned} \text{corr}(\textbf{x}, \textbf{y}) &= \text{corr}(\textbf{x}, \textbf{x} + \alpha) \\ \\ &= \frac{\textbf{x} \cdot (\textbf{x} + \alpha)}{\lVert \textbf{x} \rVert_2 \lVert \textbf{x} + \alpha \rVert_2 } \\ \\ &= \frac{\sum_i (x_i - \mu_x)(x_i + \alpha - \mu_y)}{\sqrt{\sum_i (x_i - \mu_x)^2} \sqrt{\sum_i (x_i + \alpha - \mu_y)^2}} \\ \\ &= \frac{\sum_i (x_i - \mu_x)(x_i + \alpha - \mu_x + \alpha)}{\sqrt{\sum_i (x_i - \mu_x)^2} \sqrt{\sum_i (x_i + \alpha - \mu_x + \alpha)^2}} \\ \\ &= \frac{\sum_i (x_i - \mu_x)(x_i - \mu_x)}{\sqrt{\sum_i (x_i - \mu_x)^2} \sqrt{\sum_i (x_i - \mu_x)^2}} \\ \\ &= \text{corr}(\textbf{x}, \textbf{x}) \end{aligned}

In words, by subtracting the mean, we allow the series to move up or down without effecting the measure of correlation. If we had an KK-dimensional dataset, we would subtract the means from all KK dimensions.

2. Scale invariance

Another important property for correlation is scale invariance. This means that two variables can be perfectly correlated even though one variable changes at a scale that is disproportionate to the other. For example, the two time series in Figure 33 are perfectly correlated.

Figure 3: Two time series that are perfectly correlated. Every value in the red time series is multiplied by 33 to generate the blue series.

We can see this property mathematically by exploring what happens when we compute the correlation between two vectors x\textbf{x} and y=αx\textbf{y} = \alpha \textbf{x} where α\alpha is a scalar. First, let’s again note the mean of y\textbf{y}:

μy=inαxin=αinxin=αμx \mu_y = \frac{\sum_i^n \alpha x_i}{n} = \alpha \frac{\sum_i^n x_i}{n} = \alpha \mu_x

corr(x,y)=corr(x,αx)=x(αx)x2αx2=i(xiμx)(αxiαμx)i(xiμx)2i(αxiαμx)2=i(α)(xiμx)(xiμx)i(xiμx)2i(α)2(xiμx)2=αi(xiμx)(xiμx)αi(xiμx)2i(xiμx)2=i(xiμx)(xiμx)i(xiμx)2i(xiμx)2=corr(x,x) \begin{aligned} \text{corr}(\textbf{x}, \textbf{y}) &= \text{corr}(\textbf{x}, \alpha \textbf{x}) \\ \\ &= \frac{\textbf{x} \cdot (\alpha \textbf{x})}{\lVert \textbf{x} \rVert_2 \lVert \alpha \textbf{x} \rVert_2 } \\ \\ &= \frac{\sum_i (x_i - \mu_x)(\alpha x_i - \alpha \mu_x)}{\sqrt{\sum_i (x_i - \mu_x)^2} \sqrt{\sum_i (\alpha x_i - \alpha \mu_x)^2}} \\ \\ &= \frac{\sum_i (\alpha)(x_i - \mu_x)( x_i \mu_x)}{\sqrt{\sum_i (x_i - \mu_x)^2} \sqrt{\sum_i (\alpha)^2(x_i - \mu_x)^2}} \\ \\ &= \frac{\alpha \sum_i (x_i - \mu_x)( x_i \mu_x)}{\alpha \sqrt{\sum_i (x_i - \mu_x)^2} \sqrt{\sum_i (x_i - \mu_x)^2}} \\ \\ &= \frac{\sum_i (x_i - \mu_x)(x_i - \mu_x)}{\sqrt{\sum_i (x_i - \mu_x)^2} \sqrt{\sum_i (x_i - \mu_x)^2}} \\ \\ &= \text{corr}(\textbf{x}, \textbf{x}) \end{aligned}

In words, y\textbf{y} is perfectly correlated with x\textbf{x} despite the two series being scaled differently.

3. Code

For completeness, here is the code to solve for wa\textbf{w}_a and wb\textbf{w}_b using Hotelling’s standard eigenvalue problem method. See my GitHub repository for a complete working version.

import numpy as np

# ------------------------------------------------------------------------------

inv  = np.linalg.inv
mm   = np.matmul

# ------------------------------------------------------------------------------

def fit_standard_eigenvalue_method(Xa, Xb):
    """
    Fits CCA parameters using the standard eigenvalue problem.
    
    :param Xa: Observations with shape (n_samps, p_dim).
    :param Xb: Observations with shape (n_samps, q_dim).
    :return:   Linear transformations Wa and Wb.
    """
    N, p = Xa.shape
    N, q = Xb.shape
    r    = min(p, q)

    Xa -= Xa.mean(axis=0)
    Xa /= Xa.std(axis=0)
    Xb -= Xb.mean(axis=0)
    Xb /= Xb.std(axis=0)

    p = Xa.shape[1]
    C   = np.cov(Xa.T, Xb.T)
    Caa = C[:p, :p]
    Cbb = C[p:, p:]
    Cab = C[:p, p:]
    Cba = C[p:, :p]

    # Either branch results in: r x r matrix where r = min(p, q).
    if q < p:
        M = mm(mm(inv(Cbb), Cba), mm(inv(Caa), Cab))
    else:
        M = mm(mm(inv(Caa), Cab), mm(inv(Cbb), Cba))

    # Solving the characteristic equation,
    #
    #     det(M - rho^2 I) = 0
    #
    # is equivalent to solving for rho^2, which are the eigenvalues of the
    # matrix.
    eigvals, eigvecs = np.linalg.eig(M)
    rhos = np.sqrt(eigvals)

    # Ensure we go through eigenvectors in descending order.
    inds    = (-rhos).argsort()
    rhos    = rhos[inds]
    eigvals = eigvals[inds]
    # NumPy returns each eigenvector as a column in a matrix.
    eigvecs = eigvecs.T[inds].T
    Wb      = eigvecs

    Wa = np.zeros((p, r))
    for i, (rho, wb_i) in enumerate(zip(rhos, Wb.T)):
        wa_i = mm(mm(inv(Caa), Cab), wb_i) / rho
        Wa[:, i] = wa_i

    # Sanity check: canonical correlations are equal to the rhos.
    Za = np.linalg.norm(mm(Xa, Wa), 2, axis=0)
    Zb = np.linalg.norm(mm(Xb, Wb), 2, axis=0)
    CCs = np.zeros(r)

    for i in range(r):
        za = Za[:, i]
        zb = Zb[:, i]
        CCs[i] = np.dot(za, zb)
    assert np.allclose(CCs, rhos)

    return Wa, Wb

Note that according to NumPy’s documentation on the eig function, eigenvalues and eigenvectors are not necessarily ordered, hence the ordering using argsort.

  1. Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28(3/4), 321–377.
  2. Bach, F. R., & Jordan, M. I. (2005). A probabilistic interpretation of canonical correlation analysis.
  3. Andrew, G., Arora, R., Bilmes, J., & Livescu, K. (2013). Deep canonical correlation analysis. International Conference on Machine Learning, 1247–1255.
  4. Uurtio, V., Monteiro, J. M., Kandola, J., Shawe-Taylor, J., Fernandez-Reyes, D., & Rousu, J. (2017). A tutorial on canonical correlation methods. ACM Computing Surveys (CSUR), 50(6), 95.