Canonical Correlation Analysis in Detail

Canonical correlation analsyis is conceptually straightforward, but I want to define its objective and derive its solution in detail, both mathematically and programmatically.

Canonical correlation analysis (CCA) is a multivariate statistical method for finding two linear projections, one for each set of observations in a paired dataset, such that the projected data points are maximally correlated. Provided the data are mean-centered, this procedure can be visualized fairly easily (Figure ). CCA finds linear projections such that paired data points are close to each other in a lower-dimensional space. And since mean-centered points are vectors with tails at , CCA can be thought of as minimizing the cosine distance between paired points.

Figure 1: A visualization of canonical correlation analysis. Let be the number of observations. Two datasets and are transformed by projections and such that the paired embeddings, and , are maximally correlated with unit length in the projected space.

The goal of this post is to go through the CCA procedure in painstaking detail, denoting the dimensionality of every vector and matrix, explaining the objective and constraints, walking through a derivation that proves how to solve the objective, and providing working code.

CCA objective

Consider two paired datasets, and . By “paired”, I mean that the -th sample consist of two row vectors, , with dimensionality and respectively. Thus, there are paired empirical multivariate observations. We assume the data are mean-centered. See the Appendix, Section 1 for a mathematical discussion of what this means.

Let and denote linear transformations of the data (note that these are vectors and not matrices, as in Figure ; this will be explained). A pair of canonical variables and are the images of and after projection, i.e.:

The goal of CCA is to learn linear transformations and such that this pair of canonical variables are maximally correlated. Let denote the covariance matrix between datasets and for . Then the CCA objective is:

Where the numerator is the cross-covariance of the projected data and the denominator is the standard deviations of the projected data, both in vectorized form. Notice that since the following derivation is true,

we can rewrite the objective as finding the minimum angle between the two canonical variables and :

I like this formulation because it has a nice geometric interpretation for what is happening. We want projections such that our embedded data are vectors pointing in the same direction in lower-dimensional space (again, Figure ). See my post on the dot product if the previous statement did not make sense.

There are two additional constraints to the above CCA objective. First, the objective is scale invariant, and therefore the canonical variables and have unit length. See Appendix, Section 2 for a mathematical derivation of why correlation is scale invariant. This further simplifies our objective to:

So far, we have only talked about finding - and -dimensional projections to -dimensional canonical variables. But we can find multiple pairs of canonical variables that satisfy the above objective (we’ll see why later). If we would like to find a second pair of canonical variables, the second constraint is that these must be orthogonal or uncorrelated with the first pair. In general, for the -th and -th pair of canonical variables, we have:

The number of pairs of canonical variables, call this , cannot be greater than the minimum of and (again, we’ll see why later):

Putting this objective and the constraints in one place, we have:

With this notation, we define the goal of CCA as finding linear projections, for , that satisfy the above constraints. Now, let’s see how to find these projections.

Solving for the objective

There are multiple ways to solve for the projections and . I will discuss the one proposed by Hotelling (Hotelling, 1936), which uses the standard eigenvalue problem, solving for eigenvector and eigenvalue given a matrix :

Or equivalently, solving the characteristic equation:

First, recall our definition of as the empirical variance matrix between variables in and , so:

And the joint covariance matrix is:

Recall that we constrained the canonical variables to have unit length, which implies:

Furthermore, note:

Substituting Equations (), (), and () into (), we get:

Let’s ignore the orthogonality constraints for now. We’ll see in a minute that these are handled by how we solve the problem. We can frame our optimization plus constraint problem using Lagrange multipliers to get:

Where and are the Lagrange multipliers, which we divide by to make taking the derivative easier. Taking the derivative of the loss with respect to , we get:

Let’s solve each term separately. I’ll write out the first term (in red) in painful detail just to be sure we understand the derivation:

By the linearity of differentiation, we can compute the derivative as the sum of the terms above. So the terms drop out, and we get:

I won’t work through the other derivatives in the same detail because the reasoning is the same, but you can convince yourself that the derivatives are:

The derivatives with respect to are symmetric. Putting it all together, we get:

And for , we get:

Now, note that the Lagrange multipliers and are the same. If we multiply by , we get:

The same logic applies if we multiply by . Putting these two equations together, we get:

Let . Now we have two equations, () and (), and three unknowns, and and . Let’s solve for in and then substitute it into . First, solving for :

Next, substituting into so we have one equation and two unknowns:

The standard eigenvalue method assumes is invertible, and we have:

We can see that this is the standard eigenvalue problem where and , which implies that is an eigenvector that satisfies this equation:

We can solve for using () after solving for .

See the Appendix, Section 3 for code.

Discussion

Now that we know how to solve for the linear projections and using the standard eigenvalue method, I think a few points make more sense, and I’d like to review them.

-th pair of canonical variables

In my experience, people typically talk about finding multiple vector projections, each , rather than a single matrix projection, , for the view (and likewise for ). This makes more sense once you realize that each eigenvector is a solution to the optimization objective. The final projections and are indeed matrices, but their columns represent different solutions to the same problem.

The number of pairs of canonical variables

Furthermore, this also explains why the number of canonical variable pairs is , where and are the dimensionality of and respectively. Now we can say that is simply the maximum number of linearly independent eigenvectors. Recall that our matrix in the characteristic equation was:

Orthogonality

Finally, we can show that our respective canonical variables are orthogonal to each other. Consider Equations () and () after rearranging terms and setting :

Consider a new pair of canonical variables, denoted and with projections and . We want to show that the respective canonical variables are uncorrelated, i.e.:

Let’s multiply () by :

And () by :

Now, notice that we can switch the ordering of the canonical pairs. This is because both pairs of canonical variables satisfy the same constraints, namely Equations () and ():

And:

Since the dot product is commutative, we can set () equal to () and () equal to (). To make it easy on the eyes, I’ve color-coded the same values in the equations below:

These equations are both true only when or . In words, the correlations among canonical variables are zero except when when they are associated with the same canonical correlation.

Conclusion

CCA is a powerful tool for finding relationships between high-dimensional datasets. There are many natural extensions to this method, such as probabilistic CCA (Bach & Jordan, 2005), which uses maximum likelihood estimates of the parameters to fit a statistical model and deep CCA (Andrew et al., 2013), which uses nonlinear projections, i.e. neural networks, while constraining the outputs of the networks to be maximally linearly correlated. The former has the benefit of handling uncertainty and modeling the generative process for the data using probabilistic machine learning. The latter leverages the rich representational power of deep learning.

Acknowledgements

I relied heavily on Hotelling’s original paper (Hotelling, 1936) and Uurtio’s fantastic survey (Uurtio et al., 2017) while writing this post and implementing CCA.

   

Appendix

1. Mean-centered data

To visualize mean-centering, let’s consider two time series datasets. We can think of time-series data as -dimensional, where each time step represents an observation. If two series are identical except for a difference in their -intercepts (Figure ), we would expect them to have a correlation coefficient of . Note that if their -intercepts are different, their means will be proportionally different.

Figure 2: Two -D time series datasets that are perfectly correlated but have different -intercepts and therefore different means.

The standard mathematical way to deal with this problem is to subtract the mean from each series. To see why this works, look at what happens mathematically when one series is simply the other series with some additive offset, i.e. . First, let’s put the mean of , , in terms of the mean of , :

Now let’s compute the correlation between and :

In words, by subtracting the mean, we allow the series to move up or down without effecting the measure of correlation. If we had an -dimensional dataset, we would subtract the means from all dimensions.

2. Scale invariance

Another important property for correlation is scale invariance. This means that two variables can be perfectly correlated even though one variable changes at a scale that is disproportionate to the other. For example, the two time series in Figure are perfectly correlated.

Figure 3: Two time series that are perfectly correlated. Every value in the red time series is multiplied by to generate the blue series.

We can see this property mathematically by exploring what happens when we compute the correlation between two vectors and where is a scalar. First, let’s again note the mean of :

In words, is perfectly correlated with despite the two series being scaled differently.

3. Code

For completeness, here is the code to solve for and using Hotelling’s standard eigenvalue problem method. See my GitHub repository for a complete working version.

import numpy as np

# ------------------------------------------------------------------------------

inv  = np.linalg.inv
mm   = np.matmul

# ------------------------------------------------------------------------------

def fit_standard_eigenvalue_method(Xa, Xb):
    """
    Fits CCA parameters using the standard eigenvalue problem.
    
    :param Xa: Observations with shape (n_samps, p_dim).
    :param Xb: Observations with shape (n_samps, q_dim).
    :return:   Linear transformations Wa and Wb.
    """
    N, p = Xa.shape
    N, q = Xb.shape
    r    = min(p, q)

    Xa -= Xa.mean(axis=0)
    Xa /= Xa.std(axis=0)
    Xb -= Xb.mean(axis=0)
    Xb /= Xb.std(axis=0)

    p = Xa.shape[1]
    C   = np.cov(Xa.T, Xb.T)
    Caa = C[:p, :p]
    Cbb = C[p:, p:]
    Cab = C[:p, p:]
    Cba = C[p:, :p]

    # Either branch results in: r x r matrix where r = min(p, q).
    if q < p:
        M = mm(mm(inv(Cbb), Cba), mm(inv(Caa), Cab))
    else:
        M = mm(mm(inv(Caa), Cab), mm(inv(Cbb), Cba))

    # Solving the characteristic equation,
    #
    #     det(M - rho^2 I) = 0
    #
    # is equivalent to solving for rho^2, which are the eigenvalues of the
    # matrix.
    eigvals, eigvecs = np.linalg.eig(M)
    rhos = np.sqrt(eigvals)

    # Ensure we go through eigenvectors in descending order.
    inds    = (-rhos).argsort()
    rhos    = rhos[inds]
    eigvals = eigvals[inds]
    # NumPy returns each eigenvector as a column in a matrix.
    eigvecs = eigvecs.T[inds].T
    Wb      = eigvecs

    Wa = np.zeros((p, r))
    for i, (rho, wb_i) in enumerate(zip(rhos, Wb.T)):
        wa_i = mm(mm(inv(Caa), Cab), wb_i) / rho
        Wa[:, i] = wa_i

    # Sanity check: canonical correlations are equal to the rhos.
    Za = np.linalg.norm(mm(Xa, Wa), 2, axis=0)
    Zb = np.linalg.norm(mm(Xb, Wb), 2, axis=0)
    CCs = np.zeros(r)

    for i in range(r):
        za = Za[:, i]
        zb = Zb[:, i]
        CCs[i] = np.dot(za, zb)
    assert np.allclose(CCs, rhos)

    return Wa, Wb

Note that according to NumPy’s documentation on the eig function, eigenvalues and eigenvectors are not necessarily ordered, hence the ordering using argsort.

  1. Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28(3/4), 321–377.
  2. Bach, F. R., & Jordan, M. I. (2005). A probabilistic interpretation of canonical correlation analysis.
  3. Andrew, G., Arora, R., Bilmes, J., & Livescu, K. (2013). Deep canonical correlation analysis. International Conference on Machine Learning, 1247–1255.
  4. Uurtio, V., Monteiro, J. M., Kandola, J., Shawe-Taylor, J., Fernandez-Reyes, D., & Rousu, J. (2017). A tutorial on canonical correlation methods. ACM Computing Surveys (CSUR), 50(6), 95.