Canonical Correlation Analysis in Detail
Canonical correlation analsyis is conceptually straightforward, but I want to define its objective and derive its solution in detail, both mathematically and programmatically.
Canonical correlation analysis (CCA) is a multivariate statistical method for finding two linear projections, one for each set of observations in a paired dataset, such that the projected data points are maximally correlated. Provided the data are mean-centered, this procedure can be visualized fairly easily (Figure

The goal of this post is to go through the CCA procedure in painstaking detail, denoting the dimensionality of every vector and matrix, explaining the objective and constraints, walking through a derivation that proves how to solve the objective, and providing working code.
CCA objective
Consider two paired datasets,
Let
The goal of CCA is to learn linear transformations
Where the numerator is the cross-covariance of the projected data and the denominator is the standard deviations of the projected data, both in vectorized form. Notice that since the following derivation is true,
we can rewrite the objective as finding the minimum angle
I like this formulation because it has a nice geometric interpretation for what is happening. We want projections such that our embedded data are vectors pointing in the same direction in lower-dimensional space (again, Figure
There are two additional constraints to the above CCA objective. First, the objective is scale invariant, and therefore the canonical variables
So far, we have only talked about finding
The number of pairs of canonical variables, call this
Putting this objective and the constraints in one place, we have:
With this notation, we define the goal of CCA as finding
Solving for the objective
There are multiple ways to solve for the projections
Or equivalently, solving the characteristic equation:
First, recall our definition of
And the joint covariance matrix is:
Recall that we constrained the canonical variables to have unit length, which implies:
Furthermore, note:
Substituting Equations (
Let’s ignore the orthogonality constraints for now. We’ll see in a minute that these are handled by how we solve the problem. We can frame our optimization plus constraint problem using Lagrange multipliers to get:
Where
Let’s solve each term separately. I’ll write out the first term (in red) in painful detail just to be sure we understand the derivation:
By the linearity of differentiation, we can compute the derivative as the sum of the terms above. So the
I won’t work through the other derivatives in the same detail because the reasoning is the same, but you can convince yourself that the derivatives are:
The derivatives with respect to
And for
Now, note that the Lagrange multipliers
The same logic applies if we multiply
Let
Next, substituting
The standard eigenvalue method assumes
We can see that this is the standard eigenvalue problem where
We can solve for
See the Appendix, Section 3 for code.
Discussion
Now that we know how to solve for the linear projections
-th pair of canonical variables
In my experience, people typically talk about finding multiple vector projections, each
The number of pairs of canonical variables
Furthermore, this also explains why the number of canonical variable pairs is
Orthogonality
Finally, we can show that our respective canonical variables are orthogonal to each other. Consider Equations (
Consider a new pair of canonical variables, denoted
Let’s multiply (
And (
Now, notice that we can switch the ordering of the canonical pairs. This is because both pairs of canonical variables satisfy the same constraints, namely Equations (
And:
Since the dot product is commutative, we can set (
These equations are both true only when
Conclusion
CCA is a powerful tool for finding relationships between high-dimensional datasets. There are many natural extensions to this method, such as probabilistic CCA (Bach & Jordan, 2005), which uses maximum likelihood estimates of the parameters to fit a statistical model and deep CCA (Andrew et al., 2013), which uses nonlinear projections, i.e. neural networks, while constraining the outputs of the networks to be maximally linearly correlated. The former has the benefit of handling uncertainty and modeling the generative process for the data using probabilistic machine learning. The latter leverages the rich representational power of deep learning.
Acknowledgements
I relied heavily on Hotelling’s original paper (Hotelling, 1936) and Uurtio’s fantastic survey (Uurtio et al., 2017) while writing this post and implementing CCA.
Appendix
1. Mean-centered data
To visualize mean-centering, let’s consider two time series datasets. We can think of time-series data as

The standard mathematical way to deal with this problem is to subtract the mean from each series. To see why this works, look at what happens mathematically when one series is simply the other series with some additive offset, i.e.
Now let’s compute the correlation between
In words, by subtracting the mean, we allow the series to move up or down without effecting the measure of correlation. If we had an
2. Scale invariance
Another important property for correlation is scale invariance. This means that two variables can be perfectly correlated even though one variable changes at a scale that is disproportionate to the other. For example, the two time series in Figure

We can see this property mathematically by exploring what happens when we compute the correlation between two vectors
In words,
3. Code
For completeness, here is the code to solve for
import numpy as np
# ------------------------------------------------------------------------------
inv = np.linalg.inv
mm = np.matmul
# ------------------------------------------------------------------------------
def fit_standard_eigenvalue_method(Xa, Xb):
"""
Fits CCA parameters using the standard eigenvalue problem.
:param Xa: Observations with shape (n_samps, p_dim).
:param Xb: Observations with shape (n_samps, q_dim).
:return: Linear transformations Wa and Wb.
"""
N, p = Xa.shape
N, q = Xb.shape
r = min(p, q)
Xa -= Xa.mean(axis=0)
Xa /= Xa.std(axis=0)
Xb -= Xb.mean(axis=0)
Xb /= Xb.std(axis=0)
p = Xa.shape[1]
C = np.cov(Xa.T, Xb.T)
Caa = C[:p, :p]
Cbb = C[p:, p:]
Cab = C[:p, p:]
Cba = C[p:, :p]
# Either branch results in: r x r matrix where r = min(p, q).
if q < p:
M = mm(mm(inv(Cbb), Cba), mm(inv(Caa), Cab))
else:
M = mm(mm(inv(Caa), Cab), mm(inv(Cbb), Cba))
# Solving the characteristic equation,
#
# det(M - rho^2 I) = 0
#
# is equivalent to solving for rho^2, which are the eigenvalues of the
# matrix.
eigvals, eigvecs = np.linalg.eig(M)
rhos = np.sqrt(eigvals)
# Ensure we go through eigenvectors in descending order.
inds = (-rhos).argsort()
rhos = rhos[inds]
eigvals = eigvals[inds]
# NumPy returns each eigenvector as a column in a matrix.
eigvecs = eigvecs.T[inds].T
Wb = eigvecs
Wa = np.zeros((p, r))
for i, (rho, wb_i) in enumerate(zip(rhos, Wb.T)):
wa_i = mm(mm(inv(Caa), Cab), wb_i) / rho
Wa[:, i] = wa_i
# Sanity check: canonical correlations are equal to the rhos.
Za = np.linalg.norm(mm(Xa, Wa), 2, axis=0)
Zb = np.linalg.norm(mm(Xb, Wb), 2, axis=0)
CCs = np.zeros(r)
for i in range(r):
za = Za[:, i]
zb = Zb[:, i]
CCs[i] = np.dot(za, zb)
assert np.allclose(CCs, rhos)
return Wa, Wb
Note that according to NumPy’s documentation on the eig
function, eigenvalues and eigenvectors are not necessarily ordered, hence the ordering using argsort
.
- Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28(3/4), 321–377.
- Bach, F. R., & Jordan, M. I. (2005). A probabilistic interpretation of canonical correlation analysis.
- Andrew, G., Arora, R., Bilmes, J., & Livescu, K. (2013). Deep canonical correlation analysis. International Conference on Machine Learning, 1247–1255.
- Uurtio, V., Monteiro, J. M., Kandola, J., Shawe-Taylor, J., Fernandez-Reyes, D., & Rousu, J. (2017). A tutorial on canonical correlation methods. ACM Computing Surveys (CSUR), 50(6), 95.