Probabilistic Canonical Correlation Analysis in Detail
Probabilistic canonical correlation analysis is a reinterpretation of CCA as a latent variable model, which has benefits such as generative modeling, handling uncertainty, and composability. I define and derive its solution in detail.
Published
10 September 2018
Standard CCA
Canonical correlation analysis (CCA) is a multivariate statistical method for finding two linear projections, one for each set of observations in a paired dataset, such that the projected data points are maximally correlated. For a thorough explanation, please see my previous post.
I will present an abbreviated explanation here for completeness and notation. Let Xa∈Rn×p and Xb∈Rn×q be two datasets with n samples each and dimensionlity p and q respectively. Let wa∈Rp and wb∈Rq be two linear projections and za=Xawa and zb=Xbwb be a pair of n-dimensional “canonical variables”. Then the CCA objective is:
Since wa⊤Xa⊤Xbwb=za⊤⋅zb, the objective can be visualized as finding linear projections wa and wb such that za and zb are close to each other on a unit ball in Rn. If we find r such projections where r≤min(p,q), then we can embed our two datasets into r-dimensional space.
Probabilistic interpretation of CCA
A probabilistic interpetation of CCA (PCCA), is one in which our two datasets, Xa and Xb, are viewed as two sets of n observations of two random variables, xa and xb, that are generated by a shared latent variable z. Rather than use linear algebra to set up an objective and then solve for two linear projections wa and wb, we instead write down a model that captures these probabilistic relationships and use maximum likelihood estimates to update its parameters. See my previous post on probabilistic machine learning if that statement is not clear. The model is:
Where Wa∈Rp×r and Wb∈Rq×r are two arbitrary matrices, and Ψa and Ψb are both positive semi-definite. (Bach & Jordan, 2005) proved that the resulting maximum likelihood estimates are equivalent, up to rotation and scaling, to CCA.
It is worth being explicit about the differences between this probabilistic framing and the standard framing. In CCA, we take our data and perform matrix multiplications to get lower-dimensional representations za and zb:
za=Xawazb=Xbwb
The objective is to find projections wa and wb such that za and zb are maximally correlated.
But the probabilistic model, Equations (1), is a function of random variables. If we marginalize out z for either xa or xb, we get the following generative model:
x=Wz+u(2)
Where u∼N(μ,Ψ). If we assume our data is mean-centered, meaning μ=0—we make the same assumption in CCA—and rename W to Λ, we can see that PCCA is just group factor analysis with two groups (Klami et al., 2015):
It’s worth thinking about how the properties of CCA are converted to probabilistic assumptions in PCCA. First, in CCA, za and zb are a pair of embeddings that we correlate. The assumption is that both datasets have similar low-rank approximations. In PCCA, this property is modeled by having a shared latent variable z.
Furthermore, in CCA, we proved that the canonical variables are orthogonal. In PCCA, there is no such orthogonality constraint. Instead, we assume the latent variables are independent with an isotropic covariance matrix:
z∼N(0,Ir)
This independence assumption is the probabilistic equivalent of orthogonality. The covariance matrix of the latent variables is diagonal, meaning there is no covariance between the i-th and j-th variables.
The final constraint of the CCA objective is that the vectors have unit length. In probabilistic terms, this is analogous to unit variance, which we have since the identity matrix Ir is an isotropic matrix with each diagonal term equal to 1.
Code
For an implementation of PCCA, please see my GitHub repository of machine learning algorithms, specifically this file.
Appendix
1. Derivations for μx and Σx
Let’s solve for μx and Σx for the density:
p([xaxb])=N(μx,Σx)
First, note that E[z]=0 and E[u]=μ. Then:
μx=E[x]=E[Wz+u]=WE[z]+E[u]=μ=[μaμb]
If the data are mean-centered, as we assume, then:
[μaμb]=[00]
To understand the covariance matrix Σx,
Σx=[ΣaaΣbaΣabΣbb]
let’s consider Σaa and Σab. The remaining block matrices, Σbb and Σba have identical proofs, respectively, but with the variables renamed. Both derivations will use the fact that E[XY]=E[X]⋅E[Y] if X and Y are independent. First, let’s consider Σaa:
Bach, F. R., & Jordan, M. I. (2005). A probabilistic interpretation of canonical correlation analysis.
Klami, A., Virtanen, S., Leppäaho, E., & Kaski, S. (2015). Group factor analysis. IEEE Transactions on Neural Networks and Learning Systems, 26(9), 2136–2147.