Probabilistic Canonical Correlation Analysis in Detail

Probabilistic canonical correlation analysis is a reinterpretation of CCA as a latent variable model, which has benefits such as generative modeling, handling uncertainty, and composability. I define and derive its solution in detail.

Standard CCA

Canonical correlation analysis (CCA) is a multivariate statistical method for finding two linear projections, one for each set of observations in a paired dataset, such that the projected data points are maximally correlated. For a thorough explanation, please see my previous post.

I will present an abbreviated explanation here for completeness and notation. Let XaRn×pX_a \in \mathbb{R}^{n \times p} and XbRn×qX_b \in \mathbb{R}^{n \times q} be two datasets with nn samples each and dimensionlity pp and qq respectively. Let waRp\textbf{w}_a \in \mathbb{R}^{p} and wbRq\textbf{w}_b \in \mathbb{R}^{q} be two linear projections and za=Xawa\textbf{z}_a = X_a \textbf{w}_a and zb=Xbwb\textbf{z}_b = X_b \textbf{w}_b be a pair of nn-dimensional “canonical variables”. Then the CCA objective is:

wa,wb=arg ⁣maxwa,wb{waXaXbwb} \textbf{w}_a^{\star}, \textbf{w}_b^{\star} = \arg\!\max_{ \textbf{w}_a, \textbf{w}_b } \{ \textbf{w}_a^{\top} X_a^{\top} X_b \textbf{w}_b \}

With the constraint that:

za2=waXaXawa=1zb2=wbXbXbwb=1 \lVert \textbf{z}_a \lVert_2 = \sqrt{\textbf{w}_a^{\top} X_a^{\top} X_a \textbf{w}_a} = 1 \qquad \lVert \textbf{z}_b \lVert_2 = \sqrt{\textbf{w}_b^{\top} X_b^{\top} X_b \textbf{w}_b} = 1

Since waXaXbwb=zazb\textbf{w}_a^{\top} X_a^{\top} X_b \textbf{w}_b = \textbf{z}_a^{\top} \cdot \textbf{z}_b, the objective can be visualized as finding linear projections wa\textbf{w}_a and wb\textbf{w}_b such that za\textbf{z}_a and zb\textbf{z}_b are close to each other on a unit ball in Rn\mathbb{R}^n. If we find rr such projections where rmin(p,q)r \leq \min(p, q), then we can embed our two datasets into rr-dimensional space.

Probabilistic interpretation of CCA

A probabilistic interpetation of CCA (PCCA), is one in which our two datasets, XaX_a and XbX_b, are viewed as two sets of nn observations of two random variables, xa\textbf{x}_a and xb\textbf{x}_b, that are generated by a shared latent variable z\textbf{z}. Rather than use linear algebra to set up an objective and then solve for two linear projections wa\textbf{w}_a and wb\textbf{w}_b, we instead write down a model that captures these probabilistic relationships and use maximum likelihood estimates to update its parameters. See my previous post on probabilistic machine learning if that statement is not clear. The model is:

zN(0,Ir),rmin(p,q)xazN(Waz+μa,Ψa)xbzN(Wbz+μb,Ψb)(1) \begin{aligned} \textbf{z} &\sim \mathcal{N}(0, I_r), \quad r \leq \min(p, q) \\ \textbf{x}_a \mid \textbf{z} &\sim \mathcal{N}(W_a \textbf{z} + \boldsymbol{\mu}_a, \Psi_a) \\ \textbf{x}_b \mid \textbf{z} &\sim \mathcal{N}(W_b \textbf{z} + \boldsymbol{\mu}_b, \Psi_b) \end{aligned} \tag{1}

Where WaRp×rW_a \in \mathbb{R}^{p \times r} and WbRq×rW_b \in \mathbb{R}^{q \times r} are two arbitrary matrices, and Ψa\Psi_a and Ψb\Psi_b are both positive semi-definite. (Bach & Jordan, 2005) proved that the resulting maximum likelihood estimates are equivalent, up to rotation and scaling, to CCA.

It is worth being explicit about the differences between this probabilistic framing and the standard framing. In CCA, we take our data and perform matrix multiplications to get lower-dimensional representations za\textbf{z}_a and zb\textbf{z}_b:

za=Xawazb=Xbwb \textbf{z}_a = X_a \textbf{w}_a \\ \textbf{z}_b = X_b \textbf{w}_b

The objective is to find projections wa\textbf{w}_a and wb\textbf{w}_b such that za\textbf{z}_a and zb\textbf{z}_b are maximally correlated.

But the probabilistic model, Equations (11), is a function of random variables. If we marginalize out z\textbf{z} for either xa\textbf{x}_a or xb\textbf{x}_b, we get the following generative model:

x=Wz+u(2) \textbf{x} = W \textbf{z} + \textbf{u} \tag{2}

Where uN(μ,Ψ)\textbf{u} \sim \mathcal{N}(\boldsymbol{\mu}, \Psi). If we assume our data is mean-centered, meaning μ=0\boldsymbol{\mu} = 0—we make the same assumption in CCA—and rename WW to Λ\Lambda, we can see that PCCA is just group factor analysis with two groups (Klami et al., 2015):

xa=Λaz+uaxb=Λbz+ub(3) \begin{aligned} \textbf{x}_a &= \Lambda_a \textbf{z} + \textbf{u}_a \\ \textbf{x}_b &= \Lambda_b \textbf{z} + \textbf{u}_b \end{aligned} \tag{3}

And this in turn is nice because we can just use the maximum likelihood estimates we computed for factor analysis for PCCA. To see this, let’s use block matrices to represent our data and parameters:

x=[xaxb],Λ=[ΛaΛb],u=[uaub] \textbf{x} = \begin{bmatrix} \textbf{x}_a \\ \textbf{x}_b \end{bmatrix}, \qquad \Lambda = \begin{bmatrix} \Lambda_a \\ \Lambda_b \end{bmatrix}, \qquad \textbf{u} = \begin{bmatrix} \textbf{u}_a \\ \textbf{u}_b \end{bmatrix}

Where xRp+q\textbf{x} \in \mathbb{R}^{p + q}, ΛR(p+q)×k\Lambda \in \mathbb{R}^{(p+q) \times k}, and uRp+q\textbf{u} \in \mathbb{R}^{p + q} and,

uN([00],[Ψa00Ψb]) \textbf{u} \sim \mathcal{N} \Big( \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix} \Psi_a & 0 \\ 0 & \Psi_b \end{bmatrix} \Big)

Then our PCCA updates are identical to our updates for factor analysis:

Λ=(i=1nxEzx[zxi])(i=1nEzx[zzxi])1Ψ=1ndiag(i=1nxixiΛEzx[zxi]xi) \begin{aligned} \Lambda^{\star} &= \Big( \sum_{i=1}^{n} \textbf{x} \mathbb{E}_{\textbf{z} \mid \textbf{x}} \big[ \textbf{z} \mid \textbf{x}_i \big]^{\top} \Big) \Big( \sum_{i=1}^{n} \mathbb{E}_{\textbf{z} \mid \textbf{x}} \big[ \textbf{z} \textbf{z}^{\top} \mid \textbf{x}_i \big] \Big)^{-1} \\ \\ \Psi^{\star} &= \frac{1}{n} \text{diag}\Bigg( \sum_{i=1}^{n} \textbf{x}_i \textbf{x}_i^{\top} - \Lambda^{\star} \mathbb{E}_{\textbf{z} \mid \textbf{x}} \big[ \textbf{z} \mid \textbf{x}_i \big] \textbf{x}_i^{\top} \Bigg) \end{aligned}

Futhermore, this definition gives us a joint density for x\textbf{x} that should look similar to the density p(x)p(\textbf{x}) in factor analysis:

p(x)=N([00],[ΛaΛa+ΨaΛaΛbΛbΛaΛbΛb+Ψb]) p(\textbf{x}) = \mathcal{N} \Big( \begin{bmatrix} 0 \\ 0 \end{bmatrix} , \begin{bmatrix} \Lambda_a \Lambda_a^{\top} + \Psi_a & \Lambda_a \Lambda_b^{\top} \\ \Lambda_b \Lambda_a^{\top} & \Lambda_b \Lambda_b^{\top} + \Psi_b \end{bmatrix} \Big)

See the Appendix for a derivation.

It’s worth thinking about how the properties of CCA are converted to probabilistic assumptions in PCCA. First, in CCA, za\textbf{z}_a and zb\textbf{z}_b are a pair of embeddings that we correlate. The assumption is that both datasets have similar low-rank approximations. In PCCA, this property is modeled by having a shared latent variable z\textbf{z}.

Furthermore, in CCA, we proved that the canonical variables are orthogonal. In PCCA, there is no such orthogonality constraint. Instead, we assume the latent variables are independent with an isotropic covariance matrix:

zN(0,Ir) \textbf{z} \sim \mathcal{N}(0, I_r)

This independence assumption is the probabilistic equivalent of orthogonality. The covariance matrix of the latent variables is diagonal, meaning there is no covariance between the ii-th and jj-th variables.

The final constraint of the CCA objective is that the vectors have unit length. In probabilistic terms, this is analogous to unit variance, which we have since the identity matrix IrI_r is an isotropic matrix with each diagonal term equal to 11.

Code

For an implementation of PCCA, please see my GitHub repository of machine learning algorithms, specifically this file.

   

Appendix

1. Derivations for μx\boldsymbol{\mu}_x and Σx\Sigma_x

Let’s solve for μx\boldsymbol{\mu}_x and Σx\Sigma_x for the density:

p([xaxb])=N(μx,Σx) p\Big(\begin{bmatrix} \textbf{x}_a \\ \textbf{x}_b \end{bmatrix}\Big) = \mathcal{N}(\boldsymbol{\mu}_x, \Sigma_x)

First, note that E[z]=0\mathbb{E}[\textbf{z}] = 0 and E[u]=μ\mathbb{E}[\textbf{u}] = \boldsymbol{\mu}. Then:

μx=E[x]=E[Wz+u]=WE[z]+E[u]=μ=[μaμb] \begin{aligned} \boldsymbol{\mu}_x = \mathbb{E}[\textbf{x}] &= \mathbb{E}[W \textbf{z} + \textbf{u}] \\ &= W \mathbb{E}[\textbf{z}] + \mathbb{E}[\textbf{u}] \\ &= \boldsymbol{\mu} \\ &= \begin{bmatrix} \boldsymbol{\mu}_a \\ \boldsymbol{\mu}_b \end{bmatrix} \end{aligned}

If the data are mean-centered, as we assume, then:

[μaμb]=[00] \begin{bmatrix} \boldsymbol{\mu}_a \\ \boldsymbol{\mu}_b \end{bmatrix} = \begin{bmatrix} \textbf{0} \\ \textbf{0} \end{bmatrix}

To understand the covariance matrix Σx\Sigma_x,

Σx=[ΣaaΣabΣbaΣbb] \Sigma_x = \begin{bmatrix} \Sigma_{aa} & \Sigma_{ab} \\ \Sigma_{ba} & \Sigma_{bb} \end{bmatrix}

let’s consider Σaa\Sigma_{aa} and Σab\Sigma_{ab}. The remaining block matrices, Σbb\Sigma_{bb} and Σba\Sigma_{ba} have identical proofs, respectively, but with the variables renamed. Both derivations will use the fact that E[XY]=E[X]E[Y]\mathbb{E}[XY] = \mathbb{E}[X] \cdot \mathbb{E}[Y] if XX and YY are independent. First, let’s consider Σaa\Sigma_{aa}:

Σaa=E[(xaμa)(xaμa)]=E[(xa)(xa)]μaμa=E[(Wz+ua)(Wz+ua)]μaμa=E[ΛazzΛa+uazΛa+Λazua+uaua]μaμa=ΛaΛaE[zz]ΛaΛaIr+E[ua]E[z]E[Λa]0+ΛaE[z]E[ua]0+E[uaua]var(xa)+μaμaμaμa=ΛaΛa+Ψa \begin{aligned} \Sigma_{aa} &= \mathbb{E}[(\textbf{x}_a - \boldsymbol{\mu}_a)(\textbf{x}_a - \boldsymbol{\mu}_a)^{\top}] \\ &= \mathbb{E}[(\textbf{x}_a)(\textbf{x}_a)^{\top}] - \boldsymbol{\mu}_a \boldsymbol{\mu}_a^{\top} \\ &= \mathbb{E}[(W \textbf{z} + \textbf{u}_a)(W \textbf{z} + \textbf{u}_a)^{\top}] - \boldsymbol{\mu}_a \boldsymbol{\mu}_a^{\top} \\ &= \mathbb{E}[ \Lambda_a \textbf{z} \textbf{z}^{\top} \Lambda_a^{\top} + \textbf{u}_a \textbf{z}^{\top} \Lambda_a^{\top} + \Lambda_a \textbf{z} \textbf{u}_a^{\top} + \textbf{u}_a \textbf{u}_a^{\top} ] - \boldsymbol{\mu}_a \boldsymbol{\mu}_a^{\top} \\ &= \underbrace{\Lambda_a \Lambda_a^{\top} \mathbb{E}[\textbf{z} \textbf{z}^{\top}]}_{\Lambda_a \Lambda_a^{\top} I_r} + \underbrace{\mathbb{E}[\textbf{u}_a] \mathbb{E}[\textbf{z}^{\top}] \mathbb{E}[\Lambda_a^{\top}]}_{0} + \underbrace{\Lambda_a \mathbb{E}[\textbf{z}] \mathbb{E}[\textbf{u}_a^{\top}]}_{0} + \underbrace{\mathbb{E}[\textbf{u}_a \textbf{u}_a^{\top}]}_{\text{var}(\textbf{x}_a) + \boldsymbol{\mu}_a \boldsymbol{\mu}_a^{\top}} - \boldsymbol{\mu}_a \boldsymbol{\mu}_a^{\top} \\ &= \Lambda_a \Lambda_a^{\top} + \Psi_a \end{aligned}

So for the two views, aa and bb, we have:

Σaa=ΛaΛa+ΨaΣbb=ΛbΛb+Ψb \Sigma_{aa} = \Lambda_a \Lambda_a^{\top} + \Psi_a \\ \Sigma_{bb} = \Lambda_b \Lambda_b^{\top} + \Psi_b

Now, let’s consider Σab\Sigma_{ab}:

Σab=E[(xaμa)(xbμb)]=E[xaxb]μaμb=E[(Λaz+ua)(Λbz+ub)]μaμb=E[ΛazzΛb+uazΛb+Λazub+uaub]μaμb=ΛaΛbE[zz]ΛaΛbIr+ΛbE[ua]E[z]0+ΛaE[z]E[ub]0+E[uaub]cov(ua,ub)+μaμbμaμb=ΛaΛb \begin{aligned} \Sigma_{ab} &= \mathbb{E}[(\textbf{x}_a - \boldsymbol{\mu}_a)(\textbf{x}_b - \boldsymbol{\mu}_b)^{\top}] \\ &= \mathbb{E}[\textbf{x}_a \textbf{x}_b^{\top}] - \boldsymbol{\mu}_a \boldsymbol{\mu}_b^{\top} \\ &= \mathbb{E}[(\Lambda_a \textbf{z} + \textbf{u}_a)(\Lambda_b \textbf{z} + \textbf{u}_b)^{\top}] - \boldsymbol{\mu}_a \boldsymbol{\mu}_b^{\top} \\ &= \mathbb{E}[ \Lambda_a \textbf{z} \textbf{z}^{\top} \Lambda_b^{\top} + \textbf{u}_a \textbf{z}^{\top} \Lambda_b^{\top} + \Lambda_a \textbf{z} \textbf{u}_b^{\top} + \textbf{u}_a \textbf{u}_b^{\top} ] - \boldsymbol{\mu}_a \boldsymbol{\mu}_b^{\top} \\ &= \underbrace{\Lambda_a \Lambda_b^{\top} \mathbb{E}[\textbf{z} \textbf{z}^{\top}]}_{\Lambda_a \Lambda_b^{\top} I_r} + \underbrace{\Lambda_b^{\top} \mathbb{E}[\textbf{u}_a] \mathbb{E}[\textbf{z}^{\top}]}_{0} + \underbrace{\Lambda_a \mathbb{E}[\textbf{z}] \mathbb{E}[\textbf{u}_b^{\top}]}_{0} + \underbrace{\mathbb{E}[\textbf{u}_a \textbf{u}_b^{\top}]}_{\text{cov}(\textbf{u}_a, \textbf{u}_b) + \boldsymbol{\mu}_a \boldsymbol{\mu}_b^{\top}} - \boldsymbol{\mu}_a \boldsymbol{\mu}_b^{\top} \\ &= \Lambda_a \Lambda_b^{\top} \end{aligned}

Thus, the full covariance matrix for x\textbf{x} is:

Σ=[ΛaΛa+ΨaΛaΛbΛbΛaΛbΛb+Ψb] \Sigma = \begin{bmatrix} \Lambda_a \Lambda_a^{\top} + \Psi_a & \Lambda_a \Lambda_b^{\top} \\ \Lambda_b \Lambda_a^{\top} & \Lambda_b \Lambda_b^{\top} + \Psi_b \end{bmatrix}

  1. Bach, F. R., & Jordan, M. I. (2005). A probabilistic interpretation of canonical correlation analysis.
  2. Klami, A., Virtanen, S., Leppäaho, E., & Kaski, S. (2015). Group factor analysis. IEEE Transactions on Neural Networks and Learning Systems, 26(9), 2136–2147.