Probabilistic Canonical Correlation Analysis in Detail

Probabilistic canonical correlation analysis is a reinterpretation of CCA as a latent variable model, which has benefits such as generative modeling, handling uncertainty, and composability. I define and derive its solution in detail.

Published

10 September 2018

Standard CCA

Canonical correlation analysis (CCA) is a multivariate statistical method for finding two linear projections, one for each set of observations in a paired dataset, such that the projected data points are maximally correlated. For a thorough explanation, please see my previous post.

I will present an abbreviated explanation here for completeness and notation. Let $X_a \in \mathbb{R}^{n \times p}$ and $X_b \in \mathbb{R}^{n \times q}$ be two datasets with $n$ samples each and dimensionlity $p$ and $q$ respectively. Let $\textbf{w}_a \in \mathbb{R}^{p}$ and $\textbf{w}_b \in \mathbb{R}^{q}$ be two linear projections and $\textbf{z}_a = X_a \textbf{w}_a$ and $\textbf{z}_b = X_b \textbf{w}_b$ be a pair of $n$ -dimensional “canonical variables”. Then the CCA objective is:

$\textbf{w}_a^{\star}, \textbf{w}_b^{\star} = \arg\!\max_{ \textbf{w}_a, \textbf{w}_b } \{ \textbf{w}_a^{\top} X_a^{\top} X_b \textbf{w}_b \}$

With the constraint that:

$\lVert \textbf{z}_a \lVert_2 = \sqrt{\textbf{w}_a^{\top} X_a^{\top} X_a \textbf{w}_a} = 1 \qquad \lVert \textbf{z}_b \lVert_2 = \sqrt{\textbf{w}_b^{\top} X_b^{\top} X_b \textbf{w}_b} = 1$

Since $\textbf{w}_a^{\top} X_a^{\top} X_b \textbf{w}_b = \textbf{z}_a^{\top} \cdot \textbf{z}_b$ , the objective can be visualized as finding linear projections $\textbf{w}_a$ and $\textbf{w}_b$ such that $\textbf{z}_a$ and $\textbf{z}_b$ are close to each other on a unit ball in $\mathbb{R}^n$ . If we find $r$ such projections where $r \leq \min(p, q)$ , then we can embed our two datasets into $r$ -dimensional space.

Probabilistic interpretation of CCA

A probabilistic interpetation of CCA (PCCA), is one in which our two datasets, $X_a$ and $X_b$ , are viewed as two sets of $n$ observations of two random variables, $\textbf{x}_a$ and $\textbf{x}_b$ , that are generated by a shared latent variable $\textbf{z}$ . Rather than use linear algebra to set up an objective and then solve for two linear projections $\textbf{w}_a$ and $\textbf{w}_b$ , we instead write down a model that captures these probabilistic relationships and use maximum likelihood estimates to update its parameters. See my previous post on probabilistic machine learning if that statement is not clear. The model is:

$\begin{aligned} \textbf{z} &\sim \mathcal{N}(0, I_r), \quad r \leq \min(p, q) \\ \textbf{x}_a \mid \textbf{z} &\sim \mathcal{N}(W_a \textbf{z} + \boldsymbol{\mu}_a, \Psi_a) \\ \textbf{x}_b \mid \textbf{z} &\sim \mathcal{N}(W_b \textbf{z} + \boldsymbol{\mu}_b, \Psi_b) \end{aligned} \tag{1}$

Where $W_a \in \mathbb{R}^{p \times r}$ and $W_b \in \mathbb{R}^{q \times r}$ are two arbitrary matrices, and $\Psi_a$ and $\Psi_b$ are both positive semi-definite. (Bach & Jordan, 2005) proved that the resulting maximum likelihood estimates are equivalent, up to rotation and scaling, to CCA.

It is worth being explicit about the differences between this probabilistic framing and the standard framing. In CCA, we take our data and perform matrix multiplications to get lower-dimensional representations $\textbf{z}_a$ and $\textbf{z}_b$ :

$\textbf{z}_a = X_a \textbf{w}_a \\ \textbf{z}_b = X_b \textbf{w}_b$

The objective is to find projections $\textbf{w}_a$ and $\textbf{w}_b$ such that $\textbf{z}_a$ and $\textbf{z}_b$ are maximally correlated.

But the probabilistic model, Equations ( $1$ ), is a function of random variables. If we marginalize out $\textbf{z}$ for either $\textbf{x}_a$ or $\textbf{x}_b$ , we get the following generative model:

$\textbf{x} = W \textbf{z} + \textbf{u} \tag{2}$

Where $\textbf{u} \sim \mathcal{N}(\boldsymbol{\mu}, \Psi)$ . If we assume our data is mean-centered, meaning $\boldsymbol{\mu} = 0$ —we make the same assumption in CCA—and rename $W$ to $\Lambda$ , we can see that PCCA is just group factor analysis with two groups (Klami et al., 2015):

$\begin{aligned} \textbf{x}_a &= \Lambda_a \textbf{z} + \textbf{u}_a \\ \textbf{x}_b &= \Lambda_b \textbf{z} + \textbf{u}_b \end{aligned} \tag{3}$

And this in turn is nice because we can just use the maximum likelihood estimates we computed for factor analysis for PCCA. To see this, let’s use block matrices to represent our data and parameters:

$\textbf{x} = \begin{bmatrix} \textbf{x}_a \\ \textbf{x}_b \end{bmatrix}, \qquad \Lambda = \begin{bmatrix} \Lambda_a \\ \Lambda_b \end{bmatrix}, \qquad \textbf{u} = \begin{bmatrix} \textbf{u}_a \\ \textbf{u}_b \end{bmatrix}$

Where $\textbf{x} \in \mathbb{R}^{p + q}$ , $\Lambda \in \mathbb{R}^{(p+q) \times k}$ , and $\textbf{u} \in \mathbb{R}^{p + q}$ and,

$\textbf{u} \sim \mathcal{N} \Big( \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix} \Psi_a & 0 \\ 0 & \Psi_b \end{bmatrix} \Big)$

Then our PCCA updates are identical to our updates for factor analysis:

$\begin{aligned} \Lambda^{\star} &= \Big( \sum_{i=1}^{n} \textbf{x} \mathbb{E}_{\textbf{z} \mid \textbf{x}} \big[ \textbf{z} \mid \textbf{x}_i \big]^{\top} \Big) \Big( \sum_{i=1}^{n} \mathbb{E}_{\textbf{z} \mid \textbf{x}} \big[ \textbf{z} \textbf{z}^{\top} \mid \textbf{x}_i \big] \Big)^{-1} \\ \\ \Psi^{\star} &= \frac{1}{n} \text{diag}\Bigg( \sum_{i=1}^{n} \textbf{x}_i \textbf{x}_i^{\top} - \Lambda^{\star} \mathbb{E}_{\textbf{z} \mid \textbf{x}} \big[ \textbf{z} \mid \textbf{x}_i \big] \textbf{x}_i^{\top} \Bigg) \end{aligned}$

Futhermore, this definition gives us a joint density for $\textbf{x}$ that should look similar to the density $p(\textbf{x})$ in factor analysis:

$p(\textbf{x}) = \mathcal{N} \Big( \begin{bmatrix} 0 \\ 0 \end{bmatrix} , \begin{bmatrix} \Lambda_a \Lambda_a^{\top} + \Psi_a & \Lambda_a \Lambda_b^{\top} \\ \Lambda_b \Lambda_a^{\top} & \Lambda_b \Lambda_b^{\top} + \Psi_b \end{bmatrix} \Big)$

See the Appendix for a derivation.

It’s worth thinking about how the properties of CCA are converted to probabilistic assumptions in PCCA. First, in CCA, $\textbf{z}_a$ and $\textbf{z}_b$ are a pair of embeddings that we correlate. The assumption is that both datasets have similar low-rank approximations. In PCCA, this property is modeled by having a shared latent variable $\textbf{z}$ .

Furthermore, in CCA, we proved that the canonical variables are orthogonal. In PCCA, there is no such orthogonality constraint. Instead, we assume the latent variables are independent with an isotropic covariance matrix:

$\textbf{z} \sim \mathcal{N}(0, I_r)$

This independence assumption is the probabilistic equivalent of orthogonality. The covariance matrix of the latent variables is diagonal, meaning there is no covariance between the $i$ -th and $j$ -th variables.

The final constraint of the CCA objective is that the vectors have unit length. In probabilistic terms, this is analogous to unit variance, which we have since the identity matrix $I_r$ is an isotropic matrix with each diagonal term equal to $1$ .

Code

For an implementation of PCCA, please see my GitHub repository of machine learning algorithms, specifically this file.

Appendix

1. Derivations for $\boldsymbol{\mu}_x$ and $\Sigma_x$

Let’s solve for $\boldsymbol{\mu}_x$ and $\Sigma_x$ for the density:

$p\Big(\begin{bmatrix} \textbf{x}_a \\ \textbf{x}_b \end{bmatrix}\Big) = \mathcal{N}(\boldsymbol{\mu}_x, \Sigma_x)$

First, note that $\mathbb{E}[\textbf{z}] = 0$ and $\mathbb{E}[\textbf{u}] = \boldsymbol{\mu}$ . Then:

$\begin{aligned} \boldsymbol{\mu}_x = \mathbb{E}[\textbf{x}] &= \mathbb{E}[W \textbf{z} + \textbf{u}] \\ &= W \mathbb{E}[\textbf{z}] + \mathbb{E}[\textbf{u}] \\ &= \boldsymbol{\mu} \\ &= \begin{bmatrix} \boldsymbol{\mu}_a \\ \boldsymbol{\mu}_b \end{bmatrix} \end{aligned}$

If the data are mean-centered, as we assume, then:

$\begin{bmatrix} \boldsymbol{\mu}_a \\ \boldsymbol{\mu}_b \end{bmatrix} = \begin{bmatrix} \textbf{0} \\ \textbf{0} \end{bmatrix}$

To understand the covariance matrix $\Sigma_x$ ,

$\Sigma_x = \begin{bmatrix} \Sigma_{aa} & \Sigma_{ab} \\ \Sigma_{ba} & \Sigma_{bb} \end{bmatrix}$

let’s consider $\Sigma_{aa}$ and $\Sigma_{ab}$ . The remaining block matrices, $\Sigma_{bb}$ and $\Sigma_{ba}$ have identical proofs, respectively, but with the variables renamed. Both derivations will use the fact that $\mathbb{E}[XY] = \mathbb{E}[X] \cdot \mathbb{E}[Y]$ if $X$ and $Y$ are independent. First, let’s consider $\Sigma_{aa}$ :

$\begin{aligned} \Sigma_{aa} &= \mathbb{E}[(\textbf{x}_a - \boldsymbol{\mu}_a)(\textbf{x}_a - \boldsymbol{\mu}_a)^{\top}] \\ &= \mathbb{E}[(\textbf{x}_a)(\textbf{x}_a)^{\top}] - \boldsymbol{\mu}_a \boldsymbol{\mu}_a^{\top} \\ &= \mathbb{E}[(W \textbf{z} + \textbf{u}_a)(W \textbf{z} + \textbf{u}_a)^{\top}] - \boldsymbol{\mu}_a \boldsymbol{\mu}_a^{\top} \\ &= \mathbb{E}[ \Lambda_a \textbf{z} \textbf{z}^{\top} \Lambda_a^{\top} + \textbf{u}_a \textbf{z}^{\top} \Lambda_a^{\top} + \Lambda_a \textbf{z} \textbf{u}_a^{\top} + \textbf{u}_a \textbf{u}_a^{\top} ] - \boldsymbol{\mu}_a \boldsymbol{\mu}_a^{\top} \\ &= \underbrace{\Lambda_a \Lambda_a^{\top} \mathbb{E}[\textbf{z} \textbf{z}^{\top}]}_{\Lambda_a \Lambda_a^{\top} I_r} + \underbrace{\mathbb{E}[\textbf{u}_a] \mathbb{E}[\textbf{z}^{\top}] \mathbb{E}[\Lambda_a^{\top}]}_{0} + \underbrace{\Lambda_a \mathbb{E}[\textbf{z}] \mathbb{E}[\textbf{u}_a^{\top}]}_{0} + \underbrace{\mathbb{E}[\textbf{u}_a \textbf{u}_a^{\top}]}_{\text{var}(\textbf{x}_a) + \boldsymbol{\mu}_a \boldsymbol{\mu}_a^{\top}} - \boldsymbol{\mu}_a \boldsymbol{\mu}_a^{\top} \\ &= \Lambda_a \Lambda_a^{\top} + \Psi_a \end{aligned}$

So for the two views, $a$ and $b$ , we have:

$\Sigma_{aa} = \Lambda_a \Lambda_a^{\top} + \Psi_a \\ \Sigma_{bb} = \Lambda_b \Lambda_b^{\top} + \Psi_b$

Now, let’s consider $\Sigma_{ab}$ :

$\begin{aligned} \Sigma_{ab} &= \mathbb{E}[(\textbf{x}_a - \boldsymbol{\mu}_a)(\textbf{x}_b - \boldsymbol{\mu}_b)^{\top}] \\ &= \mathbb{E}[\textbf{x}_a \textbf{x}_b^{\top}] - \boldsymbol{\mu}_a \boldsymbol{\mu}_b^{\top} \\ &= \mathbb{E}[(\Lambda_a \textbf{z} + \textbf{u}_a)(\Lambda_b \textbf{z} + \textbf{u}_b)^{\top}] - \boldsymbol{\mu}_a \boldsymbol{\mu}_b^{\top} \\ &= \mathbb{E}[ \Lambda_a \textbf{z} \textbf{z}^{\top} \Lambda_b^{\top} + \textbf{u}_a \textbf{z}^{\top} \Lambda_b^{\top} + \Lambda_a \textbf{z} \textbf{u}_b^{\top} + \textbf{u}_a \textbf{u}_b^{\top} ] - \boldsymbol{\mu}_a \boldsymbol{\mu}_b^{\top} \\ &= \underbrace{\Lambda_a \Lambda_b^{\top} \mathbb{E}[\textbf{z} \textbf{z}^{\top}]}_{\Lambda_a \Lambda_b^{\top} I_r} + \underbrace{\Lambda_b^{\top} \mathbb{E}[\textbf{u}_a] \mathbb{E}[\textbf{z}^{\top}]}_{0} + \underbrace{\Lambda_a \mathbb{E}[\textbf{z}] \mathbb{E}[\textbf{u}_b^{\top}]}_{0} + \underbrace{\mathbb{E}[\textbf{u}_a \textbf{u}_b^{\top}]}_{\text{cov}(\textbf{u}_a, \textbf{u}_b) + \boldsymbol{\mu}_a \boldsymbol{\mu}_b^{\top}} - \boldsymbol{\mu}_a \boldsymbol{\mu}_b^{\top} \\ &= \Lambda_a \Lambda_b^{\top} \end{aligned}$

Thus, the full covariance matrix for $\textbf{x}$ is:

$\Sigma = \begin{bmatrix} \Lambda_a \Lambda_a^{\top} + \Psi_a & \Lambda_a \Lambda_b^{\top} \\ \Lambda_b \Lambda_a^{\top} & \Lambda_b \Lambda_b^{\top} + \Psi_b \end{bmatrix}$

Bach, F. R., & Jordan, M. I. (2005). A probabilistic interpretation of canonical correlation analysis.
Klami, A., Virtanen, S., Leppäaho, E., & Kaski, S. (2015). Group factor analysis. IEEE Transactions on Neural Networks and Learning Systems, 26(9), 2136–2147.