A Unifying Review of EM for Gaussian Latent Factor Models

The expectation–maximization (EM) updates for several Gaussian latent factor models (factor analysis, probabilistic principal component analysis, probabilistic canonical correlation analysis, and inter-battery factor analysis) are closely related. I explore these relationships in detail.

In previous posts, I have discussed a number of Gaussian latent factor models such as factor analysis, probabilistic PCA, and probabilistic CCA. However, while I have mentioned connections between these models at various times, I have never formalized them. The goal of this post is to show how these models are closely related. First, I will derive expectation–maximization (EM) updates for factor analysis and then show how these are related to the EM updates for probabilistic PCA (Tipping & Bishop, 1999), probabilistic canonical correlation analysis (Bach & Jordan, 2005), and inter-battery factor analysis (Klami & Kaski, 2006).

The title of this post is a reference Sam Roweis’s excellent overview of linear-Gaussian models (Roweis & Ghahramani, 1999).

EM for factor analysis

The factor analysis model is

xn=Wzn+un,znNK(0,I),unNP(0,Ψ),(1) \begin{aligned} \mathbf{x}_n &= \mathbf{W} \mathbf{z}_n + \mathbf{u}_n, \\ \mathbf{z}_n &\sim \mathcal{N}_K(\mathbf{0}, \mathbf{I}), \\ \mathbf{u}_n &\sim \mathcal{N}_P(\mathbf{0}, \boldsymbol{\Psi}), \end{aligned} \tag{1}

where xnRP\mathbf{x}_n \in \mathbb{R}^P, znRK\mathbf{z}_n \in \mathbb{R}^K, WRP×K\mathbf{W} \in \mathbb{R}^{P \times K}, and ΨRK×K\boldsymbol{\Psi} \in \mathbb{R}^{K \times K}, a diagonal matrix. We can also write this in vectorized notation as

X=WZ+U,(2) \mathbf{X} = \mathbf{W}\mathbf{Z} + \mathbf{U}, \tag{2}

where XRP×N\mathbf{X} \in \mathbb{R}^{P \times N} and ZRK×N\mathbf{Z} \in \mathbb{R}^{K \times N}. In words, we assume our PP-dimensional observations are linear functions of KK-dimensional latent variables plus some additive Gaussian noise. The Gaussian assumptions on z\mathbf{z} and u\mathbf{u} implies we assume our data are Gaussian distributed. Since Ψ\boldsymbol{\Psi} is diagonal, we assume the latent components are uncorrelated.

To compute the EM updates, we first write down the expected complete log likelihood, only including terms that depend on Z\mathbf{Z} since we’re maximizing the parameters θ={W,Ψ}\boldsymbol{\theta} = \{\mathbf{W}, \boldsymbol{\Psi}\} w.r.t. the log posterior logp(ZX,θ)\log p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}):

Q=E[12n=1N(xnWzn)Ψ1(xnWzn)12lnΨ+C]12n=1NE[xnΨ1xn2znWΨ1xn+znWΨ1Wzn]N2lnΨ.=12n=1N[[xnΨ1xnA2[E[zn]WΨ1xnB+[tr(WΨ1WE[znzn])]C[N2lnΨD.(3) \begin{aligned} Q &= \mathbb{E} \big[ -\frac{1}{2} \sum_{n=1}^{N}(\textbf{x}_n - \mathbf{W} \textbf{z}_n)^{\top} \boldsymbol{\Psi}^{-1} (\textbf{x}_n - \mathbf{W} \textbf{z}_n) - \frac{1}{2} \ln |\boldsymbol{\Psi}| + C \big] \\ &\propto -\frac{1}{2} \sum_{n=1}^{N} \mathbb{E} \big[ \textbf{x}_n^{\top} \boldsymbol{\Psi}^{-1} \textbf{x}_n - 2 \textbf{z}_n^{\top} \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \textbf{x}_n + \textbf{z}_n^{\top} \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W} \textbf{z}_n \big] - \frac{N}{2} \ln |\boldsymbol{\Psi}|. \\ &= -\frac{1}{2} \sum_{n=1}^{N} \big[ \underbrace{\vphantom{\Big[}\textbf{x}_n^{\top} \boldsymbol{\Psi}^{-1} \textbf{x}_n}_{A} - 2 \underbrace{\vphantom{\Big[}\mathbb{E}[\textbf{z}_n]^{\top} \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \textbf{x}_n}_{B} + \underbrace{\vphantom{\Big[}\text{tr}\big( \boldsymbol{\mathbf{W}}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbb{E}[ \textbf{z}_n \mathbf{z}_n^{\top}] \big) \big]}_{C} - \underbrace{\vphantom{\Big[}\frac{N}{2} \ln |\boldsymbol{\Psi}|}_{D}. \end{aligned} \tag{3}

See my post on EM if the big picture does not make sense. To compute EM updates for either W\mathbf{W} or Ψ\boldsymbol{\Psi}, we compute the derivative of Eq. 33 w.r.t. the parameter, set the equation equal to zero, and solve for the parameter. We can take derivatives of the terms labeled AA, BB, CC, and DD in Eq. 33 w.r.t. both parameters using the excellent Matrix Cookbook,

BW=Ψ1xnE[zn],MC.71CW=2Ψ1WE[znzn],MC.117AΨ1=xnxn,MC.72BΨ1=xnE[zn]W,MC.70CΨ1=WE[znzn]W,MC.102DΨ1=N2Ψ.MC.56(4) \begin{aligned} \frac{\partial B}{\partial \mathbf{W}} &= \boldsymbol{\Psi}^{-1} \mathbf{x}_n \mathbb{E}[\mathbf{z}_n]^{\top}, && \text{MC.71} \\ \frac{\partial C}{\partial \mathbf{W}} &= 2 \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbb{E}[\mathbf{z}_n \mathbf{z}_n^{\top}], && \text{MC.117} \\ \\ \frac{\partial A}{\partial \boldsymbol{\Psi}^{-1}} &= \mathbf{x}_n \mathbf{x}_n^{\top}, && \text{MC.72} \\ \frac{\partial B}{\partial \boldsymbol{\Psi}^{-1}} &= \mathbf{x}_n \mathbb{E}[\mathbf{z}_n]^{\top} \mathbf{W}^{\top}, && \text{MC.70} \\ \frac{\partial C}{\partial \boldsymbol{\Psi}^{-1}} &= \mathbf{W} \mathbb{E}[\mathbf{z}_n \mathbf{z}_n^{\top}] \mathbf{W}^{\top}, && \text{MC.102} \\ \frac{\partial D}{\partial \boldsymbol{\Psi}^{-1}} &= -\frac{N}{2} \boldsymbol{\Psi}. && \text{MC.56} \end{aligned} \tag{4}

MC.X\text{MC.X} denotes Eq. X\text{X} in the Matrix Cookbook. First, we can solve for the optimal W\mathbf{W}^{\star}:

QW=n=1NΨ1xnE[zn]+n=1NΨ1WE[znzn],W=(n=1NxnE[zn])(n=1NE[znzn])1.(5) \begin{aligned} \frac{\partial Q}{\partial \mathbf{W}} &= -\sum_{n=1}^{N} \boldsymbol{\Psi}^{-1} \mathbf{x}_n \mathbb{E}[\mathbf{z}_n]^{\top} + \sum_{n=1}^{N} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbb{E}[\mathbf{z}_n \mathbf{z}_n^{\top}], \\ &\Downarrow \\ \mathbf{W}^{\star} &= \Big(\sum_{n=1}^{N} \mathbf{x}_n \mathbb{E}[\mathbf{z}_n]^{\top} \Big) \Big( \sum_{n=1}^{N} \mathbb{E}[\mathbf{z}_n \mathbf{z}_n^{\top}] \Big)^{-1}. \end{aligned} \tag{5}

Next, we plug W\mathbf{W}^{\star} into the expected complete log likelihood in Eq. 33 and solve for Ψ\boldsymbol{\Psi}:

QΨ1=12n=1N[xnxn2xnE[zn][W]+WE[znzn][W]]+N2Ψ,Ψ=1Nn=1N[xnxn2xnE[zn][W]+WE[znzn][W]]=1Nn=1Nxnxn1Nn=1N2xnE[zn][W]+1Nn=1NWE[znzn][W].(6) \begin{aligned} \frac{\partial Q}{\partial \boldsymbol{\Psi}^{-1}} &= -\frac{1}{2} \sum_{n=1}^{N} \Big[ \mathbf{x}_n \mathbf{x}_n^{\top} - 2 \mathbf{x}_n \mathbb{E}[\mathbf{z}_n]^{\top} [\mathbf{W}^{\star}]^{\top} + \mathbf{W}^{\star} \mathbb{E}[\mathbf{z}_n \mathbf{z}_n^{\top}] [\mathbf{W}^{\star}]^{\top} \Big] + \frac{N}{2} \boldsymbol{\Psi}, \\ &\Downarrow \\ \boldsymbol{\Psi}^{\star} &= \frac{1}{N} \sum_{n=1}^{N} \Big[ \mathbf{x}_n \mathbf{x}_n^{\top} - 2 \mathbf{x}_n \mathbb{E}[\mathbf{z}_n]^{\top} [\mathbf{W}^{\star}]^{\top} + \mathbf{W}^{\star} \mathbb{E}[\mathbf{z}_n \mathbf{z}_n^{\top}] [\mathbf{W}^{\star}]^{\top} \Big] \\ &= \frac{1}{N} \sum_{n=1}^{N} \mathbf{x}_n \mathbf{x}_n^{\top} - \frac{1}{N} \sum_{n=1}^{N} 2 \mathbf{x}_n \mathbb{E}[\mathbf{z}_n]^{\top} [\mathbf{W}^{\star}]^{\top} + \frac{1}{N} \sum_{n=1}^{N} \mathbf{W}^{\star} \mathbb{E}[\mathbf{z}_n \mathbf{z}_n^{\top}] [\mathbf{W}^{\star}]^{\top}. \end{aligned} \tag{6}

We can simplify Eq. 66 with the following observation based on the equality of Eq. 55:

1Nn=1NWE[znzn][W]=1Nn=1NxnE[zn][W].(7) \frac{1}{N} \sum_{n=1}^N \mathbf{W}^{\star} \mathbb{E}[\mathbf{z}_n \mathbf{z}_n^{\top}] [\mathbf{W}^{\star}]^{\top} = \frac{1}{N} \sum_{n=1}^{N} \mathbf{x}_n \mathbb{E}[\mathbf{z}_n]^{\top} [\mathbf{W}^{\star}]^{\top}. \tag{7}

So we can simplify Eq. 66 as

Ψ=1Nn=1Nxnxn1Nn=1NxnE[zn][W].(8) \boldsymbol{\Psi}^{\star} = \frac{1}{N} \sum_{n=1}^{N} \mathbf{x}_n \mathbf{x}_n^{\top} - \frac{1}{N} \sum_{n=1}^{N} \mathbf{x}_n \mathbb{E}[\mathbf{z}_n]^{\top} [\mathbf{W}^{\star}]^{\top}. \tag{8}

To derive the posterior moments in Eq. 44, recall that the Gaussian assumptions on both z\mathbf{z} and u\mathbf{u} induce a conditionally Gaussian assumption on x\mathbf{x},

xzNP(Wz,Ψ),(9) \mathbf{x} \mid \mathbf{z} \sim \mathcal{N}_P(\mathbf{W} \mathbf{z}, \boldsymbol{\Psi}), \tag{9}

Since our prior on z\mathbf{z} is also Gaussian, the posterior is Gaussian

zxNK(MWΨ1x,M)(10) \mathbf{z} \mid \mathbf{x} \sim \mathcal{N}_K(\mathbf{M} \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{x}, \mathbf{M}) \tag{10}

where

M=(I+WΨ1W)1.(11) \mathbf{M} = (\mathbf{I} + \mathbf{W}^{\top}\boldsymbol{\Psi}^{-1}\mathbf{W})^{-1}. \tag{11}

See Sec. 2.3.32.3.3 in (Bishop, 2006) for details. We can use the covariance and mean to solve for the second moment:

E[zz]=M+E[z]E[z],(12) \mathbb{E}[\textbf{z} \textbf{z}^{\top}] = \mathbf{M} + \mathbb{E}[\textbf{z}] \mathbb{E}[\textbf{z}]^{\top}, \tag{12}

where the first moment is clearly

E[z]=MWΨ1x.(13) \mathbb{E}[\mathbf{z}] = \mathbf{M} \mathbf{W}^{\top}\boldsymbol{\Psi}^{-1}\mathbf{x}. \tag{13}

For the curious, these posterior moments are equivalent to those in my previous post on factor analysis, albeit in a different and vectorized form. See A1 for a proof of this claim. Finally, let’s define the vectorized quantity S\mathbf{S} as

S=[E[z1]E[zN]]=MWΨ1X.(14) \begin{aligned} \mathbf{S} &= \left[\begin{array}{c|c|c} \mathbb{E}[\mathbf{z}_1] & \dots & \mathbb{E}[\mathbf{z}_N] \end{array}\right] \\ &= \mathbf{M} \mathbf{W}^{\top}\boldsymbol{\Psi}^{-1} \mathbf{X}. \end{aligned} \tag{14}

So SRK×N\mathbf{S} \in \mathbb{R}^{K \times N} is a matrix of first posterior moments. The EM updates in Eq. 55 and 88 match those given in (Ghahramani & Hinton, 1996). However, we can vectorize them as follows:

W=(n=1NxnE[zn])(n=1NE[znzn])1=(n=1NxnxnΨ1WM)(n=1N[M+MWΨ1xnxnΨ1WM])1=XS(NM+SS)1,Ψ=1Nn=1Nxnxn1Nn=1NxnE[zn][W]=Σ~1Nn=1NxnxnΨ1WM[W]=Σ~Σ~Ψ1WM[W],(15) \begin{aligned} \mathbf{W}^{\star} &= \Big(\sum_{n=1}^{N} \mathbf{x}_n \mathbb{E}[\mathbf{z}_n]^{\top} \Big) \Big( \sum_{n=1}^{N} \mathbb{E}[\mathbf{z}_n \mathbf{z}_n^{\top}] \Big)^{-1} \\ &= \Big(\sum_{n=1}^{N} \mathbf{x}_n \mathbf{x}_n^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \Big) \Big( \sum_{n=1}^{N} \Big[ \mathbf{M} + \mathbf{M}\mathbf{W}^{\top}\boldsymbol{\Psi}^{-1} \mathbf{x}_n \mathbf{x}_n^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \Big] \Big)^{-1} \\ &= \mathbf{X}\mathbf{S}^{\top} (N \mathbf{M} + \mathbf{S}\mathbf{S}^{\top})^{-1}, \\ \\ \boldsymbol{\Psi}^{\star} &= \frac{1}{N} \sum_{n=1}^{N} \mathbf{x}_n \mathbf{x}_n^{\top} - \frac{1}{N} \sum_{n=1}^{N} \mathbf{x}_n \mathbb{E}[\mathbf{z}_n]^{\top} [\mathbf{W}^{\star}]^{\top} \\ &= \tilde{\boldsymbol{\Sigma}} - \frac{1}{N} \sum_{n=1}^{N} \mathbf{x}_n \mathbf{x}_n^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} [\mathbf{W}^{\star}]^{\top} \\ &= \tilde{\boldsymbol{\Sigma}} - \tilde{\boldsymbol{\Sigma}} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} [\mathbf{W}^{\star}]^{\top}, \end{aligned} \tag{15}

where Σ~=N1XX\tilde{\boldsymbol{\Sigma}} = N^{-1} \mathbf{X} \mathbf{X}^{\top}. Throughout this post, I repeatedly use the fact that the sum of outer products is equal to a matrix-matrix multiplication. In summary, we can write the EM algorithm for factor analysis as:

E-step:

M=(I+WΨ1W)1,S=MWΨ1X,A=NM+SS.(16) \begin{aligned} \mathbf{M} &= (\mathbf{I} + \mathbf{W}^{\top}\boldsymbol{\Psi}^{-1}\mathbf{W})^{-1}, \\ \mathbf{S} &= \mathbf{M} \mathbf{W}^{\top}\boldsymbol{\Psi}^{-1} \mathbf{X}, \\ \mathbf{A} &= N \mathbf{M} + \mathbf{S} \mathbf{S}^{\top}. \end{aligned} \tag{16}

M-step:

W=XSA1,Ψ=diag(Σ~Σ~Ψ1WM[W]).(17) \begin{aligned} \mathbf{W}^{\star} &= \mathbf{X} \mathbf{S}^{\top} \mathbf{A}^{-1}, \\ \boldsymbol{\Psi}^{\star} &= \text{diag} \left( \tilde{\boldsymbol{\Sigma}} - \tilde{\boldsymbol{\Sigma}} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} [\mathbf{W}^{\star}]^{\top} \right). \end{aligned} \tag{17}

We need the diagonalization operator because when finding the optimal value of Ψ\boldsymbol{\Psi} since we only care about the diagonal elements. In other words, this is a logical rather than linear algebraic constraint.

EM for probabilistic PCA

We will now see that the EM updates for probabilistic PCA are closely related. The generative model for probabilistic PCA is

xn=Wzn+un,znNK(0,I),unNP(0,σ2I).(18) \begin{aligned} \mathbf{x}_n &= \mathbf{W} \mathbf{z}_n + \mathbf{u}_n, \\ \mathbf{z}_n &\sim \mathcal{N}_K(\mathbf{0}, \mathbf{I}), \\ \mathbf{u}_n &\sim \mathcal{N}_P(\mathbf{0}, \sigma^2 \mathbf{I}). \end{aligned} \tag{18}

The only difference between Eq. 1818 with Eq. 11 is that the noise covariance in probabilistic PCA is isotropic. Therefore, the updates for W\mathbf{W} are exactly the same as in factor analysis:

W=XSA1.(19) \mathbf{W}^{\star} = \mathbf{X} \mathbf{S}^{\top} \mathbf{A}^{-1}. \tag{19}

See A2 for a proof that this is equivalent to Tipping’s EM update for W\mathbf{W}. However, we need a slightly different derivation for σ2\sigma^2, since it is a scalar. In probabilistic PCA, the expected complete log likelihood is

Q=n=1N[12σ2xnxn+1σ2E[zn]Wxn12σ2tr(WWE[znzn])K2logσ2].(20) Q = \sum_{n=1}^N \Big[-\frac{1}{2 \sigma^2} \mathbf{x}_n^{\top} \mathbf{x}_n + \frac{1}{\sigma^2} \mathbb{E}[\mathbf{z}_n]^{\top} \mathbf{W}^{\top} \mathbf{x}_n - \frac{1}{2 \sigma^2} \text{tr} \Big( \mathbf{W}^{\top} \mathbf{W} \mathbb{E}[\mathbf{z}_n \mathbf{z}_n^{\top}]\Big) - \frac{K}{2} \log \sigma^2 \Big]. \tag{20}

Let’s take the derivative of QQ w.r.t. σ2\sigma^2, set it equal to zero, and solve for σ2\sigma^2, using the optimal value of W\mathbf{W}:

(σ2)=1NKn=1N[xnxn2E[zn][W]xn+tr([W]WE[znzn])](21) (\sigma^2)^{\star} = \frac{1}{NK} \sum_{n=1}^N \Big[\mathbf{x}_n^{\top} \mathbf{x}_n - 2 \mathbb{E}[\textbf{z}_n]^{\top} [\mathbf{W}^{\star}]^{\top} \textbf{x}_n + \text{tr} \Big( [\mathbf{W}^{\star}]^{\top} \mathbf{W}^{\star} \mathbb{E}[\mathbf{z}_n \mathbf{z}_n^{\top}]\Big) \Big] \tag{21}

Notice that

n=1Ntr([W]WE[znzn])=tr([W]Wn=1NE[znzn])=tr([W]WA)=tr([W]XS),n=1NE[zn][W]xn=n=1Ntr(xnE[zn][W])=tr(n=1NxnE[zn][W])=tr([W]XS).(22) \begin{aligned} \sum_{n=1}^{N} \text{tr}([\mathbf{W}^{\star}]^{\top} \mathbf{W}^{\star} \mathbb{E}[\mathbf{z}_n \mathbf{z}_n^{\top}]) &= \text{tr}([\mathbf{W}^{\star}]^{\top} \mathbf{W}^{\star} \sum_{n=1}^{N} \mathbb{E}[\mathbf{z}_n \mathbf{z}_n^{\top}]) \\ &= \text{tr}([\mathbf{W}^{\star}]^{\top} \mathbf{W}^{\star} \mathbf{A}) \\ &= \text{tr}([\mathbf{W}^{\star}]^{\top} \mathbf{X} \mathbf{S}^{\top}), \\ \\ \sum_{n=1}^{N} \mathbb{E}[\mathbf{z}_n]^{\top} [\mathbf{W}^{\star}]^{\top} \mathbf{x}_n &= \sum_{n=1}^{N} \text{tr}(\mathbf{x}_n \mathbb{E}[\mathbf{z}_n]^{\top} [\mathbf{W}^{\star}]^{\top}) \\ &= \text{tr} \Big(\sum_{n=1}^{N} \mathbf{x}_n \mathbb{E}[\mathbf{z}_n]^{\top} [\mathbf{W}^{\star}]^{\top} \Big) \\ &= \text{tr}([\mathbf{W}^{\star}]^{\top} \mathbf{X} \mathbf{S}^{\top}). \end{aligned} \tag{22}

So we can write the optimal value for σ2\sigma^2 as

(σ2)=1NKn=1N{xnxn2E[zn][W]xn+tr(W[W]E[znzn])}=1NK{tr(XX)tr([W]XS)}=1Ktr(Σ~Σ~W1σ2M[W]).(23) \begin{aligned} (\sigma^2)^{\star} &= \frac{1}{NK} \sum_{n=1}^N \Big\{\mathbf{x}_n^{\top} \mathbf{x}_n - 2 \mathbb{E}[\mathbf{z}_n]^{\top} [\mathbf{W}^{\star}]^{\top} \mathbf{x}_n + \text{tr} \Big( \mathbf{W}^{\top} [\mathbf{W}^{\star}] \mathbb{E}[\mathbf{z}_n \mathbf{z}_n^{\top}]\Big) \Big\} \\ &= \frac{1}{NK} \Big\{ \text{tr}(\mathbf{XX}^{\top}) - \text{tr}([\mathbf{W}^{\star}]^{\top} \mathbf{X} \mathbf{S}^{\top}) \Big\} \\ &= \frac{1}{K} \text{tr}\Big(\tilde{\boldsymbol{\Sigma}} - \tilde{\boldsymbol{\Sigma}} \mathbf{W} \frac{1}{\sigma^2} \mathbf{M} [\mathbf{W}^{\star}]^{\top} \Big). \end{aligned} \tag{23}

This matches the EM updates for (Tipping & Bishop, 1999).

EM for probabilistic CCA

Finally, we’ll see that the EM updates for probabilistic CCA are essentially the updates for factor analysis, with some block structure enforced. The generative model for probabilistic CCA is

zNK(0,I),Kmin(P1,P2),x1zNP1(W1z,Ψ1),x2zNP2(W2z,Ψ2).(24) \begin{aligned} \textbf{z} &\sim \mathcal{N}_K(\mathbf{0}, \mathbf{I}), \quad K \leq \min(P_1, P_2), \\ \textbf{x}_1 \mid \textbf{z} &\sim \mathcal{N}_{P_1}(\mathbf{W}_1 \textbf{z}, \boldsymbol{\Psi}_1), \\ \textbf{x}_2 \mid \textbf{z} &\sim \mathcal{N}_{P_2}(\mathbf{W}_2 \textbf{z}, \boldsymbol{\Psi}_2). \end{aligned} \tag{24}

Note that in probabilistic CCA, the covariance matrices Ψ1\boldsymbol{\Psi}_1 and Ψ2\boldsymbol{\Psi}_2 are not constrained to be diagonal. In my post on probabilistic CCA, I observed that if we appropriately tiled the data,

x=[x1x2],W=[W1W2],Ψ=[Ψ100Ψ2],Σ~=[Σ~11Σ~12Σ~21Σ~22],(25) \textbf{x} = \begin{bmatrix} \textbf{x}_1 \\ \textbf{x}_2 \end{bmatrix}, \qquad \mathbf{W} = \begin{bmatrix} \mathbf{W}_1 \\ \mathbf{W}_2 \end{bmatrix}, \qquad \boldsymbol{\Psi} = \begin{bmatrix} \boldsymbol{\Psi}_1 & 0 \\ 0 & \boldsymbol{\Psi}_2 \end{bmatrix}, \quad \tilde{\boldsymbol{\Sigma}} = \begin{bmatrix} \tilde{\boldsymbol{\Sigma}}_{11} & \tilde{\boldsymbol{\Sigma}}_{12} \\ \tilde{\boldsymbol{\Sigma}}_{21} & \tilde{\boldsymbol{\Sigma}}_{22} \end{bmatrix}, \tag{25}

where P=P1+P2P = P_1 + P_2, xRP\textbf{x} \in \mathbb{R}^{P}, and WRP×K\mathbf{W} \in \mathbb{R}^{P \times K}, then the EM updates for probabilistic CCA in (Bach & Jordan, 2005) would be equivalent to those of factor analysis. However, I did not explicitly show this equivalence, and there are subtleties I ignored. Let’s make the connection explicitly now. The optimal update for W\mathbf{W}, again given by our formula for factor analysis in Eq. 1414, is

W=XSA1=X(MWΨ1X)(NM+SS)1=XXΨ1WM(NM+MWΨ1XXΨ1WM)1=1NXXΨ1WM(M+MWΨ11NXXΨ1WM)1=Σ~Ψ1WM(M+MWΨ1Σ~Ψ1WM)1.(26) \begin{aligned} \mathbf{W}^{\star} &= \mathbf{X} \mathbf{S}^{\top} \mathbf{A}^{-1} \\ &= \mathbf{X} (\mathbf{M} \mathbf{W}^{\top}\boldsymbol{\Psi}^{-1} \mathbf{X})^{\top} (N \mathbf{M} + \mathbf{S} \mathbf{S}^{\top})^{-1} \\ &= \mathbf{X} \mathbf{X}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \Big(N \mathbf{M} + \mathbf{M} \mathbf{W}^{\top}\boldsymbol{\Psi}^{-1} \mathbf{X} \mathbf{X}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \Big)^{-1} \\ &= \frac{1}{N} \mathbf{X} \mathbf{X}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \Big(\mathbf{M} + \mathbf{M} \mathbf{W}^{\top}\boldsymbol{\Psi}^{-1} \frac{1}{N} \mathbf{X} \mathbf{X}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \Big)^{-1} \\ &= \tilde{\boldsymbol{\Sigma}} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \Big(\mathbf{M} + \mathbf{M} \mathbf{W}^{\top}\boldsymbol{\Psi}^{-1} \tilde{\boldsymbol{\Sigma}} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \Big)^{-1}. \end{aligned} \tag{26}

In Eq. 2626, we use the fact that (cA)1=c1A1(c \mathbf{A})^{-1} = c^{-1} \mathbf{A}^{-1} for any constant cc. This works because the block structure of Ψ\boldsymbol{\Psi} ensures we update each block of W\mathbf{W} correctly, e.g.:

[[W1W2]W[[Ψ100Ψ2]Ψ=[W1Ψ1W2Ψ2].(27) \overbrace{\vphantom{\Bigg[}\begin{bmatrix} \mathbf{W}_1^{\top} & \mathbf{W}_2^{\top} \end{bmatrix}}^{\mathbf{W}^{\top}} \overbrace{\vphantom{\Bigg[}\begin{bmatrix} \boldsymbol{\Psi}_1 & \mathbf{0} \\ \mathbf{0} & \boldsymbol{\Psi}_2 \end{bmatrix}}^{\boldsymbol{\Psi}} = \begin{bmatrix} \mathbf{W}_1^{\top} \boldsymbol{\Psi}_1 & \mathbf{W}_2^{\top} \boldsymbol{\Psi}_2 \end{bmatrix}. \tag{27}

The update for Ψ\boldsymbol{\Psi} is trickier only because we need to maintain the diagonal block structure. For Ψi\boldsymbol{\Psi}_i^{\star} with i{1,2}i \in \{1, 2\} indexing each data modality, we have:

Ψi=Σ~iiΣ~Ψ1WMi[Wi].(28) \boldsymbol{\Psi}_i^{\star} = \tilde{\boldsymbol{\Sigma}}_{ii} - \tilde{\boldsymbol{\Sigma}} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M}_i [\mathbf{W}_i^{\star}]^{\top}. \tag{28}

In other words, the EM updates for probabilistic CCA immediately fall out of the EM updates for factor analysis:

W=Σ~Ψ1WM(M+MWΨ1Σ~Ψ1WM)1,Ψ=[Σ~11Σ~11Ψ11W1M1[W1]012021Σ~22Σ~22Ψ21W2M2[W2]],(29) \begin{aligned} \mathbf{W}^{\star} &= \tilde{\boldsymbol{\Sigma}} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \Big(\mathbf{M} + \mathbf{M} \mathbf{W}^{\top}\boldsymbol{\Psi}^{-1} \tilde{\boldsymbol{\Sigma}} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \Big)^{-1}, \\ \boldsymbol{\Psi}^{\star} &= \begin{bmatrix} \tilde{\boldsymbol{\Sigma}}_{11} - \tilde{\boldsymbol{\Sigma}}_{11} \boldsymbol{\Psi}_1^{-1} \mathbf{W}_1 \mathbf{M}_1 [\mathbf{W}_1^{\star}]^{\top} & \mathbf{0}_{12} \\ \mathbf{0}_{21} & \tilde{\boldsymbol{\Sigma}}_{22} - \tilde{\boldsymbol{\Sigma}}_{22} \boldsymbol{\Psi}_2^{-1} \mathbf{W}_2 \mathbf{M}_2 [\mathbf{W}_2^{\star}]^{\top} \end{bmatrix}, \end{aligned} \tag{29}

where 012RP1×P2\mathbf{0}_{12} \in \mathbb{R}^{P_1 \times P_2} and 021=012\mathbf{0}_{21} = \mathbf{0}_{12}^{\top} are both all zeros. How I think of this is that by restricting the off-diagonal block matrices to be all zero, we enforce our modeling assumption that there is no correlation between the noise terms for each modality, x1\mathbf{x}_1 and x2\mathbf{x}_2.

EM for IBFA

Inter-battery factor analysis (IBFA) is similar to probabilistic CCA except that it explicitly models a shared latent variable z0\mathbf{z}_0 and view-specific latent variables z1\mathbf{z}_1 and z2\mathbf{z}_2:

z0NK0(0,I),z1NK1(0,I),z2NK2(0,I),x1z0,z1,z2NP1(W1z0+B1z1,σ12I),x2z0,z1,z2NP2(W2z0+B2z2,σ22I).(30) \begin{aligned} \mathbf{z}_0 &\sim \mathcal{N}_{K_0}(\mathbf{0}, \mathbf{I}), \\ \mathbf{z}_1 &\sim \mathcal{N}_{K_1}(\mathbf{0}, \mathbf{I}), \\ \mathbf{z}_2 &\sim \mathcal{N}_{K_2}(\mathbf{0}, \mathbf{I}), \\ \mathbf{x}_1 \mid \mathbf{z}_0, \mathbf{z}_1, \mathbf{z}_2 &\sim \mathcal{N}_{P_1}(\mathbf{W}_1 \mathbf{z}_0 + \mathbf{B}_1 \mathbf{z}_1, \sigma_1^2 \mathbf{I}), \\ \mathbf{x}_2 \mid \mathbf{z}_0, \mathbf{z}_1, \mathbf{z}_2 &\sim \mathcal{N}_{P_2}(\mathbf{W}_2 \mathbf{z}_0 + \mathbf{B}_2 \mathbf{z}_2, \sigma_2^2 \mathbf{I}). \end{aligned} \tag{30}

The latent variable z0\mathbf{z}_0 can be viewed as modeling shared variation through the linear maps W1\mathbf{W}_1 and W2\mathbf{W}_2, while the z1\mathbf{z}_1 and z2\mathbf{z}_2 variables model view-specific variation through B1\mathbf{B}_1 and B2\mathbf{B}_2 respectively. Also, notice that the covariance matrices are isotropic, not just diagonal.

So what’s the relationship between IBFA and probabilistic CCA? We can treat the view-specific latent variables as nuisance parameters and marginalize them out. For z1\mathbf{z}_1, we have:

p(x1,x2,z0,z1,z2)dz1=p(x1z0,z1)p(x2z0,z2)p(z0)p(z1)p(z2)dz1=p(x2z0,z2)p(z2)p(z0)p(x1z0,z1)p(z1)dz1.(31) \begin{aligned} \int p(\mathbf{x}_1, \mathbf{x}_2, \mathbf{z}_0, \mathbf{z}_1, \mathbf{z}_2) d \mathbf{z}_1 &= \int p(\mathbf{x}_1 \mid \mathbf{z}_0, \mathbf{z}_1) p(\mathbf{x}_2 \mid \mathbf{z}_0, \mathbf{z}_2) p(\mathbf{z}_0) p(\mathbf{z}_1) p(\mathbf{z}_2) d \mathbf{z}_1 \\ &= p(\mathbf{x}_2 \mid \mathbf{z}_0, \mathbf{z}_2) p(\mathbf{z}_2) p(\mathbf{z}_0) \int p(\mathbf{x}_1 \mid \mathbf{z}_0, \mathbf{z}_1) p(\mathbf{z}_1) d \mathbf{z}_1. \end{aligned} \tag{31}

Notice that z0\mathbf{z}_0 is a constant in the integration. Let x~1=x1W1z0\tilde{\mathbf{x}}_1 = \mathbf{x}_1 - \mathbf{W}_1 \mathbf{z}_0. Then we can easily integrate

p(x~1z1)p(z1)dz1(32) \int p(\tilde{\mathbf{x}}_1 \mid \mathbf{z}_1) p(\mathbf{z}_1) d\mathbf{z}_1 \tag{32}

since both densities are Gaussian:

x~1z1NP1(B1z1,σ12I),z1NK1(0,I),x~1NP1(0,B1B1+σ12I),x1z0NP1(W1z0,B1B1+σ12I).(33) \begin{aligned} \tilde{\mathbf{x}}_1 \mid \mathbf{z}_1 &\sim \mathcal{N}_{P_1}(\mathbf{B}_1 \mathbf{z}_1, \sigma_1^2 \mathbf{I}), \\ \mathbf{z}_1 &\sim \mathcal{N}_{K_1}(\mathbf{0}, \mathbf{I}), \\ &\Downarrow \\ \tilde{\mathbf{x}}_1 &\sim \mathcal{N}_{P_1}(\mathbf{0}, \mathbf{B}_1 \mathbf{B}_1^{\top} + \sigma_1^2 \mathbf{I}), \\ &\Downarrow \\ \mathbf{x}_1 \mid \mathbf{z}_0 &\sim \mathcal{N}_{P_1}(\mathbf{W}_1 \mathbf{z}_0, \mathbf{B}_1 \mathbf{B}_1^{\top} + \sigma_1^2 \mathbf{I}). \end{aligned} \tag{33}

Notice that if B1B1+σ12I\mathbf{B}_1 \mathbf{B}_1^{\top} + \sigma_1^2 \mathbf{I} were full rank, we could write it as Ψ1\boldsymbol{\Psi}_1 as in probabilistic CCA (Eq. 2424). If we applied this same logic to x2\mathbf{x}_2, we would get the same generative model as probabilistic CCA:

z0NK0(0,I),x1z0NP1(W1z0,B1B1+σ12I),x2z0NP2(W2z0,B2B2+σ22I).(34) \begin{aligned} \mathbf{z}_0 &\sim \mathcal{N}_{K_0}(\mathbf{0}, \mathbf{I}), \\ \mathbf{x}_1 \mid \mathbf{z}_0 &\sim \mathcal{N}_{P_1}(\mathbf{W}_1 \mathbf{z}_0, \mathbf{B}_1 \mathbf{B}_1^{\top} + \sigma_1^2 \mathbf{I}), \\ \mathbf{x}_2 \mid \mathbf{z}_0 &\sim \mathcal{N}_{P_2}(\mathbf{W}_2 \mathbf{z}_0, \mathbf{B}_2 \mathbf{B}_2^{\top} + \sigma_2^2 \mathbf{I}). \end{aligned} \tag{34}

Thus, Bach’s formulation of probabilistic CCA assumes that the true dimensionalities of z1\mathbf{z}_1 and z2\mathbf{z}_2 are big enough to only require non-constrained covariance matrices. To marginalize over z0\mathbf{z}_0,

p(x1)p(x2)p(x1z0,z1)p(x2z0,z2)p(z0)dz0,(35) p(\mathbf{x}_1) p(\mathbf{x}_2) \int p(\mathbf{x}_1 \mid \mathbf{z}_0, \mathbf{z}_1) p(\mathbf{x}_2 \mid \mathbf{z}_0, \mathbf{z}_2) p(\mathbf{z}_0) d \mathbf{z}_0, \tag{35}

we can apply the same trick as in Eq. 3333. First, let’s tile our data as in probabilistic CCA, where additionally

B=[B100B2],σ2I=[σ12I00σ22I].(36) \mathbf{B} = \begin{bmatrix} \mathbf{B}_1 & \mathbf{0} \\ \mathbf{0} & \mathbf{B}_2 \end{bmatrix}, \quad \sigma^2 \mathbf{I} = \begin{bmatrix} \sigma_1^2 \mathbf{I} & \mathbf{0} \\ \mathbf{0} & \sigma_2^2 \mathbf{I} \end{bmatrix}. \tag{36}

Next, let’s write p(x1z0,z1)p(x2z0,z2)p(\mathbf{x}_1 \mid \mathbf{z}_0, \mathbf{z}_1) p(\mathbf{x}_2 \mid \mathbf{z}_0, \mathbf{z}_2) as a PP-variate Gaussian random variable:

xz0,zNP(Wz0+Bz,σ2I).(37) \mathbf{x} \mid \mathbf{z}_0, \mathbf{z} \sim \mathcal{N}_{P}( \mathbf{W} \mathbf{z}_0 + \mathbf{B} \mathbf{z}, \sigma^2 \mathbf{I}). \tag{37}

Finally, let’s apply the trick where now z\mathbf{z} is a constant of integration. Let x^=xBz\hat{\mathbf{x}} = \mathbf{x} - \mathbf{B} \mathbf{z}, then

x^z0NP(Wz0,σ2I),z0NK0(0,I),x^NP(0,WW+σ2I),xNP(Bz,WW+σ2I),[x1x2]NP([B100B2][z1z2],[W1W1+σ12IW1W2W2W1W2W2+σ22I]),xiziNPi(Bizi,WiWi+σi2I).(38) \begin{aligned} \hat{\mathbf{x}} \mid \mathbf{z}_0 &\sim \mathcal{N}_P(\mathbf{W} \mathbf{z}_0, \sigma^2 \mathbf{I}), \\ \mathbf{z}_0 &\sim \mathcal{N}_{K_0}(\mathbf{0}, \mathbf{I}), \\ &\Downarrow \\ \hat{\mathbf{x}} &\sim \mathcal{N}_{P}(\mathbf{0}, \mathbf{W} \mathbf{W}^{\top} + \sigma^2 \mathbf{I}), \\ &\Downarrow \\ \mathbf{x} &\sim \mathcal{N}_{P}(\mathbf{B} \mathbf{z}, \mathbf{W} \mathbf{W}^{\top} + \sigma^2 \mathbf{I}), \\ &\Downarrow \\ \begin{bmatrix} \mathbf{x}_1 \\ \mathbf{x}_2 \end{bmatrix} &\sim \mathcal{N}_P \Bigg( \begin{bmatrix} \mathbf{B}_1 & \mathbf{0} \\ \mathbf{0} & \mathbf{B}_2 \end{bmatrix} \begin{bmatrix} \mathbf{z}_1 \\ \mathbf{z}_2 \end{bmatrix}, \begin{bmatrix} \mathbf{W}_1 \mathbf{W}_1^{\top} + \sigma_1^2 \mathbf{I} & \mathbf{W}_1 \mathbf{W}_2^{\top} \\ \mathbf{W}_2 \mathbf{W}_1^{\top} & \mathbf{W}_2 \mathbf{W}_2^{\top} + \sigma_2^2 \mathbf{I} \end{bmatrix} \Bigg), \\ &\Downarrow \\ \mathbf{x}_i \mid \mathbf{z}_i &\sim \mathcal{N}_{P_i}(\mathbf{B}_i \mathbf{z}_i, \mathbf{W}_i \mathbf{W}_i^{\top} + \sigma_i^2 \mathbf{I}). \end{aligned} \tag{38}

We can see that the last lines of Eq. 3333 and 3838 are both essentially the same conditionally Gaussian assumption as in factor analysis (Eq. 99). Thus, we can see the M-step updates proposed by (Klami & Kaski, 2006) are, yet again, essentially those based on factor analysis:

M-step:

Marginalize over z1\mathbf{z}_1 and z2\mathbf{z}_2 and optimize W\mathbf{W}. These are just the updates for probabilistic CCA:

Ψ=[B1B1+σ12I012021B2B2+σ22I],M=(I+WΨ1W)1,W=Σ~iiΨi1WM(M+MiWiΨi1Σ~Ψ1WM)1.(39) \begin{aligned} \boldsymbol{\Psi} &= \begin{bmatrix} \mathbf{B}_1 \mathbf{B}_1^{\top} + \sigma_1^2 \mathbf{I} & \mathbf{0}_{12} \\ \mathbf{0}_{21} & \mathbf{B}_2 \mathbf{B}_2^{\top} + \sigma_2^2 \mathbf{I} \end{bmatrix}, \\ \mathbf{M} &= (\mathbf{I} + \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W})^{-1}, \\ \mathbf{W}^{\star} &= \tilde{\boldsymbol{\Sigma}}_{ii} \boldsymbol{\Psi}_i^{-1} \mathbf{W} \mathbf{M} \Big(\mathbf{M} + \mathbf{M}_i \mathbf{W}_i^{\top}\boldsymbol{\Psi}_i^{-1} \tilde{\boldsymbol{\Sigma}} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \Big)^{-1}. \end{aligned} \tag{39}

Next, marginalize over z0\mathbf{z}_0 and optimize each Bi\mathbf{B}_i. Note that these steps are not probabilistic CCA since we’ve integrated out the shared latent variable. Instead, this can be thought of as two separate factor analysis models. So we update each Bi\mathbf{B}_i separately. For each i{1,2}i \in \{1,2\}, do:

Ψi=WiWi+σi2I,Mi=(I+BiΨi1Bi)1,Bi=Σ~iiΨi1BiMi(Mi+MiBiΨi1Σ~iiΨi1BiMi)1,(40) \begin{aligned} \boldsymbol{\Psi}_i &= \mathbf{W}_i \mathbf{W}_i^{\top} + \sigma_i^2 \mathbf{I}, \\ \mathbf{M}_i &= (\mathbf{I} + \mathbf{B}_i^{\top} \boldsymbol{\Psi}_i^{-1} \mathbf{B}_i)^{-1}, \\ \mathbf{B}_i^{\star} &= \tilde{\boldsymbol{\Sigma}}_{ii} \boldsymbol{\Psi}_i^{-1} \mathbf{B}_i \mathbf{M}_i \Big(\mathbf{M}_i + \mathbf{M}_i \mathbf{B}_i^{\top}\boldsymbol{\Psi}_i^{-1} \tilde{\boldsymbol{\Sigma}}_{ii} \boldsymbol{\Psi}_i^{-1} \mathbf{B}_i \mathbf{M}_i \Big)^{-1}, \end{aligned} \tag{40}

Finally, Klami proposes these updates for σi2\sigma_i^2:

σi2=1Pitrace(Σ~iiΣ~iiΨ1BiMBiWiWi).(41) \sigma^2_i = \frac{1}{P_i} \text{trace}\left(\tilde{\boldsymbol{\Sigma}}_{ii} - \tilde{\boldsymbol{\Sigma}}_{ii} \boldsymbol{\Psi}^{-1} \mathbf{B}_i \mathbf{M} \mathbf{B}_i^{\top} - \mathbf{W}_i \mathbf{W}_i^{\top} \right). \tag{41}

Unfortunately, I do not know how Klami derived these updates. If the covariance matrix were just σi2I\sigma^2_i \mathbf{I}, then the maximum likelihood estimate would be the same as for probabilistic PCA. However, we have to deal with this term,

(WiWi+σi2I)1.(42) (\mathbf{W}_i \mathbf{W}_i + \sigma_i^2 \mathbf{I})^{-1}. \tag{42}

I’m not sure how to either compute the derivative of this term with respect to σi2\sigma_i^2, or even how to isolate it, since the Woodbury matrix formula will keep σi2\sigma_i^2 inside the inverse. Please email me if you understand how to derive Eq. 4141.

Conclusion

Several Gaussian latent variable models are strikingly similar, resulting in similar EM updates. This is useful to know, since it is often assumed by authors. For example, in (Bach & Jordan, 2005), the authors provide no derivation their EM updates in Section 4.14.1, simply stating that their model “readily yields the following EM update equations.” I suspect this is because those updates are identical to those for factor analysis sans the block structure. Furthermore, I suspect the EM updates for other models such as group factor analysis (Klami et al., 2015) and partial least squares (Murphy, 2012) are also closely related.

   

Acknowledgements

I thank Arto Klami for kindly replying to several emails explaining steps in these derivations.

   

Appendix

A1. Equivalence with Ghahramani and Hinton’s moments

In this post, we computed the first and second posterior moments as:

E[z]=MWΨ1xE[zz]=M+E[z]E[z](A1.1) \begin{aligned} \mathbb{E}[\textbf{z}] &= \mathbf{M} \mathbf{W}^{\top}\boldsymbol{\Psi}^{-1}\mathbf{x} \\ \mathbb{E}[\textbf{z} \textbf{z}^{\top}] &= \mathbf{M} + \mathbb{E}[\textbf{z}] \mathbb{E}[\textbf{z}]^{\top} \end{aligned} \tag{A1.1}

In the last post, we computed the first and second posterior moments as:

E[z]=W(Ψ+WW)1x(A1.2) \begin{aligned} \mathbb{E}[\textbf{z}] &= \mathbf{W}^{\top} (\boldsymbol{\Psi} + \mathbf{WW}^{\top})^{-1} \mathbf{x} \end{aligned} \tag{A1.2}

We now show these are equivalent:

E[z]=W(Ψ+WW)1x=W[Ψ1Ψ1W(I+WΨ1W)1MWΨ1]x=WΨ1xWΨ1WMWΨ1x=[IWΨ1WM]WΨ1x=MWΨ1xE[zz]=IW(Ψ+WW)1WE[xz]E[xz]=(I+WΨ1W)1E[xz]E[xz]=ME[xz]E[xz](A1.3) \begin{aligned} \mathbb{E}[\textbf{z}] &= \mathbf{W}^{\top} (\boldsymbol{\Psi} + \mathbf{WW}^{\top})^{-1} \mathbf{x} \\ &\stackrel{\dagger}{=} \mathbf{W}^{\top} [\boldsymbol{\Psi}^{-1} - \boldsymbol{\Psi}^{-1} \mathbf{W} \underbrace{(\mathbf{I} + \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W})^{-1}}_{\mathbf{M}} \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1}] \mathbf{x} \\ &= \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{x} - \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{x} \\ &= [\mathbf{I} - \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M}] \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{x} \\ &\stackrel{\star}{=} \mathbf{M} \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{x} \\ \\ \mathbb{E}[\mathbf{z} \mathbf{z}^{\top}] &= \mathbf{I} - \mathbf{W}^{\top} (\boldsymbol{\Psi} + \mathbf{WW}^{\top})^{-1} \mathbf{W} - \mathbb{E}[\mathbf{x} \mid \mathbf{z}] \mathbb{E}[\mathbf{x} \mid \mathbf{z}]^{\top} \\ &\stackrel{\dagger}{=} (\mathbf{I} + \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W})^{-1} - \mathbb{E}[\mathbf{x} \mid \mathbf{z}] \mathbb{E}[\mathbf{x} \mid \mathbf{z}]^{\top} \\ &= \mathbf{M} - \mathbb{E}[\mathbf{x} \mid \mathbf{z}] \mathbb{E}[\mathbf{x} \mid \mathbf{z}]^{\top} \end{aligned} \tag{A1.3}

In step \star, we use the fact that

I=(I+WΨ1W)(I+WΨ1W)1I=(I+WΨ1W)MI=M+WΨ1WMM=IWΨ1WM(A1.4) \begin{aligned} \mathbf{I} &= (\mathbf{I} + \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W}) (\mathbf{I} + \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W})^{-1} \\ \mathbf{I} &= (\mathbf{I} + \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W}) \mathbf{M} \\ \mathbf{I} &= \mathbf{M} + \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \\ \mathbf{M} &= \mathbf{I} - \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \end{aligned} \tag{A1.4}

Both \dagger steps just use the Woodbury matrix identity.

A2. Equivalence with Tipping and Bishop’s update

We want to show that (Tipping & Bishop, 1999) EM update for W\mathbf{W},

W=Σ~W(σ2I+G1WΣ~W)1,(A2.1) \mathbf{W}^{\star} = \tilde{\boldsymbol{\Sigma}} \mathbf{W} (\sigma^2 \mathbf{I} + \mathbf{G}^{-1} \mathbf{W}^{\top} \tilde{\boldsymbol{\Sigma}} \mathbf{W})^{-1}, \tag{A2.1}

where G=WW+σ2I\mathbf{G} = \mathbf{W}^{\top} \mathbf{W} + \sigma^2 \mathbf{I} and Σ~=1NXX\tilde{\boldsymbol{\Sigma}} = \frac{1}{N} \mathbf{X} \mathbf{X}^{\top}, is equivalent to our update:

W=XSA1.(A2.2) \mathbf{W}^{\star} = \mathbf{X} \mathbf{S}^{\top} \mathbf{A}^{-1}. \tag{A2.2}

It is:

W=XSA1=X(MWΨ1X)(NM+(MWΨ1X)(MWΨ1X))1=XXΨ1WM(N[M+1NMWΨ1XXΨ1WM])1=Σ~Ψ1WM(M+MWΨ1Σ~Ψ1WM)1=Σ~Ψ1WM([I+MWΨ1Σ~Ψ1W]M)1=Σ~Ψ1W(I+MWΨ1Σ~Ψ1W)1=Σ~W(σ2[I+MW1σ2Σ~1σ2W])1=Σ~W(σ2I+1σ2MWΣ~W)1=Σ~W(σ2I+1σ2(W1σ2W+I)1WΣ~W)1=Σ~W(σ2I+(WW+σ2I)1WΣ~W)1=Σ~W(σ2I+G1WΣ~W)1(A2.3) \begin{aligned} \mathbf{W}^{\star} &= \mathbf{X} \mathbf{S}^{\top} \mathbf{A}^{-1} \\ &= \mathbf{X} (\mathbf{M} \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{X})^{\top} (N \mathbf{M} + (\mathbf{M} \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{X}) (\mathbf{M} \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{X})^{\top})^{-1} \\ &= \mathbf{X}\mathbf{X}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W}\mathbf{M} \Big( N \Big[ \mathbf{M} + \frac{1}{N} \mathbf{M} \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{X} \mathbf{X}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \Big] \Big)^{-1} \\ &= \tilde{\boldsymbol{\Sigma}} \boldsymbol{\Psi}^{-1} \mathbf{W}\mathbf{M} \Big( \mathbf{M} + \mathbf{M} \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \tilde{\boldsymbol{\Sigma}} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \Big)^{-1} \\ &= \tilde{\boldsymbol{\Sigma}} \boldsymbol{\Psi}^{-1} \mathbf{W}\mathbf{M} \Big( \Big[ \mathbf{I} + \mathbf{M} \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \tilde{\boldsymbol{\Sigma}} \boldsymbol{\Psi}^{-1} \mathbf{W} \Big] \mathbf{M} \Big)^{-1} \\ &\stackrel{\star}{=} \tilde{\boldsymbol{\Sigma}} \boldsymbol{\Psi}^{-1} \mathbf{W} \Big( \mathbf{I} + \mathbf{M} \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \tilde{\boldsymbol{\Sigma}} \boldsymbol{\Psi}^{-1} \mathbf{W} \Big)^{-1} \\ &\stackrel{\dagger}{=} \tilde{\boldsymbol{\Sigma}} \mathbf{W} \Big( \sigma^2 \Big[ \mathbf{I} + \mathbf{M} \mathbf{W}^{\top} \frac{1}{\sigma^2} \tilde{\boldsymbol{\Sigma}} \frac{1}{\sigma^2} \mathbf{W} \Big] \Big)^{-1} \\ &= \tilde{\boldsymbol{\Sigma}} \mathbf{W} \Big( \sigma^2 \mathbf{I} + \frac{1}{\sigma^2} \mathbf{M} \mathbf{W}^{\top} \tilde{\boldsymbol{\Sigma}} \mathbf{W} \Big)^{-1} \\ &= \tilde{\boldsymbol{\Sigma}} \mathbf{W} \Big( \sigma^2 \mathbf{I} + \frac{1}{\sigma^2} \Big( \mathbf{W}^{\top} \frac{1}{\sigma^2} \mathbf{W} + \mathbf{I} \Big)^{-1} \mathbf{W}^{\top} \tilde{\boldsymbol{\Sigma}} \mathbf{W} \Big)^{-1} \\ &= \tilde{\boldsymbol{\Sigma}} \mathbf{W} ( \sigma^2 \mathbf{I} + (\mathbf{W}^{\top} \mathbf{W} + \sigma^2 \mathbf{I})^{-1} \mathbf{W}^{\top} \tilde{\boldsymbol{\Sigma}} \mathbf{W})^{-1} \\ &= \tilde{\boldsymbol{\Sigma}} \mathbf{W}( \sigma^2 \mathbf{I} + \mathbf{G}^{-1} \mathbf{W}^{\top} \tilde{\boldsymbol{\Sigma}} \mathbf{W})^{-1} \end{aligned} \tag{A2.3}

In step \star, we used the fact that for two invertible N×NN \times N matrices A\mathbf{A} and B\mathbf{B}:

(AB)1=B1A1.(A2.4) (\mathbf{AB})^{-1} = \mathbf{B}^{-1} \mathbf{A}^{-1}. \tag{A2.4}

In step \dagger, we substitute σ2I\sigma^2 \mathbf{I} for Ψ\boldsymbol{\Psi}.

  1. Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3), 611–622.
  2. Bach, F. R., & Jordan, M. I. (2005). A probabilistic interpretation of canonical correlation analysis.
  3. Klami, A., & Kaski, S. (2006). Generative models that discover dependencies between data sets. 2006 16th IEEE Signal Processing Society Workshop on Machine Learning for Signal Processing, 123–128.
  4. Roweis, S., & Ghahramani, Z. (1999). A unifying review of linear Gaussian models. Neural Computation, 11(2), 305–345.
  5. Bishop, C. M. (2006). Pattern Recognition and Machine Learning.
  6. Ghahramani, Z., & Hinton, G. E. (1996). The EM algorithm for mixtures of factor analyzers. Technical Report CRG-TR-96-1, University of Toronto.
  7. Klami, A., Virtanen, S., Leppäaho, E., & Kaski, S. (2015). Group factor analysis. IEEE Transactions on Neural Networks and Learning Systems, 26(9), 2136–2147.
  8. Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.