In previous posts, I have discussed a number of Gaussian latent factor models such as factor analysis , probabilistic PCA , and probabilistic CCA . However, while I have mentioned connections between these models at various times, I have never formalized them. The goal of this post is to show how these models are closely related. First, I will derive expectation–maximization (EM) updates for factor analysis and then show how these are related to the EM updates for probabilistic PCA (Tipping & Bishop, 1999) , probabilistic canonical correlation analysis (Bach & Jordan, 2005) , and inter-battery factor analysis (Klami & Kaski, 2006) .
The title of this post is a reference Sam Roweis’s excellent overview of linear-Gaussian models (Roweis & Ghahramani, 1999) .
EM for factor analysis
The factor analysis model is
x n = W z n + u n , z n ∼ N K ( 0 , I ) , u n ∼ N P ( 0 , Ψ ) , (1)
\begin{aligned}
\mathbf{x}_n &= \mathbf{W} \mathbf{z}_n + \mathbf{u}_n,
\\
\mathbf{z}_n &\sim \mathcal{N}_K(\mathbf{0}, \mathbf{I}),
\\
\mathbf{u}_n &\sim \mathcal{N}_P(\mathbf{0}, \boldsymbol{\Psi}),
\end{aligned} \tag{1}
x n z n u n = W z n + u n , ∼ N K ( 0 , I ) , ∼ N P ( 0 , Ψ ) , ( 1 )
where x n ∈ R P \mathbf{x}_n \in \mathbb{R}^P x n ∈ R P , z n ∈ R K \mathbf{z}_n \in \mathbb{R}^K z n ∈ R K , W ∈ R P × K \mathbf{W} \in \mathbb{R}^{P \times K} W ∈ R P × K , and Ψ ∈ R K × K \boldsymbol{\Psi} \in \mathbb{R}^{K \times K} Ψ ∈ R K × K , a diagonal matrix. We can also write this in vectorized notation as
X = W Z + U , (2)
\mathbf{X} = \mathbf{W}\mathbf{Z} + \mathbf{U}, \tag{2}
X = W Z + U , ( 2 )
where X ∈ R P × N \mathbf{X} \in \mathbb{R}^{P \times N} X ∈ R P × N and Z ∈ R K × N \mathbf{Z} \in \mathbb{R}^{K \times N} Z ∈ R K × N . In words, we assume our P P P -dimensional observations are linear functions of K K K -dimensional latent variables plus some additive Gaussian noise. The Gaussian assumptions on z \mathbf{z} z and u \mathbf{u} u implies we assume our data are Gaussian distributed. Since Ψ \boldsymbol{\Psi} Ψ is diagonal, we assume the latent components are uncorrelated.
To compute the EM updates, we first write down the expected complete log likelihood, only including terms that depend on Z \mathbf{Z} Z since we’re maximizing the parameters θ = { W , Ψ } \boldsymbol{\theta} = \{\mathbf{W}, \boldsymbol{\Psi}\} θ = { W , Ψ } w.r.t. the log posterior log p ( Z ∣ X , θ ) \log p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}) log p ( Z ∣ X , θ ) :
Q = E [ − 1 2 ∑ n = 1 N ( x n − W z n ) ⊤ Ψ − 1 ( x n − W z n ) − 1 2 ln ∣ Ψ ∣ + C ] ∝ − 1 2 ∑ n = 1 N E [ x n ⊤ Ψ − 1 x n − 2 z n ⊤ W ⊤ Ψ − 1 x n + z n ⊤ W ⊤ Ψ − 1 W z n ] − N 2 ln ∣ Ψ ∣ . = − 1 2 ∑ n = 1 N [ [ x n ⊤ Ψ − 1 x n ⏟ A − 2 [ E [ z n ] ⊤ W ⊤ Ψ − 1 x n ⏟ B + [ tr ( W ⊤ Ψ − 1 W E [ z n z n ⊤ ] ) ] ⏟ C − [ N 2 ln ∣ Ψ ∣ ⏟ D . (3)
\begin{aligned}
Q &=
\mathbb{E} \big[
-\frac{1}{2} \sum_{n=1}^{N}(\textbf{x}_n - \mathbf{W} \textbf{z}_n)^{\top} \boldsymbol{\Psi}^{-1} (\textbf{x}_n - \mathbf{W} \textbf{z}_n) - \frac{1}{2} \ln |\boldsymbol{\Psi}| + C \big]
\\
&\propto
-\frac{1}{2} \sum_{n=1}^{N}
\mathbb{E} \big[
\textbf{x}_n^{\top} \boldsymbol{\Psi}^{-1} \textbf{x}_n - 2 \textbf{z}_n^{\top} \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \textbf{x}_n + \textbf{z}_n^{\top} \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W} \textbf{z}_n
\big] - \frac{N}{2} \ln |\boldsymbol{\Psi}|.
\\
&=
-\frac{1}{2} \sum_{n=1}^{N} \big[
\underbrace{\vphantom{\Big[}\textbf{x}_n^{\top} \boldsymbol{\Psi}^{-1} \textbf{x}_n}_{A} - 2 \underbrace{\vphantom{\Big[}\mathbb{E}[\textbf{z}_n]^{\top} \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \textbf{x}_n}_{B} + \underbrace{\vphantom{\Big[}\text{tr}\big( \boldsymbol{\mathbf{W}}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbb{E}[ \textbf{z}_n \mathbf{z}_n^{\top}] \big) \big]}_{C} - \underbrace{\vphantom{\Big[}\frac{N}{2} \ln |\boldsymbol{\Psi}|}_{D}.
\end{aligned} \tag{3}
Q = E [ − 2 1 n = 1 ∑ N ( x n − W z n ) ⊤ Ψ − 1 ( x n − W z n ) − 2 1 ln ∣ Ψ ∣ + C ] ∝ − 2 1 n = 1 ∑ N E [ x n ⊤ Ψ − 1 x n − 2 z n ⊤ W ⊤ Ψ − 1 x n + z n ⊤ W ⊤ Ψ − 1 W z n ] − 2 N ln ∣ Ψ ∣ . = − 2 1 n = 1 ∑ N [ A [ x n ⊤ Ψ − 1 x n − 2 B [ E [ z n ] ⊤ W ⊤ Ψ − 1 x n + C [ tr ( W ⊤ Ψ − 1 W E [ z n z n ⊤ ] ) ] − D [ 2 N ln ∣ Ψ ∣ . ( 3 )
See my post on EM if the big picture does not make sense. To compute EM updates for either W \mathbf{W} W or Ψ \boldsymbol{\Psi} Ψ , we compute the derivative of Eq. 3 3 3 w.r.t. the parameter, set the equation equal to zero, and solve for the parameter. We can take derivatives of the terms labeled A A A , B B B , C C C , and D D D in Eq. 3 3 3 w.r.t. both parameters using the excellent Matrix Cookbook ,
∂ B ∂ W = Ψ − 1 x n E [ z n ] ⊤ , MC.71 ∂ C ∂ W = 2 Ψ − 1 W E [ z n z n ⊤ ] , MC.117 ∂ A ∂ Ψ − 1 = x n x n ⊤ , MC.72 ∂ B ∂ Ψ − 1 = x n E [ z n ] ⊤ W ⊤ , MC.70 ∂ C ∂ Ψ − 1 = W E [ z n z n ⊤ ] W ⊤ , MC.102 ∂ D ∂ Ψ − 1 = − N 2 Ψ . MC.56 (4)
\begin{aligned}
\frac{\partial B}{\partial \mathbf{W}} &= \boldsymbol{\Psi}^{-1} \mathbf{x}_n \mathbb{E}[\mathbf{z}_n]^{\top}, && \text{MC.71}
\\
\frac{\partial C}{\partial \mathbf{W}} &= 2 \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbb{E}[\mathbf{z}_n \mathbf{z}_n^{\top}], && \text{MC.117}
\\
\\
\frac{\partial A}{\partial \boldsymbol{\Psi}^{-1}} &= \mathbf{x}_n \mathbf{x}_n^{\top}, && \text{MC.72}
\\
\frac{\partial B}{\partial \boldsymbol{\Psi}^{-1}} &= \mathbf{x}_n \mathbb{E}[\mathbf{z}_n]^{\top} \mathbf{W}^{\top}, && \text{MC.70}
\\
\frac{\partial C}{\partial \boldsymbol{\Psi}^{-1}} &= \mathbf{W} \mathbb{E}[\mathbf{z}_n \mathbf{z}_n^{\top}] \mathbf{W}^{\top}, && \text{MC.102}
\\
\frac{\partial D}{\partial \boldsymbol{\Psi}^{-1}} &= -\frac{N}{2} \boldsymbol{\Psi}. && \text{MC.56}
\end{aligned} \tag{4}
∂ W ∂ B ∂ W ∂ C ∂ Ψ − 1 ∂ A ∂ Ψ − 1 ∂ B ∂ Ψ − 1 ∂ C ∂ Ψ − 1 ∂ D = Ψ − 1 x n E [ z n ] ⊤ , = 2 Ψ − 1 W E [ z n z n ⊤ ] , = x n x n ⊤ , = x n E [ z n ] ⊤ W ⊤ , = W E [ z n z n ⊤ ] W ⊤ , = − 2 N Ψ . MC.71 MC.117 MC.72 MC.70 MC.102 MC.56 ( 4 )
MC.X \text{MC.X} MC.X denotes Eq. X \text{X} X in the Matrix Cookbook . First, we can solve for the optimal W ⋆ \mathbf{W}^{\star} W ⋆ :
∂ Q ∂ W = − ∑ n = 1 N Ψ − 1 x n E [ z n ] ⊤ + ∑ n = 1 N Ψ − 1 W E [ z n z n ⊤ ] , ⇓ W ⋆ = ( ∑ n = 1 N x n E [ z n ] ⊤ ) ( ∑ n = 1 N E [ z n z n ⊤ ] ) − 1 . (5)
\begin{aligned}
\frac{\partial Q}{\partial \mathbf{W}} &= -\sum_{n=1}^{N} \boldsymbol{\Psi}^{-1} \mathbf{x}_n \mathbb{E}[\mathbf{z}_n]^{\top} + \sum_{n=1}^{N} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbb{E}[\mathbf{z}_n \mathbf{z}_n^{\top}],
\\
&\Downarrow
\\
\mathbf{W}^{\star} &= \Big(\sum_{n=1}^{N} \mathbf{x}_n \mathbb{E}[\mathbf{z}_n]^{\top} \Big) \Big( \sum_{n=1}^{N} \mathbb{E}[\mathbf{z}_n \mathbf{z}_n^{\top}] \Big)^{-1}.
\end{aligned} \tag{5}
∂ W ∂ Q W ⋆ = − n = 1 ∑ N Ψ − 1 x n E [ z n ] ⊤ + n = 1 ∑ N Ψ − 1 W E [ z n z n ⊤ ] , ⇓ = ( n = 1 ∑ N x n E [ z n ] ⊤ ) ( n = 1 ∑ N E [ z n z n ⊤ ] ) − 1 . ( 5 )
Next, we plug W ⋆ \mathbf{W}^{\star} W ⋆ into the expected complete log likelihood in Eq. 3 3 3 and solve for Ψ \boldsymbol{\Psi} Ψ :
∂ Q ∂ Ψ − 1 = − 1 2 ∑ n = 1 N [ x n x n ⊤ − 2 x n E [ z n ] ⊤ [ W ⋆ ] ⊤ + W ⋆ E [ z n z n ⊤ ] [ W ⋆ ] ⊤ ] + N 2 Ψ , ⇓ Ψ ⋆ = 1 N ∑ n = 1 N [ x n x n ⊤ − 2 x n E [ z n ] ⊤ [ W ⋆ ] ⊤ + W ⋆ E [ z n z n ⊤ ] [ W ⋆ ] ⊤ ] = 1 N ∑ n = 1 N x n x n ⊤ − 1 N ∑ n = 1 N 2 x n E [ z n ] ⊤ [ W ⋆ ] ⊤ + 1 N ∑ n = 1 N W ⋆ E [ z n z n ⊤ ] [ W ⋆ ] ⊤ . (6)
\begin{aligned}
\frac{\partial Q}{\partial \boldsymbol{\Psi}^{-1}} &= -\frac{1}{2} \sum_{n=1}^{N} \Big[ \mathbf{x}_n \mathbf{x}_n^{\top} - 2 \mathbf{x}_n \mathbb{E}[\mathbf{z}_n]^{\top} [\mathbf{W}^{\star}]^{\top} + \mathbf{W}^{\star} \mathbb{E}[\mathbf{z}_n \mathbf{z}_n^{\top}] [\mathbf{W}^{\star}]^{\top} \Big] + \frac{N}{2} \boldsymbol{\Psi},
\\
&\Downarrow
\\
\boldsymbol{\Psi}^{\star} &= \frac{1}{N} \sum_{n=1}^{N} \Big[ \mathbf{x}_n \mathbf{x}_n^{\top} - 2 \mathbf{x}_n \mathbb{E}[\mathbf{z}_n]^{\top} [\mathbf{W}^{\star}]^{\top} + \mathbf{W}^{\star} \mathbb{E}[\mathbf{z}_n \mathbf{z}_n^{\top}] [\mathbf{W}^{\star}]^{\top} \Big]
\\
&= \frac{1}{N} \sum_{n=1}^{N} \mathbf{x}_n \mathbf{x}_n^{\top} - \frac{1}{N} \sum_{n=1}^{N} 2 \mathbf{x}_n \mathbb{E}[\mathbf{z}_n]^{\top} [\mathbf{W}^{\star}]^{\top} + \frac{1}{N} \sum_{n=1}^{N} \mathbf{W}^{\star} \mathbb{E}[\mathbf{z}_n \mathbf{z}_n^{\top}] [\mathbf{W}^{\star}]^{\top}.
\end{aligned} \tag{6}
∂ Ψ − 1 ∂ Q Ψ ⋆ = − 2 1 n = 1 ∑ N [ x n x n ⊤ − 2 x n E [ z n ] ⊤ [ W ⋆ ] ⊤ + W ⋆ E [ z n z n ⊤ ] [ W ⋆ ] ⊤ ] + 2 N Ψ , ⇓ = N 1 n = 1 ∑ N [ x n x n ⊤ − 2 x n E [ z n ] ⊤ [ W ⋆ ] ⊤ + W ⋆ E [ z n z n ⊤ ] [ W ⋆ ] ⊤ ] = N 1 n = 1 ∑ N x n x n ⊤ − N 1 n = 1 ∑ N 2 x n E [ z n ] ⊤ [ W ⋆ ] ⊤ + N 1 n = 1 ∑ N W ⋆ E [ z n z n ⊤ ] [ W ⋆ ] ⊤ . ( 6 )
We can simplify Eq. 6 6 6 with the following observation based on the equality of Eq. 5 5 5 :
1 N ∑ n = 1 N W ⋆ E [ z n z n ⊤ ] [ W ⋆ ] ⊤ = 1 N ∑ n = 1 N x n E [ z n ] ⊤ [ W ⋆ ] ⊤ . (7)
\frac{1}{N} \sum_{n=1}^N \mathbf{W}^{\star} \mathbb{E}[\mathbf{z}_n \mathbf{z}_n^{\top}] [\mathbf{W}^{\star}]^{\top} = \frac{1}{N} \sum_{n=1}^{N} \mathbf{x}_n \mathbb{E}[\mathbf{z}_n]^{\top} [\mathbf{W}^{\star}]^{\top}. \tag{7}
N 1 n = 1 ∑ N W ⋆ E [ z n z n ⊤ ] [ W ⋆ ] ⊤ = N 1 n = 1 ∑ N x n E [ z n ] ⊤ [ W ⋆ ] ⊤ . ( 7 )
So we can simplify Eq. 6 6 6 as
Ψ ⋆ = 1 N ∑ n = 1 N x n x n ⊤ − 1 N ∑ n = 1 N x n E [ z n ] ⊤ [ W ⋆ ] ⊤ . (8)
\boldsymbol{\Psi}^{\star} = \frac{1}{N} \sum_{n=1}^{N} \mathbf{x}_n \mathbf{x}_n^{\top} - \frac{1}{N} \sum_{n=1}^{N} \mathbf{x}_n \mathbb{E}[\mathbf{z}_n]^{\top} [\mathbf{W}^{\star}]^{\top}. \tag{8}
Ψ ⋆ = N 1 n = 1 ∑ N x n x n ⊤ − N 1 n = 1 ∑ N x n E [ z n ] ⊤ [ W ⋆ ] ⊤ . ( 8 )
To derive the posterior moments in Eq. 4 4 4 , recall that the Gaussian assumptions on both z \mathbf{z} z and u \mathbf{u} u induce a conditionally Gaussian assumption on x \mathbf{x} x ,
x ∣ z ∼ N P ( W z , Ψ ) , (9)
\mathbf{x} \mid \mathbf{z} \sim \mathcal{N}_P(\mathbf{W} \mathbf{z}, \boldsymbol{\Psi}), \tag{9}
x ∣ z ∼ N P ( W z , Ψ ) , ( 9 )
Since our prior on z \mathbf{z} z is also Gaussian, the posterior is Gaussian
z ∣ x ∼ N K ( M W ⊤ Ψ − 1 x , M ) (10)
\mathbf{z} \mid \mathbf{x} \sim \mathcal{N}_K(\mathbf{M} \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{x}, \mathbf{M}) \tag{10}
z ∣ x ∼ N K ( M W ⊤ Ψ − 1 x , M ) ( 1 0 )
where
M = ( I + W ⊤ Ψ − 1 W ) − 1 . (11)
\mathbf{M} = (\mathbf{I} + \mathbf{W}^{\top}\boldsymbol{\Psi}^{-1}\mathbf{W})^{-1}. \tag{11}
M = ( I + W ⊤ Ψ − 1 W ) − 1 . ( 1 1 )
See Sec. 2.3.3 2.3.3 2 . 3 . 3 in (Bishop, 2006) for details. We can use the covariance and mean to solve for the second moment:
E [ z z ⊤ ] = M + E [ z ] E [ z ] ⊤ , (12)
\mathbb{E}[\textbf{z} \textbf{z}^{\top}] = \mathbf{M} + \mathbb{E}[\textbf{z}] \mathbb{E}[\textbf{z}]^{\top}, \tag{12}
E [ z z ⊤ ] = M + E [ z ] E [ z ] ⊤ , ( 1 2 )
where the first moment is clearly
E [ z ] = M W ⊤ Ψ − 1 x . (13)
\mathbb{E}[\mathbf{z}] = \mathbf{M} \mathbf{W}^{\top}\boldsymbol{\Psi}^{-1}\mathbf{x}. \tag{13}
E [ z ] = M W ⊤ Ψ − 1 x . ( 1 3 )
For the curious, these posterior moments are equivalent to those in my previous post on factor analysis, albeit in a different and vectorized form. See A1 for a proof of this claim. Finally, let’s define the vectorized quantity S \mathbf{S} S as
S = [ E [ z 1 ] … E [ z N ] ] = M W ⊤ Ψ − 1 X . (14)
\begin{aligned}
\mathbf{S}
&= \left[\begin{array}{c|c|c}
\mathbb{E}[\mathbf{z}_1] & \dots & \mathbb{E}[\mathbf{z}_N]
\end{array}\right]
\\
&= \mathbf{M} \mathbf{W}^{\top}\boldsymbol{\Psi}^{-1} \mathbf{X}.
\end{aligned} \tag{14}
S = [ E [ z 1 ] … E [ z N ] ] = M W ⊤ Ψ − 1 X . ( 1 4 )
So S ∈ R K × N \mathbf{S} \in \mathbb{R}^{K \times N} S ∈ R K × N is a matrix of first posterior moments. The EM updates in Eq. 5 5 5 and 8 8 8 match those given in (Ghahramani & Hinton, 1996) . However, we can vectorize them as follows:
W ⋆ = ( ∑ n = 1 N x n E [ z n ] ⊤ ) ( ∑ n = 1 N E [ z n z n ⊤ ] ) − 1 = ( ∑ n = 1 N x n x n ⊤ Ψ − 1 W M ) ( ∑ n = 1 N [ M + M W ⊤ Ψ − 1 x n x n ⊤ Ψ − 1 W M ] ) − 1 = X S ⊤ ( N M + S S ⊤ ) − 1 , Ψ ⋆ = 1 N ∑ n = 1 N x n x n ⊤ − 1 N ∑ n = 1 N x n E [ z n ] ⊤ [ W ⋆ ] ⊤ = Σ ~ − 1 N ∑ n = 1 N x n x n ⊤ Ψ − 1 W M [ W ⋆ ] ⊤ = Σ ~ − Σ ~ Ψ − 1 W M [ W ⋆ ] ⊤ , (15)
\begin{aligned}
\mathbf{W}^{\star} &= \Big(\sum_{n=1}^{N} \mathbf{x}_n \mathbb{E}[\mathbf{z}_n]^{\top} \Big) \Big( \sum_{n=1}^{N} \mathbb{E}[\mathbf{z}_n \mathbf{z}_n^{\top}] \Big)^{-1}
\\
&= \Big(\sum_{n=1}^{N} \mathbf{x}_n \mathbf{x}_n^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \Big) \Big( \sum_{n=1}^{N} \Big[ \mathbf{M} + \mathbf{M}\mathbf{W}^{\top}\boldsymbol{\Psi}^{-1} \mathbf{x}_n \mathbf{x}_n^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \Big] \Big)^{-1}
\\
&= \mathbf{X}\mathbf{S}^{\top} (N \mathbf{M} + \mathbf{S}\mathbf{S}^{\top})^{-1},
\\
\\
\boldsymbol{\Psi}^{\star} &= \frac{1}{N} \sum_{n=1}^{N} \mathbf{x}_n \mathbf{x}_n^{\top} - \frac{1}{N} \sum_{n=1}^{N} \mathbf{x}_n \mathbb{E}[\mathbf{z}_n]^{\top} [\mathbf{W}^{\star}]^{\top}
\\
&= \tilde{\boldsymbol{\Sigma}} - \frac{1}{N} \sum_{n=1}^{N} \mathbf{x}_n \mathbf{x}_n^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} [\mathbf{W}^{\star}]^{\top}
\\
&= \tilde{\boldsymbol{\Sigma}} - \tilde{\boldsymbol{\Sigma}} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} [\mathbf{W}^{\star}]^{\top},
\end{aligned} \tag{15}
W ⋆ Ψ ⋆ = ( n = 1 ∑ N x n E [ z n ] ⊤ ) ( n = 1 ∑ N E [ z n z n ⊤ ] ) − 1 = ( n = 1 ∑ N x n x n ⊤ Ψ − 1 W M ) ( n = 1 ∑ N [ M + M W ⊤ Ψ − 1 x n x n ⊤ Ψ − 1 W M ] ) − 1 = X S ⊤ ( N M + S S ⊤ ) − 1 , = N 1 n = 1 ∑ N x n x n ⊤ − N 1 n = 1 ∑ N x n E [ z n ] ⊤ [ W ⋆ ] ⊤ = Σ ~ − N 1 n = 1 ∑ N x n x n ⊤ Ψ − 1 W M [ W ⋆ ] ⊤ = Σ ~ − Σ ~ Ψ − 1 W M [ W ⋆ ] ⊤ , ( 1 5 )
where Σ ~ = N − 1 X X ⊤ \tilde{\boldsymbol{\Sigma}} = N^{-1} \mathbf{X} \mathbf{X}^{\top} Σ ~ = N − 1 X X ⊤ . Throughout this post, I repeatedly use the fact that the sum of outer products is equal to a matrix-matrix multiplication . In summary, we can write the EM algorithm for factor analysis as:
E-step:
M = ( I + W ⊤ Ψ − 1 W ) − 1 , S = M W ⊤ Ψ − 1 X , A = N M + S S ⊤ . (16)
\begin{aligned}
\mathbf{M} &= (\mathbf{I} + \mathbf{W}^{\top}\boldsymbol{\Psi}^{-1}\mathbf{W})^{-1},
\\
\mathbf{S} &= \mathbf{M} \mathbf{W}^{\top}\boldsymbol{\Psi}^{-1} \mathbf{X},
\\
\mathbf{A} &= N \mathbf{M} + \mathbf{S} \mathbf{S}^{\top}.
\end{aligned} \tag{16}
M S A = ( I + W ⊤ Ψ − 1 W ) − 1 , = M W ⊤ Ψ − 1 X , = N M + S S ⊤ . ( 1 6 )
M-step:
W ⋆ = X S ⊤ A − 1 , Ψ ⋆ = diag ( Σ ~ − Σ ~ Ψ − 1 W M [ W ⋆ ] ⊤ ) . (17)
\begin{aligned}
\mathbf{W}^{\star} &= \mathbf{X} \mathbf{S}^{\top} \mathbf{A}^{-1},
\\
\boldsymbol{\Psi}^{\star} &= \text{diag} \left( \tilde{\boldsymbol{\Sigma}} - \tilde{\boldsymbol{\Sigma}} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} [\mathbf{W}^{\star}]^{\top} \right).
\end{aligned} \tag{17}
W ⋆ Ψ ⋆ = X S ⊤ A − 1 , = diag ( Σ ~ − Σ ~ Ψ − 1 W M [ W ⋆ ] ⊤ ) . ( 1 7 )
We need the diagonalization operator because when finding the optimal value of Ψ \boldsymbol{\Psi} Ψ since we only care about the diagonal elements. In other words, this is a logical rather than linear algebraic constraint.
EM for probabilistic PCA
We will now see that the EM updates for probabilistic PCA are closely related. The generative model for probabilistic PCA is
x n = W z n + u n , z n ∼ N K ( 0 , I ) , u n ∼ N P ( 0 , σ 2 I ) . (18)
\begin{aligned}
\mathbf{x}_n &= \mathbf{W} \mathbf{z}_n + \mathbf{u}_n,
\\
\mathbf{z}_n &\sim \mathcal{N}_K(\mathbf{0}, \mathbf{I}),
\\
\mathbf{u}_n &\sim \mathcal{N}_P(\mathbf{0}, \sigma^2 \mathbf{I}).
\end{aligned} \tag{18}
x n z n u n = W z n + u n , ∼ N K ( 0 , I ) , ∼ N P ( 0 , σ 2 I ) . ( 1 8 )
The only difference between Eq. 18 18 1 8 with Eq. 1 1 1 is that the noise covariance in probabilistic PCA is isotropic. Therefore, the updates for W \mathbf{W} W are exactly the same as in factor analysis:
W ⋆ = X S ⊤ A − 1 . (19)
\mathbf{W}^{\star} = \mathbf{X} \mathbf{S}^{\top} \mathbf{A}^{-1}. \tag{19}
W ⋆ = X S ⊤ A − 1 . ( 1 9 )
See A2 for a proof that this is equivalent to Tipping’s EM update for W \mathbf{W} W . However, we need a slightly different derivation for σ 2 \sigma^2 σ 2 , since it is a scalar. In probabilistic PCA, the expected complete log likelihood is
Q = ∑ n = 1 N [ − 1 2 σ 2 x n ⊤ x n + 1 σ 2 E [ z n ] ⊤ W ⊤ x n − 1 2 σ 2 tr ( W ⊤ W E [ z n z n ⊤ ] ) − K 2 log σ 2 ] . (20)
Q = \sum_{n=1}^N \Big[-\frac{1}{2 \sigma^2} \mathbf{x}_n^{\top} \mathbf{x}_n + \frac{1}{\sigma^2} \mathbb{E}[\mathbf{z}_n]^{\top} \mathbf{W}^{\top} \mathbf{x}_n - \frac{1}{2 \sigma^2} \text{tr} \Big( \mathbf{W}^{\top} \mathbf{W} \mathbb{E}[\mathbf{z}_n \mathbf{z}_n^{\top}]\Big) - \frac{K}{2} \log \sigma^2 \Big]. \tag{20}
Q = n = 1 ∑ N [ − 2 σ 2 1 x n ⊤ x n + σ 2 1 E [ z n ] ⊤ W ⊤ x n − 2 σ 2 1 tr ( W ⊤ W E [ z n z n ⊤ ] ) − 2 K log σ 2 ] . ( 2 0 )
Let’s take the derivative of Q Q Q w.r.t. σ 2 \sigma^2 σ 2 , set it equal to zero, and solve for σ 2 \sigma^2 σ 2 , using the optimal value of W \mathbf{W} W :
( σ 2 ) ⋆ = 1 N K ∑ n = 1 N [ x n ⊤ x n − 2 E [ z n ] ⊤ [ W ⋆ ] ⊤ x n + tr ( [ W ⋆ ] ⊤ W ⋆ E [ z n z n ⊤ ] ) ] (21)
(\sigma^2)^{\star} = \frac{1}{NK} \sum_{n=1}^N \Big[\mathbf{x}_n^{\top} \mathbf{x}_n - 2 \mathbb{E}[\textbf{z}_n]^{\top} [\mathbf{W}^{\star}]^{\top} \textbf{x}_n + \text{tr} \Big( [\mathbf{W}^{\star}]^{\top} \mathbf{W}^{\star} \mathbb{E}[\mathbf{z}_n \mathbf{z}_n^{\top}]\Big) \Big] \tag{21}
( σ 2 ) ⋆ = N K 1 n = 1 ∑ N [ x n ⊤ x n − 2 E [ z n ] ⊤ [ W ⋆ ] ⊤ x n + tr ( [ W ⋆ ] ⊤ W ⋆ E [ z n z n ⊤ ] ) ] ( 2 1 )
Notice that
∑ n = 1 N tr ( [ W ⋆ ] ⊤ W ⋆ E [ z n z n ⊤ ] ) = tr ( [ W ⋆ ] ⊤ W ⋆ ∑ n = 1 N E [ z n z n ⊤ ] ) = tr ( [ W ⋆ ] ⊤ W ⋆ A ) = tr ( [ W ⋆ ] ⊤ X S ⊤ ) , ∑ n = 1 N E [ z n ] ⊤ [ W ⋆ ] ⊤ x n = ∑ n = 1 N tr ( x n E [ z n ] ⊤ [ W ⋆ ] ⊤ ) = tr ( ∑ n = 1 N x n E [ z n ] ⊤ [ W ⋆ ] ⊤ ) = tr ( [ W ⋆ ] ⊤ X S ⊤ ) . (22)
\begin{aligned}
\sum_{n=1}^{N} \text{tr}([\mathbf{W}^{\star}]^{\top} \mathbf{W}^{\star} \mathbb{E}[\mathbf{z}_n \mathbf{z}_n^{\top}])
&= \text{tr}([\mathbf{W}^{\star}]^{\top} \mathbf{W}^{\star} \sum_{n=1}^{N} \mathbb{E}[\mathbf{z}_n \mathbf{z}_n^{\top}])
\\
&= \text{tr}([\mathbf{W}^{\star}]^{\top} \mathbf{W}^{\star} \mathbf{A})
\\
&= \text{tr}([\mathbf{W}^{\star}]^{\top} \mathbf{X} \mathbf{S}^{\top}),
\\
\\
\sum_{n=1}^{N} \mathbb{E}[\mathbf{z}_n]^{\top} [\mathbf{W}^{\star}]^{\top} \mathbf{x}_n
&= \sum_{n=1}^{N} \text{tr}(\mathbf{x}_n \mathbb{E}[\mathbf{z}_n]^{\top} [\mathbf{W}^{\star}]^{\top})
\\
&= \text{tr} \Big(\sum_{n=1}^{N} \mathbf{x}_n \mathbb{E}[\mathbf{z}_n]^{\top} [\mathbf{W}^{\star}]^{\top} \Big)
\\
&= \text{tr}([\mathbf{W}^{\star}]^{\top} \mathbf{X} \mathbf{S}^{\top}).
\end{aligned} \tag{22}
n = 1 ∑ N tr ( [ W ⋆ ] ⊤ W ⋆ E [ z n z n ⊤ ] ) n = 1 ∑ N E [ z n ] ⊤ [ W ⋆ ] ⊤ x n = tr ( [ W ⋆ ] ⊤ W ⋆ n = 1 ∑ N E [ z n z n ⊤ ] ) = tr ( [ W ⋆ ] ⊤ W ⋆ A ) = tr ( [ W ⋆ ] ⊤ X S ⊤ ) , = n = 1 ∑ N tr ( x n E [ z n ] ⊤ [ W ⋆ ] ⊤ ) = tr ( n = 1 ∑ N x n E [ z n ] ⊤ [ W ⋆ ] ⊤ ) = tr ( [ W ⋆ ] ⊤ X S ⊤ ) . ( 2 2 )
So we can write the optimal value for σ 2 \sigma^2 σ 2 as
( σ 2 ) ⋆ = 1 N K ∑ n = 1 N { x n ⊤ x n − 2 E [ z n ] ⊤ [ W ⋆ ] ⊤ x n + tr ( W ⊤ [ W ⋆ ] E [ z n z n ⊤ ] ) } = 1 N K { tr ( X X ⊤ ) − tr ( [ W ⋆ ] ⊤ X S ⊤ ) } = 1 K tr ( Σ ~ − Σ ~ W 1 σ 2 M [ W ⋆ ] ⊤ ) . (23)
\begin{aligned}
(\sigma^2)^{\star} &= \frac{1}{NK} \sum_{n=1}^N \Big\{\mathbf{x}_n^{\top} \mathbf{x}_n - 2 \mathbb{E}[\mathbf{z}_n]^{\top} [\mathbf{W}^{\star}]^{\top} \mathbf{x}_n + \text{tr} \Big( \mathbf{W}^{\top} [\mathbf{W}^{\star}] \mathbb{E}[\mathbf{z}_n \mathbf{z}_n^{\top}]\Big) \Big\}
\\
&= \frac{1}{NK} \Big\{ \text{tr}(\mathbf{XX}^{\top}) - \text{tr}([\mathbf{W}^{\star}]^{\top} \mathbf{X} \mathbf{S}^{\top}) \Big\}
\\
&= \frac{1}{K} \text{tr}\Big(\tilde{\boldsymbol{\Sigma}} - \tilde{\boldsymbol{\Sigma}} \mathbf{W} \frac{1}{\sigma^2} \mathbf{M} [\mathbf{W}^{\star}]^{\top} \Big).
\end{aligned} \tag{23}
( σ 2 ) ⋆ = N K 1 n = 1 ∑ N { x n ⊤ x n − 2 E [ z n ] ⊤ [ W ⋆ ] ⊤ x n + tr ( W ⊤ [ W ⋆ ] E [ z n z n ⊤ ] ) } = N K 1 { tr ( X X ⊤ ) − tr ( [ W ⋆ ] ⊤ X S ⊤ ) } = K 1 tr ( Σ ~ − Σ ~ W σ 2 1 M [ W ⋆ ] ⊤ ) . ( 2 3 )
This matches the EM updates for (Tipping & Bishop, 1999) .
EM for probabilistic CCA
Finally, we’ll see that the EM updates for probabilistic CCA are essentially the updates for factor analysis, with some block structure enforced. The generative model for probabilistic CCA is
z ∼ N K ( 0 , I ) , K ≤ min ( P 1 , P 2 ) , x 1 ∣ z ∼ N P 1 ( W 1 z , Ψ 1 ) , x 2 ∣ z ∼ N P 2 ( W 2 z , Ψ 2 ) . (24)
\begin{aligned}
\textbf{z} &\sim \mathcal{N}_K(\mathbf{0}, \mathbf{I}), \quad K \leq \min(P_1, P_2),
\\
\textbf{x}_1 \mid \textbf{z} &\sim \mathcal{N}_{P_1}(\mathbf{W}_1 \textbf{z}, \boldsymbol{\Psi}_1),
\\
\textbf{x}_2 \mid \textbf{z} &\sim \mathcal{N}_{P_2}(\mathbf{W}_2 \textbf{z}, \boldsymbol{\Psi}_2).
\end{aligned} \tag{24}
z x 1 ∣ z x 2 ∣ z ∼ N K ( 0 , I ) , K ≤ min ( P 1 , P 2 ) , ∼ N P 1 ( W 1 z , Ψ 1 ) , ∼ N P 2 ( W 2 z , Ψ 2 ) . ( 2 4 )
Note that in probabilistic CCA, the covariance matrices Ψ 1 \boldsymbol{\Psi}_1 Ψ 1 and Ψ 2 \boldsymbol{\Psi}_2 Ψ 2 are not constrained to be diagonal. In my post on probabilistic CCA , I observed that if we appropriately tiled the data,
x = [ x 1 x 2 ] , W = [ W 1 W 2 ] , Ψ = [ Ψ 1 0 0 Ψ 2 ] , Σ ~ = [ Σ ~ 11 Σ ~ 12 Σ ~ 21 Σ ~ 22 ] , (25)
\textbf{x} = \begin{bmatrix} \textbf{x}_1 \\ \textbf{x}_2 \end{bmatrix}, \qquad \mathbf{W} = \begin{bmatrix} \mathbf{W}_1 \\ \mathbf{W}_2 \end{bmatrix},
\qquad
\boldsymbol{\Psi} = \begin{bmatrix} \boldsymbol{\Psi}_1 & 0 \\ 0 & \boldsymbol{\Psi}_2 \end{bmatrix},
\quad
\tilde{\boldsymbol{\Sigma}} = \begin{bmatrix} \tilde{\boldsymbol{\Sigma}}_{11} & \tilde{\boldsymbol{\Sigma}}_{12}
\\
\tilde{\boldsymbol{\Sigma}}_{21} & \tilde{\boldsymbol{\Sigma}}_{22} \end{bmatrix}, \tag{25}
x = [ x 1 x 2 ] , W = [ W 1 W 2 ] , Ψ = [ Ψ 1 0 0 Ψ 2 ] , Σ ~ = [ Σ ~ 1 1 Σ ~ 2 1 Σ ~ 1 2 Σ ~ 2 2 ] , ( 2 5 )
where P = P 1 + P 2 P = P_1 + P_2 P = P 1 + P 2 , x ∈ R P \textbf{x} \in \mathbb{R}^{P} x ∈ R P , and W ∈ R P × K \mathbf{W} \in \mathbb{R}^{P \times K} W ∈ R P × K , then the EM updates for probabilistic CCA in (Bach & Jordan, 2005) would be equivalent to those of factor analysis. However, I did not explicitly show this equivalence, and there are subtleties I ignored. Let’s make the connection explicitly now. The optimal update for W \mathbf{W} W , again given by our formula for factor analysis in Eq. 14 14 1 4 , is
W ⋆ = X S ⊤ A − 1 = X ( M W ⊤ Ψ − 1 X ) ⊤ ( N M + S S ⊤ ) − 1 = X X ⊤ Ψ − 1 W M ( N M + M W ⊤ Ψ − 1 X X ⊤ Ψ − 1 W M ) − 1 = 1 N X X ⊤ Ψ − 1 W M ( M + M W ⊤ Ψ − 1 1 N X X ⊤ Ψ − 1 W M ) − 1 = Σ ~ Ψ − 1 W M ( M + M W ⊤ Ψ − 1 Σ ~ Ψ − 1 W M ) − 1 . (26)
\begin{aligned}
\mathbf{W}^{\star}
&= \mathbf{X} \mathbf{S}^{\top} \mathbf{A}^{-1}
\\
&= \mathbf{X} (\mathbf{M} \mathbf{W}^{\top}\boldsymbol{\Psi}^{-1} \mathbf{X})^{\top} (N \mathbf{M} + \mathbf{S} \mathbf{S}^{\top})^{-1}
\\
&= \mathbf{X} \mathbf{X}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \Big(N \mathbf{M} + \mathbf{M} \mathbf{W}^{\top}\boldsymbol{\Psi}^{-1} \mathbf{X} \mathbf{X}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \Big)^{-1}
\\
&= \frac{1}{N} \mathbf{X} \mathbf{X}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \Big(\mathbf{M} + \mathbf{M} \mathbf{W}^{\top}\boldsymbol{\Psi}^{-1} \frac{1}{N} \mathbf{X} \mathbf{X}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \Big)^{-1}
\\
&= \tilde{\boldsymbol{\Sigma}} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \Big(\mathbf{M} + \mathbf{M} \mathbf{W}^{\top}\boldsymbol{\Psi}^{-1} \tilde{\boldsymbol{\Sigma}} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \Big)^{-1}.
\end{aligned} \tag{26}
W ⋆ = X S ⊤ A − 1 = X ( M W ⊤ Ψ − 1 X ) ⊤ ( N M + S S ⊤ ) − 1 = X X ⊤ Ψ − 1 W M ( N M + M W ⊤ Ψ − 1 X X ⊤ Ψ − 1 W M ) − 1 = N 1 X X ⊤ Ψ − 1 W M ( M + M W ⊤ Ψ − 1 N 1 X X ⊤ Ψ − 1 W M ) − 1 = Σ ~ Ψ − 1 W M ( M + M W ⊤ Ψ − 1 Σ ~ Ψ − 1 W M ) − 1 . ( 2 6 )
In Eq. 26 26 2 6 , we use the fact that ( c A ) − 1 = c − 1 A − 1 (c \mathbf{A})^{-1} = c^{-1} \mathbf{A}^{-1} ( c A ) − 1 = c − 1 A − 1 for any constant c c c . This works because the block structure of Ψ \boldsymbol{\Psi} Ψ ensures we update each block of W \mathbf{W} W correctly, e.g.:
[ [ W 1 ⊤ W 2 ⊤ ] ⏞ W ⊤ [ [ Ψ 1 0 0 Ψ 2 ] ⏞ Ψ = [ W 1 ⊤ Ψ 1 W 2 ⊤ Ψ 2 ] . (27)
\overbrace{\vphantom{\Bigg[}\begin{bmatrix}
\mathbf{W}_1^{\top} & \mathbf{W}_2^{\top}
\end{bmatrix}}^{\mathbf{W}^{\top}}
\overbrace{\vphantom{\Bigg[}\begin{bmatrix} \boldsymbol{\Psi}_1 & \mathbf{0} \\ \mathbf{0} & \boldsymbol{\Psi}_2 \end{bmatrix}}^{\boldsymbol{\Psi}}
=
\begin{bmatrix}
\mathbf{W}_1^{\top} \boldsymbol{\Psi}_1 & \mathbf{W}_2^{\top} \boldsymbol{\Psi}_2
\end{bmatrix}. \tag{27}
[ [ W 1 ⊤ W 2 ⊤ ] W ⊤ [ [ Ψ 1 0 0 Ψ 2 ] Ψ = [ W 1 ⊤ Ψ 1 W 2 ⊤ Ψ 2 ] . ( 2 7 )
The update for Ψ \boldsymbol{\Psi} Ψ is trickier only because we need to maintain the diagonal block structure. For Ψ i ⋆ \boldsymbol{\Psi}_i^{\star} Ψ i ⋆ with i ∈ { 1 , 2 } i \in \{1, 2\} i ∈ { 1 , 2 } indexing each data modality, we have:
Ψ i ⋆ = Σ ~ i i − Σ ~ Ψ − 1 W M i [ W i ⋆ ] ⊤ . (28)
\boldsymbol{\Psi}_i^{\star} = \tilde{\boldsymbol{\Sigma}}_{ii} - \tilde{\boldsymbol{\Sigma}} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M}_i [\mathbf{W}_i^{\star}]^{\top}. \tag{28}
Ψ i ⋆ = Σ ~ i i − Σ ~ Ψ − 1 W M i [ W i ⋆ ] ⊤ . ( 2 8 )
In other words, the EM updates for probabilistic CCA immediately fall out of the EM updates for factor analysis:
W ⋆ = Σ ~ Ψ − 1 W M ( M + M W ⊤ Ψ − 1 Σ ~ Ψ − 1 W M ) − 1 , Ψ ⋆ = [ Σ ~ 11 − Σ ~ 11 Ψ 1 − 1 W 1 M 1 [ W 1 ⋆ ] ⊤ 0 12 0 21 Σ ~ 22 − Σ ~ 22 Ψ 2 − 1 W 2 M 2 [ W 2 ⋆ ] ⊤ ] , (29)
\begin{aligned}
\mathbf{W}^{\star}
&= \tilde{\boldsymbol{\Sigma}} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \Big(\mathbf{M} + \mathbf{M} \mathbf{W}^{\top}\boldsymbol{\Psi}^{-1} \tilde{\boldsymbol{\Sigma}} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \Big)^{-1},
\\
\boldsymbol{\Psi}^{\star} &= \begin{bmatrix}
\tilde{\boldsymbol{\Sigma}}_{11} - \tilde{\boldsymbol{\Sigma}}_{11} \boldsymbol{\Psi}_1^{-1} \mathbf{W}_1 \mathbf{M}_1 [\mathbf{W}_1^{\star}]^{\top} & \mathbf{0}_{12}
\\
\mathbf{0}_{21} & \tilde{\boldsymbol{\Sigma}}_{22} - \tilde{\boldsymbol{\Sigma}}_{22} \boldsymbol{\Psi}_2^{-1} \mathbf{W}_2 \mathbf{M}_2 [\mathbf{W}_2^{\star}]^{\top}
\end{bmatrix},
\end{aligned} \tag{29}
W ⋆ Ψ ⋆ = Σ ~ Ψ − 1 W M ( M + M W ⊤ Ψ − 1 Σ ~ Ψ − 1 W M ) − 1 , = [ Σ ~ 1 1 − Σ ~ 1 1 Ψ 1 − 1 W 1 M 1 [ W 1 ⋆ ] ⊤ 0 2 1 0 1 2 Σ ~ 2 2 − Σ ~ 2 2 Ψ 2 − 1 W 2 M 2 [ W 2 ⋆ ] ⊤ ] , ( 2 9 )
where 0 12 ∈ R P 1 × P 2 \mathbf{0}_{12} \in \mathbb{R}^{P_1 \times P_2} 0 1 2 ∈ R P 1 × P 2 and 0 21 = 0 12 ⊤ \mathbf{0}_{21} = \mathbf{0}_{12}^{\top} 0 2 1 = 0 1 2 ⊤ are both all zeros. How I think of this is that by restricting the off-diagonal block matrices to be all zero, we enforce our modeling assumption that there is no correlation between the noise terms for each modality, x 1 \mathbf{x}_1 x 1 and x 2 \mathbf{x}_2 x 2 .
EM for IBFA
Inter-battery factor analysis (IBFA) is similar to probabilistic CCA except that it explicitly models a shared latent variable z 0 \mathbf{z}_0 z 0 and view-specific latent variables z 1 \mathbf{z}_1 z 1 and z 2 \mathbf{z}_2 z 2 :
z 0 ∼ N K 0 ( 0 , I ) , z 1 ∼ N K 1 ( 0 , I ) , z 2 ∼ N K 2 ( 0 , I ) , x 1 ∣ z 0 , z 1 , z 2 ∼ N P 1 ( W 1 z 0 + B 1 z 1 , σ 1 2 I ) , x 2 ∣ z 0 , z 1 , z 2 ∼ N P 2 ( W 2 z 0 + B 2 z 2 , σ 2 2 I ) . (30)
\begin{aligned}
\mathbf{z}_0 &\sim \mathcal{N}_{K_0}(\mathbf{0}, \mathbf{I}),
\\
\mathbf{z}_1 &\sim \mathcal{N}_{K_1}(\mathbf{0}, \mathbf{I}),
\\
\mathbf{z}_2 &\sim \mathcal{N}_{K_2}(\mathbf{0}, \mathbf{I}),
\\
\mathbf{x}_1 \mid \mathbf{z}_0, \mathbf{z}_1, \mathbf{z}_2 &\sim \mathcal{N}_{P_1}(\mathbf{W}_1 \mathbf{z}_0 + \mathbf{B}_1 \mathbf{z}_1, \sigma_1^2 \mathbf{I}),
\\
\mathbf{x}_2 \mid \mathbf{z}_0, \mathbf{z}_1, \mathbf{z}_2 &\sim \mathcal{N}_{P_2}(\mathbf{W}_2 \mathbf{z}_0 + \mathbf{B}_2 \mathbf{z}_2, \sigma_2^2 \mathbf{I}).
\end{aligned} \tag{30}
z 0 z 1 z 2 x 1 ∣ z 0 , z 1 , z 2 x 2 ∣ z 0 , z 1 , z 2 ∼ N K 0 ( 0 , I ) , ∼ N K 1 ( 0 , I ) , ∼ N K 2 ( 0 , I ) , ∼ N P 1 ( W 1 z 0 + B 1 z 1 , σ 1 2 I ) , ∼ N P 2 ( W 2 z 0 + B 2 z 2 , σ 2 2 I ) . ( 3 0 )
The latent variable z 0 \mathbf{z}_0 z 0 can be viewed as modeling shared variation through the linear maps W 1 \mathbf{W}_1 W 1 and W 2 \mathbf{W}_2 W 2 , while the z 1 \mathbf{z}_1 z 1 and z 2 \mathbf{z}_2 z 2 variables model view-specific variation through B 1 \mathbf{B}_1 B 1 and B 2 \mathbf{B}_2 B 2 respectively. Also, notice that the covariance matrices are isotropic, not just diagonal.
So what’s the relationship between IBFA and probabilistic CCA? We can treat the view-specific latent variables as nuisance parameters and marginalize them out. For z 1 \mathbf{z}_1 z 1 , we have:
∫ p ( x 1 , x 2 , z 0 , z 1 , z 2 ) d z 1 = ∫ p ( x 1 ∣ z 0 , z 1 ) p ( x 2 ∣ z 0 , z 2 ) p ( z 0 ) p ( z 1 ) p ( z 2 ) d z 1 = p ( x 2 ∣ z 0 , z 2 ) p ( z 2 ) p ( z 0 ) ∫ p ( x 1 ∣ z 0 , z 1 ) p ( z 1 ) d z 1 . (31)
\begin{aligned}
\int p(\mathbf{x}_1, \mathbf{x}_2, \mathbf{z}_0, \mathbf{z}_1, \mathbf{z}_2) d \mathbf{z}_1
&= \int p(\mathbf{x}_1 \mid \mathbf{z}_0, \mathbf{z}_1) p(\mathbf{x}_2 \mid \mathbf{z}_0, \mathbf{z}_2) p(\mathbf{z}_0) p(\mathbf{z}_1) p(\mathbf{z}_2) d \mathbf{z}_1
\\
&= p(\mathbf{x}_2 \mid \mathbf{z}_0, \mathbf{z}_2) p(\mathbf{z}_2) p(\mathbf{z}_0) \int p(\mathbf{x}_1 \mid \mathbf{z}_0, \mathbf{z}_1) p(\mathbf{z}_1) d \mathbf{z}_1.
\end{aligned} \tag{31}
∫ p ( x 1 , x 2 , z 0 , z 1 , z 2 ) d z 1 = ∫ p ( x 1 ∣ z 0 , z 1 ) p ( x 2 ∣ z 0 , z 2 ) p ( z 0 ) p ( z 1 ) p ( z 2 ) d z 1 = p ( x 2 ∣ z 0 , z 2 ) p ( z 2 ) p ( z 0 ) ∫ p ( x 1 ∣ z 0 , z 1 ) p ( z 1 ) d z 1 . ( 3 1 )
Notice that z 0 \mathbf{z}_0 z 0 is a constant in the integration. Let x ~ 1 = x 1 − W 1 z 0 \tilde{\mathbf{x}}_1 = \mathbf{x}_1 - \mathbf{W}_1 \mathbf{z}_0 x ~ 1 = x 1 − W 1 z 0 . Then we can easily integrate
∫ p ( x ~ 1 ∣ z 1 ) p ( z 1 ) d z 1 (32)
\int p(\tilde{\mathbf{x}}_1 \mid \mathbf{z}_1) p(\mathbf{z}_1) d\mathbf{z}_1 \tag{32}
∫ p ( x ~ 1 ∣ z 1 ) p ( z 1 ) d z 1 ( 3 2 )
since both densities are Gaussian:
x ~ 1 ∣ z 1 ∼ N P 1 ( B 1 z 1 , σ 1 2 I ) , z 1 ∼ N K 1 ( 0 , I ) , ⇓ x ~ 1 ∼ N P 1 ( 0 , B 1 B 1 ⊤ + σ 1 2 I ) , ⇓ x 1 ∣ z 0 ∼ N P 1 ( W 1 z 0 , B 1 B 1 ⊤ + σ 1 2 I ) . (33)
\begin{aligned}
\tilde{\mathbf{x}}_1 \mid \mathbf{z}_1 &\sim \mathcal{N}_{P_1}(\mathbf{B}_1 \mathbf{z}_1, \sigma_1^2 \mathbf{I}),
\\
\mathbf{z}_1 &\sim \mathcal{N}_{K_1}(\mathbf{0}, \mathbf{I}),
\\
&\Downarrow
\\
\tilde{\mathbf{x}}_1 &\sim \mathcal{N}_{P_1}(\mathbf{0}, \mathbf{B}_1 \mathbf{B}_1^{\top} + \sigma_1^2 \mathbf{I}),
\\
&\Downarrow
\\
\mathbf{x}_1 \mid \mathbf{z}_0 &\sim \mathcal{N}_{P_1}(\mathbf{W}_1 \mathbf{z}_0, \mathbf{B}_1 \mathbf{B}_1^{\top} + \sigma_1^2 \mathbf{I}).
\end{aligned} \tag{33}
x ~ 1 ∣ z 1 z 1 x ~ 1 x 1 ∣ z 0 ∼ N P 1 ( B 1 z 1 , σ 1 2 I ) , ∼ N K 1 ( 0 , I ) , ⇓ ∼ N P 1 ( 0 , B 1 B 1 ⊤ + σ 1 2 I ) , ⇓ ∼ N P 1 ( W 1 z 0 , B 1 B 1 ⊤ + σ 1 2 I ) . ( 3 3 )
Notice that if B 1 B 1 ⊤ + σ 1 2 I \mathbf{B}_1 \mathbf{B}_1^{\top} + \sigma_1^2 \mathbf{I} B 1 B 1 ⊤ + σ 1 2 I were full rank, we could write it as Ψ 1 \boldsymbol{\Psi}_1 Ψ 1 as in probabilistic CCA (Eq. 24 24 2 4 ). If we applied this same logic to x 2 \mathbf{x}_2 x 2 , we would get the same generative model as probabilistic CCA:
z 0 ∼ N K 0 ( 0 , I ) , x 1 ∣ z 0 ∼ N P 1 ( W 1 z 0 , B 1 B 1 ⊤ + σ 1 2 I ) , x 2 ∣ z 0 ∼ N P 2 ( W 2 z 0 , B 2 B 2 ⊤ + σ 2 2 I ) . (34)
\begin{aligned}
\mathbf{z}_0 &\sim \mathcal{N}_{K_0}(\mathbf{0}, \mathbf{I}),
\\
\mathbf{x}_1 \mid \mathbf{z}_0 &\sim \mathcal{N}_{P_1}(\mathbf{W}_1 \mathbf{z}_0, \mathbf{B}_1 \mathbf{B}_1^{\top} + \sigma_1^2 \mathbf{I}),
\\
\mathbf{x}_2 \mid \mathbf{z}_0 &\sim \mathcal{N}_{P_2}(\mathbf{W}_2 \mathbf{z}_0, \mathbf{B}_2 \mathbf{B}_2^{\top} + \sigma_2^2 \mathbf{I}).
\end{aligned} \tag{34}
z 0 x 1 ∣ z 0 x 2 ∣ z 0 ∼ N K 0 ( 0 , I ) , ∼ N P 1 ( W 1 z 0 , B 1 B 1 ⊤ + σ 1 2 I ) , ∼ N P 2 ( W 2 z 0 , B 2 B 2 ⊤ + σ 2 2 I ) . ( 3 4 )
Thus, Bach’s formulation of probabilistic CCA assumes that the true dimensionalities of z 1 \mathbf{z}_1 z 1 and z 2 \mathbf{z}_2 z 2 are big enough to only require non-constrained covariance matrices. To marginalize over z 0 \mathbf{z}_0 z 0 ,
p ( x 1 ) p ( x 2 ) ∫ p ( x 1 ∣ z 0 , z 1 ) p ( x 2 ∣ z 0 , z 2 ) p ( z 0 ) d z 0 , (35)
p(\mathbf{x}_1) p(\mathbf{x}_2) \int p(\mathbf{x}_1 \mid \mathbf{z}_0, \mathbf{z}_1) p(\mathbf{x}_2 \mid \mathbf{z}_0, \mathbf{z}_2) p(\mathbf{z}_0) d \mathbf{z}_0, \tag{35}
p ( x 1 ) p ( x 2 ) ∫ p ( x 1 ∣ z 0 , z 1 ) p ( x 2 ∣ z 0 , z 2 ) p ( z 0 ) d z 0 , ( 3 5 )
we can apply the same trick as in Eq. 33 33 3 3 . First, let’s tile our data as in probabilistic CCA, where additionally
B = [ B 1 0 0 B 2 ] , σ 2 I = [ σ 1 2 I 0 0 σ 2 2 I ] . (36)
\mathbf{B} = \begin{bmatrix}
\mathbf{B}_1 & \mathbf{0}
\\
\mathbf{0} & \mathbf{B}_2
\end{bmatrix},
\quad
\sigma^2 \mathbf{I} = \begin{bmatrix}
\sigma_1^2 \mathbf{I} & \mathbf{0}
\\
\mathbf{0} & \sigma_2^2 \mathbf{I}
\end{bmatrix}. \tag{36}
B = [ B 1 0 0 B 2 ] , σ 2 I = [ σ 1 2 I 0 0 σ 2 2 I ] . ( 3 6 )
Next, let’s write p ( x 1 ∣ z 0 , z 1 ) p ( x 2 ∣ z 0 , z 2 ) p(\mathbf{x}_1 \mid \mathbf{z}_0, \mathbf{z}_1) p(\mathbf{x}_2 \mid \mathbf{z}_0, \mathbf{z}_2) p ( x 1 ∣ z 0 , z 1 ) p ( x 2 ∣ z 0 , z 2 ) as a P P P -variate Gaussian random variable:
x ∣ z 0 , z ∼ N P ( W z 0 + B z , σ 2 I ) . (37)
\mathbf{x} \mid \mathbf{z}_0, \mathbf{z} \sim \mathcal{N}_{P}( \mathbf{W} \mathbf{z}_0 + \mathbf{B} \mathbf{z}, \sigma^2 \mathbf{I}). \tag{37}
x ∣ z 0 , z ∼ N P ( W z 0 + B z , σ 2 I ) . ( 3 7 )
Finally, let’s apply the trick where now z \mathbf{z} z is a constant of integration. Let x ^ = x − B z \hat{\mathbf{x}} = \mathbf{x} - \mathbf{B} \mathbf{z} x ^ = x − B z , then
x ^ ∣ z 0 ∼ N P ( W z 0 , σ 2 I ) , z 0 ∼ N K 0 ( 0 , I ) , ⇓ x ^ ∼ N P ( 0 , W W ⊤ + σ 2 I ) , ⇓ x ∼ N P ( B z , W W ⊤ + σ 2 I ) , ⇓ [ x 1 x 2 ] ∼ N P ( [ B 1 0 0 B 2 ] [ z 1 z 2 ] , [ W 1 W 1 ⊤ + σ 1 2 I W 1 W 2 ⊤ W 2 W 1 ⊤ W 2 W 2 ⊤ + σ 2 2 I ] ) , ⇓ x i ∣ z i ∼ N P i ( B i z i , W i W i ⊤ + σ i 2 I ) . (38)
\begin{aligned}
\hat{\mathbf{x}} \mid \mathbf{z}_0 &\sim \mathcal{N}_P(\mathbf{W} \mathbf{z}_0, \sigma^2 \mathbf{I}),
\\
\mathbf{z}_0 &\sim \mathcal{N}_{K_0}(\mathbf{0}, \mathbf{I}),
\\
&\Downarrow
\\
\hat{\mathbf{x}} &\sim \mathcal{N}_{P}(\mathbf{0}, \mathbf{W} \mathbf{W}^{\top} + \sigma^2 \mathbf{I}),
\\
&\Downarrow
\\
\mathbf{x} &\sim \mathcal{N}_{P}(\mathbf{B} \mathbf{z}, \mathbf{W} \mathbf{W}^{\top} + \sigma^2 \mathbf{I}),
\\
&\Downarrow
\\
\begin{bmatrix}
\mathbf{x}_1 \\ \mathbf{x}_2
\end{bmatrix} &\sim \mathcal{N}_P \Bigg(
\begin{bmatrix}
\mathbf{B}_1 & \mathbf{0}
\\
\mathbf{0} & \mathbf{B}_2
\end{bmatrix}
\begin{bmatrix} \mathbf{z}_1 \\ \mathbf{z}_2 \end{bmatrix},
\begin{bmatrix}
\mathbf{W}_1 \mathbf{W}_1^{\top} + \sigma_1^2 \mathbf{I} & \mathbf{W}_1 \mathbf{W}_2^{\top}
\\
\mathbf{W}_2 \mathbf{W}_1^{\top} & \mathbf{W}_2 \mathbf{W}_2^{\top} + \sigma_2^2 \mathbf{I}
\end{bmatrix}
\Bigg),
\\
&\Downarrow
\\
\mathbf{x}_i \mid \mathbf{z}_i &\sim \mathcal{N}_{P_i}(\mathbf{B}_i \mathbf{z}_i, \mathbf{W}_i \mathbf{W}_i^{\top} + \sigma_i^2 \mathbf{I}).
\end{aligned} \tag{38}
x ^ ∣ z 0 z 0 x ^ x [ x 1 x 2 ] x i ∣ z i ∼ N P ( W z 0 , σ 2 I ) , ∼ N K 0 ( 0 , I ) , ⇓ ∼ N P ( 0 , W W ⊤ + σ 2 I ) , ⇓ ∼ N P ( B z , W W ⊤ + σ 2 I ) , ⇓ ∼ N P ( [ B 1 0 0 B 2 ] [ z 1 z 2 ] , [ W 1 W 1 ⊤ + σ 1 2 I W 2 W 1 ⊤ W 1 W 2 ⊤ W 2 W 2 ⊤ + σ 2 2 I ] ) , ⇓ ∼ N P i ( B i z i , W i W i ⊤ + σ i 2 I ) . ( 3 8 )
We can see that the last lines of Eq. 33 33 3 3 and 38 38 3 8 are both essentially the same conditionally Gaussian assumption as in factor analysis (Eq. 9 9 9 ). Thus, we can see the M-step updates proposed by (Klami & Kaski, 2006) are, yet again, essentially those based on factor analysis:
M-step:
Marginalize over z 1 \mathbf{z}_1 z 1 and z 2 \mathbf{z}_2 z 2 and optimize W \mathbf{W} W . These are just the updates for probabilistic CCA:
Ψ = [ B 1 B 1 ⊤ + σ 1 2 I 0 12 0 21 B 2 B 2 ⊤ + σ 2 2 I ] , M = ( I + W ⊤ Ψ − 1 W ) − 1 , W ⋆ = Σ ~ i i Ψ i − 1 W M ( M + M i W i ⊤ Ψ i − 1 Σ ~ Ψ − 1 W M ) − 1 . (39)
\begin{aligned}
\boldsymbol{\Psi} &= \begin{bmatrix}
\mathbf{B}_1 \mathbf{B}_1^{\top} + \sigma_1^2 \mathbf{I} & \mathbf{0}_{12}
\\
\mathbf{0}_{21} & \mathbf{B}_2 \mathbf{B}_2^{\top} + \sigma_2^2 \mathbf{I}
\end{bmatrix},
\\
\mathbf{M} &= (\mathbf{I} + \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W})^{-1},
\\
\mathbf{W}^{\star} &= \tilde{\boldsymbol{\Sigma}}_{ii} \boldsymbol{\Psi}_i^{-1} \mathbf{W} \mathbf{M} \Big(\mathbf{M} + \mathbf{M}_i \mathbf{W}_i^{\top}\boldsymbol{\Psi}_i^{-1} \tilde{\boldsymbol{\Sigma}} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \Big)^{-1}.
\end{aligned} \tag{39}
Ψ M W ⋆ = [ B 1 B 1 ⊤ + σ 1 2 I 0 2 1 0 1 2 B 2 B 2 ⊤ + σ 2 2 I ] , = ( I + W ⊤ Ψ − 1 W ) − 1 , = Σ ~ i i Ψ i − 1 W M ( M + M i W i ⊤ Ψ i − 1 Σ ~ Ψ − 1 W M ) − 1 . ( 3 9 )
Next, marginalize over z 0 \mathbf{z}_0 z 0 and optimize each B i \mathbf{B}_i B i . Note that these steps are not probabilistic CCA since we’ve integrated out the shared latent variable. Instead, this can be thought of as two separate factor analysis models. So we update each B i \mathbf{B}_i B i separately. For each i ∈ { 1 , 2 } i \in \{1,2\} i ∈ { 1 , 2 } , do:
Ψ i = W i W i ⊤ + σ i 2 I , M i = ( I + B i ⊤ Ψ i − 1 B i ) − 1 , B i ⋆ = Σ ~ i i Ψ i − 1 B i M i ( M i + M i B i ⊤ Ψ i − 1 Σ ~ i i Ψ i − 1 B i M i ) − 1 , (40)
\begin{aligned}
\boldsymbol{\Psi}_i &= \mathbf{W}_i \mathbf{W}_i^{\top} + \sigma_i^2 \mathbf{I},
\\
\mathbf{M}_i &= (\mathbf{I} + \mathbf{B}_i^{\top} \boldsymbol{\Psi}_i^{-1} \mathbf{B}_i)^{-1},
\\
\mathbf{B}_i^{\star} &= \tilde{\boldsymbol{\Sigma}}_{ii} \boldsymbol{\Psi}_i^{-1} \mathbf{B}_i \mathbf{M}_i \Big(\mathbf{M}_i + \mathbf{M}_i \mathbf{B}_i^{\top}\boldsymbol{\Psi}_i^{-1} \tilde{\boldsymbol{\Sigma}}_{ii} \boldsymbol{\Psi}_i^{-1} \mathbf{B}_i \mathbf{M}_i \Big)^{-1},
\end{aligned} \tag{40}
Ψ i M i B i ⋆ = W i W i ⊤ + σ i 2 I , = ( I + B i ⊤ Ψ i − 1 B i ) − 1 , = Σ ~ i i Ψ i − 1 B i M i ( M i + M i B i ⊤ Ψ i − 1 Σ ~ i i Ψ i − 1 B i M i ) − 1 , ( 4 0 )
Finally, Klami proposes these updates for σ i 2 \sigma_i^2 σ i 2 :
σ i 2 = 1 P i trace ( Σ ~ i i − Σ ~ i i Ψ − 1 B i M B i ⊤ − W i W i ⊤ ) . (41)
\sigma^2_i = \frac{1}{P_i} \text{trace}\left(\tilde{\boldsymbol{\Sigma}}_{ii} - \tilde{\boldsymbol{\Sigma}}_{ii} \boldsymbol{\Psi}^{-1} \mathbf{B}_i \mathbf{M} \mathbf{B}_i^{\top} - \mathbf{W}_i \mathbf{W}_i^{\top} \right). \tag{41}
σ i 2 = P i 1 trace ( Σ ~ i i − Σ ~ i i Ψ − 1 B i M B i ⊤ − W i W i ⊤ ) . ( 4 1 )
Unfortunately, I do not know how Klami derived these updates. If the covariance matrix were just σ i 2 I \sigma^2_i \mathbf{I} σ i 2 I , then the maximum likelihood estimate would be the same as for probabilistic PCA. However, we have to deal with this term,
( W i W i + σ i 2 I ) − 1 . (42)
(\mathbf{W}_i \mathbf{W}_i + \sigma_i^2 \mathbf{I})^{-1}. \tag{42}
( W i W i + σ i 2 I ) − 1 . ( 4 2 )
I’m not sure how to either compute the derivative of this term with respect to σ i 2 \sigma_i^2 σ i 2 , or even how to isolate it, since the Woodbury matrix formula will keep σ i 2 \sigma_i^2 σ i 2 inside the inverse. Please email me if you understand how to derive Eq. 41 41 4 1 .
Conclusion
Several Gaussian latent variable models are strikingly similar, resulting in similar EM updates. This is useful to know, since it is often assumed by authors. For example, in (Bach & Jordan, 2005) , the authors provide no derivation their EM updates in Section 4.1 4.1 4 . 1 , simply stating that their model “readily yields the following EM update equations.” I suspect this is because those updates are identical to those for factor analysis sans the block structure. Furthermore, I suspect the EM updates for other models such as group factor analysis (Klami et al., 2015) and partial least squares (Murphy, 2012) are also closely related.
Acknowledgements
I thank Arto Klami for kindly replying to several emails explaining steps in these derivations.
Appendix
A1. Equivalence with Ghahramani and Hinton’s moments
In this post, we computed the first and second posterior moments as:
E [ z ] = M W ⊤ Ψ − 1 x E [ z z ⊤ ] = M + E [ z ] E [ z ] ⊤ (A1.1)
\begin{aligned}
\mathbb{E}[\textbf{z}] &= \mathbf{M} \mathbf{W}^{\top}\boldsymbol{\Psi}^{-1}\mathbf{x}
\\
\mathbb{E}[\textbf{z} \textbf{z}^{\top}] &= \mathbf{M} + \mathbb{E}[\textbf{z}] \mathbb{E}[\textbf{z}]^{\top}
\end{aligned} \tag{A1.1}
E [ z ] E [ z z ⊤ ] = M W ⊤ Ψ − 1 x = M + E [ z ] E [ z ] ⊤ ( A 1 . 1 )
In the last post, we computed the first and second posterior moments as:
E [ z ] = W ⊤ ( Ψ + W W ⊤ ) − 1 x (A1.2)
\begin{aligned}
\mathbb{E}[\textbf{z}] &= \mathbf{W}^{\top} (\boldsymbol{\Psi} + \mathbf{WW}^{\top})^{-1} \mathbf{x}
\end{aligned} \tag{A1.2}
E [ z ] = W ⊤ ( Ψ + W W ⊤ ) − 1 x ( A 1 . 2 )
We now show these are equivalent:
E [ z ] = W ⊤ ( Ψ + W W ⊤ ) − 1 x = † W ⊤ [ Ψ − 1 − Ψ − 1 W ( I + W ⊤ Ψ − 1 W ) − 1 ⏟ M W ⊤ Ψ − 1 ] x = W ⊤ Ψ − 1 x − W ⊤ Ψ − 1 W M W ⊤ Ψ − 1 x = [ I − W ⊤ Ψ − 1 W M ] W ⊤ Ψ − 1 x = ⋆ M W ⊤ Ψ − 1 x E [ z z ⊤ ] = I − W ⊤ ( Ψ + W W ⊤ ) − 1 W − E [ x ∣ z ] E [ x ∣ z ] ⊤ = † ( I + W ⊤ Ψ − 1 W ) − 1 − E [ x ∣ z ] E [ x ∣ z ] ⊤ = M − E [ x ∣ z ] E [ x ∣ z ] ⊤ (A1.3)
\begin{aligned}
\mathbb{E}[\textbf{z}] &= \mathbf{W}^{\top} (\boldsymbol{\Psi} + \mathbf{WW}^{\top})^{-1} \mathbf{x}
\\
&\stackrel{\dagger}{=} \mathbf{W}^{\top} [\boldsymbol{\Psi}^{-1} - \boldsymbol{\Psi}^{-1} \mathbf{W} \underbrace{(\mathbf{I} + \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W})^{-1}}_{\mathbf{M}} \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1}] \mathbf{x}
\\
&= \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{x} - \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{x}
\\
&= [\mathbf{I} - \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M}] \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{x}
\\
&\stackrel{\star}{=} \mathbf{M} \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{x}
\\
\\
\mathbb{E}[\mathbf{z} \mathbf{z}^{\top}] &= \mathbf{I} - \mathbf{W}^{\top} (\boldsymbol{\Psi} + \mathbf{WW}^{\top})^{-1} \mathbf{W} - \mathbb{E}[\mathbf{x} \mid \mathbf{z}] \mathbb{E}[\mathbf{x} \mid \mathbf{z}]^{\top}
\\
&\stackrel{\dagger}{=} (\mathbf{I} + \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W})^{-1} - \mathbb{E}[\mathbf{x} \mid \mathbf{z}] \mathbb{E}[\mathbf{x} \mid \mathbf{z}]^{\top}
\\
&= \mathbf{M} - \mathbb{E}[\mathbf{x} \mid \mathbf{z}] \mathbb{E}[\mathbf{x} \mid \mathbf{z}]^{\top}
\end{aligned} \tag{A1.3}
E [ z ] E [ z z ⊤ ] = W ⊤ ( Ψ + W W ⊤ ) − 1 x = † W ⊤ [ Ψ − 1 − Ψ − 1 W M ( I + W ⊤ Ψ − 1 W ) − 1 W ⊤ Ψ − 1 ] x = W ⊤ Ψ − 1 x − W ⊤ Ψ − 1 W M W ⊤ Ψ − 1 x = [ I − W ⊤ Ψ − 1 W M ] W ⊤ Ψ − 1 x = ⋆ M W ⊤ Ψ − 1 x = I − W ⊤ ( Ψ + W W ⊤ ) − 1 W − E [ x ∣ z ] E [ x ∣ z ] ⊤ = † ( I + W ⊤ Ψ − 1 W ) − 1 − E [ x ∣ z ] E [ x ∣ z ] ⊤ = M − E [ x ∣ z ] E [ x ∣ z ] ⊤ ( A 1 . 3 )
In step ⋆ \star ⋆ , we use the fact that
I = ( I + W ⊤ Ψ − 1 W ) ( I + W ⊤ Ψ − 1 W ) − 1 I = ( I + W ⊤ Ψ − 1 W ) M I = M + W ⊤ Ψ − 1 W M M = I − W ⊤ Ψ − 1 W M (A1.4)
\begin{aligned}
\mathbf{I} &= (\mathbf{I} + \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W}) (\mathbf{I} + \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W})^{-1}
\\
\mathbf{I} &= (\mathbf{I} + \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W}) \mathbf{M}
\\
\mathbf{I} &= \mathbf{M} + \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M}
\\
\mathbf{M} &= \mathbf{I} - \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M}
\end{aligned} \tag{A1.4}
I I I M = ( I + W ⊤ Ψ − 1 W ) ( I + W ⊤ Ψ − 1 W ) − 1 = ( I + W ⊤ Ψ − 1 W ) M = M + W ⊤ Ψ − 1 W M = I − W ⊤ Ψ − 1 W M ( A 1 . 4 )
Both † \dagger † steps just use the Woodbury matrix identity .
A2. Equivalence with Tipping and Bishop’s update
We want to show that (Tipping & Bishop, 1999) EM update for W \mathbf{W} W ,
W ⋆ = Σ ~ W ( σ 2 I + G − 1 W ⊤ Σ ~ W ) − 1 , (A2.1)
\mathbf{W}^{\star} = \tilde{\boldsymbol{\Sigma}} \mathbf{W} (\sigma^2 \mathbf{I} + \mathbf{G}^{-1} \mathbf{W}^{\top} \tilde{\boldsymbol{\Sigma}} \mathbf{W})^{-1}, \tag{A2.1}
W ⋆ = Σ ~ W ( σ 2 I + G − 1 W ⊤ Σ ~ W ) − 1 , ( A 2 . 1 )
where G = W ⊤ W + σ 2 I \mathbf{G} = \mathbf{W}^{\top} \mathbf{W} + \sigma^2 \mathbf{I} G = W ⊤ W + σ 2 I and Σ ~ = 1 N X X ⊤ \tilde{\boldsymbol{\Sigma}} = \frac{1}{N} \mathbf{X} \mathbf{X}^{\top} Σ ~ = N 1 X X ⊤ , is equivalent to our update:
W ⋆ = X S ⊤ A − 1 . (A2.2)
\mathbf{W}^{\star} = \mathbf{X} \mathbf{S}^{\top} \mathbf{A}^{-1}. \tag{A2.2}
W ⋆ = X S ⊤ A − 1 . ( A 2 . 2 )
It is:
W ⋆ = X S ⊤ A − 1 = X ( M W ⊤ Ψ − 1 X ) ⊤ ( N M + ( M W ⊤ Ψ − 1 X ) ( M W ⊤ Ψ − 1 X ) ⊤ ) − 1 = X X ⊤ Ψ − 1 W M ( N [ M + 1 N M W ⊤ Ψ − 1 X X ⊤ Ψ − 1 W M ] ) − 1 = Σ ~ Ψ − 1 W M ( M + M W ⊤ Ψ − 1 Σ ~ Ψ − 1 W M ) − 1 = Σ ~ Ψ − 1 W M ( [ I + M W ⊤ Ψ − 1 Σ ~ Ψ − 1 W ] M ) − 1 = ⋆ Σ ~ Ψ − 1 W ( I + M W ⊤ Ψ − 1 Σ ~ Ψ − 1 W ) − 1 = † Σ ~ W ( σ 2 [ I + M W ⊤ 1 σ 2 Σ ~ 1 σ 2 W ] ) − 1 = Σ ~ W ( σ 2 I + 1 σ 2 M W ⊤ Σ ~ W ) − 1 = Σ ~ W ( σ 2 I + 1 σ 2 ( W ⊤ 1 σ 2 W + I ) − 1 W ⊤ Σ ~ W ) − 1 = Σ ~ W ( σ 2 I + ( W ⊤ W + σ 2 I ) − 1 W ⊤ Σ ~ W ) − 1 = Σ ~ W ( σ 2 I + G − 1 W ⊤ Σ ~ W ) − 1 (A2.3)
\begin{aligned}
\mathbf{W}^{\star}
&= \mathbf{X} \mathbf{S}^{\top} \mathbf{A}^{-1}
\\
&= \mathbf{X} (\mathbf{M} \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{X})^{\top} (N \mathbf{M} + (\mathbf{M} \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{X}) (\mathbf{M} \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{X})^{\top})^{-1}
\\
&= \mathbf{X}\mathbf{X}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W}\mathbf{M} \Big( N \Big[ \mathbf{M} + \frac{1}{N} \mathbf{M} \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{X} \mathbf{X}^{\top} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \Big] \Big)^{-1}
\\
&= \tilde{\boldsymbol{\Sigma}} \boldsymbol{\Psi}^{-1} \mathbf{W}\mathbf{M} \Big( \mathbf{M} + \mathbf{M} \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \tilde{\boldsymbol{\Sigma}} \boldsymbol{\Psi}^{-1} \mathbf{W} \mathbf{M} \Big)^{-1}
\\
&= \tilde{\boldsymbol{\Sigma}} \boldsymbol{\Psi}^{-1} \mathbf{W}\mathbf{M} \Big( \Big[ \mathbf{I} + \mathbf{M} \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \tilde{\boldsymbol{\Sigma}} \boldsymbol{\Psi}^{-1} \mathbf{W} \Big] \mathbf{M} \Big)^{-1}
\\
&\stackrel{\star}{=} \tilde{\boldsymbol{\Sigma}} \boldsymbol{\Psi}^{-1} \mathbf{W} \Big( \mathbf{I} + \mathbf{M} \mathbf{W}^{\top} \boldsymbol{\Psi}^{-1} \tilde{\boldsymbol{\Sigma}} \boldsymbol{\Psi}^{-1} \mathbf{W} \Big)^{-1}
\\
&\stackrel{\dagger}{=} \tilde{\boldsymbol{\Sigma}} \mathbf{W} \Big( \sigma^2 \Big[ \mathbf{I} + \mathbf{M} \mathbf{W}^{\top} \frac{1}{\sigma^2} \tilde{\boldsymbol{\Sigma}} \frac{1}{\sigma^2} \mathbf{W} \Big] \Big)^{-1}
\\
&= \tilde{\boldsymbol{\Sigma}} \mathbf{W} \Big( \sigma^2 \mathbf{I} + \frac{1}{\sigma^2} \mathbf{M} \mathbf{W}^{\top} \tilde{\boldsymbol{\Sigma}} \mathbf{W} \Big)^{-1}
\\
&= \tilde{\boldsymbol{\Sigma}} \mathbf{W} \Big( \sigma^2 \mathbf{I} + \frac{1}{\sigma^2} \Big( \mathbf{W}^{\top} \frac{1}{\sigma^2} \mathbf{W} + \mathbf{I} \Big)^{-1} \mathbf{W}^{\top} \tilde{\boldsymbol{\Sigma}} \mathbf{W} \Big)^{-1}
\\
&= \tilde{\boldsymbol{\Sigma}} \mathbf{W} ( \sigma^2 \mathbf{I} + (\mathbf{W}^{\top} \mathbf{W} + \sigma^2 \mathbf{I})^{-1} \mathbf{W}^{\top} \tilde{\boldsymbol{\Sigma}} \mathbf{W})^{-1}
\\
&= \tilde{\boldsymbol{\Sigma}} \mathbf{W}( \sigma^2 \mathbf{I} + \mathbf{G}^{-1} \mathbf{W}^{\top} \tilde{\boldsymbol{\Sigma}} \mathbf{W})^{-1}
\end{aligned} \tag{A2.3}
W ⋆ = X S ⊤ A − 1 = X ( M W ⊤ Ψ − 1 X ) ⊤ ( N M + ( M W ⊤ Ψ − 1 X ) ( M W ⊤ Ψ − 1 X ) ⊤ ) − 1 = X X ⊤ Ψ − 1 W M ( N [ M + N 1 M W ⊤ Ψ − 1 X X ⊤ Ψ − 1 W M ] ) − 1 = Σ ~ Ψ − 1 W M ( M + M W ⊤ Ψ − 1 Σ ~ Ψ − 1 W M ) − 1 = Σ ~ Ψ − 1 W M ( [ I + M W ⊤ Ψ − 1 Σ ~ Ψ − 1 W ] M ) − 1 = ⋆ Σ ~ Ψ − 1 W ( I + M W ⊤ Ψ − 1 Σ ~ Ψ − 1 W ) − 1 = † Σ ~ W ( σ 2 [ I + M W ⊤ σ 2 1 Σ ~ σ 2 1 W ] ) − 1 = Σ ~ W ( σ 2 I + σ 2 1 M W ⊤ Σ ~ W ) − 1 = Σ ~ W ( σ 2 I + σ 2 1 ( W ⊤ σ 2 1 W + I ) − 1 W ⊤ Σ ~ W ) − 1 = Σ ~ W ( σ 2 I + ( W ⊤ W + σ 2 I ) − 1 W ⊤ Σ ~ W ) − 1 = Σ ~ W ( σ 2 I + G − 1 W ⊤ Σ ~ W ) − 1 ( A 2 . 3 )
In step ⋆ \star ⋆ , we used the fact that for two invertible N × N N \times N N × N matrices A \mathbf{A} A and B \mathbf{B} B :
( A B ) − 1 = B − 1 A − 1 . (A2.4)
(\mathbf{AB})^{-1} = \mathbf{B}^{-1} \mathbf{A}^{-1}. \tag{A2.4}
( A B ) − 1 = B − 1 A − 1 . ( A 2 . 4 )
In step † \dagger † , we substitute σ 2 I \sigma^2 \mathbf{I} σ 2 I for Ψ \boldsymbol{\Psi} Ψ .