Factor Analysis in Detail

Factor analysis is a statistical method for modeling high-dimensional data using a smaller number of latent variables. It is deeply related to other probabilistic models such as probabilistic PCA and probabilistic CCA. I define the model and how to fit it in detail.

Factor analysis is a statistical method for describing observed variables with a fewer number of unobserved variables called factors. The key idea is that by modeling the correlations in our data, we can represent it with fewer variables or dimensions.

As an example, imagine we modeled a random vector x\textbf{x}, normally distributed, in which two variables were strongly correlated:

x=[x1x2x3]N([000],[10.900.910001]). \textbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} \sim \mathcal{N} \Bigg( \begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix}, \begin{bmatrix} 1 & 0.9 & 0 \\ 0.9 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} \Bigg).

The correlation between x1x_1 and x2x_2 allows us to model the covariance of x\textbf{x} as a low-rank approximation using a matrix called the loadings, plus a noise term. Simulating x\textbf{x} with 10,00010,000 samples and running factor analysis, we have (values are rounded):

[0.990.90.020.91.010.030.020.031.03]Empirical covariance[0.940.010.950.00.030.16]Loadings Λ×[0.940.950.030.010.00.16]Λ+[0.10000.10001.0]Noise. \overbrace{ \begin{bmatrix} 0.99 & 0.9 & 0.02 \\ 0.9 & 1.01 & 0.03 \\ 0.02 & 0.03 & 1.03 \end{bmatrix} }^{\text{Empirical covariance}} \approx \overbrace{ \begin{bmatrix} -0.94 & 0.01 \\ -0.95 & -0.0 \\ -0.03 & -0.16 \\ \end{bmatrix} }^{\text{Loadings $\Lambda$}} \times \overbrace{ \begin{bmatrix} -0.94 & -0.95 & -0.03 \\ 0.01 & -0.0 & -0.16 \end{bmatrix} }^{\Lambda^{\top}} + \overbrace{ \begin{bmatrix} 0.1 & 0 & 0 \\ 0 & 0.1 & 0 \\ 0 & 0 & 1.0 \end{bmatrix} }^{\text{Noise}}.

This is the main intuition behind factor analysis. The loadings matrix allows us to quickly read off which variables are correlated, and the diagonal noise matrix accounts for arbitrary and uncorrelated offsets.

Notation

Now that we have the big idea, let’s formalize the model. Consider a random vector xRp\textbf{x} \in \mathbb{R}^{p}. We want to represent this variable with an unobserved latent variable zRk\textbf{z} \in \mathbb{R}^{k} where typically kpk \ll p. The generative model is

x=Λz+u(1) \textbf{x} = \Lambda \textbf{z} + \textbf{u} \tag{1}

where ΛRp×k\Lambda \in \mathbb{R}^{p \times k} is the loadings matrix and uRp\textbf{u} \in \mathbb{R}^{p} is the noise or uncertainty. The factors z\textbf{z} are assumed to be normal with unit variance, i.e. zN(0,Ik)\textbf{z} \sim \mathcal{N}(\textbf{0}, I_k). This is the probabilistic analog of vectors that are of unit length and orthogonal to each other. The uncertainty u\textbf{u} is distributed N(0,Ψ)\mathcal{N}(\textbf{0}, \Psi) where Ψ\Psi is a diagonal matrix. This diagonality means we assume the uncertainty between observations to be uncorrelated.

Using Equation 11, we can compute p(x)p(\textbf{x}) as

ΛzN(0,ΛΛ). \Lambda \textbf{z} \sim \mathcal{N}(0, \Lambda \Lambda^{\top}).

By the properties of affine transformations of normal random variables. Then using the property of the sum of normal random vectors, we have

Λz+uN(0,ΛΛ+Ψ). \Lambda \textbf{z} + \textbf{u} \sim \mathcal{N}(\textbf{0}, \Lambda \Lambda^{\top} + \Psi).

Hence, p(x)=N(0,ΛΛ+Ψ)p(\textbf{x}) = \mathcal{N}(\textbf{0}, \Lambda \Lambda^{\top} + \Psi), and so x\textbf{x} is also Gaussian distributed.

It’s important to think about this modeling assumption. Since ΛΛ\Lambda \Lambda^{\top} is a low-rank approximation, it cannot model all possible covariance matrices for x\textbf{x}. But if Ψ\Psi were a full matrix, then we could model any covariance matrix for x\textbf{x}. The diagonalization of Ψ\Psi, then, means that factor analysis is modeling our observations as a constrained Gaussian.

EM parameter updates

A factor analysis model can be fit using the Expectation–Maximization (EM) algorithm (Dempster et al., 1977). To use EM, we need to be able to compute the expected log-likelihood, E[L(Λ,Ψx,z)]\mathbb{E}\big[\mathcal{L}(\Lambda, \Psi \mid \textbf{x}, \textbf{z})\big], with respect to the conditional distribution of our latent variables z\textbf{z} given x\textbf{x} and our current estimates of the parameters (E-step),

Q(Λ,ΨΛt,Ψt)=Ezx,ΛtΨt[logp(x,zΛ,Ψ)]. Q(\Lambda, \Psi \mid \Lambda^{t}, \Psi^{t}) = \mathbb{E}_{\textbf{z} \mid \textbf{x}, \Lambda^t \Psi^t}\big[\log p(\textbf{x}, \textbf{z} \mid \Lambda, \Psi)\big].

Then we want to find the parameters that maximize this quantity (M-step),

Λt+1,Ψt+1=arg ⁣maxΛ,ΨQ(Λ,ΨΛt,Ψt). \Lambda^{t+1}, \Psi^{t+1} = \arg\!\max_{\Lambda, \Psi} Q(\Lambda, \Psi \mid \Lambda^{t}, \Psi^{t}).

These steps are repeated for a set number of iterations or until convergence. There are plenty of resources the updates for EM for factor analysis, but I show how to derive these updates in detail, skipping over as few steps as possible.

Log-likelihood of a multvariate normal

To begin, consider the joint normal distribution of x\textbf{x} and z\textbf{z}:

p([xz])=N([00],[ΛΛ+ΨΛΛI]). p \bigg( \begin{bmatrix} \textbf{x} \\ \textbf{z} \end{bmatrix} \bigg) = \mathcal{N}\bigg( \begin{bmatrix} 0 \\ 0 \end{bmatrix} , \begin{bmatrix} \Lambda \Lambda^{\top} + \Psi & \Lambda \\ \Lambda^{\top} & I \end{bmatrix} \bigg).

To see why the covariance matrix is so defined, see the Appendix, Section 1. Using this joint distribution, we can compute the conditional distributions using convenient properties of joint Gaussians. See section 8.1.38.1.3 of the Matrix Cookbook (Petersen et al., 2008) (henceforth Cookbook). Thus, we have the conditional distribution p(zx)p(\textbf{z} \mid \textbf{x}) with a mean or conditional expectation:

p(xz)=N(Λz,Ψ). p(\textbf{x} \mid \textbf{z}) = \mathcal{N}(\Lambda \textbf{z}, \Psi).

See the Appendix, Section 2 for a derivation. We want this distribution, because we can frame the log-likelihood as follows:

logp(x,z)=logp(xz)+logp(z). \log p(\textbf{x}, \textbf{z}) = \log p(\textbf{x} \mid \textbf{z}) + \log p(\textbf{z}).

Let’s solve for each term separately. See the Appendix, Section 3 for a derivation of this standard formulation of the log of a multivariate normal in general,

logp(z)=12i=1n(zi0)Ik1(zi0)n2lnIk+const=12zzn2lnIk. \begin{aligned} \log p(\textbf{z}) &= -\frac{1}{2} \sum_{i=1}^{n}(\textbf{z}_i - \textbf{0})^{\top} I_k^{-1} (\textbf{z}_i - \textbf{0}) - \frac{n}{2} \ln |I_k| + \text{const} \\ &= -\frac{1}{2} \textbf{z}^{\top} \textbf{z} - \frac{n}{2} \ln |I_k|. \end{aligned}

Now notice that when we take derivatives with respect to the model parameters, this prior on z\textbf{z} goes to 00. So it’s reasonable to focus on the conditional distribution p(xz)p(\textbf{x} \mid \textbf{z}), which is what the authors do in (Ghahramani & Hinton, 1996). That log-likelihood term is

logp(xz)=12i=1n(xiΛz)Ψ1(xiΛz)n2lnΨ+c \log p(\textbf{x} \mid \textbf{z}) = -\frac{1}{2} \sum_{i=1}^{n}(\textbf{x}_i - \Lambda \textbf{z})^{\top} \Psi^{-1} (\textbf{x}_i - \Lambda \textbf{z}) - \frac{n}{2} \ln |\Psi| + c

where cc is a constant. Taking the expectation with respect to z\textbf{z} given x\textbf{x}, we have

Q=E[12i=1n(xiΛzi)Ψ1(xiΛzi)n2lnΨ+c]=12i=1nE[(xiΛzi)Ψ1(xiΛzi)]n2lnΨ+c=12i=1nE[xiΨ1xiziΛΨ1xixiΨ1Λzi+ziΛΨ1Λzi]n2lnΨ+c. \begin{aligned} Q &= \mathbb{E} \big[ -\frac{1}{2} \sum_{i=1}^{n}(\textbf{x}_i - \Lambda \textbf{z}_i)^{\top} \Psi^{-1} (\textbf{x}_i - \Lambda \textbf{z}_i) - \frac{n}{2} \ln |\Psi| + c \big] \\ &= -\frac{1}{2} \sum_{i=1}^{n} \mathbb{E} \big[ (\textbf{x}_i - \Lambda \textbf{z}_i)^{\top} \Psi^{-1} (\textbf{x}_i - \Lambda \textbf{z}_i)\big] - \frac{n}{2} \ln |\Psi| + c \\ &= -\frac{1}{2} \sum_{i=1}^{n} \mathbb{E} \big[ \textbf{x}_i^{\top} \Psi^{-1} \textbf{x}_i - \color{#bc2612}{\textbf{z}_i^{\top} \Lambda^{\top} \Psi^{-1} \textbf{x}_i} - \color{#bc2612}{\textbf{x}_i^{\top} \Psi^{-1} \Lambda \textbf{z}_i} + \color{#11accd}{\textbf{z}_i^{\top} \Lambda^{\top} \Psi^{-1} \Lambda \textbf{z}_i} \big] - \frac{n}{2} \ln |\Psi| + c. \end{aligned}

We can simplify the red terms because both quantities are scalars and therefore a=aa = a^{\top}:

zΛΨ1xia+xiΨ1Λza=2xiΨ1Λz2a. \overbrace{\color{#bc2612}{\textbf{z}^{\top} \Lambda^{\top} \Psi^{-1} \textbf{x}_i}}^{a^{\top}} + \overbrace{\color{#bc2612}{\textbf{x}_i^{\top} \Psi^{-1} \Lambda \textbf{z}}}^{a} = \overbrace{\color{#bc2612}{2 \textbf{x}_i^{\top} \Psi^{-1} \Lambda \textbf{z}}}^{2a}.

Next, we can use the trace trick to write the blue term as:

zΛΨ1Λz=tr(ΛΨ1Λzz). \color{#11accd}{\textbf{z}^{\top} \Lambda^{\top} \Psi^{-1} \Lambda \textbf{z}} = \color{#11accd}{\text{tr}(\Lambda^{\top} \Psi^{-1} \Lambda \textbf{z} \textbf{z}^{\top})}.

These forms are convenient because we can now move the expectation with respect to z\textbf{z} as close as possible to the terms containing z\textbf{z}:

12i=1n(xiΨ1xiA2xiΨ1ΛE[zi]B+tr(ΛΨ1ΛE[zizi]C))n2lnΨD+c.(2) \begin{aligned} -\frac{1}{2} \sum_{i=1}^{n} \big( \underbrace{\textbf{x}_i^{\top} \Psi^{-1} \textbf{x}_i}_{\text{A}} \underbrace{- 2 \textbf{x}_i^{\top} \Psi^{-1} \Lambda \mathbb{E}[\textbf{z}_i]}_{\text{B}} + \underbrace{\text{tr}(\Lambda^{\top} \Psi^{-1} \Lambda \mathbb{E}[ \textbf{z}_i \textbf{z}_i^{\top}]}_{\text{C}} ) \big) \underbrace{- \frac{n}{2} \ln |\Psi|}_{\text{D}} + c. \end{aligned} \tag{2}

This is the E-step of the EM algorithm: we’ve computed the conditional expectation of the log-likelihood. The M-step is to find the parameters that maximize this equation. We do that by taking the derivative of this equation with respect to each parameter θ\theta, setting it equal to 00 to find the stationary point, and then solving for the update θ\theta^{\star}.

Derivative w.r.t. Λ\Lambda

Because of the linearity of differentiation, we can solve for the terms AA, BB, CC, and DD in Equation 22 separately. Clearly the derivatives of AA and DD are 00 because they do not depend on Λ\Lambda. Then we have:

QΛ=BΛ+CΛ \frac{\partial Q}{\partial \Lambda} = \frac{\partial B}{\partial \Lambda} + \frac{\partial C}{\partial \Lambda}

BΛ=Λ(2xiΨ1ΛEzx[z]). \frac{\partial B}{\partial \Lambda} = \frac{\partial}{\partial \Lambda} \Big( - 2 \textbf{x}_i^{\top} \Psi^{-1} \Lambda \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] \Big).

Since since (Ψ1)=Ψ1(\Psi^{-1})^{\top} = \Psi^{-1}, we can use Equation 7070 from the Matrix Cookbook (Petersen et al., 2008) to get:

BΛ=2Ψ1xiEzx[z] \frac{\partial B}{\partial \Lambda} = -2\Psi^{-1} \textbf{x}_i \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}]^{\top}

For CC, we can use Equation 117117 from the Matrix Cookbook to get

CΛ=Λ(tr(ΛΨ1ΛEzx[zz]))=(Ψ1Λ+Ψ1Λ)Ezx[zz]=2Ψ1ΛEzx[zz]. \begin{aligned} \frac{\partial C}{\partial \Lambda} &= \frac{\partial}{\partial \Lambda}\Big( \text{tr}(\Lambda^{\top} \Psi^{-1} \Lambda \mathbb{E}_{\textbf{z} \mid \textbf{x}}[ \textbf{z} \textbf{z}^{\top}] ) \Big) \\ &\stackrel{\star}{=} (\Psi^{-1} \Lambda + \Psi^{-1} \Lambda) \mathbb{E}_{\textbf{z} \mid \textbf{x}}[ \textbf{z} \textbf{z}^{\top}] \\ &= 2 \Psi^{-1} \Lambda \mathbb{E}_{\textbf{z} \mid \textbf{x}}[ \textbf{z} \textbf{z}^{\top}]. \end{aligned}

Step \star above just uses the fact that both Ψ1\Psi^{-1} and Ezx[zz]\mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z} \textbf{z}^{\top}] are symmetric. Combining these two derivatives and distributing the 12\frac{1}{2} and summation, we get

QΛ=i=1nΨ1xiEzx[z]+i=1nΨ1ΛEzx[zz]=0. \frac{\partial Q}{\partial \Lambda} = -\sum_{i=1}^{n} \Psi^{-1} \textbf{x}_i \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}]^{\top} + \sum_{i=1}^{n} \Psi^{-1} \Lambda \mathbb{E}_{\textbf{z} \mid \textbf{x}}[ \textbf{z} \textbf{z}^{\top}] = \textbf{0}.

We can solve for Λ\Lambda to get our update, Λ\Lambda^{\star}:

Λ=(i=1nxiEzx[z])(i=1nEzx[zz])1.(3) \Lambda^{\star} = \Big( \sum_{i=1}^{n} \textbf{x}_i \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}]^{\top} \Big) \Big( \sum_{i=1}^{n} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[ \textbf{z} \textbf{z}^{\top}] \Big)^{-1} \tag{3}.

Derivative w.r.t. Ψ1\Psi^{-1}

Next, we want to compute the derivative of QQ with respect to Ψ1\Psi^{-1} and solve for Ψ\Psi^{\star}, which happens nicely because one of the derivative terms will result in inverting Ψ1\Psi^{-1}. For this calculation, we need to compute the derivaties for each of the terms AA, BB, CC, and DD,

QΨ1=AΨ1+BΨ1+CΨ1+DΨ1. \frac{\partial Q}{\partial \Psi^{-1}} = \frac{\partial A}{\partial \Psi^{-1}} + \frac{\partial B}{\partial \Psi^{-1}} + \frac{\partial C}{\partial \Psi^{-1}} + \frac{\partial D}{\partial \Psi^{-1}}.

Let’s tackle each term separately. For AA, we can use Equation 7272 from the Matrix Cookbook to get:

AΨ1=xixi. \frac{\partial A}{\partial \Psi^{-1}} = \textbf{x}_i \textbf{x}_i^{\top}.

For BB, we first use the fact that the term is a scalar, so a=aa = a^{\top}, to re-arrange the order—we’ll see why in a minute:

2xiΨ1ΛEzx[z]=(2xiΨ1ΛEzx[z])=2(ΛEzx[z])Ψ1xi. \begin{aligned} -2 \textbf{x}_i^{\top} \Psi^{-1} \Lambda^{\star} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] &= \Big(- 2 \textbf{x}_i^{\top} \Psi^{-1} \Lambda^{\star} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}]\Big)^{\top} \\ &= -2 \big( \Lambda^{\star} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}]\big)^{\top} \Psi^{-1} \textbf{x}_i. \end{aligned}

Then using Equation 7070 from the Cookbook, we have:

BΨ1=2ΛEzx[z]xi \frac{\partial B}{\partial \Psi^{-1}} = -2 \Lambda^{\star} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] \textbf{x}_i^{\top}

For CC, we use Equation 102102 from the Cookbook:

CΨ1=ΛEzx[zz](Λ). \frac{\partial C}{\partial \Psi^{-1}} = \Lambda^{\star} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[ \textbf{z} \textbf{z}^{\top}] (\Lambda^{\star})^{\top}.

Finally, for DD, we use Equation 5656 from the Cookbook:

DΨ1=n2Ψ. \frac{\partial D}{\partial \Psi^{-1}} = \frac{n}{2} \Psi^{\star}.

If this last one is confusing, note that since Ψ\Psi is diagonal,

logΨΨ=logΨΨ=2logΨ    logΨ=12logΨΨ. \log |\Psi^{\top} \Psi| = \log |\Psi||\Psi| = 2 \log |\Psi| \quad\implies\quad \log |\Psi| = \frac{1}{2} \log |\Psi^{\top}\Psi|.

Adding these derivatives together and solving for Ψ\Psi^{\star}, we have,

QΨ1=12i=1n(xixi2ΛEzx[z]xi+ΛEzx[zz](Λ))+n2Ψ=0Ψ=1ni=1n(xixi2ΛEzx[z]xi+ΛEzx[zz](Λ))Ψ=1ni=1n(xixi)1n2Λi=1n(Ezx[z]xi)+1nΛi=1n(Ezx[zz])(Λ). \begin{aligned} \frac{\partial Q}{\partial \Psi^{-1}} &= -\frac{1}{2} \sum_{i=1}^{n} \Big( \textbf{x}_i \textbf{x}_i^{\top} - 2 \Lambda^{\star} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] \textbf{x}_i^{\top} + \Lambda^{\star} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[ \textbf{z} \textbf{z}^{\top}] (\Lambda^{\star})^{\top} \Big) + \frac{n}{2} \Psi^{\star} = \textbf{0} \\ \\ \Psi^{\star} &= \frac{1}{n} \sum_{i=1}^{n} \Big( \textbf{x}_i \textbf{x}_i^{\top} - 2 \Lambda^{\star} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] \textbf{x}_i^{\top} + \Lambda^{\star} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[ \textbf{z} \textbf{z}^{\top}] (\Lambda^{\star})^{\top} \Big) \\ \\ \Psi^{\star} &= \frac{1}{n} \sum_{i=1}^{n} \Big( \textbf{x}_i \textbf{x}_i^{\top} \Big) - \frac{1}{n} 2 \Lambda^{\star} \sum_{i=1}^{n} \Big( \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] \textbf{x}_i^{\top} \Big) + \frac{1}{n} \color{#bc2612}{\Lambda^{\star} \sum_{i=1}^{n} \Big( \mathbb{E}_{\textbf{z} \mid \textbf{x}}[ \textbf{z} \textbf{z}^{\top}] \Big)} \color{black}{(\Lambda^{\star})^{\top}}. \end{aligned}

Now note that if we modify Equation 33, we can get the term in red above,

Λ(i=1nEzx[zz])=(i=1nxiEzx[z]). \color{#bc2612}{\Lambda^{\star} \Big( \sum_{i=1}^{n} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[ \textbf{z} \textbf{z}^{\top}] \Big)} = \color{#bc2612}{\Big( \sum_{i=1}^{n} \textbf{x}_i \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}]^{\top} \Big)}.

Performing this substitution, we get

Ψ=1ni=1n(xixi)1n2Λi=1n(Ezx[z]xi)+1n(i=1nxiEzx[z])(Λ). \Psi^{\star} = \frac{1}{n} \sum_{i=1}^{n} \Big( \textbf{x}_i \textbf{x}_i^{\top} \Big) - \frac{1}{n} 2 \color{#807504}{\Lambda^{\star} \sum_{i=1}^{n} \Big(\mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] \textbf{x}_i^{\top} \Big)} + \frac{1}{n} \color{#807504}{\Big( \sum_{i=1}^{n} \textbf{x}_i \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}]^{\top} \Big) (\Lambda^{\star})^{\top}}.

And since (ABC)=CBA(ABC)^{\top} = C^{\top} B^{\top} A^{\top}, the two terms in yellow above are equal, giving us:

Ψ=1ni=1n(xixi)1nΛi=1n(Ezx[z]xi)=1ni=1n(xixiΛEzx[z]xi). \begin{aligned} \Psi^{\star} &= \frac{1}{n} \sum_{i=1}^{n} \Big( \textbf{x}_i \textbf{x}_i^{\top} \Big) - \frac{1}{n} \Lambda^{\star} \sum_{i=1}^{n} \Big(\mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] \textbf{x}_i^{\top} \Big) \\ &= \frac{1}{n} \sum_{i=1}^{n} \Big( \textbf{x}_i \textbf{x}_i^{\top} - \Lambda^{\star} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] \textbf{x}_i^{\top} \Big). \end{aligned}

Finally, we need to add a diagonal constraint to our update. Let diag(A)\text{diag}(A) diagonalize a matrix AA, meaning setting the off-diagonal elements to 00. Then the update is

Ψ=1ndiag(i=1nxixiΛEzx[z]xi). \Psi^{\star} = \frac{1}{n} \text{diag}\Bigg( \sum_{i=1}^{n} \textbf{x}_i \textbf{x}_i^{\top} - \Lambda^{\star} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] \textbf{x}_i^{\top} \Bigg).

Summary

The derivations above are fairly tedious, but I wanted to give them in full. My derivations follow the notation and structure of (Ghahramani & Hinton, 1996), but with my own reasoning filling in the gaps. In summary, here are the parameter updates that maximize the likelihood of the parameters given the data and previous parameters,

Λ=(i=1nxiEzx[z])(i=1nEzx[zz])1Ψ=1ndiag(i=1nxixiΛEzx[z]xi). \begin{aligned} \Lambda^{\star} &= \Big( \sum_{i=1}^{n} \textbf{x}_i \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}]^{\top} \Big) \Big( \sum_{i=1}^{n} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[ \textbf{z} \textbf{z}^{\top}] \Big)^{-1} \\ \\ \Psi^{\star} &= \frac{1}{n} \text{diag}\Bigg( \sum_{i=1}^{n} \textbf{x}_i \textbf{x}_i^{\top} - \Lambda^{\star} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] \textbf{x}_i^{\top} \Bigg). \end{aligned}

Notice that to compute Λ\Lambda^{\star} and Ψ\Psi^{\star}, we need to be able to compute Ezx[z]\mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] and Ezx[zz]\mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z} \textbf{z}^{\top}]. These are fairly straightforward, but you need to know about some convenient properties of multivariate normal distributions. Both derivations below rely on Equation 354354 from the Cookbook. Also, following the notation of (Ghahramani & Hinton, 1996), let β=Λ(ΛΛ+Ψ)1\beta = \Lambda^{\top}(\Lambda \Lambda^{\top} + \Psi)^{-1}. Then we have:

Ezx[z]=E[z]+Λ(ΛΛ+Ψ)1(xE[x])=Λ(ΛΛ+Ψ)1x=βx \begin{aligned} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] &= \mathbb{E}[\textbf{z}] + \Lambda^{\top}(\Lambda \Lambda^{\top} + \Psi)^{-1} (\textbf{x} - \mathbb{E}[\textbf{x}]) \\ &= \Lambda^{\top}(\Lambda \Lambda^{\top} + \Psi)^{-1} \textbf{x} \\ &= \beta \textbf{x} \end{aligned}

Ezx[zz]=Ezx[z](Ezx[z])+Vzx[z]=βxxβ+Vzx[z]=IΛ(ΛΛ+Ψ)1Λ+βxxβ=IβΛ+βxxβ. \begin{aligned} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z} \textbf{z}^{\top}] &= \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] (\mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}])^{\top} + \mathbb{V}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] \\ &= \beta \textbf{x} \textbf{x}^{\top} \beta^{\top} + \mathbb{V}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] \\ &= I - \Lambda^{\top} (\Lambda \Lambda^{\top} + \Psi)^{-1} \Lambda + \beta \textbf{x} \textbf{x}^{\top} \beta^{\top} \\ &= I - \beta \Lambda + \beta \textbf{x} \textbf{x}^{\top} \beta^{\top}. \end{aligned}

Note that (ΛΛ+Ψ)1(\Lambda \Lambda^{\top} + \Psi)^{-1} can be computed using the Woodbury identity or matrix inversion lemma (see the Cookbook, section 3.2.23.2.2). And we’re done.

Factor analysis, PPCA, PCCA

Factor analysis is related to probabilistic PCA (Tipping & Bishop, 1999) and probabilistic CCA (Bach & Jordan, 2005) in interesting ways. Recall that the model for factor analysis is

p(x)=N(0,ΛΛ+Ψ) p(\textbf{x}) = \mathcal{N}(\textbf{0}, \Lambda \Lambda^{\top} + \Psi)

where Ψ\Psi is a diagonal matrix. The main conceptual difference between factor analysis and probabilistic PCA is that in probabilistic PCA, Ψ\Psi is isotropic, meaning all the values on the diagonal are the same. In other words, probabilistic PCA models our data with a covariance matrix that is assumes each component is uncorrelated and has the same variance. This is analogous to standard PCA, in which the principal components are orthogonal unit vectors.

In probabilistic CCA, we have two views of our data, but a shared latent variable z\textbf{z}:

xa=Λaz+uxb=Λbz+u. \textbf{x}_a = \Lambda_a \textbf{z} + \textbf{u} \\ \textbf{x}_b = \Lambda_b \textbf{z} + \textbf{u}.

I find these kinds of comparisons helpful because I can better chunk an idea when I see how it is related to other similar methods.

   

Acknowledgements

I thank Kaiqian Zhang for finding and correcting several typos and a mistake in a derivation.

   

Appendix

1. Covariance matrix of joint normal distribution

Given the joint distribution

p([xz])=N([00],[ΛΛ+ΨΛΛI]), p \bigg( \begin{bmatrix} \textbf{x} \\ \textbf{z} \end{bmatrix} \bigg) = \mathcal{N}\bigg( \begin{bmatrix} 0 \\ 0 \end{bmatrix} , \begin{bmatrix} \Lambda \Lambda^{\top} + \Psi & \Lambda \\ \Lambda^{\top} & I \end{bmatrix} \bigg),

we can easily see that Var(x)\text{Var}(\textbf{x}) and Var(z)\text{Var}(\textbf{z}) come from their definitions. The covariances Cov(x,z)\text{Cov}(\textbf{x}, \textbf{z}) and Cov(z,x)\text{Cov}(\textbf{z}, \textbf{x}) can be computed from their definitions. I show Cov(x,z)\text{Cov}(\textbf{x}, \textbf{z}), but the derivation for Cov(z,x)\text{Cov}(\textbf{z}, \textbf{x}) is nearly identical:

Cov(x,z)=E[(xE[x])(zE[z])]=E[(x0)(z0)]=E[(Λz+u)(z)]=E[Λzz+uz]=ΛE[zz]+E[uz]=Λ. \begin{aligned} \text{Cov}(\textbf{x}, \textbf{z}) &= \mathbb{E}[(\textbf{x} - \mathbb{E}[\textbf{x}])(\textbf{z} - \mathbb{E}[\textbf{z}])^{\top}] \\ &= \mathbb{E}[(\textbf{x} - 0)(\textbf{z} - 0)^{\top}] \\ &= \mathbb{E}[(\Lambda \textbf{z} + \textbf{u})(\textbf{z})^{\top}] \\ &= \mathbb{E}[\Lambda \textbf{z} \textbf{z}^{\top} + \textbf{u} \textbf{z}^{\top}] \\ &= \Lambda \mathbb{E}[\textbf{z} \textbf{z}^{\top}] + \mathbb{E}[\textbf{u} \textbf{z}^{\top}] \\ &= \Lambda. \end{aligned}

We use the fact that E[uz]=E[u]E[z]=00\mathbb{E}[\textbf{u} \textbf{z}^{\top}] = \mathbb{E}[\textbf{u}]\mathbb{E}[\textbf{z}^{\top}] = 0 \cdot 0 because z\textbf{z} and u\textbf{u} are independent random variables, and E[zz]=Ik\mathbb{E}[\textbf{z}\textbf{z}^{\top}] = I_k because:

Var(z)=E[zz]E[z]E[z]Ik=E[zz]0 \begin{aligned} \text{Var}(\textbf{z}) &= \mathbb{E}[\textbf{z}\textbf{z}^{\top}] - \mathbb{E}[\textbf{z}] \mathbb{E}[\textbf{z}]^{\top} \\ I_k &= \mathbb{E}[\textbf{z}\textbf{z}^{\top}] - 0 \end{aligned}

2. Conditional distributions

Given the joint distribution

p([xz])=N([00],[ΛΛ+ΨΛΛI]), p \bigg( \begin{bmatrix} \textbf{x} \\ \textbf{z} \end{bmatrix} \bigg) = \mathcal{N}\bigg( \begin{bmatrix} 0 \\ 0 \end{bmatrix} , \begin{bmatrix} \Lambda \Lambda^{\top} + \Psi & \Lambda \\ \Lambda^{\top} & I \end{bmatrix} \bigg),

we can use convenient properties of joint Gaussians to compute the conditional distributions. See the Matrix Cookbook (Petersen et al., 2008) section 8.1.38.1.3, especially Equations 353353 and 354354. Below, I translate p(xz)p(\textbf{x} \mid \textbf{z}) using Equation 353353:

p(xz)=N(E[x]+ΛIk1(xE[z]),ΛΛ+ΨΛIk1Λ)=N(0+Λ(x0),ΛΛΛΛ+Ψ)=N(Λx,Ψ). \begin{aligned} p(\textbf{x} \mid \textbf{z}) &= \mathcal{N}(\mathbb{E}[\textbf{x}] + \Lambda I_k^{-1} (\textbf{x} - \mathbb{E}[\textbf{z}]), \Lambda \Lambda^{\top} + \Psi - \Lambda I_k^{-1} \Lambda^{\top}) \\ &= \mathcal{N}(\textbf{0} + \Lambda(\textbf{x} - \textbf{0}), \Lambda \Lambda^{\top} - \Lambda \Lambda^{\top} + \Psi) \\ &= \mathcal{N}(\Lambda \textbf{x}, \Psi). \end{aligned}

The conditional distribution p(zx)p(\textbf{z} \mid \textbf{x}) can be computed using Equation 354354.

3. Log of the multivariate normal

If p(xμ,Σ)p(\textbf{x} \mid \boldsymbol{\mu}, \Sigma) is a multivariate normal distribution with a mean μ\boldsymbol{\mu} and covariance Σ\Sigma, then the likelihood of the parameters given the data is

p(Xμ,Σ)=i=1np(xiμ,Σ). p(X \mid \boldsymbol{\mu}, \Sigma) = \prod_{i=1}^{n} p(\textbf{x}_i \mid \boldsymbol{\mu}, \Sigma).

Which holds because we assume our samples are i.i.d. The log-likehood, then, is

lnp(Xμ,Σ)=i=1nlnp(xiμ,Σ)=i=1nln(1(2π)kΣexp[12(xiμ)Σ1(xiμ)])=i=1n(12(xiμ)Σ1(xiμ)ln(2π)kΣ)=12i=1n(xiμ)Σ1(xiμ)nln(2π)kΣ=12i=1n(xiμ)Σ1(xiμ)n2ln(2π)kΣ=12i=1n(xiμ)Σ1(xiμ)nk2ln2πn2lnΣ. \begin{aligned} \ln p(X \mid \boldsymbol{\mu}, \Sigma) &= \sum_{i=1}^{n} \ln p(\textbf{x}_i \mid \boldsymbol{\mu}, \Sigma) \\ &= \sum_{i=1}^{n} \ln \Big( \frac{1}{\sqrt{(2 \pi)^k |\Sigma |}} \exp \big[ -\frac{1}{2} (\textbf{x}_i - \boldsymbol{\mu})^{\top} \Sigma^{-1} (\textbf{x}_i - \boldsymbol{\mu}) \big] \Big) \\ &= \sum_{i=1}^{n} \Big( - \frac{1}{2} (\textbf{x}_i - \boldsymbol{\mu})^{\top} \Sigma^{-1} (\textbf{x}_i - \boldsymbol{\mu}) - \ln \sqrt{(2 \pi)^k |\Sigma|} \Big) \\ &= - \frac{1}{2} \sum_{i=1}^{n} (\textbf{x}_i - \boldsymbol{\mu})^{\top} \Sigma^{-1} (\textbf{x}_i - \boldsymbol{\mu}) - n \ln \sqrt{(2 \pi)^k |\Sigma|} \\ &= - \frac{1}{2} \sum_{i=1}^{n}(\textbf{x}_i - \boldsymbol{\mu})^{\top} \Sigma^{-1} (\textbf{x}_i - \boldsymbol{\mu}) - \frac{n}{2} \ln (2 \pi)^k |\Sigma| \\ &= - \frac{1}{2} \sum_{i=1}^{n}(\textbf{x}_i - \boldsymbol{\mu})^{\top} \Sigma^{-1} (\textbf{x}_i - \boldsymbol{\mu}) - \frac{nk}{2} \ln 2 \pi - \frac{n}{2} \ln |\Sigma|. \end{aligned}

This is often written as

lnp(Xμ,Σ)=12i=1n(xiμ)Σ1(xiμ)n2lnΣ+const \ln p(X \mid \boldsymbol{\mu}, \Sigma) = -\frac{1}{2} \sum_{i=1}^{n}(\textbf{x}_i - \boldsymbol{\mu})^{\top} \Sigma^{-1} (\textbf{x}_i - \boldsymbol{\mu}) - \frac{n}{2} \ln |\Sigma| + \text{const}

where we call nk2ln2π\frac{nk}{2} \ln 2 \pi a constant because it goes to 00 for all our derivatives.

4. The trace trick

The “trace trick” is the following relationship that is often applied to Gaussian distributions. Given a covariance matrix Σ\Sigma and observations x\textbf{x}, we have

xΣ1x=tr(Σ1xx) \textbf{x}^{\top} \Sigma^{-1} \textbf{x} = \text{tr}( \Sigma^{-1} \textbf{x}^{\top} \textbf{x})

where tr()\text{tr}(\cdot) is the trace of a matrix, which for an arbitrary matrix AA is defined as

tr(A)=iaii=a11+a22++ann. \text{tr}(A) = \sum_{i} a_{ii} = a_{11} + a_{22} + \dots + a_{nn}.

In words, it is the sum of the diagonal elements. There are two important properties of the trace operation. First for a constant kk,

tr(k)=k. \text{tr}(k) = k.

This makes sense because there are no terms to sum over. Second, the operation is commutative when the dimensionality allows it or

tr(AB)=tr(BA). \text{tr}(AB) = \text{tr}(BA).

You can see this by considering that we just sum over the elements in a different order:

tr(AB)=i=1nj=1maijbji=j=1mi=1nbjiaji=tr(BA). \begin{aligned} \text{tr}(AB) &= \sum_{i=1}^n \sum_{j=1}^m a_{ij} b_{ji} \\ &= \sum_{j=1}^m \sum_{i=1}^n b_{ji} a_{ji} \\ &= \text{tr}(BA). \end{aligned}

Since xΣxR\textbf{x}^{\top} \Sigma \textbf{x} \in \mathbb{R}, i.e. a 1×11 \times 1 matrix, we have the following proof of the trace trick:

xΣ1x=tr(xΣx)=tr(Σ1xx). \begin{aligned} \textbf{x}^{\top} \Sigma^{-1} \textbf{x} &= \text{tr}(\textbf{x}^{\top} \Sigma \textbf{x}) \\ &= \text{tr}(\Sigma^{-1} \textbf{x}^{\top} \textbf{x}). \end{aligned}

  1. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 1–38.
  2. Petersen, K. B., Pedersen, M. S., & others. (2008). The matrix cookbook. Technical University of Denmark, 7(15), 510.
  3. Ghahramani, Z., & Hinton, G. E. (1996). The EM algorithm for mixtures of factor analyzers. Technical Report CRG-TR-96-1, University of Toronto.
  4. Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3), 611–622.
  5. Bach, F. R., & Jordan, M. I. (2005). A probabilistic interpretation of canonical correlation analysis.