Factor Analysis in Detail

Factor analysis is a statistical method for modeling high-dimensional data using a smaller number of latent variables. It is deeply related to other probabilistic models such as probabilistic PCA and probabilistic CCA. I define the model and how to fit it in detail.

Published

08 August 2018

Factor analysis is a statistical method for describing observed variables with a fewer number of unobserved variables called factors. The key idea is that by modeling the correlations in our data, we can represent it with fewer variables or dimensions.

As an example, imagine we modeled a random vector $\textbf{x}$ , normally distributed, in which two variables were strongly correlated:

$\textbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} \sim \mathcal{N} \Bigg( \begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix}, \begin{bmatrix} 1 & 0.9 & 0 \\ 0.9 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} \Bigg).$

The correlation between $x_1$ and $x_2$ allows us to model the covariance of $\textbf{x}$ as a low-rank approximation using a matrix called the loadings, plus a noise term. Simulating $\textbf{x}$ with $10,000$ samples and running factor analysis, we have (values are rounded):

$\overbrace{ \begin{bmatrix} 0.99 & 0.9 & 0.02 \\ 0.9 & 1.01 & 0.03 \\ 0.02 & 0.03 & 1.03 \end{bmatrix} }^{\text{Empirical covariance}} \approx \overbrace{ \begin{bmatrix} -0.94 & 0.01 \\ -0.95 & -0.0 \\ -0.03 & -0.16 \\ \end{bmatrix} }^{\text{Loadings $\Lambda$}} \times \overbrace{ \begin{bmatrix} -0.94 & -0.95 & -0.03 \\ 0.01 & -0.0 & -0.16 \end{bmatrix} }^{\Lambda^{\top}} + \overbrace{ \begin{bmatrix} 0.1 & 0 & 0 \\ 0 & 0.1 & 0 \\ 0 & 0 & 1.0 \end{bmatrix} }^{\text{Noise}}.$

This is the main intuition behind factor analysis. The loadings matrix allows us to quickly read off which variables are correlated, and the diagonal noise matrix accounts for arbitrary and uncorrelated offsets.

Notation

Now that we have the big idea, let’s formalize the model. Consider a random vector $\textbf{x} \in \mathbb{R}^{p}$ . We want to represent this variable with an unobserved latent variable $\textbf{z} \in \mathbb{R}^{k}$ where typically $k \ll p$ . The generative model is

$\textbf{x} = \Lambda \textbf{z} + \textbf{u} \tag{1}$

where $\Lambda \in \mathbb{R}^{p \times k}$ is the loadings matrix and $\textbf{u} \in \mathbb{R}^{p}$ is the noise or uncertainty. The factors $\textbf{z}$ are assumed to be normal with unit variance, i.e. $\textbf{z} \sim \mathcal{N}(\textbf{0}, I_k)$ . This is the probabilistic analog of vectors that are of unit length and orthogonal to each other. The uncertainty $\textbf{u}$ is distributed $\mathcal{N}(\textbf{0}, \Psi)$ where $\Psi$ is a diagonal matrix. This diagonality means we assume the uncertainty between observations to be uncorrelated.

Using Equation $1$ , we can compute $p(\textbf{x})$ as

$\Lambda \textbf{z} \sim \mathcal{N}(0, \Lambda \Lambda^{\top}).$

By the properties of affine transformations of normal random variables. Then using the property of the sum of normal random vectors, we have

$\Lambda \textbf{z} + \textbf{u} \sim \mathcal{N}(\textbf{0}, \Lambda \Lambda^{\top} + \Psi).$

Hence, $p(\textbf{x}) = \mathcal{N}(\textbf{0}, \Lambda \Lambda^{\top} + \Psi)$ , and so $\textbf{x}$ is also Gaussian distributed.

It’s important to think about this modeling assumption. Since $\Lambda \Lambda^{\top}$ is a low-rank approximation, it cannot model all possible covariance matrices for $\textbf{x}$ . But if $\Psi$ were a full matrix, then we could model any covariance matrix for $\textbf{x}$ . The diagonalization of $\Psi$ , then, means that factor analysis is modeling our observations as a constrained Gaussian.

EM parameter updates

A factor analysis model can be fit using the Expectation–Maximization (EM) algorithm (Dempster et al., 1977). To use EM, we need to be able to compute the expected log-likelihood, $\mathbb{E}\big[\mathcal{L}(\Lambda, \Psi \mid \textbf{x}, \textbf{z})\big]$ , with respect to the conditional distribution of our latent variables $\textbf{z}$ given $\textbf{x}$ and our current estimates of the parameters (E-step),

$Q(\Lambda, \Psi \mid \Lambda^{t}, \Psi^{t}) = \mathbb{E}_{\textbf{z} \mid \textbf{x}, \Lambda^t \Psi^t}\big[\log p(\textbf{x}, \textbf{z} \mid \Lambda, \Psi)\big].$

Then we want to find the parameters that maximize this quantity (M-step),

$\Lambda^{t+1}, \Psi^{t+1} = \arg\!\max_{\Lambda, \Psi} Q(\Lambda, \Psi \mid \Lambda^{t}, \Psi^{t}).$

These steps are repeated for a set number of iterations or until convergence. There are plenty of resources the updates for EM for factor analysis, but I show how to derive these updates in detail, skipping over as few steps as possible.

Log-likelihood of a multvariate normal

To begin, consider the joint normal distribution of $\textbf{x}$ and $\textbf{z}$ :

$p \bigg( \begin{bmatrix} \textbf{x} \\ \textbf{z} \end{bmatrix} \bigg) = \mathcal{N}\bigg( \begin{bmatrix} 0 \\ 0 \end{bmatrix} , \begin{bmatrix} \Lambda \Lambda^{\top} + \Psi & \Lambda \\ \Lambda^{\top} & I \end{bmatrix} \bigg).$

To see why the covariance matrix is so defined, see the Appendix, Section 1. Using this joint distribution, we can compute the conditional distributions using convenient properties of joint Gaussians. See section $8.1.3$ of the Matrix Cookbook (Petersen et al., 2008) (henceforth Cookbook). Thus, we have the conditional distribution $p(\textbf{z} \mid \textbf{x})$ with a mean or conditional expectation:

$p(\textbf{x} \mid \textbf{z}) = \mathcal{N}(\Lambda \textbf{z}, \Psi).$

See the Appendix, Section 2 for a derivation. We want this distribution, because we can frame the log-likelihood as follows:

$\log p(\textbf{x}, \textbf{z}) = \log p(\textbf{x} \mid \textbf{z}) + \log p(\textbf{z}).$

Let’s solve for each term separately. See the Appendix, Section 3 for a derivation of this standard formulation of the log of a multivariate normal in general,

$\begin{aligned} \log p(\textbf{z}) &= -\frac{1}{2} \sum_{i=1}^{n}(\textbf{z}_i - \textbf{0})^{\top} I_k^{-1} (\textbf{z}_i - \textbf{0}) - \frac{n}{2} \ln |I_k| + \text{const} \\ &= -\frac{1}{2} \textbf{z}^{\top} \textbf{z} - \frac{n}{2} \ln |I_k|. \end{aligned}$

Now notice that when we take derivatives with respect to the model parameters, this prior on $\textbf{z}$ goes to $0$ . So it’s reasonable to focus on the conditional distribution $p(\textbf{x} \mid \textbf{z})$ , which is what the authors do in (Ghahramani & Hinton, 1996). That log-likelihood term is

$\log p(\textbf{x} \mid \textbf{z}) = -\frac{1}{2} \sum_{i=1}^{n}(\textbf{x}_i - \Lambda \textbf{z})^{\top} \Psi^{-1} (\textbf{x}_i - \Lambda \textbf{z}) - \frac{n}{2} \ln |\Psi| + c$

where $c$ is a constant. Taking the expectation with respect to $\textbf{z}$ given $\textbf{x}$ , we have

$\begin{aligned} Q &= \mathbb{E} \big[ -\frac{1}{2} \sum_{i=1}^{n}(\textbf{x}_i - \Lambda \textbf{z}_i)^{\top} \Psi^{-1} (\textbf{x}_i - \Lambda \textbf{z}_i) - \frac{n}{2} \ln |\Psi| + c \big] \\ &= -\frac{1}{2} \sum_{i=1}^{n} \mathbb{E} \big[ (\textbf{x}_i - \Lambda \textbf{z}_i)^{\top} \Psi^{-1} (\textbf{x}_i - \Lambda \textbf{z}_i)\big] - \frac{n}{2} \ln |\Psi| + c \\ &= -\frac{1}{2} \sum_{i=1}^{n} \mathbb{E} \big[ \textbf{x}_i^{\top} \Psi^{-1} \textbf{x}_i - \color{#bc2612}{\textbf{z}_i^{\top} \Lambda^{\top} \Psi^{-1} \textbf{x}_i} - \color{#bc2612}{\textbf{x}_i^{\top} \Psi^{-1} \Lambda \textbf{z}_i} + \color{#11accd}{\textbf{z}_i^{\top} \Lambda^{\top} \Psi^{-1} \Lambda \textbf{z}_i} \big] - \frac{n}{2} \ln |\Psi| + c. \end{aligned}$

We can simplify the red terms because both quantities are scalars and therefore $a = a^{\top}$ :

$\overbrace{\color{#bc2612}{\textbf{z}^{\top} \Lambda^{\top} \Psi^{-1} \textbf{x}_i}}^{a^{\top}} + \overbrace{\color{#bc2612}{\textbf{x}_i^{\top} \Psi^{-1} \Lambda \textbf{z}}}^{a} = \overbrace{\color{#bc2612}{2 \textbf{x}_i^{\top} \Psi^{-1} \Lambda \textbf{z}}}^{2a}.$

Next, we can use the trace trick to write the blue term as:

$\color{#11accd}{\textbf{z}^{\top} \Lambda^{\top} \Psi^{-1} \Lambda \textbf{z}} = \color{#11accd}{\text{tr}(\Lambda^{\top} \Psi^{-1} \Lambda \textbf{z} \textbf{z}^{\top})}.$

These forms are convenient because we can now move the expectation with respect to $\textbf{z}$ as close as possible to the terms containing $\textbf{z}$ :

$\begin{aligned} -\frac{1}{2} \sum_{i=1}^{n} \big( \underbrace{\textbf{x}_i^{\top} \Psi^{-1} \textbf{x}_i}_{\text{A}} \underbrace{- 2 \textbf{x}_i^{\top} \Psi^{-1} \Lambda \mathbb{E}[\textbf{z}_i]}_{\text{B}} + \underbrace{\text{tr}(\Lambda^{\top} \Psi^{-1} \Lambda \mathbb{E}[ \textbf{z}_i \textbf{z}_i^{\top}]}_{\text{C}} ) \big) \underbrace{- \frac{n}{2} \ln |\Psi|}_{\text{D}} + c. \end{aligned} \tag{2}$

This is the E-step of the EM algorithm: we’ve computed the conditional expectation of the log-likelihood. The M-step is to find the parameters that maximize this equation. We do that by taking the derivative of this equation with respect to each parameter $\theta$ , setting it equal to $0$ to find the stationary point, and then solving for the update $\theta^{\star}$ .

Derivative w.r.t. $\Lambda$

Because of the linearity of differentiation, we can solve for the terms $A$ , $B$ , $C$ , and $D$ in Equation $2$ separately. Clearly the derivatives of $A$ and $D$ are $0$ because they do not depend on $\Lambda$ . Then we have:

$\frac{\partial Q}{\partial \Lambda} = \frac{\partial B}{\partial \Lambda} + \frac{\partial C}{\partial \Lambda}$

$\frac{\partial B}{\partial \Lambda} = \frac{\partial}{\partial \Lambda} \Big( - 2 \textbf{x}_i^{\top} \Psi^{-1} \Lambda \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] \Big).$

Since since $(\Psi^{-1})^{\top} = \Psi^{-1}$ , we can use Equation $70$ from the Matrix Cookbook (Petersen et al., 2008) to get:

$\frac{\partial B}{\partial \Lambda} = -2\Psi^{-1} \textbf{x}_i \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}]^{\top}$

For $C$ , we can use Equation $117$ from the Matrix Cookbook to get

$\begin{aligned} \frac{\partial C}{\partial \Lambda} &= \frac{\partial}{\partial \Lambda}\Big( \text{tr}(\Lambda^{\top} \Psi^{-1} \Lambda \mathbb{E}_{\textbf{z} \mid \textbf{x}}[ \textbf{z} \textbf{z}^{\top}] ) \Big) \\ &\stackrel{\star}{=} (\Psi^{-1} \Lambda + \Psi^{-1} \Lambda) \mathbb{E}_{\textbf{z} \mid \textbf{x}}[ \textbf{z} \textbf{z}^{\top}] \\ &= 2 \Psi^{-1} \Lambda \mathbb{E}_{\textbf{z} \mid \textbf{x}}[ \textbf{z} \textbf{z}^{\top}]. \end{aligned}$

Step $\star$ above just uses the fact that both $\Psi^{-1}$ and $\mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z} \textbf{z}^{\top}]$ are symmetric. Combining these two derivatives and distributing the $\frac{1}{2}$ and summation, we get

$\frac{\partial Q}{\partial \Lambda} = -\sum_{i=1}^{n} \Psi^{-1} \textbf{x}_i \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}]^{\top} + \sum_{i=1}^{n} \Psi^{-1} \Lambda \mathbb{E}_{\textbf{z} \mid \textbf{x}}[ \textbf{z} \textbf{z}^{\top}] = \textbf{0}.$

We can solve for $\Lambda$ to get our update, $\Lambda^{\star}$ :

$\Lambda^{\star} = \Big( \sum_{i=1}^{n} \textbf{x}_i \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}]^{\top} \Big) \Big( \sum_{i=1}^{n} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[ \textbf{z} \textbf{z}^{\top}] \Big)^{-1} \tag{3}.$

Derivative w.r.t. $\Psi^{-1}$

Next, we want to compute the derivative of $Q$ with respect to $\Psi^{-1}$ and solve for $\Psi^{\star}$ , which happens nicely because one of the derivative terms will result in inverting $\Psi^{-1}$ . For this calculation, we need to compute the derivaties for each of the terms $A$ , $B$ , $C$ , and $D$ ,

$\frac{\partial Q}{\partial \Psi^{-1}} = \frac{\partial A}{\partial \Psi^{-1}} + \frac{\partial B}{\partial \Psi^{-1}} + \frac{\partial C}{\partial \Psi^{-1}} + \frac{\partial D}{\partial \Psi^{-1}}.$

Let’s tackle each term separately. For $A$ , we can use Equation $72$ from the Matrix Cookbook to get:

$\frac{\partial A}{\partial \Psi^{-1}} = \textbf{x}_i \textbf{x}_i^{\top}.$

For $B$ , we first use the fact that the term is a scalar, so $a = a^{\top}$ , to re-arrange the order—we’ll see why in a minute:

$\begin{aligned} -2 \textbf{x}_i^{\top} \Psi^{-1} \Lambda^{\star} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] &= \Big(- 2 \textbf{x}_i^{\top} \Psi^{-1} \Lambda^{\star} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}]\Big)^{\top} \\ &= -2 \big( \Lambda^{\star} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}]\big)^{\top} \Psi^{-1} \textbf{x}_i. \end{aligned}$

Then using Equation $70$ from the Cookbook, we have:

$\frac{\partial B}{\partial \Psi^{-1}} = -2 \Lambda^{\star} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] \textbf{x}_i^{\top}$

For $C$ , we use Equation $102$ from the Cookbook:

$\frac{\partial C}{\partial \Psi^{-1}} = \Lambda^{\star} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[ \textbf{z} \textbf{z}^{\top}] (\Lambda^{\star})^{\top}.$

Finally, for $D$ , we use Equation $56$ from the Cookbook:

$\frac{\partial D}{\partial \Psi^{-1}} = \frac{n}{2} \Psi^{\star}.$

If this last one is confusing, note that since $\Psi$ is diagonal,

$\log |\Psi^{\top} \Psi| = \log |\Psi||\Psi| = 2 \log |\Psi| \quad\implies\quad \log |\Psi| = \frac{1}{2} \log |\Psi^{\top}\Psi|.$

Adding these derivatives together and solving for $\Psi^{\star}$ , we have,

$\begin{aligned} \frac{\partial Q}{\partial \Psi^{-1}} &= -\frac{1}{2} \sum_{i=1}^{n} \Big( \textbf{x}_i \textbf{x}_i^{\top} - 2 \Lambda^{\star} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] \textbf{x}_i^{\top} + \Lambda^{\star} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[ \textbf{z} \textbf{z}^{\top}] (\Lambda^{\star})^{\top} \Big) + \frac{n}{2} \Psi^{\star} = \textbf{0} \\ \\ \Psi^{\star} &= \frac{1}{n} \sum_{i=1}^{n} \Big( \textbf{x}_i \textbf{x}_i^{\top} - 2 \Lambda^{\star} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] \textbf{x}_i^{\top} + \Lambda^{\star} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[ \textbf{z} \textbf{z}^{\top}] (\Lambda^{\star})^{\top} \Big) \\ \\ \Psi^{\star} &= \frac{1}{n} \sum_{i=1}^{n} \Big( \textbf{x}_i \textbf{x}_i^{\top} \Big) - \frac{1}{n} 2 \Lambda^{\star} \sum_{i=1}^{n} \Big( \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] \textbf{x}_i^{\top} \Big) + \frac{1}{n} \color{#bc2612}{\Lambda^{\star} \sum_{i=1}^{n} \Big( \mathbb{E}_{\textbf{z} \mid \textbf{x}}[ \textbf{z} \textbf{z}^{\top}] \Big)} \color{black}{(\Lambda^{\star})^{\top}}. \end{aligned}$

Now note that if we modify Equation $3$ , we can get the term in red above,

$\color{#bc2612}{\Lambda^{\star} \Big( \sum_{i=1}^{n} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[ \textbf{z} \textbf{z}^{\top}] \Big)} = \color{#bc2612}{\Big( \sum_{i=1}^{n} \textbf{x}_i \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}]^{\top} \Big)}.$

Performing this substitution, we get

$\Psi^{\star} = \frac{1}{n} \sum_{i=1}^{n} \Big( \textbf{x}_i \textbf{x}_i^{\top} \Big) - \frac{1}{n} 2 \color{#807504}{\Lambda^{\star} \sum_{i=1}^{n} \Big(\mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] \textbf{x}_i^{\top} \Big)} + \frac{1}{n} \color{#807504}{\Big( \sum_{i=1}^{n} \textbf{x}_i \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}]^{\top} \Big) (\Lambda^{\star})^{\top}}.$

And since $(ABC)^{\top} = C^{\top} B^{\top} A^{\top}$ , the two terms in yellow above are equal, giving us:

$\begin{aligned} \Psi^{\star} &= \frac{1}{n} \sum_{i=1}^{n} \Big( \textbf{x}_i \textbf{x}_i^{\top} \Big) - \frac{1}{n} \Lambda^{\star} \sum_{i=1}^{n} \Big(\mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] \textbf{x}_i^{\top} \Big) \\ &= \frac{1}{n} \sum_{i=1}^{n} \Big( \textbf{x}_i \textbf{x}_i^{\top} - \Lambda^{\star} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] \textbf{x}_i^{\top} \Big). \end{aligned}$

Finally, we need to add a diagonal constraint to our update. Let $\text{diag}(A)$ diagonalize a matrix $A$ , meaning setting the off-diagonal elements to $0$ . Then the update is

$\Psi^{\star} = \frac{1}{n} \text{diag}\Bigg( \sum_{i=1}^{n} \textbf{x}_i \textbf{x}_i^{\top} - \Lambda^{\star} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] \textbf{x}_i^{\top} \Bigg).$

Summary

The derivations above are fairly tedious, but I wanted to give them in full. My derivations follow the notation and structure of (Ghahramani & Hinton, 1996), but with my own reasoning filling in the gaps. In summary, here are the parameter updates that maximize the likelihood of the parameters given the data and previous parameters,

$\begin{aligned} \Lambda^{\star} &= \Big( \sum_{i=1}^{n} \textbf{x}_i \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}]^{\top} \Big) \Big( \sum_{i=1}^{n} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[ \textbf{z} \textbf{z}^{\top}] \Big)^{-1} \\ \\ \Psi^{\star} &= \frac{1}{n} \text{diag}\Bigg( \sum_{i=1}^{n} \textbf{x}_i \textbf{x}_i^{\top} - \Lambda^{\star} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] \textbf{x}_i^{\top} \Bigg). \end{aligned}$

Notice that to compute $\Lambda^{\star}$ and $\Psi^{\star}$ , we need to be able to compute $\mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}]$ and $\mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z} \textbf{z}^{\top}]$ . These are fairly straightforward, but you need to know about some convenient properties of multivariate normal distributions. Both derivations below rely on Equation $354$ from the Cookbook. Also, following the notation of (Ghahramani & Hinton, 1996), let $\beta = \Lambda^{\top}(\Lambda \Lambda^{\top} + \Psi)^{-1}$ . Then we have:

$\begin{aligned} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] &= \mathbb{E}[\textbf{z}] + \Lambda^{\top}(\Lambda \Lambda^{\top} + \Psi)^{-1} (\textbf{x} - \mathbb{E}[\textbf{x}]) \\ &= \Lambda^{\top}(\Lambda \Lambda^{\top} + \Psi)^{-1} \textbf{x} \\ &= \beta \textbf{x} \end{aligned}$

$\begin{aligned} \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z} \textbf{z}^{\top}] &= \mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] (\mathbb{E}_{\textbf{z} \mid \textbf{x}}[\textbf{z}])^{\top} + \mathbb{V}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] \\ &= \beta \textbf{x} \textbf{x}^{\top} \beta^{\top} + \mathbb{V}_{\textbf{z} \mid \textbf{x}}[\textbf{z}] \\ &= I - \Lambda^{\top} (\Lambda \Lambda^{\top} + \Psi)^{-1} \Lambda + \beta \textbf{x} \textbf{x}^{\top} \beta^{\top} \\ &= I - \beta \Lambda + \beta \textbf{x} \textbf{x}^{\top} \beta^{\top}. \end{aligned}$

Note that $(\Lambda \Lambda^{\top} + \Psi)^{-1}$ can be computed using the Woodbury identity or matrix inversion lemma (see the Cookbook, section $3.2.2$ ). And we’re done.

Factor analysis, PPCA, PCCA

Factor analysis is related to probabilistic PCA (Tipping & Bishop, 1999) and probabilistic CCA (Bach & Jordan, 2005) in interesting ways. Recall that the model for factor analysis is

$p(\textbf{x}) = \mathcal{N}(\textbf{0}, \Lambda \Lambda^{\top} + \Psi)$

where $\Psi$ is a diagonal matrix. The main conceptual difference between factor analysis and probabilistic PCA is that in probabilistic PCA, $\Psi$ is isotropic, meaning all the values on the diagonal are the same. In other words, probabilistic PCA models our data with a covariance matrix that is assumes each component is uncorrelated and has the same variance. This is analogous to standard PCA, in which the principal components are orthogonal unit vectors.

In probabilistic CCA, we have two views of our data, but a shared latent variable $\textbf{z}$ :

$\textbf{x}_a = \Lambda_a \textbf{z} + \textbf{u} \\ \textbf{x}_b = \Lambda_b \textbf{z} + \textbf{u}.$

I find these kinds of comparisons helpful because I can better chunk an idea when I see how it is related to other similar methods.

Acknowledgements

I thank Kaiqian Zhang for finding and correcting several typos and a mistake in a derivation.

Appendix

1. Covariance matrix of joint normal distribution

Given the joint distribution

we can easily see that $\text{Var}(\textbf{x})$ and $\text{Var}(\textbf{z})$ come from their definitions. The covariances $\text{Cov}(\textbf{x}, \textbf{z})$ and $\text{Cov}(\textbf{z}, \textbf{x})$ can be computed from their definitions. I show $\text{Cov}(\textbf{x}, \textbf{z})$ , but the derivation for $\text{Cov}(\textbf{z}, \textbf{x})$ is nearly identical:

$\begin{aligned} \text{Cov}(\textbf{x}, \textbf{z}) &= \mathbb{E}[(\textbf{x} - \mathbb{E}[\textbf{x}])(\textbf{z} - \mathbb{E}[\textbf{z}])^{\top}] \\ &= \mathbb{E}[(\textbf{x} - 0)(\textbf{z} - 0)^{\top}] \\ &= \mathbb{E}[(\Lambda \textbf{z} + \textbf{u})(\textbf{z})^{\top}] \\ &= \mathbb{E}[\Lambda \textbf{z} \textbf{z}^{\top} + \textbf{u} \textbf{z}^{\top}] \\ &= \Lambda \mathbb{E}[\textbf{z} \textbf{z}^{\top}] + \mathbb{E}[\textbf{u} \textbf{z}^{\top}] \\ &= \Lambda. \end{aligned}$

We use the fact that $\mathbb{E}[\textbf{u} \textbf{z}^{\top}] = \mathbb{E}[\textbf{u}]\mathbb{E}[\textbf{z}^{\top}] = 0 \cdot 0$ because $\textbf{z}$ and $\textbf{u}$ are independent random variables, and $\mathbb{E}[\textbf{z}\textbf{z}^{\top}] = I_k$ because:

$\begin{aligned} \text{Var}(\textbf{z}) &= \mathbb{E}[\textbf{z}\textbf{z}^{\top}] - \mathbb{E}[\textbf{z}] \mathbb{E}[\textbf{z}]^{\top} \\ I_k &= \mathbb{E}[\textbf{z}\textbf{z}^{\top}] - 0 \end{aligned}$

2. Conditional distributions

Given the joint distribution

we can use convenient properties of joint Gaussians to compute the conditional distributions. See the Matrix Cookbook (Petersen et al., 2008) section $8.1.3$ , especially Equations $353$ and $354$ . Below, I translate $p(\textbf{x} \mid \textbf{z})$ using Equation $353$ :

$\begin{aligned} p(\textbf{x} \mid \textbf{z}) &= \mathcal{N}(\mathbb{E}[\textbf{x}] + \Lambda I_k^{-1} (\textbf{x} - \mathbb{E}[\textbf{z}]), \Lambda \Lambda^{\top} + \Psi - \Lambda I_k^{-1} \Lambda^{\top}) \\ &= \mathcal{N}(\textbf{0} + \Lambda(\textbf{x} - \textbf{0}), \Lambda \Lambda^{\top} - \Lambda \Lambda^{\top} + \Psi) \\ &= \mathcal{N}(\Lambda \textbf{x}, \Psi). \end{aligned}$

The conditional distribution $p(\textbf{z} \mid \textbf{x})$ can be computed using Equation $354$ .

3. Log of the multivariate normal

If $p(\textbf{x} \mid \boldsymbol{\mu}, \Sigma)$ is a multivariate normal distribution with a mean $\boldsymbol{\mu}$ and covariance $\Sigma$ , then the likelihood of the parameters given the data is

$p(X \mid \boldsymbol{\mu}, \Sigma) = \prod_{i=1}^{n} p(\textbf{x}_i \mid \boldsymbol{\mu}, \Sigma).$

Which holds because we assume our samples are i.i.d. The log-likehood, then, is

$\begin{aligned} \ln p(X \mid \boldsymbol{\mu}, \Sigma) &= \sum_{i=1}^{n} \ln p(\textbf{x}_i \mid \boldsymbol{\mu}, \Sigma) \\ &= \sum_{i=1}^{n} \ln \Big( \frac{1}{\sqrt{(2 \pi)^k |\Sigma |}} \exp \big[ -\frac{1}{2} (\textbf{x}_i - \boldsymbol{\mu})^{\top} \Sigma^{-1} (\textbf{x}_i - \boldsymbol{\mu}) \big] \Big) \\ &= \sum_{i=1}^{n} \Big( - \frac{1}{2} (\textbf{x}_i - \boldsymbol{\mu})^{\top} \Sigma^{-1} (\textbf{x}_i - \boldsymbol{\mu}) - \ln \sqrt{(2 \pi)^k |\Sigma|} \Big) \\ &= - \frac{1}{2} \sum_{i=1}^{n} (\textbf{x}_i - \boldsymbol{\mu})^{\top} \Sigma^{-1} (\textbf{x}_i - \boldsymbol{\mu}) - n \ln \sqrt{(2 \pi)^k |\Sigma|} \\ &= - \frac{1}{2} \sum_{i=1}^{n}(\textbf{x}_i - \boldsymbol{\mu})^{\top} \Sigma^{-1} (\textbf{x}_i - \boldsymbol{\mu}) - \frac{n}{2} \ln (2 \pi)^k |\Sigma| \\ &= - \frac{1}{2} \sum_{i=1}^{n}(\textbf{x}_i - \boldsymbol{\mu})^{\top} \Sigma^{-1} (\textbf{x}_i - \boldsymbol{\mu}) - \frac{nk}{2} \ln 2 \pi - \frac{n}{2} \ln |\Sigma|. \end{aligned}$

This is often written as

$\ln p(X \mid \boldsymbol{\mu}, \Sigma) = -\frac{1}{2} \sum_{i=1}^{n}(\textbf{x}_i - \boldsymbol{\mu})^{\top} \Sigma^{-1} (\textbf{x}_i - \boldsymbol{\mu}) - \frac{n}{2} \ln |\Sigma| + \text{const}$

where we call $\frac{nk}{2} \ln 2 \pi$ a constant because it goes to $0$ for all our derivatives.

4. The trace trick

The “trace trick” is the following relationship that is often applied to Gaussian distributions. Given a covariance matrix $\Sigma$ and observations $\textbf{x}$ , we have

$\textbf{x}^{\top} \Sigma^{-1} \textbf{x} = \text{tr}( \Sigma^{-1} \textbf{x}^{\top} \textbf{x})$

where $\text{tr}(\cdot)$ is the trace of a matrix, which for an arbitrary matrix $A$ is defined as

$\text{tr}(A) = \sum_{i} a_{ii} = a_{11} + a_{22} + \dots + a_{nn}.$

In words, it is the sum of the diagonal elements. There are two important properties of the trace operation. First for a constant $k$ ,

$\text{tr}(k) = k.$

This makes sense because there are no terms to sum over. Second, the operation is commutative when the dimensionality allows it or

$\text{tr}(AB) = \text{tr}(BA).$

You can see this by considering that we just sum over the elements in a different order:

$\begin{aligned} \text{tr}(AB) &= \sum_{i=1}^n \sum_{j=1}^m a_{ij} b_{ji} \\ &= \sum_{j=1}^m \sum_{i=1}^n b_{ji} a_{ji} \\ &= \text{tr}(BA). \end{aligned}$

Since $\textbf{x}^{\top} \Sigma \textbf{x} \in \mathbb{R}$ , i.e. a $1 \times 1$ matrix, we have the following proof of the trace trick:

$\begin{aligned} \textbf{x}^{\top} \Sigma^{-1} \textbf{x} &= \text{tr}(\textbf{x}^{\top} \Sigma \textbf{x}) \\ &= \text{tr}(\Sigma^{-1} \textbf{x}^{\top} \textbf{x}). \end{aligned}$

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 1–38.
Petersen, K. B., Pedersen, M. S., & others. (2008). The matrix cookbook. Technical University of Denmark, 7(15), 510.
Ghahramani, Z., & Hinton, G. E. (1996). The EM algorithm for mixtures of factor analyzers. Technical Report CRG-TR-96-1, University of Toronto.
Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3), 611–622.
Bach, F. R., & Jordan, M. I. (2005). A probabilistic interpretation of canonical correlation analysis.