Factor analysis is a statistical method for modeling high-dimensional data using a smaller number of latent variables. It is deeply related to other probabilistic models such as probabilistic PCA and probabilistic CCA. I define the model and how to fit it in detail.
Published
08 August 2018
Factor analysis is a statistical method for describing observed variables with a fewer number of unobserved variables called factors. The key idea is that by modeling the correlations in our data, we can represent it with fewer variables or dimensions.
As an example, imagine we modeled a random vector x, normally distributed, in which two variables were strongly correlated:
The correlation between x1 and x2 allows us to model the covariance of x as a low-rank approximation using a matrix called the loadings, plus a noise term. Simulating x with 10,000 samples and running factor analysis, we have (values are rounded):
This is the main intuition behind factor analysis. The loadings matrix allows us to quickly read off which variables are correlated, and the diagonal noise matrix accounts for arbitrary and uncorrelated offsets.
Notation
Now that we have the big idea, let’s formalize the model. Consider a random vector x∈Rp. We want to represent this variable with an unobserved latent variable z∈Rk where typically k≪p. The generative model is
x=Λz+u(1)
where Λ∈Rp×k is the loadings matrix and u∈Rp is the noise or uncertainty. The factors z are assumed to be normal with unit variance, i.e. z∼N(0,Ik). This is the probabilistic analog of vectors that are of unit length and orthogonal to each other. The uncertainty u is distributed N(0,Ψ) where Ψ is a diagonal matrix. This diagonality means we assume the uncertainty between observations to be uncorrelated.
Hence, p(x)=N(0,ΛΛ⊤+Ψ), and so x is also Gaussian distributed.
It’s important to think about this modeling assumption. Since ΛΛ⊤ is a low-rank approximation, it cannot model all possible covariance matrices for x. But if Ψ were a full matrix, then we could model any covariance matrix for x. The diagonalization of Ψ, then, means that factor analysis is modeling our observations as a constrained Gaussian.
EM parameter updates
A factor analysis model can be fit using the Expectation–Maximization (EM) algorithm (Dempster et al., 1977). To use EM, we need to be able to compute the expected log-likelihood, E[L(Λ,Ψ∣x,z)], with respect to the conditional distribution of our latent variables z given x and our current estimates of the parameters (E-step),
Q(Λ,Ψ∣Λt,Ψt)=Ez∣x,ΛtΨt[logp(x,z∣Λ,Ψ)].
Then we want to find the parameters that maximize this quantity (M-step),
Λt+1,Ψt+1=argΛ,ΨmaxQ(Λ,Ψ∣Λt,Ψt).
These steps are repeated for a set number of iterations or until convergence. There are plenty of resources the updates for EM for factor analysis, but I show how to derive these updates in detail, skipping over as few steps as possible.
Log-likelihood of a multvariate normal
To begin, consider the joint normal distribution of x and z:
p([xz])=N([00],[ΛΛ⊤+ΨΛ⊤ΛI]).
To see why the covariance matrix is so defined, see the Appendix, Section 1. Using this joint distribution, we can compute the conditional distributions using convenient properties of joint Gaussians. See section 8.1.3 of the Matrix Cookbook(Petersen et al., 2008) (henceforth Cookbook). Thus, we have the conditional distribution p(z∣x) with a mean or conditional expectation:
p(x∣z)=N(Λz,Ψ).
See the Appendix, Section 2 for a derivation. We want this distribution, because we can frame the log-likelihood as follows:
logp(x,z)=logp(x∣z)+logp(z).
Let’s solve for each term separately. See the Appendix, Section 3 for a derivation of this standard formulation of the log of a multivariate normal in general,
Now notice that when we take derivatives with respect to the model parameters, this prior on z goes to 0. So it’s reasonable to focus on the conditional distribution p(x∣z), which is what the authors do in (Ghahramani & Hinton, 1996). That log-likelihood term is
This is the E-step of the EM algorithm: we’ve computed the conditional expectation of the log-likelihood. The M-step is to find the parameters that maximize this equation. We do that by taking the derivative of this equation with respect to each parameter θ, setting it equal to 0 to find the stationary point, and then solving for the update θ⋆.
Derivative w.r.t. Λ
Because of the linearity of differentiation, we can solve for the terms A, B, C, and D in Equation 2 separately. Clearly the derivatives of A and D are 0 because they do not depend on Λ. Then we have:
∂Λ∂Q=∂Λ∂B+∂Λ∂C
∂Λ∂B=∂Λ∂(−2xi⊤Ψ−1ΛEz∣x[z]).
Since since (Ψ−1)⊤=Ψ−1, we can use Equation 70 from the Matrix Cookbook(Petersen et al., 2008) to get:
∂Λ∂B=−2Ψ−1xiEz∣x[z]⊤
For C, we can use Equation 117 from the Matrix Cookbook to get
Step ⋆ above just uses the fact that both Ψ−1 and Ez∣x[zz⊤] are symmetric. Combining these two derivatives and distributing the 21 and summation, we get
Next, we want to compute the derivative of Q with respect to Ψ−1 and solve for Ψ⋆, which happens nicely because one of the derivative terms will result in inverting Ψ−1. For this calculation, we need to compute the derivaties for each of the terms A, B, C, and D,
∂Ψ−1∂Q=∂Ψ−1∂A+∂Ψ−1∂B+∂Ψ−1∂C+∂Ψ−1∂D.
Let’s tackle each term separately. For A, we can use Equation 72 from the Matrix Cookbook to get:
∂Ψ−1∂A=xixi⊤.
For B, we first use the fact that the term is a scalar, so a=a⊤, to re-arrange the order—we’ll see why in a minute:
Finally, we need to add a diagonal constraint to our update. Let diag(A) diagonalize a matrix A, meaning setting the off-diagonal elements to 0. Then the update is
Ψ⋆=n1diag(i=1∑nxixi⊤−Λ⋆Ez∣x[z]xi⊤).
Summary
The derivations above are fairly tedious, but I wanted to give them in full. My derivations follow the notation and structure of (Ghahramani & Hinton, 1996), but with my own reasoning filling in the gaps. In summary, here are the parameter updates that maximize the likelihood of the parameters given the data and previous parameters,
Notice that to compute Λ⋆ and Ψ⋆, we need to be able to compute Ez∣x[z] and Ez∣x[zz⊤]. These are fairly straightforward, but you need to know about some convenient properties of multivariate normal distributions. Both derivations below rely on Equation 354 from the Cookbook. Also, following the notation of (Ghahramani & Hinton, 1996), let β=Λ⊤(ΛΛ⊤+Ψ)−1. Then we have:
Note that (ΛΛ⊤+Ψ)−1 can be computed using the Woodbury identity or matrix inversion lemma (see the Cookbook, section 3.2.2). And we’re done.
Factor analysis, PPCA, PCCA
Factor analysis is related to probabilistic PCA (Tipping & Bishop, 1999) and probabilistic CCA (Bach & Jordan, 2005) in interesting ways. Recall that the model for factor analysis is
p(x)=N(0,ΛΛ⊤+Ψ)
where Ψ is a diagonal matrix. The main conceptual difference between factor analysis and probabilistic PCA is that in probabilistic PCA, Ψ is isotropic, meaning all the values on the diagonal are the same. In other words, probabilistic PCA models our data with a covariance matrix that is assumes each component is uncorrelated and has the same variance. This is analogous to standard PCA, in which the principal components are orthogonal unit vectors.
In probabilistic CCA, we have two views of our data, but a shared latent variable z:
xa=Λaz+uxb=Λbz+u.
I find these kinds of comparisons helpful because I can better chunk an idea when I see how it is related to other similar methods.
Acknowledgements
I thank Kaiqian Zhang for finding and correcting several typos and a mistake in a derivation.
Appendix
1. Covariance matrix of joint normal distribution
Given the joint distribution
p([xz])=N([00],[ΛΛ⊤+ΨΛ⊤ΛI]),
we can easily see that Var(x) and Var(z) come from their definitions. The covariances Cov(x,z) and Cov(z,x) can be computed from their definitions. I show Cov(x,z), but the derivation for Cov(z,x) is nearly identical:
We use the fact that E[uz⊤]=E[u]E[z⊤]=0⋅0 because z and u are independent random variables, and E[zz⊤]=Ik because:
Var(z)Ik=E[zz⊤]−E[z]E[z]⊤=E[zz⊤]−0
2. Conditional distributions
Given the joint distribution
p([xz])=N([00],[ΛΛ⊤+ΨΛ⊤ΛI]),
we can use convenient properties of joint Gaussians to compute the conditional distributions. See the Matrix Cookbook(Petersen et al., 2008) section 8.1.3, especially Equations 353 and 354. Below, I translate p(x∣z) using Equation 353:
where we call 2nkln2π a constant because it goes to 0 for all our derivatives.
4. The trace trick
The “trace trick” is the following relationship that is often applied to Gaussian distributions. Given a covariance matrix Σ and observations x, we have
x⊤Σ−1x=tr(Σ−1x⊤x)
where tr(⋅) is the trace of a matrix, which for an arbitrary matrix A is defined as
tr(A)=i∑aii=a11+a22+⋯+ann.
In words, it is the sum of the diagonal elements. There are two important properties of the trace operation. First for a constant k,
tr(k)=k.
This makes sense because there are no terms to sum over. Second, the operation is commutative when the dimensionality allows it or
tr(AB)=tr(BA).
You can see this by considering that we just sum over the elements in a different order:
Since x⊤Σx∈R, i.e. a 1×1 matrix, we have the following proof of the trace trick:
x⊤Σ−1x=tr(x⊤Σx)=tr(Σ−1x⊤x).
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 1–38.
Petersen, K. B., Pedersen, M. S., & others. (2008). The matrix cookbook. Technical University of Denmark, 7(15), 510.
Ghahramani, Z., & Hinton, G. E. (1996). The EM algorithm for mixtures of factor analyzers. Technical Report CRG-TR-96-1, University of Toronto.
Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3), 611–622.
Bach, F. R., & Jordan, M. I. (2005). A probabilistic interpretation of canonical correlation analysis.