Woodbury Matrix Identity for Factor Analysis

In factor analysis, the Woodbury matrix identity allows us to invert the covariance matrix of our data x\textbf{x} in O(k3)O(k^3) time rather than O(p3)O(p^3) time where kk and pp are the latent and data dimensions respectively. I explain and implement the technique.

Motivation

Given a pp-dimensional random vector x\textbf{x}, its generative model under factor analysis is

x=Λz+u \textbf{x} = \Lambda \textbf{z} + \textbf{u}

where ΛRp×k\Lambda \in \mathbb{R}^{p \times k} is the loadings matrix and uRp\textbf{u} \in \mathbb{R}^{p} is the noise or uncertainty. The factors z\textbf{z} are assumed to be normal with unit variance, i.e. zN(0,Ik)\textbf{z} \sim \mathcal{N}(\textbf{0}, I_k). The uncertainty u\textbf{u} is distributed N(0,Ψ)\mathcal{N}(\textbf{0}, \Psi) where Ψ\Psi is a diagonal matrix. This means that the random variable x\textbf{x} is modeled by

xN(0,ΛΛ+ΨΣ). \textbf{x} \sim \mathcal{N}(\textbf{0}, \underbrace{\Lambda \Lambda^{\top} + \Psi}_{\Sigma}).

One way to view this is that since ΛΛ\Lambda \Lambda^{\top} is a p×kp \times k matrix times a k×pk \times p matrix, it is a low-rank approximation of the covariance of x\textbf{x}, while Ψ\Psi is a correction matrix that allows Σ\Sigma to be full rank.

If you write out the log-likelihood of this model, set its derivative equal to 00, and solve for the parameters Λ\Lambda and Ψ\Psi, you’ll eventually run into the fact that you need to invert Σ\Sigma, i.e. this shows up in the EM updates:

(ΛΛ+Ψ)1 (\Lambda \Lambda^{\top} + \Psi)^{-1}

See my previous post on factor analysis for more details. This matrix is p×pp \times p, meaning that matrix inversion is O(p3)O(p^3) using Gauss–Jordan elimination. In Ghahramani and Hinton’s paper on a mixture of factor analyzers (Ghahramani & Hinton, 1996), they mention that (ΛΛ+Ψ)(\Lambda \Lambda^{\top} + \Psi) can be “efficiently inverted using the matrix inversion lemma” or the Woodbury matrix identity. This post explores what that comment means.

Woodbury matrix identity

The Woodbury matrix identity states that the inverse of a rank-kk correction of some matrix—in factor analysis, the full rank matrix Ψ\Psi is rank-corrected by ΛΛ\Lambda \Lambda^{\top}—can be computed by doing a rank-kk correction to the inverse of the original matrix.

If AA is a p×pp \times p full rank matrix that is rank corrected by UCVUCV where URp×kU \in \mathbb{R}^{p \times k}, CRk×kC \in \mathbb{R}^{k \times k}, and VRk×pV \in \mathbb{R}^{k \times p}, then the Woodbury identity is:

(A+UCV)1=A1A1U(C1+VA1U)1VA1 (A + UCV)^{-1} = A^{-1} - A^{-1} U (C^{-1} + V A^{-1} U)^{-1} V A^{-1}

Wikipedia has a direct proof of this that is easy to follow. In our case, we have A=ΨA = \Psi, U=ΛU = \Lambda, C=IC = I, and V=ΛV = \Lambda^{\top}, giving us:

(Ψ+ΛΛ)=Ψ1Ψ1Λ(I+ΛΨ1Λ)1ΛΨ1 (\Psi + \Lambda \Lambda^{\top}) = \Psi^{-1} - \Psi^{-1} \Lambda(I + \Lambda^{\top} \Psi^{-1} \Lambda)^{-1} \Lambda^{\top} \Psi^{-1}

The critical thing to note is that

(I+ΛΨ1Λk×k)1, (\overbrace{I + \Lambda^{\top} \Psi^{-1} \Lambda}^{k \times k})^{-1},

meaning that inverting this matrix is O(k3)O(k^3) where we assume that kpk \ll p. Inverting Ψ\Psi is easy because it is diagonal; we just replace each element in the diagonal with its reciprocal, i.e.

Ψii11Ψii \Psi^{-1}_{ii} \triangleq \frac{1}{\Psi_{ii}}

In summary, the Woodbury matrix identity is perfect for fast matrix inversion in factor analysis. The inversion is O(k3)O(k^3) rather than O(p3)O(p^3).

Implementation

I’m starting to appreciate that while numerical linear algebra is fun—it’s the perfect combination of math, programming, and efficiency—it’s hard to get right. The Woodbury identity is a good example of what I mean.

Consider this Python function to compute the Woodbury identity given a pp-vector PP that represents the diagonal elements of Ψ\Psi and matrices UU and VV:

import numpy as np

def woodbury(P, U, V):
    k = P.shape[1]
    # Fast inversion of diagonal Psi.
    P_inv = np.diag(1./P)
    # B is the k by k matrix to invert.
    B_inv = np.linalg.inv(np.eye(k) + V @ P_inv @ U)
    return P_inv - P_inv @ U @ B_inv @ V @ P_inv

Comparing this method to np.linalg.inv(), it is certainly faster, but it is not as fast as I would have expected (Figure 11). What gives? In the experiments in Figure 11, the number of samples nn is 100,000100,000, k=100k = 100, and pp, which ranges along the xx-axis, gets as large as 10,00010,000. Yet NumPy is competitive with our function.

Figure 1: Comparison of `np.linalg.inv()` versus inversion using the Woodbury matrix identity. The number of samples nn is 100,000100,000 and k=100k = 100. Results were computed on a standard laptop (2.3 GHz Intel Core i5, 16 GB RAM).

My hypothesis was that NumPy is competitive for large pp because woodbury() requires storing a lot of intermediate matrices and performs. To find a faster way to perform our matrix multiplications, I used einsum_path. The improved function is:

import numpy as np
    
def woodbury_einsum(P, U, V):
    k = P.shape[1]
    # Fast inversion of diagonal Psi.
    P_inv = np.diag(1./np.diag(P))
    tmp   = np.einsum('ab,bc,cd->ad',
                      V, P_inv, U,
                      optimize=['einsum_path', (1, 2), (0, 1)])
    B_inv = np.linalg.inv(np.eye(k) + tmp)
    tmp   = np.einsum('ab,bc,cd,de,ef->af',
                      P_inv, U, B_inv, V, P_inv,
                      optimize=['einsum_path', (0, 1), (0, 1), (0, 2), (0, 1)])
    return P_inv - tmp

The results with woodbury_einsum() are much more satisfying (Figure 22). Now our practical implementation starts to match our expectations based on theory.

Figure 2: Comparison of `np.linalg.inv()` versus inversion using the Woodbury matrix identity optimized with `np.einsum()`. The number of samples nn is 100,000100,000 and k=100k = 100. Results were computed on a standard laptop (2.3 GHz Intel Core i5, 16 GB RAM).

But there’s an even faster way to tackle this problem. I asked how to make this function faster on StackOverflow, and someone observed that since we know P is a diagonal matrix, we can use NumPy’s broadcasting rather than explicitly performing matrix multiplications. This works because:

>>> import numpy as np
>>> A = np.array([2, 3])
>>> B = np.ones((2, 2))
>>> B * A
array([[2., 3.],
       [2., 3.]])
>>> B @ np.diag(A)
array([[2., 3.],
       [2., 3.]])

In words, when we write B * A where A is a vector, NumPy broadcasts, performing element-wise vector-vector multiplication with every row vector in B. This is mathematically equivalent to matrix multiplying B times A when A is a diagonal matrix. The code for this is (credit: Chris Swierczewski):

def woodbury_broadcasting(P, U, V):
    k = P.shape[1]
    P_inv_diag = 1./np.diag(P)
    B_inv = np.linalg.inv(np.eye(k) + (V * P_inv_diag) @ U)
    return np.diag(P_inv_diag) - (P_inv_diag.reshape(-1,1) * U @ B_inv @ V * P_inv_diag)

I compared this new method, woodbury_broadcasting(), to our method optimized with einsum() (Figure 33) and the results are impressive. The new method using broadcasting is over twice as fast. This makes a lot of sense. Avoiding a couple of big matrix multiplications is obviously faster than optimizing those operations.

Figure 3: Comparison of `woodbury_einsum()` versus `woodbury_broadcasting()`. The number of samples nn is 100,000100,000 and k=100k = 100. Results were computed on a standard laptop (2.3 GHz Intel Core i5, 16 GB RAM).

These kinds of problems and solutions are some of my favorite. We started with a hard problem, an algorithm that scaled cubically with the dimension of our data. But with a little math and a little programming, we were able to recover an exact solution that, for large dimensions, is several orders of magnitude faster.

   

Acknowledgements

I want to thank Ryan Adams for telling me to look into this identity when thinking about how to make factor analysis more scalable.

  1. Ghahramani, Z., & Hinton, G. E. (1996). The EM algorithm for mixtures of factor analyzers. Technical Report CRG-TR-96-1, University of Toronto.