Principal Component Analysis

Principal component analyis (PCA) is a simple, fast, and elegant linear method for data analysis. I explore PCA in detail, first with pictures and intuition, then with linear algebra and detailed derivations, and finally with code.

Principal component analysis (PCA) (Pearson, 1901; Hotelling, 1933) is a workhorse in statistics. It is often the first tool we reach for when performing exploratory data analysis or dimension reduction. The method is simple, fast, and theoretically well-understood, and it is the bedrock upon which many more complex dimension reduction methods are built.

The goal of this post is to understand PCA in detail. We’ll start by visualizing the big idea, exploring the key concepts with as little mathematical detail or jargon as needed. In the next two sections, we’ll formalize our intuitions from the first section and then derive some basic properties of PCA. Finally, we’ll end with a complete implementation of PCA to reproduce a famous result, eigenfaces (Turk & Pentland, 1991).

This is a long post. Please feel free to jump around as desired:

  1. Visualizing the big idea
  2. Formalizing the objective function
  3. Deriving basic properties
  4. Exploring a complete example

1. Visualizing the big idea

Imagine we are naturalists visiting a strange new island, and we are tasked with naming, defining, and classifying the birds we find there. We plan to do this by first observing and cataloguing as many birds as possible and then grouping them based on their shared characteristics. But how do we decide when to place a bird into a given group?

For example, compare the ramphastos (Figure 11, bottom left) to the rufous-collared kingfisher (Figure 11, top right). We may think that these two birds are obviously different species, but why do we think that? The answer is that these two birds are more similar to many other of birds than they are to each other. We can easily find two groupings of birds, say of toucans and kingfishers, such that the variation between the two groups is larger then the variation within each group.

How can be make this idea of intra- versus extra-group variation more precise? One idea would be to record many features of each bird we see: coloring, size, location, plumage, beak shape, and so forth. Then we could look for features along which all the birds on this island had the most variation. We could use this variance-maximizing feature to help us split birds into meaningful groups.

Figure 1. Different species of birds illustrated by Charles Dessalines d'Orbigny, Dictionnaire Universel D'histoire Naturelle, 1841-1849.

It’s worth asking why features with lots of variation are useful for our task, so consider the opposite: imagine that all the birds on the island had a particular feature with almost no variation. Suppose, for example, that all the birds were white. This feature, coloring, would be almost useless for classification of the islands birds. However, all the birds on the island being white does not mean that all the birds belong in the group. If some white birds have large, hooked beaks while other white birds have short, stout beaks, we might wonder if this feature, beak shape, is important for our task. This is the basic logic behind looking for features with lots of variation.

Let’s try to visualize this idea with some data. Imagine we plot two bird features against each other, say beak size against wing span (Figure 11). After plotting these data1, we notice something: since these two features are correlated, the axis along which variation between birds is maximized is neither just beak size nor just wing span. Instead, the axis along which variation is maximized is some combination of both features, as represented by the blue dashed line. This diagonal line can be thought as as representing a new, composite feature or dimension. We can’t precisely interpret this composite feature, but we might hypothesize that it represents “bird size”.

In the language of statistics and linear algebra, we call the direction of this blue dashed line the first principal component (PC). More formally, the first PC is a unit vector that defines an axis along which our data’s variance is maximized.

Figure 2. Beak size versus wing span using an "imaginary" data set of island bird features. The blue dashed line is the axis along which our data have maximal variation. The red dashed line is the second axis along which these two features have maximal variation, under the constraint that it is orthogonal to the first axis. The solid lines are the principal components (unit vectors), which are then scaled by the standard deviations of our data along each dimension.

After finding this first PC, what can we do? We might still be interested in finding dimensions along which our data’s variance is maximized, but without any constraints, another principal component would just be an axis that is very similar to the first. Thus, we’ll add a constraint: the next PC must be orthogonal to the first. So the second PC defines an axis along which our data’s variance is maximized but this new axis is orthogonal to the first PC.

What can we do with these two PCs? First, observe that these two PCs are orthogonal to each other and thus form a basis in R2\mathbb{R}^2. In other words, we can represent any bird in our data set (any black dot in Figure 22) as a linear combination of these two PCs. This allows us to perform a change of basis on our data. If we rotate or reflect our data such that the principal components are our new basis, then we get Figure 33. All the relationships in our data remain the same, but we have “whitened” our data in the sense that our two features are no longer correlated!

Figure 3. Beak size versus wing span, transformed by principal components, into new data with orthogonal features. The axes are labeled "PC1" and "PC2" because they are abstract or composite feature constructed from our original features.

Again, we should stop and question things. Why do we care about this change of basis? Figure 33 has the answer. Recall that our task as naturalists on a strange island is to classify all the new birds. Imagine that, being short on time, the only features we were able to collect about the island’s birds were beak size and wing span, so we must classify the birds using only these two features. What Figure 33 suggests is that in fact, we only need one feature to classify the birds: the first principal component. This fact, that we don’t need both features, we only need one, is a deep idea. You may know that PCA is used for dimension reduction, and this is why. PCA allows us to find new composite features based on how much variation they explain, allowing us to keep only the new features that explain a large fraction of the variance in our data. These new composite features are important, so let’s name them; we’ll call them latent features or latent variables.

I think it’s worth reminding ourselves that while all this talk of “principal components” and “change of basis” and “orthogonality” is abstract, its meaning can be interpretable and even intuitive. Imagine that a brilliant ornithology visited our island, and correctly classified all the birds into three groups (Figure 44). Since the first PC captures most of the variation, this would immediately suggest that if we actually looked at the birds and their groupings, we would find large beaks and large wingspans one end of the spectrum, and small beaks and small wingspans on the other. The first PC is simply capturing the variation in both, which is possible precisely because these two features covary.

Figure 4. Our imaginary data with "ground-truth" labels. Since PC1 captures correlation between beak size and wing span, we expect large birds on one end of the spectrum and small birds on the other end.

2. Formalizing the objective function

Now that we understand the big picture, we can ask: how do we find these principal components? Let’s turn our geometric picture into an optimization problem.

Computing the principal components

Assume our data is mean-centered, and consider an arbitrary unit vector2 u1\mathbf{u}_1 (Figure 55). We want to maximize the variance of our data after it is projected onto a new axis defined by u1\mathbf{u}_1. In other words, we want to maximize the squared distance3 of our projected data from the origin.

To formalize this, consider an arbitrary datum in our collection, xi\mathbf{x}_i, and let’s denote the projection of xi\mathbf{x}_i onto u1\mathbf{u}_1 as pi\mathbf{p}_i:

piproju1xi=xiu1u1u1u1.(1) \mathbf{p}_i \triangleq \text{proj}_{\mathbf{u}_1} \mathbf{x}_i = \frac{\mathbf{x}_i \cdot \mathbf{u}_1}{\mathbf{u}_1 \cdot \mathbf{u}_1} \mathbf{u}_1. \tag{1}

We want to find the optimal u1\mathbf{u}_1^{\star} such that the squared length of pi\mathbf{p}_i is maximized, where the length is

pi2=xiu1u1u1u12=xiu1u12u1u1=xiu1.(2) \begin{aligned} \lVert \mathbf{p}_i \rVert_2 &= \left\lVert \frac{\mathbf{x}_i \cdot \mathbf{u}_1}{\mathbf{u}_1 \cdot \mathbf{u}_1} \mathbf{u}_1 \right\rVert_2 \\ &= \frac{| \mathbf{x}_i \cdot \mathbf{u}_1| \lVert \mathbf{u}_1 \rVert_2 }{|\mathbf{u}_1 \cdot \mathbf{u}_1|} \\ &= |\mathbf{x}_i \cdot \mathbf{u}_1|. \end{aligned} \tag{2}

This suggests the following optimization problem:

u1=arg ⁣maxu1i=1N(u1xi)2,subject to u12=1.(3) \mathbf{u}_1^{\star} = \arg\!\max_{\mathbf{u}_1} \sum_{i=1}^N \left( \mathbf{u}_1 \cdot \mathbf{x}_i \right)^2, \quad \text{subject to $\lVert \mathbf{u}_1 \rVert_2 = 1$}. \tag{3}

This is the linear algebraic formulation of the idea in the previous section: that the first PC represents the axis along which our projected data’s variance is maximized.

Figure 5. Illustration of the PCA optimization problem. We want to find the first PC u1\mathbf{u}_1 such that the length of the projection of our datum xi\mathbf{x}_i onto u1\mathbf{u}_1 is maximized. In other words, we want to maximize the length of the projection pi\mathbf{p}_i. Alternatively, we want to minimize the length of yi\mathbf{y}_i, the difference between our datum xi\mathbf{x}_i and its projection.

A little algebraic manipulation of the summation in Equation 33 gives us

i=1N(u1xi)2=i=1N(u1xi)(u1xi)=u1[n=1Nxixi]u1=u1XXu1.(4) \begin{aligned} \sum_{i=1}^N \left( \mathbf{u}_1 \cdot \mathbf{x}_i \right)^2 &= \sum_{i=1}^N \left( \mathbf{u}_1 \cdot \mathbf{x}_i \right) \left( \mathbf{u}_1 \cdot \mathbf{x}_i \right) \\ &= \mathbf{u}_1^{\top} \left[ \sum_{n=1}^N \mathbf{x}_i \mathbf{x}_i^{\top } \right] \mathbf{u}_1 \\ &\stackrel{\star}{=} \mathbf{u}_1^{\top} \mathbf{X}^{\top} \mathbf{X} \mathbf{u}_1. \end{aligned} \tag{4}

See this post if you’re unconvinced by step \star. Equation 44 allows us to rewrite Equation 33 as

u1=arg ⁣maxu1{u1XXu1},subject to u12=1.(5) \mathbf{u}_1^{\star} = \arg\!\max_{\mathbf{u}_1} \left\{ \mathbf{u}_1^{\top} \mathbf{X}^{\top} \mathbf{X} \mathbf{u}_1 \right\}, \quad \text{subject to $\lVert \mathbf{u}_1 \rVert_2 = 1$}. \tag{5}

So the optimization problem for finding the first PC requires optimizing with respect to our data’s (mean-centered) covariance matrix. This is because we’re ultimately maximizing the variance of our data with respect to u1\mathbf{u}_1, and the covariance matrix captures the necessary relationships.

However, what we’re about to show is even better: the first PC is actually just the first eigenvector of the covariance matrix, XX\mathbf{X}^{\top} \mathbf{X}! This is not, at least in my mind, an obvious result. It is surprising and beautiful. To see it, observe that since XX\mathbf{X}^{\top} \mathbf{X} is a symmetric and positive semi-definite matrix, it has an eigendecomposition,

XX=QΛQ,(6) \mathbf{X}^{\top} \mathbf{X} = \mathbf{Q} \boldsymbol{\Lambda} \mathbf{Q}^{\top}, \tag{6}

where Q\mathbf{Q} is a square matrix (P×PP \times P) with eigenvectors for columns and where Λ\boldsymbol{\Lambda} is a diagonal matrix of eigenvalues. If we define a new variable w1\mathbf{w}_1 as

w1Qu1,(7) \mathbf{w}_1 \triangleq \mathbf{Q}^{\top} \mathbf{u}_1, \tag{7}

then we can rewrite the optimization problem in Equation 55 as

w1=arg ⁣maxw1{w1Λw1},subject to w12=1.(8) \mathbf{w}_1^{\star} = \arg\!\max_{\mathbf{w}_1} \left\{ \mathbf{w}_1^{\top} \boldsymbol{\Lambda} \mathbf{w}_1 \right\}, \quad \text{subject to $\lVert \mathbf{w}_1 \rVert_2 = 1$}. \tag{8}

The norm of w1\mathbf{w}_1 must also be unity since

w12=Qu12=u1QQu1=u1u1=u12=1.(9) \lVert \mathbf{w}_1 \rVert_2 = \lVert \mathbf{Q}^{\top} \mathbf{u}_1 \rVert_2 = \sqrt{\mathbf{u}_1^{\top} \mathbf{Q} \mathbf{Q}^{\top} \mathbf{u}_1} = \sqrt{\mathbf{u}_1^{\top} \mathbf{u}_1} = \lVert \mathbf{u}_1 \rVert_2 = 1. \tag{9}

Equation 99 relies on the fact that Q\mathbf{Q} is orthnormal and thus QQ=QQ=IP\mathbf{Q}^{\top} \mathbf{Q} = \mathbf{Q} \mathbf{Q}^{\top} = \mathbf{I}_P. Since Λ\boldsymbol{\Lambda} is diagonal, the bracketed term in Equation 88 can be un-packed back into a summation, giving us

w1=arg ⁣maxw1{p=1Pλiwi2},subject to w12=1,(10) \mathbf{w}_1^{\star} = \arg\!\max_{\mathbf{w}_1} \left\{ \sum_{p=1}^{P} \lambda_i w_i^2 \right\}, \quad \text{subject to $\lVert \mathbf{w}_1 \rVert_2 = 1$}, \tag{10}

where λi\lambda_i is the ii-th eigenvalue associated with the ii-th column of Q\mathbf{Q}. Without loss of generality, assume that the eigenvalues are ordered, λ1λ2λP0\lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_P \geq 0. (Recall that these eigenvalues are all non-negative, since covariance matrices are always symmetric and positive semi-definite.) Since the eigenvalues are all non-negative, clearly the weights that maximize the problem are

w1=[100].(11) \mathbf{w}_1^{\star} = \begin{bmatrix} 1 \\ 0 \\ \vdots \\ 0 \end{bmatrix}. \tag{11}

In other words, the optimal w1\mathbf{w}_1^{\star} is just the first standard basis vector e1\mathbf{e}_1. If we plug this optimal weight back into Equation 77, we see that the optimal u1\mathbf{u}_1 is simply the first eigenvector!

v1=Qe1.(12) \mathbf{v}_1 = \mathbf{Q} \mathbf{e}_1. \tag{12}

To summarize, we wanted to find the first principal component of our data. As we saw in the first section, this first principal component is a vector that defines an axis along which the variance of our projected data is maximized. And the first principal component is, in fact, the first eigenvector of our data’s covariance matrix!

Finding the second principal follows the same reasoning as above, except now we need to add an additional constraint: u2\mathbf{u}_2 must be orthogonal to u1\mathbf{u}_1. Intuitively, I think it’s pretty clear that w2\mathbf{w}_2^{\star} for w2=Qu2\mathbf{w}_2 = \mathbf{Q}^{\top} \mathbf{u}_2 will be the second standard basis vector, e2\mathbf{e}_2, since that will “select” the second largest eigenvalue, λ2\lambda_2, while ensuring that the two principal components are orthogonal, since

u1u2=w1QQw2=e1e2=0.(13) \mathbf{u}_1^{\top} \mathbf{u}_2 = \mathbf{w}_1^{\top} \mathbf{Q} \mathbf{Q}^{\top} \mathbf{w}_2 = \mathbf{e}_1^{\top} \mathbf{e}_2 = 0. \tag{13}

Of course, this is not a proof, which would probably use induction, but I think it captures the main ideas nicely.

At this point, we may be wondering: have we just “punted” understanding principal components to understanding eigenvectors? I don’t think so. Of course, a deeper understanding of eigenvectors would be useful here. But I also think it’s fine to just say that this idea—that the eigenvectors of a covariance matrix define orthogonal axes along which our data has maximal variance—is simply one of many reasons why eigenvectors are such well-studied mathematical objects.

Maximizing variation versus minimizing error

So far, we have discussed PCA in terms of finding a linear projection of our data such that our projected data have maximum variance. Visually, this is just finding the blue dashed line in Figure 22 so that we can transform our data into Figure 33. However, there is an alternative view of PCA’s objective function, which is that it finds an axis such that the squared error between our data and transformed data is minimized.

This connection may sound more complicated than it is. It’s really just the claim that if we want to maximize the leg (not hypotenuse) of a right triangle (pi\mathbf{p}_i in Figure 55), we must minimize the other leg (yi\mathbf{y}_i in Figure 55). This StackExchange answer is also has some nice visualizations, but in my opinion, those visualizations are essentially more complex versions of Figure 55.

To see this formally, let’s define the difference between u\mathbf{u} and xi\mathbf{x}_i as yi\mathbf{y}_i. If we minimize the length of yi\mathbf{y}_i, we minimize the difference between our datum xi\mathbf{x}_i and our projected datum pi\mathbf{p}_i. So we want to find the optimal u1\mathbf{u}_1^{\star} such that the length of yi\mathbf{y}_i is minimized:

u1=arg ⁣minu1yi,subject to u12=1.(14) \mathbf{u}_1^{\star} = \arg\!\min_{\mathbf{u}_1} \lVert \mathbf{y}_i \rVert, \quad \text{subject to $\lVert \mathbf{u}_1 \rVert_2 = 1$}. \tag{14}

We can write the unconstrained objective in terms of u1\mathbf{u}_1 as

u1=arg ⁣minu1{yiyi}=arg ⁣minu1{(u1xi)(u1xi)}=arg ⁣minu1{u1u1+xixi2u1xi}=arg ⁣minu1{2u1xi}.(15) \begin{aligned} \mathbf{u}_1^{\star} &= \arg\!\min_{\mathbf{u}_1} \left\{ \mathbf{y_i}^{\top} \mathbf{y_i} \right\} \\ &= \arg\!\min_{\mathbf{u}_1} \left\{ (\mathbf{u}_1 - \mathbf{x}_i)^{\top} (\mathbf{u}_1 - \mathbf{x}_i) \right\} \\ &= \arg\!\min_{\mathbf{u}_1} \left\{\mathbf{u}_1^{\top} \mathbf{u}_1 + \mathbf{x}_i^{\top} \mathbf{x}_i - 2 \mathbf{u}_1^{\top} \mathbf{x}_i \right\} \\ &\stackrel{\star}{=} \arg\!\min_{\mathbf{u}_1} \left\{ - 2 \mathbf{u}_1^{\top} \mathbf{x}_i \right\}. \end{aligned} \tag{15}

Step \star holds because u1u1\mathbf{u}_1^{\top} \mathbf{u}_1 and xixi \mathbf{x}_i^{\top} \mathbf{x}_i are constants. To minimize a negative quantity, we want to maximize the un-negated quantity. So we can write this objective as

u1=arg ⁣maxu1u1xi,subject to u1xi0 and u12=1.(16) \mathbf{u}_1^{\star} = \arg\!\max_{\mathbf{u}_1} \mathbf{u}_1^{\top} \mathbf{x}_i, \quad \text{subject to $\mathbf{u}_1^{\top} \mathbf{x}_i \geq 0$ and $\lVert \mathbf{u}_1 \rVert_2 = 1$}. \tag{16}

But this is equivalent to optimizing the absolute length of pi\mathbf{p}_i, which in turn is equivalent to optimizing the squared length of pi\mathbf{p}_i.

I think it’s useful to understand this connection, but the real intuition, in my mind, is Figure 55. PCA finds principal components {u1,,uP}\{ \mathbf{u}_1, \dots, \mathbf{u}_P \} such that the variance of our projected data is maximized, which is equivalent to the squared error of our projected data being minimized (with respect to the original data).

3. Deriving basic properties

Now that we visualized what PCA is doing and have formalized our intuition, we can transition to more a rigorous treatment. Let’s derive some useful properties about PCA. In particular, this section will rely heavily on an understanding of the singular value decomposition (SVD). Using the SVD simply makes many calculations involving PCA easier to do. Please see my post on the SVD as needed.

Principal components and eigenvectors

Let’s start by computing the principal components. Let X\mathbf{X} be an N×PN \times P mean-centered design matrix of our data. The PP columns are features, while the NN rows are independent samples. We saw in the previous section that the principal components are just the eigenvectors of the covariance matrix of X\mathbf{X}. The empirical covariance matrix of our data is

CX=1N1XX.(17) \mathbf{C}_X = \frac{1}{N-1} \mathbf{X}^{\top} \mathbf{X}. \tag{17}

So CX\mathbf{C}_X is a P×PP \times P symmetric and positive semi-definite matrix. The eigendecomposition of CX\mathbf{C}_X is

CX=QΛQ,(18) \mathbf{C}_X = \mathbf{Q} \boldsymbol{\Lambda} \mathbf{Q}^{\top}, \tag{18}

where Λ\boldsymbol{\Lambda} is a diagonal matrix of eigenvalues, and where Q\mathbf{Q} is a P×PP \times P matrix whose columns are eigenvectors, i.e. principal components. In other words, the columns of Q\mathbf{Q} are unit vectors that point in orthogonal directions of maximal variation (e.g. the red and blue arrows in Figure 22).

Note that we can also compute the principal components via the SVD. The SVD of X\mathbf{X} is

X=USV(19) \mathbf{X} = \mathbf{U S V}^{\top} \tag{19}

where U\mathbf{U} is unitary and N×PN \times P, S\mathbf{S} is diagonal and P×PP \times P, and V\mathbf{V} is unitary and P×PP \times P. We can write our empirical covariance matrix CX\mathbf{C}_X via this SVD:

CX=1N1VSUUSV=1N1VS2V=V(1N1S2)V.(20) \begin{aligned} \mathbf{C}_X &= \frac{1}{N-1} \mathbf{V S U}^{\top} \mathbf{U S V}^{\top} \\ &= \frac{1}{N-1} \mathbf{V S}^2 \mathbf{V}^{\top} \\ &= \mathbf{V} \left( \frac{1}{N-1} \mathbf{S}^2 \right) \mathbf{V}^{\top}. \end{aligned} \tag{20}

Here, S2\mathbf{S}^2 denotes a diagonal matrix with the squared singular values of S\mathbf{S}. Thus, we can use the SVD to diagonalize our matrix CX\mathbf{C}_X. This immediately suggests a relationship between singular values and eigenvalues, namely:

Λ=1N1S2.(21) \boldsymbol{\Lambda} = \frac{1}{N-1} \mathbf{S}^2. \tag{21}

Note that since CX\mathbf{C}_X is symmetric and positive semi-definite, it’s eigenvalues are non-negative. Thus, the singular values and the eigenvalues tell the same story: they scale of the variation for their respective principal components.

Transformed data

Now that we know how to compute the principal components V\mathbf{V}4, let’s use them to transform our data (e.g. from Figures 22 to 33).

The matrix V\mathbf{V} performs a change of basis on PP-vectors such that after the linear transformation, the standard basis vectors are mapped to the columns (eigenvectors) of V\mathbf{V}. Visually, V\mathbf{V} un-whitens vectors by rotating them from Figure 33 back into Figure 22. Let zi\mathbf{z}_i be a vector in this whitened vector space. Then we can represent a single datum xi\mathbf{x}_i (also a PP-dimensional column vector) as

xi=Vzi.(22) \mathbf{x}_i = \mathbf{V z}_i. \tag{22}

So xi\mathbf{x}_i is our un-whitened data. Since xi\mathbf{x}_i is a column vector here but a row vector in X\mathbf{X}, we can represent Equation 2222 as a matrix operation as

X=VZVX=VVZXV=Z.(23) \begin{aligned} \mathbf{X}^{\top} &= \mathbf{V Z}^{\top} \\ &\Downarrow \\ \mathbf{V}^{\top} \mathbf{X}^{\top} &= \mathbf{V}^{\top} \mathbf{V Z}^{\top} \\ &\Downarrow \\ \mathbf{X V} &= \mathbf{Z}. \end{aligned} \tag{23}

We can write this transformation via the SVD as

XV=USVV=US.(24) \mathbf{X V} = \mathbf{USV}^{\top} \mathbf{V} = \mathbf{US}. \tag{24}

Thus, the linear transformation XV\mathbf{XV} transforms our data X\mathbf{X} into latent variables Z\mathbf{Z}.

When we use PCA for dimension reduction, then V\mathbf{V} has shape P×KP \times K and S\mathbf{S} has shape PP, where KK is number of reduced features and where ideally KPK \ll P. In this framing, we can view ZV\mathbf{ZV}^{\top} as a low-rank approximation of X\mathbf{X}. So an alternative view of PCA is that we’re finding a low-rank approximation of our data matrix, capturing the basic structure of our data with fewer parameters.

Note that an alternative proof of the equivalence between maximizing variation and minimizing error is to work with the squared error XXV22\lVert \mathbf{X} - \mathbf{XV} \rVert_2^2.

Latent variables with uncorrelated features

We claimed that after transforming our data X\mathbf{X} with the principal components V\mathbf{V}, our latent data Z\mathbf{Z} has features with zero correlation. Let’s prove this. The covariance matrix of Z\mathbf{Z} is:

CZ=1N1ZZ=1N1VXXV=1N1VVS2VV=1N1S2=Λ.(25) \begin{aligned} \mathbf{C}_Z &= \frac{1}{N-1} \mathbf{Z}^{\top} \mathbf{Z} \\ &= \frac{1}{N-1} \mathbf{V}^{\top} \mathbf{X}^{\top} \mathbf{X} \mathbf{V} \\ &= \frac{1}{N-1} \mathbf{V}^{\top} \mathbf{V} \mathbf{S}^2 \mathbf{V}^{\top} \mathbf{V} \\ &= \frac{1}{N-1} \mathbf{S}^2 \\ &= \boldsymbol{\Lambda}. \end{aligned} \tag{25}

Clearly, CZ\mathbf{C}_Z is a diagonal matrix, meaning Z\mathbf{Z} has uncorrelated features. Furthermore, we can see that the variances of our transformed data’s features are simply the eigenvalues of the covariance matrix of X\mathbf{X}! So the standard deviations in Figure 33 (the length of the principal components) are just the square roots of the eigenvalues.

Dimension reduction and variance explained

Now that we can compute the variances of the features of Z\mathbf{Z}, we can answer an important question: how many principal components should we use when performing dimension reduction with PCA? Answer: we take the just principal components with the largest eigenvalues, since those are the components associated with the largest variances!

In particular, we use the ratio of explained variance to total variance. The explained variance of a given principal component is just the variance of its associated feature (column in Z\mathbf{Z}), while the total variance is the sum of these explained variances. If we divide the former by the latter, we get set of ratios that sum to unity and represent the fraction of variance explained by each feature.

A common approach to visualizing all this is a scree plot. In a scree plot, we plot the sorted principal components against the ratio of explained variance (e.g. Figure 66). Ideally, we would want our scree plot to look like a hockey stick. The “elbow” of the stick is where the eigenvalues level off. We would only want to take the components with large eigenvalues, since they explain most of the variation in our data. For example, in Figure 33 it is clear that the first principal component has a much larger eigenvalue than the second. So if we could only pick one component, we would pick the first one.

Figure 6. A scree plot after performing PCA on the Iris data set. The four features (sepal length, sepal width, petal length, and petal width, all in centimeters) are transformed into PCs, where the first PC captures over 90% of the variation in the data.

Unchanged total variance

An interesting and intuitive property of PCA is that total variance of our raw data is identical to the total variance of our transformed data. Concretely, the sum of the diagonals of their respective covariance matrices are the same. For example, consider the two covariance matrices in Figure 77. The left subplot is the covariance matrix of some clearly correlated data X\mathbf{X}, while the right subplot is the covariance matrix of our transformed data Z\mathbf{Z}. You can check for yourself that the sum of the diagonals are identical (roughly 4.574.57).

Figure 7. (Left) The empirical covariance matrix of the Iris data set. The four features are sepal length (SL), sepal width (SW), petal length (PL), and petal width (PW), all in centimeters. (Right) The empirical covariance matrix of the transformed variables after performing PCA. Most of the total variance explained is explained by the first PC. The total variance explained by both covariance matrices (sum of the diagonals) is equal.

To see this formally, let’s compute the total variance of both our data X\mathbf{X} and the transformed data and Z\mathbf{Z}. Clearly, the total variance of Z\mathbf{Z} is just the sum of the diagonal (eigenvalues) in CZ\mathbf{C}_Z, since it is a diagonal matrix:

total variance of Z=i=1Pλi.(26) \text{total variance of $\mathbf{Z}$} = \sum_{i=1}^P \lambda_i. \tag{26}

However, the total variance of X\mathbf{X} is trickier, since CX\mathbf{C}_X is not a diagonal matrix. We want to prove that the sum of the diagonals of CX=QΛQ\mathbf{C}_X = \mathbf{Q} \boldsymbol{\Lambda} \mathbf{Q}^{\top} is equal to the sum of the eigenvalues. Intuitively, we might suspect this is true since Q\mathbf{Q} is a unitary matrix.

First, let’s write out the terms of Q\mathbf{Q} explicitly:

Q=[q11q12q1Pq21q22q2PqP1qP2qPP].(27) \mathbf{Q} = \begin{bmatrix} q_{11} & q_{12} & \dots & q_{1P} \\ q_{21} & q_{22} & \dots & q_{2P} \\ \vdots & \vdots & \ddots & \vdots \\ q_{P1} & q_{P2} & \dots & q_{PP} \end{bmatrix}. \tag{27}

Then ΛQ\boldsymbol{\Lambda} \mathbf{Q}^{\top} is

[λ1000λ2000λP][q11q21qP1q12q22qP2q1Pq2PqPP]=[λ1q11λ1q21λ1qP1λ2q12λ2q22λ2qP2λPq1PλPq2PλPqPP],(28) \begin{bmatrix} \lambda_1 & 0 & \dots & 0 \\ 0 & \lambda_2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \lambda_P \end{bmatrix} \begin{bmatrix} q_{11} & q_{21} & \dots & q_{P1} \\ q_{12} & q_{22} & \dots & q_{P2} \\ \vdots & \vdots & \ddots & \vdots \\ q_{1P} & q_{2P} & \dots & q_{PP} \end{bmatrix} = \begin{bmatrix} \lambda_1 q_{11} & \lambda_1 q_{21} & \dots & \lambda_1 q_{P1} \\ \lambda_2 q_{12} & \lambda_2 q_{22} & \dots & \lambda_2 q_{P2} \\ \vdots & \vdots & \ddots & \vdots \\ \lambda_P q_{1P} & \lambda_P q_{2P} & \dots & \lambda_P q_{PP} \end{bmatrix}, \tag{28}

and QΛQ\mathbf{Q} \boldsymbol{\Lambda} \mathbf{Q}^{\top} is

[q11q12q1Pq21q22q2PqP1qP2qPP][λ1q11λ1q21λ1qP1λ2q12λ2q22λ2qP2λPq1PλPq2PλPqPP].(29) \begin{bmatrix} q_{11} & q_{12} & \dots & q_{1P} \\ q_{21} & q_{22} & \dots & q_{2P} \\ \vdots & \vdots & \ddots & \vdots \\ q_{P1} & q_{P2} & \dots & q_{PP} \end{bmatrix} \begin{bmatrix} \lambda_1 q_{11} & \lambda_1 q_{21} & \dots & \lambda_1 q_{P1} \\ \lambda_2 q_{12} & \lambda_2 q_{22} & \dots & \lambda_2 q_{P2} \\ \vdots & \vdots & \ddots & \vdots \\ \lambda_P q_{1P} & \lambda_P q_{2P} & \dots & \lambda_P q_{PP} \end{bmatrix}. \tag{29}

Writing out each term in this product matrix would be tedious, but it is clear that the ii-th diagonal element in this matrix is

λ1qi12+λ2qi22++λPqiP2,(30) \lambda_1 q_{i1}^2 + \lambda_2 q_{i2}^2 + \dots + \lambda_P q_{iP}^2, \tag{30}

and so the sum of the diagonals must be

[λ1q112+λ2q122++λPq1P2]++[λ1qP12+λ2qP22++λPqPP2].(31) \left[ \lambda_1 q_{11}^2 + \lambda_2 q_{12}^2 + \dots + \lambda_P q_{1P}^2 \right] + \dots + \left[ \lambda_1 q_{P1}^2 + \lambda_2 q_{P2}^2 + \dots + \lambda_P q_{PP}^2 \right]. \tag{31}

We can group all the terms associated with the same λi\lambda_i. This gives us

λ1[j=1Pq1j2]+λ2[j=1Pq2j2]++λP[j=1PqPj2],(32) \lambda_1 \left[ \sum_{j=1}^P q_{1j}^2 \right] + \lambda_2 \left[ \sum_{j=1}^P q_{2j}^2 \right] + \dots + \lambda_P \left[ \sum_{j=1}^P q_{Pj}^2 \right], \tag{32}

and each inner summation is really the squared norm of a row of Q\mathbf{Q}. Since Q\mathbf{Q} is unitary, these squared norms are unity. So the sum of the diagonal elements of CX\mathbf{C}_X is the sum of the eigenvalues. To summary this subsection, we have shown:

total variance of X=total variance of Z=i=1Pλi.(33) \text{total variance of $\mathbf{X}$} = \text{total variance of $\mathbf{Z}$} = \sum_{i=1}^P \lambda_i. \tag{33}

I think this fairly intuitive. For any given data set, there is some total variation. If we apply some linear transformation to our data, we change how the features are correlated, but we don’t change the total variation inherent in the data.

Loadings

In PCA, we decompose our covariance matrix into directions or axes (PCs) and scales (eigenvalues). However, we often want to combine these two bits of information into scaled vectors (e.g. the solid axes lines in Figure 22). In the language of PCA and factor analysis, the loadings are the principal components scaled by the square roots of their respective eigenvalues. In other words, loadings are scaled by the standard deviations of the transformed features. Concretely, let L\mathbf{L} be a P×PP \times P matrix of factor loadings. Then

L=VΛ=VSN1.(34) \begin{aligned} \mathbf{L} &= \mathbf{V} \sqrt{\boldsymbol{\Lambda}} \\ &= \mathbf{V} \frac{\mathbf{S}}{\sqrt{N-1}}. \end{aligned} \tag{34}

Conversely, eigenvectors are unit-scaled loadings.

Probabilistic PCA

We can frame PCA as a probabilistic model, called probabilistic PCA (Tipping & Bishop, 1999). The statistical model for probabilistic PCA is

xi=Lzi+ei,ziN(0,I),eiN(0,σ2I).(35) \mathbf{x}_i = \mathbf{L z}_i + \mathbf{e}_i, \qquad \mathbf{z}_i \sim \mathcal{N}(\mathbf{0}, \mathbf{I}), \qquad \mathbf{e}_i \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}). \tag{35}

Why is this the correct probabilistic framing? Tipping & Bishop have a deeper analysis, but here’s how I think about it. One view of PCA is we decompose (or approximate) our design matrix X\mathbf{X} via a linear transformation of latent variables Z\mathbf{Z}:

X=ZV.(36) \mathbf{X} = \mathbf{Z} \mathbf{V}^{\top}. \tag{36}

And the covariance matrix of X\mathbf{X} is proportional to

XX=VZZV=VΛV=LL.(37) \begin{aligned} \mathbf{X}^{\top} \mathbf{X} &= \mathbf{V} \mathbf{Z}^{\top} \mathbf{Z} \mathbf{V}^{\top} \\ &= \mathbf{V} \boldsymbol{\Lambda} \mathbf{V}^{\top} \\ &= \mathbf{LL}^{\top}. \end{aligned} \tag{37}

We can see that Equation 3535 models the same relationships, since it implies that xi\mathbf{x}_i is normally distributed as

xi=Lzi+eiN(0,LL+σ2I).(38) \mathbf{x}_i = \mathbf{L z}_i + \mathbf{e}_i \sim \mathcal{N}(\mathbf{0}, \mathbf{LL}^{\top} + \sigma^2 \mathbf{I}). \tag{38}

So the model in Equation 3535 is analogous to the model we have discussed so far.

4. Exploring a complete example

As an example of the power of PCA, let’s reproduce a famous result, eigenfaces (Turk & Pentland, 1991). An eigenface is simply a principal component of a data set of images of faces. Since principal components form a basis of the images in our data set, the result of visualizing a single eigenface is an eerie, idealized face. See Figure 88 for eigenfaces based on the Olivetti database of faces.

Figure 8. The first ten eigenfaces (principal components) of the Olivetti data set.

What does this definition of eigenfaces really mean? Recall Equation 2222, which I’ll re-write here:

xi=Vzi.(39) \mathbf{x}_i = \mathbf{Vz}_i. \tag{39}

If xi\mathbf{x}_i is a PP-vector representing an image with PP pixels and if an eigenface is a column of V\mathbf{V}, then zi\mathbf{z}_i must be a PP-vector representing how much each eigenfaces (out of PP eigenfaces) contributes to xi\mathbf{x}_i. So we can think of zi\mathbf{z}_i as representing a PP-vector of latent pixels. To grok this, it may be helpful to write out the matrix-multiplication in Equation 3939 more diagrammatically:

[pixel 1pixel 2pixel P]=[eigenface 1eigenface 2eigenface P][latent pixel 1latent pixel 2latent pixel P].(40) \begin{bmatrix} \text{pixel $1$} \\ \text{pixel $2$} \\ \vdots \\ \text{pixel $P$} \end{bmatrix} = \left[\begin{array}{c|c|c|c} \vphantom{\Bigg|} \text{eigenface $1$} & \text{eigenface $2$} & \dots & \text{eigenface $P$} \end{array} \right] \begin{bmatrix} \text{latent pixel $1$} \\ \text{latent pixel $2$} \\ \vdots \\ \text{latent pixel $P$} \end{bmatrix}. \tag{40}

So each datum (the image on the left-hand side of Equation 4040) is a linear combination of eigenfaces, where the coefficients are defined by latent variables. See my post on matrices as functions and data for a bit more discussion on this idea of how to interpret matrices as data.

But what’s the utility of an eigenface? If we can find a small set of eigenfaces—say KK where KPK \ll P pixels—then we can find a lower-dimensional representation of our data that can be used for downstream tasks like visualization, data compression, and classification. For example, we could sort our eigenfaces by their associated eigenvalues and only pick the top two eigenfaces. Thus, we could represent every face in our data set as a linear combination of just two eigenfaces. And we could visualize the two latent pixels for each image (Figure 99).

Figure 9. Images from the Olivetti data set compressed to two-dimensional points after using PCA for dimension reduction. Unique color-marker pairs are individual subjects. To illustrate the clustering, all ten photos of a given subject are shown at the top. Both the photographs and their associated data are numbered. The first three photographs look distinctly different from the last seven, which is also evident in the clustering.

While eigenfaces are a bit abstract, recall that this is precisely what we were trying to do when we imagined ourselves as naturalists classifying birds on a strange island. We have simply abstracted the problem a bit more, where instead of collecting beaks and feathers, we are collecting pixels. On the island, we were able to construct composite feature “bird size” out of “beak size” and “wing span”, because those two features were correlated. With eigenfaces, we are able to construct composite or projected pixels because many pixels in our data are correlated.

Now that we understand eigenfaces, let’s implement PCA and reproduce Figure 88. Here is the Python code I used to perform PCA. I think it is pretty straightforward if you understand the derivations above:

import numpy as np
from   types import SimpleNamespace

def pca(X, n_components=None):
    """Perform principal component analysis on data X with [samples x features].
    """    
    n, p  = X.shape
    dim   = n_components if n_components else p
    xmean = X.mean(axis=0)

    # Mean center our data.
    X = X - xmean

    # Do eigendecomposition via the SVD.
    U, S, Vt = np.linalg.svd(X, full_matrices=False)

    # Compute eigenvalues from singular values.
    L = S**2 / (n-1)

    # Sort output by eigenvalues in descending order.
    inds = S.argsort()[::-1]
    eigvals = L[inds][:dim]
    pcs = Vt[inds][:dim]

    # For consistency with the blog post text, transpose to make the columns of
    # `pcs` the eigenvectors,
    pcs = pcs.T
    
    # Transform X into latent variables Z.
    Z = (U*S)[:, :dim]

    out = {}
    out['pcs'] = pcs
    out['explained_variance'] = eigvals
    out['total_variance'] = eigvals.sum()
    out['Z'] = Z
    out['n_components'] = dim
    out['mean'] = xmean
    
    return SimpleNamespace(**out)

We can then load in the Olivetti data set from Scikit-learn, perform PCA, and then visualize the first ten components:

import matplotlib.pyplot as plt
from   sklearn.datasets import fetch_olivetti_faces

faces = fetch_olivetti_faces(return_X_y=False)
result = pca(X=faces.data)
fig, axes = plt.subplots(2, 5)
for i, ax in enumerate(axes.flat):
    ax.imshow(result.pcs[:, i].reshape(faces.images[0].shape))

And we’re done!

Conclusion

Principal components are a collection of orthogonal unit vectors such that the ii-th principal component captures maximum variation in our data, while being orthogonal to the previous i1i-1 PCs. Thus, principal components are a basis, which can be used to transform our data into latent variables. These latent variables capture correlations in our data’s features, allowing for a reduced set of latent features that capture roughly the same amount of total variation.

Given the advances in computer vision today, the eigenfaces results may not be particularly impressive. For example, one could easily fit an autoencoder to produce much higher quality latent variables for clustering. But I think this is the wrong perspective. In context, it should be surprising that a simple linear model such as PCA is so effective on high-dimensional and richly structured data such as images. Given that it is also fast, easy to implement, and theoretically well-understood, PCA is a great first choice for preliminary data analysis.

  1. Some readers might notice that this is just the Iris flower data set, but I like birds. 

  2. We constrain u1\mathbf{u}_1 to be a unit vector because there are an infinite number of unconstrained solutions (any scalar multiple of u1\mathbf{u}_1). 

  3. We could optimize the absolute rather than squared length, but we use a squared quantity because it is nicer for optimization (i.e. it is continuously differentiable). 

  4. V\mathbf{V} and Q\mathbf{Q} are equivalent. For simplicity, I’ll use V\mathbf{V} for principal components. 

  1. Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11), 559–572.
  2. Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(6), 417.
  3. Turk, M., & Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1), 71–86.
  4. Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3), 611–622.