Principal Component Analysis

Principal component analyis (PCA) is a simple, fast, and elegant linear method for data analysis. I explore PCA in detail, first with pictures and intuition, then with linear algebra and detailed derivations, and finally with code.

Published

17 September 2022

Principal component analysis (PCA) (Pearson, 1901; Hotelling, 1933) is a workhorse in statistics. It is often the first tool we reach for when performing exploratory data analysis or dimension reduction. The method is simple, fast, and theoretically well-understood, and it is the bedrock upon which many more complex dimension reduction methods are built.

The goal of this post is to understand PCA in detail. We’ll start by visualizing the big idea, exploring the key concepts with as little mathematical detail or jargon as needed. In the next two sections, we’ll formalize our intuitions from the first section and then derive some basic properties of PCA. Finally, we’ll end with a complete implementation of PCA to reproduce a famous result, eigenfaces (Turk & Pentland, 1991).

This is a long post. Please feel free to jump around as desired:

Visualizing the big idea
Formalizing the objective function
Deriving basic properties
Exploring a complete example

1. Visualizing the big idea

Imagine we are naturalists visiting a strange new island, and we are tasked with naming, defining, and classifying the birds we find there. We plan to do this by first observing and cataloguing as many birds as possible and then grouping them based on their shared characteristics. But how do we decide when to place a bird into a given group?

For example, compare the ramphastos (Figure $1$ , bottom left) to the rufous-collared kingfisher (Figure $1$ , top right). We may think that these two birds are obviously different species, but why do we think that? The answer is that these two birds are more similar to many other of birds than they are to each other. We can easily find two groupings of birds, say of toucans and kingfishers, such that the variation between the two groups is larger then the variation within each group.

How can be make this idea of intra- versus extra-group variation more precise? One idea would be to record many features of each bird we see: coloring, size, location, plumage, beak shape, and so forth. Then we could look for features along which all the birds on this island had the most variation. We could use this variance-maximizing feature to help us split birds into meaningful groups.

Figure 1. Different species of birds illustrated by Charles Dessalines d'Orbigny, Dictionnaire Universel D'histoire Naturelle, 1841-1849.

It’s worth asking why features with lots of variation are useful for our task, so consider the opposite: imagine that all the birds on the island had a particular feature with almost no variation. Suppose, for example, that all the birds were white. This feature, coloring, would be almost useless for classification of the islands birds. However, all the birds on the island being white does not mean that all the birds belong in the group. If some white birds have large, hooked beaks while other white birds have short, stout beaks, we might wonder if this feature, beak shape, is important for our task. This is the basic logic behind looking for features with lots of variation.

Let’s try to visualize this idea with some data. Imagine we plot two bird features against each other, say beak size against wing span (Figure $1$ ). After plotting these data¹, we notice something: since these two features are correlated, the axis along which variation between birds is maximized is neither just beak size nor just wing span. Instead, the axis along which variation is maximized is some combination of both features, as represented by the blue dashed line. This diagonal line can be thought as as representing a new, composite feature or dimension. We can’t precisely interpret this composite feature, but we might hypothesize that it represents “bird size”.

In the language of statistics and linear algebra, we call the direction of this blue dashed line the first principal component (PC). More formally, the first PC is a unit vector that defines an axis along which our data’s variance is maximized.

Figure 2. Beak size versus wing span using an "imaginary" data set of island bird features. The blue dashed line is the axis along which our data have maximal variation. The red dashed line is the second axis along which these two features have maximal variation, under the constraint that it is orthogonal to the first axis. The solid lines are the principal components (unit vectors), which are then scaled by the standard deviations of our data along each dimension.

After finding this first PC, what can we do? We might still be interested in finding dimensions along which our data’s variance is maximized, but without any constraints, another principal component would just be an axis that is very similar to the first. Thus, we’ll add a constraint: the next PC must be orthogonal to the first. So the second PC defines an axis along which our data’s variance is maximized but this new axis is orthogonal to the first PC.

What can we do with these two PCs? First, observe that these two PCs are orthogonal to each other and thus form a basis in $\mathbb{R}^2$ . In other words, we can represent any bird in our data set (any black dot in Figure $2$ ) as a linear combination of these two PCs. This allows us to perform a change of basis on our data. If we rotate or reflect our data such that the principal components are our new basis, then we get Figure $3$ . All the relationships in our data remain the same, but we have “whitened” our data in the sense that our two features are no longer correlated!

Figure 3. Beak size versus wing span, transformed by principal components, into new data with orthogonal features. The axes are labeled "PC1" and "PC2" because they are abstract or composite feature constructed from our original features.

Again, we should stop and question things. Why do we care about this change of basis? Figure $3$ has the answer. Recall that our task as naturalists on a strange island is to classify all the new birds. Imagine that, being short on time, the only features we were able to collect about the island’s birds were beak size and wing span, so we must classify the birds using only these two features. What Figure $3$ suggests is that in fact, we only need one feature to classify the birds: the first principal component. This fact, that we don’t need both features, we only need one, is a deep idea. You may know that PCA is used for dimension reduction, and this is why. PCA allows us to find new composite features based on how much variation they explain, allowing us to keep only the new features that explain a large fraction of the variance in our data. These new composite features are important, so let’s name them; we’ll call them latent features or latent variables.

I think it’s worth reminding ourselves that while all this talk of “principal components” and “change of basis” and “orthogonality” is abstract, its meaning can be interpretable and even intuitive. Imagine that a brilliant ornithology visited our island, and correctly classified all the birds into three groups (Figure $4$ ). Since the first PC captures most of the variation, this would immediately suggest that if we actually looked at the birds and their groupings, we would find large beaks and large wingspans one end of the spectrum, and small beaks and small wingspans on the other. The first PC is simply capturing the variation in both, which is possible precisely because these two features covary.

Figure 4. Our imaginary data with "ground-truth" labels. Since PC1 captures correlation between beak size and wing span, we expect large birds on one end of the spectrum and small birds on the other end.

2. Formalizing the objective function

Now that we understand the big picture, we can ask: how do we find these principal components? Let’s turn our geometric picture into an optimization problem.

Computing the principal components

Assume our data is mean-centered, and consider an arbitrary unit vector² $\mathbf{u}_1$ (Figure $5$ ). We want to maximize the variance of our data after it is projected onto a new axis defined by $\mathbf{u}_1$ . In other words, we want to maximize the squared distance³ of our projected data from the origin.

To formalize this, consider an arbitrary datum in our collection, $\mathbf{x}_i$ , and let’s denote the projection of $\mathbf{x}_i$ onto $\mathbf{u}_1$ as $\mathbf{p}_i$ :

$\mathbf{p}_i \triangleq \text{proj}_{\mathbf{u}_1} \mathbf{x}_i = \frac{\mathbf{x}_i \cdot \mathbf{u}_1}{\mathbf{u}_1 \cdot \mathbf{u}_1} \mathbf{u}_1. \tag{1}$

We want to find the optimal $\mathbf{u}_1^{\star}$ such that the squared length of $\mathbf{p}_i$ is maximized, where the length is

$\begin{aligned} \lVert \mathbf{p}_i \rVert_2 &= \left\lVert \frac{\mathbf{x}_i \cdot \mathbf{u}_1}{\mathbf{u}_1 \cdot \mathbf{u}_1} \mathbf{u}_1 \right\rVert_2 \\ &= \frac{| \mathbf{x}_i \cdot \mathbf{u}_1| \lVert \mathbf{u}_1 \rVert_2 }{|\mathbf{u}_1 \cdot \mathbf{u}_1|} \\ &= |\mathbf{x}_i \cdot \mathbf{u}_1|. \end{aligned} \tag{2}$

This suggests the following optimization problem:

$\mathbf{u}_1^{\star} = \arg\!\max_{\mathbf{u}_1} \sum_{i=1}^N \left( \mathbf{u}_1 \cdot \mathbf{x}_i \right)^2, \quad \text{subject to $\lVert \mathbf{u}_1 \rVert_2 = 1$}. \tag{3}$

This is the linear algebraic formulation of the idea in the previous section: that the first PC represents the axis along which our projected data’s variance is maximized.

Figure 5. Illustration of the PCA optimization problem. We want to find the first PC

\mathbf{u}_1

such that the length of the projection of our datum

\mathbf{x}_i

onto

\mathbf{u}_1

is maximized. In other words, we want to maximize the length of the projection

\mathbf{p}_i

. Alternatively, we want to minimize the length of

\mathbf{y}_i

, the difference between our datum

\mathbf{x}_i

and its projection.

A little algebraic manipulation of the summation in Equation $3$ gives us

$\begin{aligned} \sum_{i=1}^N \left( \mathbf{u}_1 \cdot \mathbf{x}_i \right)^2 &= \sum_{i=1}^N \left( \mathbf{u}_1 \cdot \mathbf{x}_i \right) \left( \mathbf{u}_1 \cdot \mathbf{x}_i \right) \\ &= \mathbf{u}_1^{\top} \left[ \sum_{n=1}^N \mathbf{x}_i \mathbf{x}_i^{\top } \right] \mathbf{u}_1 \\ &\stackrel{\star}{=} \mathbf{u}_1^{\top} \mathbf{X}^{\top} \mathbf{X} \mathbf{u}_1. \end{aligned} \tag{4}$

See this post if you’re unconvinced by step $\star$ . Equation $4$ allows us to rewrite Equation $3$ as

$\mathbf{u}_1^{\star} = \arg\!\max_{\mathbf{u}_1} \left\{ \mathbf{u}_1^{\top} \mathbf{X}^{\top} \mathbf{X} \mathbf{u}_1 \right\}, \quad \text{subject to $\lVert \mathbf{u}_1 \rVert_2 = 1$}. \tag{5}$

So the optimization problem for finding the first PC requires optimizing with respect to our data’s (mean-centered) covariance matrix. This is because we’re ultimately maximizing the variance of our data with respect to $\mathbf{u}_1$ , and the covariance matrix captures the necessary relationships.

However, what we’re about to show is even better: the first PC is actually just the first eigenvector of the covariance matrix, $\mathbf{X}^{\top} \mathbf{X}$ ! This is not, at least in my mind, an obvious result. It is surprising and beautiful. To see it, observe that since $\mathbf{X}^{\top} \mathbf{X}$ is a symmetric and positive semi-definite matrix, it has an eigendecomposition,

$\mathbf{X}^{\top} \mathbf{X} = \mathbf{Q} \boldsymbol{\Lambda} \mathbf{Q}^{\top}, \tag{6}$

where $\mathbf{Q}$ is a square matrix ( $P \times P$ ) with eigenvectors for columns and where $\boldsymbol{\Lambda}$ is a diagonal matrix of eigenvalues. If we define a new variable $\mathbf{w}_1$ as

$\mathbf{w}_1 \triangleq \mathbf{Q}^{\top} \mathbf{u}_1, \tag{7}$

then we can rewrite the optimization problem in Equation $5$ as

$\mathbf{w}_1^{\star} = \arg\!\max_{\mathbf{w}_1} \left\{ \mathbf{w}_1^{\top} \boldsymbol{\Lambda} \mathbf{w}_1 \right\}, \quad \text{subject to $\lVert \mathbf{w}_1 \rVert_2 = 1$}. \tag{8}$

The norm of $\mathbf{w}_1$ must also be unity since

$\lVert \mathbf{w}_1 \rVert_2 = \lVert \mathbf{Q}^{\top} \mathbf{u}_1 \rVert_2 = \sqrt{\mathbf{u}_1^{\top} \mathbf{Q} \mathbf{Q}^{\top} \mathbf{u}_1} = \sqrt{\mathbf{u}_1^{\top} \mathbf{u}_1} = \lVert \mathbf{u}_1 \rVert_2 = 1. \tag{9}$

Equation $9$ relies on the fact that $\mathbf{Q}$ is orthnormal and thus $\mathbf{Q}^{\top} \mathbf{Q} = \mathbf{Q} \mathbf{Q}^{\top} = \mathbf{I}_P$ . Since $\boldsymbol{\Lambda}$ is diagonal, the bracketed term in Equation $8$ can be un-packed back into a summation, giving us

$\mathbf{w}_1^{\star} = \arg\!\max_{\mathbf{w}_1} \left\{ \sum_{p=1}^{P} \lambda_i w_i^2 \right\}, \quad \text{subject to $\lVert \mathbf{w}_1 \rVert_2 = 1$}, \tag{10}$

where $\lambda_i$ is the $i$ -th eigenvalue associated with the $i$ -th column of $\mathbf{Q}$ . Without loss of generality, assume that the eigenvalues are ordered, $\lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_P \geq 0$ . (Recall that these eigenvalues are all non-negative, since covariance matrices are always symmetric and positive semi-definite.) Since the eigenvalues are all non-negative, clearly the weights that maximize the problem are

$\mathbf{w}_1^{\star} = \begin{bmatrix} 1 \\ 0 \\ \vdots \\ 0 \end{bmatrix}. \tag{11}$

In other words, the optimal $\mathbf{w}_1^{\star}$ is just the first standard basis vector $\mathbf{e}_1$ . If we plug this optimal weight back into Equation $7$ , we see that the optimal $\mathbf{u}_1$ is simply the first eigenvector!

$\mathbf{v}_1 = \mathbf{Q} \mathbf{e}_1. \tag{12}$

To summarize, we wanted to find the first principal component of our data. As we saw in the first section, this first principal component is a vector that defines an axis along which the variance of our projected data is maximized. And the first principal component is, in fact, the first eigenvector of our data’s covariance matrix!

Finding the second principal follows the same reasoning as above, except now we need to add an additional constraint: $\mathbf{u}_2$ must be orthogonal to $\mathbf{u}_1$ . Intuitively, I think it’s pretty clear that $\mathbf{w}_2^{\star}$ for $\mathbf{w}_2 = \mathbf{Q}^{\top} \mathbf{u}_2$ will be the second standard basis vector, $\mathbf{e}_2$ , since that will “select” the second largest eigenvalue, $\lambda_2$ , while ensuring that the two principal components are orthogonal, since

$\mathbf{u}_1^{\top} \mathbf{u}_2 = \mathbf{w}_1^{\top} \mathbf{Q} \mathbf{Q}^{\top} \mathbf{w}_2 = \mathbf{e}_1^{\top} \mathbf{e}_2 = 0. \tag{13}$

Of course, this is not a proof, which would probably use induction, but I think it captures the main ideas nicely.

At this point, we may be wondering: have we just “punted” understanding principal components to understanding eigenvectors? I don’t think so. Of course, a deeper understanding of eigenvectors would be useful here. But I also think it’s fine to just say that this idea—that the eigenvectors of a covariance matrix define orthogonal axes along which our data has maximal variance—is simply one of many reasons why eigenvectors are such well-studied mathematical objects.

Maximizing variation versus minimizing error

So far, we have discussed PCA in terms of finding a linear projection of our data such that our projected data have maximum variance. Visually, this is just finding the blue dashed line in Figure $2$ so that we can transform our data into Figure $3$ . However, there is an alternative view of PCA’s objective function, which is that it finds an axis such that the squared error between our data and transformed data is minimized.

This connection may sound more complicated than it is. It’s really just the claim that if we want to maximize the leg (not hypotenuse) of a right triangle ( $\mathbf{p}_i$ in Figure $5$ ), we must minimize the other leg ( $\mathbf{y}_i$ in Figure $5$ ). This StackExchange answer is also has some nice visualizations, but in my opinion, those visualizations are essentially more complex versions of Figure $5$ .

To see this formally, let’s define the difference between $\mathbf{u}$ and $\mathbf{x}_i$ as $\mathbf{y}_i$ . If we minimize the length of $\mathbf{y}_i$ , we minimize the difference between our datum $\mathbf{x}_i$ and our projected datum $\mathbf{p}_i$ . So we want to find the optimal $\mathbf{u}_1^{\star}$ such that the length of $\mathbf{y}_i$ is minimized:

$\mathbf{u}_1^{\star} = \arg\!\min_{\mathbf{u}_1} \lVert \mathbf{y}_i \rVert, \quad \text{subject to $\lVert \mathbf{u}_1 \rVert_2 = 1$}. \tag{14}$

We can write the unconstrained objective in terms of $\mathbf{u}_1$ as

$\begin{aligned} \mathbf{u}_1^{\star} &= \arg\!\min_{\mathbf{u}_1} \left\{ \mathbf{y_i}^{\top} \mathbf{y_i} \right\} \\ &= \arg\!\min_{\mathbf{u}_1} \left\{ (\mathbf{u}_1 - \mathbf{x}_i)^{\top} (\mathbf{u}_1 - \mathbf{x}_i) \right\} \\ &= \arg\!\min_{\mathbf{u}_1} \left\{\mathbf{u}_1^{\top} \mathbf{u}_1 + \mathbf{x}_i^{\top} \mathbf{x}_i - 2 \mathbf{u}_1^{\top} \mathbf{x}_i \right\} \\ &\stackrel{\star}{=} \arg\!\min_{\mathbf{u}_1} \left\{ - 2 \mathbf{u}_1^{\top} \mathbf{x}_i \right\}. \end{aligned} \tag{15}$

Step $\star$ holds because $\mathbf{u}_1^{\top} \mathbf{u}_1$ and $\mathbf{x}_i^{\top} \mathbf{x}_i$ are constants. To minimize a negative quantity, we want to maximize the un-negated quantity. So we can write this objective as

$\mathbf{u}_1^{\star} = \arg\!\max_{\mathbf{u}_1} \mathbf{u}_1^{\top} \mathbf{x}_i, \quad \text{subject to $\mathbf{u}_1^{\top} \mathbf{x}_i \geq 0$ and $\lVert \mathbf{u}_1 \rVert_2 = 1$}. \tag{16}$

But this is equivalent to optimizing the absolute length of $\mathbf{p}_i$ , which in turn is equivalent to optimizing the squared length of $\mathbf{p}_i$ .

I think it’s useful to understand this connection, but the real intuition, in my mind, is Figure $5$ . PCA finds principal components $\{ \mathbf{u}_1, \dots, \mathbf{u}_P \}$ such that the variance of our projected data is maximized, which is equivalent to the squared error of our projected data being minimized (with respect to the original data).

3. Deriving basic properties

Now that we visualized what PCA is doing and have formalized our intuition, we can transition to more a rigorous treatment. Let’s derive some useful properties about PCA. In particular, this section will rely heavily on an understanding of the singular value decomposition (SVD). Using the SVD simply makes many calculations involving PCA easier to do. Please see my post on the SVD as needed.

Principal components and eigenvectors

Let’s start by computing the principal components. Let $\mathbf{X}$ be an $N \times P$ mean-centered design matrix of our data. The $P$ columns are features, while the $N$ rows are independent samples. We saw in the previous section that the principal components are just the eigenvectors of the covariance matrix of $\mathbf{X}$ . The empirical covariance matrix of our data is

$\mathbf{C}_X = \frac{1}{N-1} \mathbf{X}^{\top} \mathbf{X}. \tag{17}$

So $\mathbf{C}_X$ is a $P \times P$ symmetric and positive semi-definite matrix. The eigendecomposition of $\mathbf{C}_X$ is

$\mathbf{C}_X = \mathbf{Q} \boldsymbol{\Lambda} \mathbf{Q}^{\top}, \tag{18}$

where $\boldsymbol{\Lambda}$ is a diagonal matrix of eigenvalues, and where $\mathbf{Q}$ is a $P \times P$ matrix whose columns are eigenvectors, i.e. principal components. In other words, the columns of $\mathbf{Q}$ are unit vectors that point in orthogonal directions of maximal variation (e.g. the red and blue arrows in Figure $2$ ).

Note that we can also compute the principal components via the SVD. The SVD of $\mathbf{X}$ is

$\mathbf{X} = \mathbf{U S V}^{\top} \tag{19}$

where $\mathbf{U}$ is unitary and $N \times P$ , $\mathbf{S}$ is diagonal and $P \times P$ , and $\mathbf{V}$ is unitary and $P \times P$ . We can write our empirical covariance matrix $\mathbf{C}_X$ via this SVD:

$\begin{aligned} \mathbf{C}_X &= \frac{1}{N-1} \mathbf{V S U}^{\top} \mathbf{U S V}^{\top} \\ &= \frac{1}{N-1} \mathbf{V S}^2 \mathbf{V}^{\top} \\ &= \mathbf{V} \left( \frac{1}{N-1} \mathbf{S}^2 \right) \mathbf{V}^{\top}. \end{aligned} \tag{20}$

Here, $\mathbf{S}^2$ denotes a diagonal matrix with the squared singular values of $\mathbf{S}$ . Thus, we can use the SVD to diagonalize our matrix $\mathbf{C}_X$ . This immediately suggests a relationship between singular values and eigenvalues, namely:

$\boldsymbol{\Lambda} = \frac{1}{N-1} \mathbf{S}^2. \tag{21}$

Note that since $\mathbf{C}_X$ is symmetric and positive semi-definite, it’s eigenvalues are non-negative. Thus, the singular values and the eigenvalues tell the same story: they scale of the variation for their respective principal components.

Transformed data

Now that we know how to compute the principal components $\mathbf{V}$ ⁴, let’s use them to transform our data (e.g. from Figures $2$ to $3$ ).

The matrix $\mathbf{V}$ performs a change of basis on $P$ -vectors such that after the linear transformation, the standard basis vectors are mapped to the columns (eigenvectors) of $\mathbf{V}$ . Visually, $\mathbf{V}$ un-whitens vectors by rotating them from Figure $3$ back into Figure $2$ . Let $\mathbf{z}_i$ be a vector in this whitened vector space. Then we can represent a single datum $\mathbf{x}_i$ (also a $P$ -dimensional column vector) as

$\mathbf{x}_i = \mathbf{V z}_i. \tag{22}$

So $\mathbf{x}_i$ is our un-whitened data. Since $\mathbf{x}_i$ is a column vector here but a row vector in $\mathbf{X}$ , we can represent Equation $22$ as a matrix operation as

$\begin{aligned} \mathbf{X}^{\top} &= \mathbf{V Z}^{\top} \\ &\Downarrow \\ \mathbf{V}^{\top} \mathbf{X}^{\top} &= \mathbf{V}^{\top} \mathbf{V Z}^{\top} \\ &\Downarrow \\ \mathbf{X V} &= \mathbf{Z}. \end{aligned} \tag{23}$

We can write this transformation via the SVD as

$\mathbf{X V} = \mathbf{USV}^{\top} \mathbf{V} = \mathbf{US}. \tag{24}$

Thus, the linear transformation $\mathbf{XV}$ transforms our data $\mathbf{X}$ into latent variables $\mathbf{Z}$ .

When we use PCA for dimension reduction, then $\mathbf{V}$ has shape $P \times K$ and $\mathbf{S}$ has shape $P$ , where $K$ is number of reduced features and where ideally $K \ll P$ . In this framing, we can view $\mathbf{ZV}^{\top}$ as a low-rank approximation of $\mathbf{X}$ . So an alternative view of PCA is that we’re finding a low-rank approximation of our data matrix, capturing the basic structure of our data with fewer parameters.

Note that an alternative proof of the equivalence between maximizing variation and minimizing error is to work with the squared error $\lVert \mathbf{X} - \mathbf{XV} \rVert_2^2$ .

Latent variables with uncorrelated features

We claimed that after transforming our data $\mathbf{X}$ with the principal components $\mathbf{V}$ , our latent data $\mathbf{Z}$ has features with zero correlation. Let’s prove this. The covariance matrix of $\mathbf{Z}$ is:

$\begin{aligned} \mathbf{C}_Z &= \frac{1}{N-1} \mathbf{Z}^{\top} \mathbf{Z} \\ &= \frac{1}{N-1} \mathbf{V}^{\top} \mathbf{X}^{\top} \mathbf{X} \mathbf{V} \\ &= \frac{1}{N-1} \mathbf{V}^{\top} \mathbf{V} \mathbf{S}^2 \mathbf{V}^{\top} \mathbf{V} \\ &= \frac{1}{N-1} \mathbf{S}^2 \\ &= \boldsymbol{\Lambda}. \end{aligned} \tag{25}$

Clearly, $\mathbf{C}_Z$ is a diagonal matrix, meaning $\mathbf{Z}$ has uncorrelated features. Furthermore, we can see that the variances of our transformed data’s features are simply the eigenvalues of the covariance matrix of $\mathbf{X}$ ! So the standard deviations in Figure $3$ (the length of the principal components) are just the square roots of the eigenvalues.

Dimension reduction and variance explained

Now that we can compute the variances of the features of $\mathbf{Z}$ , we can answer an important question: how many principal components should we use when performing dimension reduction with PCA? Answer: we take the just principal components with the largest eigenvalues, since those are the components associated with the largest variances!

In particular, we use the ratio of explained variance to total variance. The explained variance of a given principal component is just the variance of its associated feature (column in $\mathbf{Z}$ ), while the total variance is the sum of these explained variances. If we divide the former by the latter, we get set of ratios that sum to unity and represent the fraction of variance explained by each feature.

A common approach to visualizing all this is a scree plot. In a scree plot, we plot the sorted principal components against the ratio of explained variance (e.g. Figure $6$ ). Ideally, we would want our scree plot to look like a hockey stick. The “elbow” of the stick is where the eigenvalues level off. We would only want to take the components with large eigenvalues, since they explain most of the variation in our data. For example, in Figure $3$ it is clear that the first principal component has a much larger eigenvalue than the second. So if we could only pick one component, we would pick the first one.

Figure 6. A scree plot after performing PCA on the Iris data set. The four features (sepal length, sepal width, petal length, and petal width, all in centimeters) are transformed into PCs, where the first PC captures over 90% of the variation in the data.

Unchanged total variance

An interesting and intuitive property of PCA is that total variance of our raw data is identical to the total variance of our transformed data. Concretely, the sum of the diagonals of their respective covariance matrices are the same. For example, consider the two covariance matrices in Figure $7$ . The left subplot is the covariance matrix of some clearly correlated data $\mathbf{X}$ , while the right subplot is the covariance matrix of our transformed data $\mathbf{Z}$ . You can check for yourself that the sum of the diagonals are identical (roughly $4.57$ ).

Figure 7. (Left) The empirical covariance matrix of the Iris data set. The four features are sepal length (SL), sepal width (SW), petal length (PL), and petal width (PW), all in centimeters. (Right) The empirical covariance matrix of the transformed variables after performing PCA. Most of the total variance explained is explained by the first PC. The total variance explained by both covariance matrices (sum of the diagonals) is equal.

To see this formally, let’s compute the total variance of both our data $\mathbf{X}$ and the transformed data and $\mathbf{Z}$ . Clearly, the total variance of $\mathbf{Z}$ is just the sum of the diagonal (eigenvalues) in $\mathbf{C}_Z$ , since it is a diagonal matrix:

$\text{total variance of $\mathbf{Z}$} = \sum_{i=1}^P \lambda_i. \tag{26}$

However, the total variance of $\mathbf{X}$ is trickier, since $\mathbf{C}_X$ is not a diagonal matrix. We want to prove that the sum of the diagonals of $\mathbf{C}_X = \mathbf{Q} \boldsymbol{\Lambda} \mathbf{Q}^{\top}$ is equal to the sum of the eigenvalues. Intuitively, we might suspect this is true since $\mathbf{Q}$ is a unitary matrix.

First, let’s write out the terms of $\mathbf{Q}$ explicitly:

$\mathbf{Q} = \begin{bmatrix} q_{11} & q_{12} & \dots & q_{1P} \\ q_{21} & q_{22} & \dots & q_{2P} \\ \vdots & \vdots & \ddots & \vdots \\ q_{P1} & q_{P2} & \dots & q_{PP} \end{bmatrix}. \tag{27}$

Then $\boldsymbol{\Lambda} \mathbf{Q}^{\top}$ is

$\begin{bmatrix} \lambda_1 & 0 & \dots & 0 \\ 0 & \lambda_2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \lambda_P \end{bmatrix} \begin{bmatrix} q_{11} & q_{21} & \dots & q_{P1} \\ q_{12} & q_{22} & \dots & q_{P2} \\ \vdots & \vdots & \ddots & \vdots \\ q_{1P} & q_{2P} & \dots & q_{PP} \end{bmatrix} = \begin{bmatrix} \lambda_1 q_{11} & \lambda_1 q_{21} & \dots & \lambda_1 q_{P1} \\ \lambda_2 q_{12} & \lambda_2 q_{22} & \dots & \lambda_2 q_{P2} \\ \vdots & \vdots & \ddots & \vdots \\ \lambda_P q_{1P} & \lambda_P q_{2P} & \dots & \lambda_P q_{PP} \end{bmatrix}, \tag{28}$

and $\mathbf{Q} \boldsymbol{\Lambda} \mathbf{Q}^{\top}$ is

$\begin{bmatrix} q_{11} & q_{12} & \dots & q_{1P} \\ q_{21} & q_{22} & \dots & q_{2P} \\ \vdots & \vdots & \ddots & \vdots \\ q_{P1} & q_{P2} & \dots & q_{PP} \end{bmatrix} \begin{bmatrix} \lambda_1 q_{11} & \lambda_1 q_{21} & \dots & \lambda_1 q_{P1} \\ \lambda_2 q_{12} & \lambda_2 q_{22} & \dots & \lambda_2 q_{P2} \\ \vdots & \vdots & \ddots & \vdots \\ \lambda_P q_{1P} & \lambda_P q_{2P} & \dots & \lambda_P q_{PP} \end{bmatrix}. \tag{29}$

Writing out each term in this product matrix would be tedious, but it is clear that the $i$ -th diagonal element in this matrix is

$\lambda_1 q_{i1}^2 + \lambda_2 q_{i2}^2 + \dots + \lambda_P q_{iP}^2, \tag{30}$

and so the sum of the diagonals must be

$\left[ \lambda_1 q_{11}^2 + \lambda_2 q_{12}^2 + \dots + \lambda_P q_{1P}^2 \right] + \dots + \left[ \lambda_1 q_{P1}^2 + \lambda_2 q_{P2}^2 + \dots + \lambda_P q_{PP}^2 \right]. \tag{31}$

We can group all the terms associated with the same $\lambda_i$ . This gives us

$\lambda_1 \left[ \sum_{j=1}^P q_{1j}^2 \right] + \lambda_2 \left[ \sum_{j=1}^P q_{2j}^2 \right] + \dots + \lambda_P \left[ \sum_{j=1}^P q_{Pj}^2 \right], \tag{32}$

and each inner summation is really the squared norm of a row of $\mathbf{Q}$ . Since $\mathbf{Q}$ is unitary, these squared norms are unity. So the sum of the diagonal elements of $\mathbf{C}_X$ is the sum of the eigenvalues. To summary this subsection, we have shown:

$\text{total variance of $\mathbf{X}$} = \text{total variance of $\mathbf{Z}$} = \sum_{i=1}^P \lambda_i. \tag{33}$

I think this fairly intuitive. For any given data set, there is some total variation. If we apply some linear transformation to our data, we change how the features are correlated, but we don’t change the total variation inherent in the data.

Loadings

In PCA, we decompose our covariance matrix into directions or axes (PCs) and scales (eigenvalues). However, we often want to combine these two bits of information into scaled vectors (e.g. the solid axes lines in Figure $2$ ). In the language of PCA and factor analysis, the loadings are the principal components scaled by the square roots of their respective eigenvalues. In other words, loadings are scaled by the standard deviations of the transformed features. Concretely, let $\mathbf{L}$ be a $P \times P$ matrix of factor loadings. Then

$\begin{aligned} \mathbf{L} &= \mathbf{V} \sqrt{\boldsymbol{\Lambda}} \\ &= \mathbf{V} \frac{\mathbf{S}}{\sqrt{N-1}}. \end{aligned} \tag{34}$

Conversely, eigenvectors are unit-scaled loadings.

Probabilistic PCA

We can frame PCA as a probabilistic model, called probabilistic PCA (Tipping & Bishop, 1999). The statistical model for probabilistic PCA is

$\mathbf{x}_i = \mathbf{L z}_i + \mathbf{e}_i, \qquad \mathbf{z}_i \sim \mathcal{N}(\mathbf{0}, \mathbf{I}), \qquad \mathbf{e}_i \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}). \tag{35}$

Why is this the correct probabilistic framing? Tipping & Bishop have a deeper analysis, but here’s how I think about it. One view of PCA is we decompose (or approximate) our design matrix $\mathbf{X}$ via a linear transformation of latent variables $\mathbf{Z}$ :

$\mathbf{X} = \mathbf{Z} \mathbf{V}^{\top}. \tag{36}$

And the covariance matrix of $\mathbf{X}$ is proportional to

$\begin{aligned} \mathbf{X}^{\top} \mathbf{X} &= \mathbf{V} \mathbf{Z}^{\top} \mathbf{Z} \mathbf{V}^{\top} \\ &= \mathbf{V} \boldsymbol{\Lambda} \mathbf{V}^{\top} \\ &= \mathbf{LL}^{\top}. \end{aligned} \tag{37}$

We can see that Equation $35$ models the same relationships, since it implies that $\mathbf{x}_i$ is normally distributed as

$\mathbf{x}_i = \mathbf{L z}_i + \mathbf{e}_i \sim \mathcal{N}(\mathbf{0}, \mathbf{LL}^{\top} + \sigma^2 \mathbf{I}). \tag{38}$

So the model in Equation $35$ is analogous to the model we have discussed so far.

4. Exploring a complete example

As an example of the power of PCA, let’s reproduce a famous result, eigenfaces (Turk & Pentland, 1991). An eigenface is simply a principal component of a data set of images of faces. Since principal components form a basis of the images in our data set, the result of visualizing a single eigenface is an eerie, idealized face. See Figure $8$ for eigenfaces based on the Olivetti database of faces.

Figure 8. The first ten eigenfaces (principal components) of the Olivetti data set.

What does this definition of eigenfaces really mean? Recall Equation $22$ , which I’ll re-write here:

$\mathbf{x}_i = \mathbf{Vz}_i. \tag{39}$

If $\mathbf{x}_i$ is a $P$ -vector representing an image with $P$ pixels and if an eigenface is a column of $\mathbf{V}$ , then $\mathbf{z}_i$ must be a $P$ -vector representing how much each eigenfaces (out of $P$ eigenfaces) contributes to $\mathbf{x}_i$ . So we can think of $\mathbf{z}_i$ as representing a $P$ -vector of latent pixels. To grok this, it may be helpful to write out the matrix-multiplication in Equation $39$ more diagrammatically:

$\begin{bmatrix} \text{pixel $1$} \\ \text{pixel $2$} \\ \vdots \\ \text{pixel $P$} \end{bmatrix} = \left[\begin{array}{c|c|c|c} \vphantom{\Bigg|} \text{eigenface $1$} & \text{eigenface $2$} & \dots & \text{eigenface $P$} \end{array} \right] \begin{bmatrix} \text{latent pixel $1$} \\ \text{latent pixel $2$} \\ \vdots \\ \text{latent pixel $P$} \end{bmatrix}. \tag{40}$

So each datum (the image on the left-hand side of Equation $40$ ) is a linear combination of eigenfaces, where the coefficients are defined by latent variables. See my post on matrices as functions and data for a bit more discussion on this idea of how to interpret matrices as data.

But what’s the utility of an eigenface? If we can find a small set of eigenfaces—say $K$ where $K \ll P$ pixels—then we can find a lower-dimensional representation of our data that can be used for downstream tasks like visualization, data compression, and classification. For example, we could sort our eigenfaces by their associated eigenvalues and only pick the top two eigenfaces. Thus, we could represent every face in our data set as a linear combination of just two eigenfaces. And we could visualize the two latent pixels for each image (Figure $9$ ).

Figure 9. Images from the Olivetti data set compressed to two-dimensional points after using PCA for dimension reduction. Unique color-marker pairs are individual subjects. To illustrate the clustering, all ten photos of a given subject are shown at the top. Both the photographs and their associated data are numbered. The first three photographs look distinctly different from the last seven, which is also evident in the clustering.

While eigenfaces are a bit abstract, recall that this is precisely what we were trying to do when we imagined ourselves as naturalists classifying birds on a strange island. We have simply abstracted the problem a bit more, where instead of collecting beaks and feathers, we are collecting pixels. On the island, we were able to construct composite feature “bird size” out of “beak size” and “wing span”, because those two features were correlated. With eigenfaces, we are able to construct composite or projected pixels because many pixels in our data are correlated.

Now that we understand eigenfaces, let’s implement PCA and reproduce Figure $8$ . Here is the Python code I used to perform PCA. I think it is pretty straightforward if you understand the derivations above:

import numpy as np
from   types import SimpleNamespace

def pca(X, n_components=None):
    """Perform principal component analysis on data X with [samples x features].
    """    
    n, p  = X.shape
    dim   = n_components if n_components else p
    xmean = X.mean(axis=0)

    # Mean center our data.
    X = X - xmean

    # Do eigendecomposition via the SVD.
    U, S, Vt = np.linalg.svd(X, full_matrices=False)

    # Compute eigenvalues from singular values.
    L = S**2 / (n-1)

    # Sort output by eigenvalues in descending order.
    inds = S.argsort()[::-1]
    eigvals = L[inds][:dim]
    pcs = Vt[inds][:dim]

    # For consistency with the blog post text, transpose to make the columns of
    # `pcs` the eigenvectors,
    pcs = pcs.T
    
    # Transform X into latent variables Z.
    Z = (U*S)[:, :dim]

    out = {}
    out['pcs'] = pcs
    out['explained_variance'] = eigvals
    out['total_variance'] = eigvals.sum()
    out['Z'] = Z
    out['n_components'] = dim
    out['mean'] = xmean
    
    return SimpleNamespace(**out)

We can then load in the Olivetti data set from Scikit-learn, perform PCA, and then visualize the first ten components:

import matplotlib.pyplot as plt
from   sklearn.datasets import fetch_olivetti_faces

faces = fetch_olivetti_faces(return_X_y=False)
result = pca(X=faces.data)
fig, axes = plt.subplots(2, 5)
for i, ax in enumerate(axes.flat):
    ax.imshow(result.pcs[:, i].reshape(faces.images[0].shape))

And we’re done!

Conclusion

Principal components are a collection of orthogonal unit vectors such that the $i$ -th principal component captures maximum variation in our data, while being orthogonal to the previous $i-1$ PCs. Thus, principal components are a basis, which can be used to transform our data into latent variables. These latent variables capture correlations in our data’s features, allowing for a reduced set of latent features that capture roughly the same amount of total variation.

Given the advances in computer vision today, the eigenfaces results may not be particularly impressive. For example, one could easily fit an autoencoder to produce much higher quality latent variables for clustering. But I think this is the wrong perspective. In context, it should be surprising that a simple linear model such as PCA is so effective on high-dimensional and richly structured data such as images. Given that it is also fast, easy to implement, and theoretically well-understood, PCA is a great first choice for preliminary data analysis.

Some readers might notice that this is just the Iris flower data set, but I like birds. ↩
We constrain $\mathbf{u}_1$ to be a unit vector because there are an infinite number of unconstrained solutions (any scalar multiple of $\mathbf{u}_1$ ). ↩
We could optimize the absolute rather than squared length, but we use a squared quantity because it is nicer for optimization (i.e. it is continuously differentiable). ↩
$\mathbf{V}$ and $\mathbf{Q}$ are equivalent. For simplicity, I’ll use $\mathbf{V}$ for principal components. ↩

Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11), 559–572.
Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(6), 417.
Turk, M., & Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1), 71–86.
Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3), 611–622.