Principal Component Analysis
Principal component analyis (PCA) is a simple, fast, and elegant linear method for data analysis. I explore PCA in detail, first with pictures and intuition, then with linear algebra and detailed derivations, and finally with code.
Principal component analysis (PCA) (Pearson, 1901; Hotelling, 1933) is a workhorse in statistics. It is often the first tool we reach for when performing exploratory data analysis or dimension reduction. The method is simple, fast, and theoretically well-understood, and it is the bedrock upon which many more complex dimension reduction methods are built.
The goal of this post is to understand PCA in detail. We’ll start by visualizing the big idea, exploring the key concepts with as little mathematical detail or jargon as needed. In the next two sections, we’ll formalize our intuitions from the first section and then derive some basic properties of PCA. Finally, we’ll end with a complete implementation of PCA to reproduce a famous result, eigenfaces (Turk & Pentland, 1991).
This is a long post. Please feel free to jump around as desired:
- Visualizing the big idea
- Formalizing the objective function
- Deriving basic properties
- Exploring a complete example
1. Visualizing the big idea
Imagine we are naturalists visiting a strange new island, and we are tasked with naming, defining, and classifying the birds we find there. We plan to do this by first observing and cataloguing as many birds as possible and then grouping them based on their shared characteristics. But how do we decide when to place a bird into a given group?
For example, compare the ramphastos (Figure , bottom left) to the rufous-collared kingfisher (Figure , top right). We may think that these two birds are obviously different species, but why do we think that? The answer is that these two birds are more similar to many other of birds than they are to each other. We can easily find two groupings of birds, say of toucans and kingfishers, such that the variation between the two groups is larger then the variation within each group.
How can be make this idea of intra- versus extra-group variation more precise? One idea would be to record many features of each bird we see: coloring, size, location, plumage, beak shape, and so forth. Then we could look for features along which all the birds on this island had the most variation. We could use this variance-maximizing feature to help us split birds into meaningful groups.
It’s worth asking why features with lots of variation are useful for our task, so consider the opposite: imagine that all the birds on the island had a particular feature with almost no variation. Suppose, for example, that all the birds were white. This feature, coloring, would be almost useless for classification of the islands birds. However, all the birds on the island being white does not mean that all the birds belong in the group. If some white birds have large, hooked beaks while other white birds have short, stout beaks, we might wonder if this feature, beak shape, is important for our task. This is the basic logic behind looking for features with lots of variation.
Let’s try to visualize this idea with some data. Imagine we plot two bird features against each other, say beak size against wing span (Figure ). After plotting these data1, we notice something: since these two features are correlated, the axis along which variation between birds is maximized is neither just beak size nor just wing span. Instead, the axis along which variation is maximized is some combination of both features, as represented by the blue dashed line. This diagonal line can be thought as as representing a new, composite feature or dimension. We can’t precisely interpret this composite feature, but we might hypothesize that it represents “bird size”.
In the language of statistics and linear algebra, we call the direction of this blue dashed line the first principal component (PC). More formally, the first PC is a unit vector that defines an axis along which our data’s variance is maximized.
After finding this first PC, what can we do? We might still be interested in finding dimensions along which our data’s variance is maximized, but without any constraints, another principal component would just be an axis that is very similar to the first. Thus, we’ll add a constraint: the next PC must be orthogonal to the first. So the second PC defines an axis along which our data’s variance is maximized but this new axis is orthogonal to the first PC.
What can we do with these two PCs? First, observe that these two PCs are orthogonal to each other and thus form a basis in . In other words, we can represent any bird in our data set (any black dot in Figure ) as a linear combination of these two PCs. This allows us to perform a change of basis on our data. If we rotate or reflect our data such that the principal components are our new basis, then we get Figure . All the relationships in our data remain the same, but we have “whitened” our data in the sense that our two features are no longer correlated!
Again, we should stop and question things. Why do we care about this change of basis? Figure has the answer. Recall that our task as naturalists on a strange island is to classify all the new birds. Imagine that, being short on time, the only features we were able to collect about the island’s birds were beak size and wing span, so we must classify the birds using only these two features. What Figure suggests is that in fact, we only need one feature to classify the birds: the first principal component. This fact, that we don’t need both features, we only need one, is a deep idea. You may know that PCA is used for dimension reduction, and this is why. PCA allows us to find new composite features based on how much variation they explain, allowing us to keep only the new features that explain a large fraction of the variance in our data. These new composite features are important, so let’s name them; we’ll call them latent features or latent variables.
I think it’s worth reminding ourselves that while all this talk of “principal components” and “change of basis” and “orthogonality” is abstract, its meaning can be interpretable and even intuitive. Imagine that a brilliant ornithology visited our island, and correctly classified all the birds into three groups (Figure ). Since the first PC captures most of the variation, this would immediately suggest that if we actually looked at the birds and their groupings, we would find large beaks and large wingspans one end of the spectrum, and small beaks and small wingspans on the other. The first PC is simply capturing the variation in both, which is possible precisely because these two features covary.
2. Formalizing the objective function
Now that we understand the big picture, we can ask: how do we find these principal components? Let’s turn our geometric picture into an optimization problem.
Computing the principal components
Assume our data is mean-centered, and consider an arbitrary unit vector2 (Figure ). We want to maximize the variance of our data after it is projected onto a new axis defined by . In other words, we want to maximize the squared distance3 of our projected data from the origin.
To formalize this, consider an arbitrary datum in our collection, , and let’s denote the projection of onto as :
We want to find the optimal such that the squared length of is maximized, where the length is
This suggests the following optimization problem:
This is the linear algebraic formulation of the idea in the previous section: that the first PC represents the axis along which our projected data’s variance is maximized.
A little algebraic manipulation of the summation in Equation gives us
See this post if you’re unconvinced by step . Equation allows us to rewrite Equation as
So the optimization problem for finding the first PC requires optimizing with respect to our data’s (mean-centered) covariance matrix. This is because we’re ultimately maximizing the variance of our data with respect to , and the covariance matrix captures the necessary relationships.
However, what we’re about to show is even better: the first PC is actually just the first eigenvector of the covariance matrix, ! This is not, at least in my mind, an obvious result. It is surprising and beautiful. To see it, observe that since is a symmetric and positive semi-definite matrix, it has an eigendecomposition,
where is a square matrix () with eigenvectors for columns and where is a diagonal matrix of eigenvalues. If we define a new variable as
then we can rewrite the optimization problem in Equation as
The norm of must also be unity since
Equation relies on the fact that is orthnormal and thus . Since is diagonal, the bracketed term in Equation can be un-packed back into a summation, giving us
where is the -th eigenvalue associated with the -th column of . Without loss of generality, assume that the eigenvalues are ordered, . (Recall that these eigenvalues are all non-negative, since covariance matrices are always symmetric and positive semi-definite.) Since the eigenvalues are all non-negative, clearly the weights that maximize the problem are
In other words, the optimal is just the first standard basis vector . If we plug this optimal weight back into Equation , we see that the optimal is simply the first eigenvector!
To summarize, we wanted to find the first principal component of our data. As we saw in the first section, this first principal component is a vector that defines an axis along which the variance of our projected data is maximized. And the first principal component is, in fact, the first eigenvector of our data’s covariance matrix!
Finding the second principal follows the same reasoning as above, except now we need to add an additional constraint: must be orthogonal to . Intuitively, I think it’s pretty clear that for will be the second standard basis vector, , since that will “select” the second largest eigenvalue, , while ensuring that the two principal components are orthogonal, since
Of course, this is not a proof, which would probably use induction, but I think it captures the main ideas nicely.
At this point, we may be wondering: have we just “punted” understanding principal components to understanding eigenvectors? I don’t think so. Of course, a deeper understanding of eigenvectors would be useful here. But I also think it’s fine to just say that this idea—that the eigenvectors of a covariance matrix define orthogonal axes along which our data has maximal variance—is simply one of many reasons why eigenvectors are such well-studied mathematical objects.
Maximizing variation versus minimizing error
So far, we have discussed PCA in terms of finding a linear projection of our data such that our projected data have maximum variance. Visually, this is just finding the blue dashed line in Figure so that we can transform our data into Figure . However, there is an alternative view of PCA’s objective function, which is that it finds an axis such that the squared error between our data and transformed data is minimized.
This connection may sound more complicated than it is. It’s really just the claim that if we want to maximize the leg (not hypotenuse) of a right triangle ( in Figure ), we must minimize the other leg ( in Figure ). This StackExchange answer is also has some nice visualizations, but in my opinion, those visualizations are essentially more complex versions of Figure .
To see this formally, let’s define the difference between and as . If we minimize the length of , we minimize the difference between our datum and our projected datum . So we want to find the optimal such that the length of is minimized:
We can write the unconstrained objective in terms of as
Step holds because and are constants. To minimize a negative quantity, we want to maximize the un-negated quantity. So we can write this objective as
But this is equivalent to optimizing the absolute length of , which in turn is equivalent to optimizing the squared length of .
I think it’s useful to understand this connection, but the real intuition, in my mind, is Figure . PCA finds principal components such that the variance of our projected data is maximized, which is equivalent to the squared error of our projected data being minimized (with respect to the original data).
3. Deriving basic properties
Now that we visualized what PCA is doing and have formalized our intuition, we can transition to more a rigorous treatment. Let’s derive some useful properties about PCA. In particular, this section will rely heavily on an understanding of the singular value decomposition (SVD). Using the SVD simply makes many calculations involving PCA easier to do. Please see my post on the SVD as needed.
Principal components and eigenvectors
Let’s start by computing the principal components. Let be an mean-centered design matrix of our data. The columns are features, while the rows are independent samples. We saw in the previous section that the principal components are just the eigenvectors of the covariance matrix of . The empirical covariance matrix of our data is
So is a symmetric and positive semi-definite matrix. The eigendecomposition of is
where is a diagonal matrix of eigenvalues, and where is a matrix whose columns are eigenvectors, i.e. principal components. In other words, the columns of are unit vectors that point in orthogonal directions of maximal variation (e.g. the red and blue arrows in Figure ).
Note that we can also compute the principal components via the SVD. The SVD of is
where is unitary and , is diagonal and , and is unitary and . We can write our empirical covariance matrix via this SVD:
Here, denotes a diagonal matrix with the squared singular values of . Thus, we can use the SVD to diagonalize our matrix . This immediately suggests a relationship between singular values and eigenvalues, namely:
Note that since is symmetric and positive semi-definite, it’s eigenvalues are non-negative. Thus, the singular values and the eigenvalues tell the same story: they scale of the variation for their respective principal components.
Transformed data
Now that we know how to compute the principal components 4, let’s use them to transform our data (e.g. from Figures to ).
The matrix performs a change of basis on -vectors such that after the linear transformation, the standard basis vectors are mapped to the columns (eigenvectors) of . Visually, un-whitens vectors by rotating them from Figure back into Figure . Let be a vector in this whitened vector space. Then we can represent a single datum (also a -dimensional column vector) as
So is our un-whitened data. Since is a column vector here but a row vector in , we can represent Equation as a matrix operation as
We can write this transformation via the SVD as
Thus, the linear transformation transforms our data into latent variables .
When we use PCA for dimension reduction, then has shape and has shape , where is number of reduced features and where ideally . In this framing, we can view as a low-rank approximation of . So an alternative view of PCA is that we’re finding a low-rank approximation of our data matrix, capturing the basic structure of our data with fewer parameters.
Note that an alternative proof of the equivalence between maximizing variation and minimizing error is to work with the squared error .
Latent variables with uncorrelated features
We claimed that after transforming our data with the principal components , our latent data has features with zero correlation. Let’s prove this. The covariance matrix of is:
Clearly, is a diagonal matrix, meaning has uncorrelated features. Furthermore, we can see that the variances of our transformed data’s features are simply the eigenvalues of the covariance matrix of ! So the standard deviations in Figure (the length of the principal components) are just the square roots of the eigenvalues.
Dimension reduction and variance explained
Now that we can compute the variances of the features of , we can answer an important question: how many principal components should we use when performing dimension reduction with PCA? Answer: we take the just principal components with the largest eigenvalues, since those are the components associated with the largest variances!
In particular, we use the ratio of explained variance to total variance. The explained variance of a given principal component is just the variance of its associated feature (column in ), while the total variance is the sum of these explained variances. If we divide the former by the latter, we get set of ratios that sum to unity and represent the fraction of variance explained by each feature.
A common approach to visualizing all this is a scree plot. In a scree plot, we plot the sorted principal components against the ratio of explained variance (e.g. Figure ). Ideally, we would want our scree plot to look like a hockey stick. The “elbow” of the stick is where the eigenvalues level off. We would only want to take the components with large eigenvalues, since they explain most of the variation in our data. For example, in Figure it is clear that the first principal component has a much larger eigenvalue than the second. So if we could only pick one component, we would pick the first one.
Unchanged total variance
An interesting and intuitive property of PCA is that total variance of our raw data is identical to the total variance of our transformed data. Concretely, the sum of the diagonals of their respective covariance matrices are the same. For example, consider the two covariance matrices in Figure . The left subplot is the covariance matrix of some clearly correlated data , while the right subplot is the covariance matrix of our transformed data . You can check for yourself that the sum of the diagonals are identical (roughly ).
To see this formally, let’s compute the total variance of both our data and the transformed data and . Clearly, the total variance of is just the sum of the diagonal (eigenvalues) in , since it is a diagonal matrix:
However, the total variance of is trickier, since is not a diagonal matrix. We want to prove that the sum of the diagonals of is equal to the sum of the eigenvalues. Intuitively, we might suspect this is true since is a unitary matrix.
First, let’s write out the terms of explicitly:
Then is
and is
Writing out each term in this product matrix would be tedious, but it is clear that the -th diagonal element in this matrix is
and so the sum of the diagonals must be
We can group all the terms associated with the same . This gives us
and each inner summation is really the squared norm of a row of . Since is unitary, these squared norms are unity. So the sum of the diagonal elements of is the sum of the eigenvalues. To summary this subsection, we have shown:
I think this fairly intuitive. For any given data set, there is some total variation. If we apply some linear transformation to our data, we change how the features are correlated, but we don’t change the total variation inherent in the data.
Loadings
In PCA, we decompose our covariance matrix into directions or axes (PCs) and scales (eigenvalues). However, we often want to combine these two bits of information into scaled vectors (e.g. the solid axes lines in Figure ). In the language of PCA and factor analysis, the loadings are the principal components scaled by the square roots of their respective eigenvalues. In other words, loadings are scaled by the standard deviations of the transformed features. Concretely, let be a matrix of factor loadings. Then
Conversely, eigenvectors are unit-scaled loadings.
Probabilistic PCA
We can frame PCA as a probabilistic model, called probabilistic PCA (Tipping & Bishop, 1999). The statistical model for probabilistic PCA is
Why is this the correct probabilistic framing? Tipping & Bishop have a deeper analysis, but here’s how I think about it. One view of PCA is we decompose (or approximate) our design matrix via a linear transformation of latent variables :
And the covariance matrix of is proportional to
We can see that Equation models the same relationships, since it implies that is normally distributed as
So the model in Equation is analogous to the model we have discussed so far.
4. Exploring a complete example
As an example of the power of PCA, let’s reproduce a famous result, eigenfaces (Turk & Pentland, 1991). An eigenface is simply a principal component of a data set of images of faces. Since principal components form a basis of the images in our data set, the result of visualizing a single eigenface is an eerie, idealized face. See Figure for eigenfaces based on the Olivetti database of faces.
What does this definition of eigenfaces really mean? Recall Equation , which I’ll re-write here:
If is a -vector representing an image with pixels and if an eigenface is a column of , then must be a -vector representing how much each eigenfaces (out of eigenfaces) contributes to . So we can think of as representing a -vector of latent pixels. To grok this, it may be helpful to write out the matrix-multiplication in Equation more diagrammatically:
So each datum (the image on the left-hand side of Equation ) is a linear combination of eigenfaces, where the coefficients are defined by latent variables. See my post on matrices as functions and data for a bit more discussion on this idea of how to interpret matrices as data.
But what’s the utility of an eigenface? If we can find a small set of eigenfaces—say where pixels—then we can find a lower-dimensional representation of our data that can be used for downstream tasks like visualization, data compression, and classification. For example, we could sort our eigenfaces by their associated eigenvalues and only pick the top two eigenfaces. Thus, we could represent every face in our data set as a linear combination of just two eigenfaces. And we could visualize the two latent pixels for each image (Figure ).
While eigenfaces are a bit abstract, recall that this is precisely what we were trying to do when we imagined ourselves as naturalists classifying birds on a strange island. We have simply abstracted the problem a bit more, where instead of collecting beaks and feathers, we are collecting pixels. On the island, we were able to construct composite feature “bird size” out of “beak size” and “wing span”, because those two features were correlated. With eigenfaces, we are able to construct composite or projected pixels because many pixels in our data are correlated.
Now that we understand eigenfaces, let’s implement PCA and reproduce Figure . Here is the Python code I used to perform PCA. I think it is pretty straightforward if you understand the derivations above:
import numpy as np
from types import SimpleNamespace
def pca(X, n_components=None):
"""Perform principal component analysis on data X with [samples x features].
"""
n, p = X.shape
dim = n_components if n_components else p
xmean = X.mean(axis=0)
# Mean center our data.
X = X - xmean
# Do eigendecomposition via the SVD.
U, S, Vt = np.linalg.svd(X, full_matrices=False)
# Compute eigenvalues from singular values.
L = S**2 / (n-1)
# Sort output by eigenvalues in descending order.
inds = S.argsort()[::-1]
eigvals = L[inds][:dim]
pcs = Vt[inds][:dim]
# For consistency with the blog post text, transpose to make the columns of
# `pcs` the eigenvectors,
pcs = pcs.T
# Transform X into latent variables Z.
Z = (U*S)[:, :dim]
out = {}
out['pcs'] = pcs
out['explained_variance'] = eigvals
out['total_variance'] = eigvals.sum()
out['Z'] = Z
out['n_components'] = dim
out['mean'] = xmean
return SimpleNamespace(**out)
We can then load in the Olivetti data set from Scikit-learn, perform PCA, and then visualize the first ten components:
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_olivetti_faces
faces = fetch_olivetti_faces(return_X_y=False)
result = pca(X=faces.data)
fig, axes = plt.subplots(2, 5)
for i, ax in enumerate(axes.flat):
ax.imshow(result.pcs[:, i].reshape(faces.images[0].shape))
And we’re done!
Conclusion
Principal components are a collection of orthogonal unit vectors such that the -th principal component captures maximum variation in our data, while being orthogonal to the previous PCs. Thus, principal components are a basis, which can be used to transform our data into latent variables. These latent variables capture correlations in our data’s features, allowing for a reduced set of latent features that capture roughly the same amount of total variation.
Given the advances in computer vision today, the eigenfaces results may not be particularly impressive. For example, one could easily fit an autoencoder to produce much higher quality latent variables for clustering. But I think this is the wrong perspective. In context, it should be surprising that a simple linear model such as PCA is so effective on high-dimensional and richly structured data such as images. Given that it is also fast, easy to implement, and theoretically well-understood, PCA is a great first choice for preliminary data analysis.
-
Some readers might notice that this is just the Iris flower data set, but I like birds. ↩
-
We constrain to be a unit vector because there are an infinite number of unconstrained solutions (any scalar multiple of ). ↩
-
We could optimize the absolute rather than squared length, but we use a squared quantity because it is nicer for optimization (i.e. it is continuously differentiable). ↩
-
and are equivalent. For simplicity, I’ll use for principal components. ↩
- Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11), 559–572.
- Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(6), 417.
- Turk, M., & Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1), 71–86.
- Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3), 611–622.