Matrices as Functions, Matrices as Data
I discuss two views of matrices: matrices as linear functions and matrices as data. The second view is particularly useful in understanding dimension reduction methods.
To a mathematician, a matrix is a linear function. But to an applied statistician, a matrix is also data. When discussing statistical models, we often switch between these views without being explicit about which view we are taking. For example, we may discuss inference for a parameter that is a linear function, a matrix, while also discussing the model’s input data, also a matrix. What is going on here? The goal of this post is to make these two views explicit. In particular, I think the latter view is important for understanding dimension reduction methods such as principal component analysis.
Let’s discuss each view in turn.
Matrices as functions
Let be a matrix with shape ( rows and columns). The matrix-as-function view is that maps vectors in a -dimensional vector space into an -dimensional vector space. While we might write this linear transformation as,
a more diagrammatic way to write this might be
Here, the matrix-as-function view is to think of a -vector being “input” into a function , which “outputs” an -vector . The important mathematical bit is that since this -vector can be represented as a linear combination of standard basis vectors ,
then we can think of the columns of as defining where these standard basis vectors in the domain “land” in range:
Since is just the -th column of , let’s denote it as . Then we could write Equation as
In my mind, Equation really underscores the point: the output of the linear function is a linear combination of the basis vectors transformed by the columns of . I discuss this view in more detail in a geometrical understanding of matrices, and Grant Sanderson (3Blue1Brown) has an excellent video on this topic.
But what does any of this have to do with data?
Matrices as data
In statistics and machine learning, we often think of our data matrix as representing independent samples or observations, where each sample has features. In this view, a row of , , is a -vector representing a single observation, while the column of , , is an -vector representing a single feature.
For example, the matrix might represent different cities’ populations across different time points. More concretely, imagine that the columns are cities in the United States and that the rows are years in . Each cell is the population of a given US city on a given year:
What does it mean to represent our data in this way? Imagine we want to model the population for a new city, call it “Foobar”. If we represent the changing population for Foobar as an vector , then we can represent it as a linear combination of the other cities’ changing populations, via :
Here, the scalars are just coefficients that make the equality hold. And as we saw above, is just the -th column of . So we can represent Foobar’s population over time as a linear combination of the other cities’ populations:
In my mind, Equation is the essence of a matrix-as-data view: we can represent a new datum as a linear combination of existing data. Of course, if we wanted to think about a cross-sectional time slice (a row vector representing cities at a single time point ), we could simply transpose the matrix and repeat the reasoning above. If you’d like another perspective on this topic, Jeremy Kun has a nice blog post on matrices as data.
In my mind, this data-view of matrices really highlights what dimension reduction methods are doing. Imagine, for example, that our data matrix representing city populations exhibited lots of multicollinearity. For example, imagine that San Francisco and Seattle had highly correlated changes in population, perhaps due to tech booms and busts since the 1990s. In that context, is a low-rank matrix. We could use dimension reduction to transform into a matrix , having columns where :
And Foobar’s population over time could be approximated as a linear combination of the columns of :
Here, we cannot label each with a city name, since there is no longer a one-to-one mapping between cities and columns. Instead, a column vector might be interpreted as representing a subgroup of cities with similar changes in population. Illustratively, we might have
This is the basic idea of dimension reduction techniques such as principal component analysis. There, the goal is to find a low-rank matrix that well-approximates our matrix in the sense that it well-approximates the relationships in our data. This allows us to use Equations or rather than .