Matrices as Functions, Matrices as Data

I discuss two views of matrices: matrices as linear functions and matrices as data. The second view is particularly useful in understanding dimension reduction methods.

To a mathematician, a matrix is a linear function. But to an applied statistician, a matrix is also data. When discussing statistical models, we often switch between these views without being explicit about which view we are taking. For example, we may discuss inference for a parameter that is a linear function, a matrix, while also discussing the model’s input data, also a matrix. What is going on here? The goal of this post is to make these two views explicit. In particular, I think the latter view is important for understanding dimension reduction methods such as principal component analysis.

Let’s discuss each view in turn.

Matrices as functions

Let A\mathbf{A} be a matrix with shape N×PN \times P (NN rows and PP columns). The matrix-as-function view is that A\mathbf{A} maps vectors in a PP-dimensional vector space into an NN-dimensional vector space. While we might write this linear transformation as,

y=Ax,(1) \mathbf{y} = \mathbf{A} \mathbf{x}, \tag{1}

a more diagrammatic way to write this might be

xAy.(2) \mathbf{x} \rightarrow \mathbf{A} \rightarrow \mathbf{y}. \tag{2}

Here, the matrix-as-function view is to think of a PP-vector x\mathbf{x} being “input” into a function A\mathbf{A}, which “outputs” an NN-vector y\mathbf{y}. The important mathematical bit is that since this PP-vector x\mathbf{x} can be represented as a linear combination of PP standard basis vectors {e1,,eP}\{\mathbf{e}_1, \dots, \mathbf{e}_P\},

x=x1e1+x2e2++xPeP,(3) \mathbf{x} = x_1 \mathbf{e}_1 + x_2 \mathbf{e}_2 + \dots + x_P \mathbf{e}_P, \tag{3}

then we can think of the columns of A\mathbf{A} as defining where these standard basis vectors in the domain “land” in range:

y=Ax=x1Ae1+x2Ae2++xPAeP.(4) \mathbf{y} = \mathbf{A} \mathbf{x} = x_1 \mathbf{A} \mathbf{e}_1 + x_2 \mathbf{A} \mathbf{e}_2 + \dots + x_P \mathbf{A} \mathbf{e}_P. \tag{4}

Since Aep\mathbf{A}\mathbf{e}_p is just the pp-th column of A\mathbf{A}, let’s denote it as ap\mathbf{a}_p. Then we could write Equation 44 as

y=Ax=x1a1+x2a2++xPaP.(5) \mathbf{y} = \mathbf{A}\mathbf{x} = x_1 \mathbf{a}_1 + x_2 \mathbf{a}_2 + \dots + x_P \mathbf{a}_P. \tag{5}

In my mind, Equation 44 really underscores the point: the output y\mathbf{y} of the linear function A\mathbf{A} is a linear combination of the basis vectors {e1,,eP}\{\mathbf{e}_1, \dots, \mathbf{e}_P\} transformed by the columns of A\mathbf{A}. I discuss this view in more detail in a geometrical understanding of matrices, and Grant Sanderson (3Blue1Brown) has an excellent video on this topic.

But what does any of this have to do with data?

Matrices as data

In statistics and machine learning, we often think of our data matrix A\mathbf{A} as representing NN independent samples or observations, where each sample has PP features. In this view, a row of A\mathbf{A}, an\mathbf{a}_n, is a PP-vector representing a single observation, while the column of A\mathbf{A}, ap\mathbf{a}_p, is an NN-vector representing a single feature.

For example, the matrix A\mathbf{A} might represent PP different cities’ populations across NN different time points. More concretely, imagine that the columns are cities in the United States and that the rows are years in [19502022][1950-2022]. Each cell is the population of a given US city on a given year:

AUS city populations=[AustinNew YorkSeattle1950p11p12p1P1951p21p22p2P2022pN1pN2pNP](6) \mathbf{A}_{\text{US city populations}} = \left[\begin{array}{c|c|c|c|c} & \text{Austin} & \text{New York} & \dots & \text{Seattle} \\ \hline \text{1950} & p_{11} & p_{12} & \dots & p_{1P} \\ \text{1951} & p_{21} & p_{22} & \dots & p_{2P} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \text{2022} & p_{N1} & p_{N2} & \dots & p_{NP} \end{array}\right] \tag{6}

What does it mean to represent our data in this way? Imagine we want to model the population for a new city, call it “Foobar”. If we represent the changing population for Foobar as an NN vector yFoobar\mathbf{y}_{\textsf{Foobar}}, then we can represent it as a linear combination of the other cities’ changing populations, via A\mathbf{A}:

yFoobar=x1Ae1+x2Ae2++xPAeP.(7) \mathbf{y}_{\textsf{Foobar}} = x_1 \mathbf{A} \mathbf{e}_1 + x_2 \mathbf{A} \mathbf{e}_2 + \dots + x_P \mathbf{A} \mathbf{e}_P. \tag{7}

Here, the scalars {x1,,xP}\{x_1, \dots, x_P\} are just coefficients that make the equality hold. And as we saw above, Aep\mathbf{A} \mathbf{e}_p is just the pp-th column of A\mathbf{A}. So we can represent Foobar’s population over time as a linear combination of the other cities’ populations:

yFoobar=x1Ae1+x2Ae2++xPAeP=x1(aAustin)+x2(aNew York)++xP(aSeattle).(8) \begin{aligned} \mathbf{y}_{\textsf{Foobar}} &= x_1 \mathbf{A} \mathbf{e}_1 + x_2 \mathbf{A} \mathbf{e}_2 + \dots + x_P \mathbf{A} \mathbf{e}_P \\ &= x_1 \left( \mathbf{a}_{\textsf{Austin}} \right) + x_2 \left( \mathbf{a}_{\textsf{New York}} \right) + \dots + x_P \left( \mathbf{a}_{\textsf{Seattle}} \right). \end{aligned} \tag{8}

In my mind, Equation 88 is the essence of a matrix-as-data view: we can represent a new datum as a linear combination of existing data. Of course, if we wanted to think about a cross-sectional time slice (a row vector representing PP cities at a single time point n{1950,1951,2022}n \in \{1950, 1951, \dots 2022\}), we could simply transpose the matrix and repeat the reasoning above. If you’d like another perspective on this topic, Jeremy Kun has a nice blog post on matrices as data.

In my mind, this data-view of matrices really highlights what dimension reduction methods are doing. Imagine, for example, that our data matrix A\mathbf{A} representing city populations exhibited lots of multicollinearity. For example, imagine that San Francisco and Seattle had highly correlated changes in population, perhaps due to tech booms and busts since the 1990s. In that context, A\mathbf{A} is a low-rank matrix. We could use dimension reduction to transform A\mathbf{A} into a matrix B\mathbf{B}, having KK columns where K<PK \lt P:

ARN×Pdim. reductionBRN×K.(9) \mathbf{A} \in \mathbb{R}^{N \times P} \quad \stackrel{\textsf{dim. reduction}}{\rightarrow} \quad \mathbf{B} \in \mathbb{R}^{N \times K}. \tag{9}

And Foobar’s population over time could be approximated as a linear combination of the columns of B\mathbf{B}:

yFoobar=x1b1+x2b2++xKbK.(10) \mathbf{y}_{\textsf{Foobar}} = x_1 \mathbf{b}_1 + x_2 \mathbf{b}_2 + \dots + x_K \mathbf{b}_K. \tag{10}

Here, we cannot label each bk\mathbf{b}_k with a city name, since there is no longer a one-to-one mapping between cities and columns. Instead, a column vector bk\mathbf{b}_k might be interpreted as representing a subgroup of cities with similar changes in population. Illustratively, we might have

yFoobar=x1(btech boom)+x2(bsmall town)++xK(bpost-industrial),(11) \mathbf{y}_{\textsf{Foobar}} = x_1 \left( \mathbf{b}_{\textsf{tech boom}} \right) + x_2 \left( \mathbf{b}_{\textsf{small town}} \right) + \dots + x_K \left( \mathbf{b}_{\textsf{post-industrial}} \right), \tag{11}

This is the basic idea of dimension reduction techniques such as principal component analysis. There, the goal is to find a low-rank matrix B\mathbf{B} that well-approximates our matrix A\mathbf{A} in the sense that it well-approximates the relationships in our data. This allows us to use Equations 1010 or 1111 rather than 88.