Matrices as Functions, Matrices as Data

I discuss two views of matrices: matrices as linear functions and matrices as data. The second view is particularly useful in understanding dimension reduction methods.

Published

28 August 2022

To a mathematician, a matrix is a linear function. But to an applied statistician, a matrix is also data. When discussing statistical models, we often switch between these views without being explicit about which view we are taking. For example, we may discuss inference for a parameter that is a linear function, a matrix, while also discussing the model’s input data, also a matrix. What is going on here? The goal of this post is to make these two views explicit. In particular, I think the latter view is important for understanding dimension reduction methods such as principal component analysis.

Let’s discuss each view in turn.

Matrices as functions

Let $\mathbf{A}$ be a matrix with shape $N \times P$ ( $N$ rows and $P$ columns). The matrix-as-function view is that $\mathbf{A}$ maps vectors in a $P$ -dimensional vector space into an $N$ -dimensional vector space. While we might write this linear transformation as,

$\mathbf{y} = \mathbf{A} \mathbf{x}, \tag{1}$

a more diagrammatic way to write this might be

$\mathbf{x} \rightarrow \mathbf{A} \rightarrow \mathbf{y}. \tag{2}$

Here, the matrix-as-function view is to think of a $P$ -vector $\mathbf{x}$ being “input” into a function $\mathbf{A}$ , which “outputs” an $N$ -vector $\mathbf{y}$ . The important mathematical bit is that since this $P$ -vector $\mathbf{x}$ can be represented as a linear combination of $P$ standard basis vectors $\{\mathbf{e}_1, \dots, \mathbf{e}_P\}$ ,

$\mathbf{x} = x_1 \mathbf{e}_1 + x_2 \mathbf{e}_2 + \dots + x_P \mathbf{e}_P, \tag{3}$

then we can think of the columns of $\mathbf{A}$ as defining where these standard basis vectors in the domain “land” in range:

$\mathbf{y} = \mathbf{A} \mathbf{x} = x_1 \mathbf{A} \mathbf{e}_1 + x_2 \mathbf{A} \mathbf{e}_2 + \dots + x_P \mathbf{A} \mathbf{e}_P. \tag{4}$

Since $\mathbf{A}\mathbf{e}_p$ is just the $p$ -th column of $\mathbf{A}$ , let’s denote it as $\mathbf{a}_p$ . Then we could write Equation $4$ as

$\mathbf{y} = \mathbf{A}\mathbf{x} = x_1 \mathbf{a}_1 + x_2 \mathbf{a}_2 + \dots + x_P \mathbf{a}_P. \tag{5}$

In my mind, Equation $4$ really underscores the point: the output $\mathbf{y}$ of the linear function $\mathbf{A}$ is a linear combination of the basis vectors $\{\mathbf{e}_1, \dots, \mathbf{e}_P\}$ transformed by the columns of $\mathbf{A}$ . I discuss this view in more detail in a geometrical understanding of matrices, and Grant Sanderson (3Blue1Brown) has an excellent video on this topic.

But what does any of this have to do with data?

Matrices as data

In statistics and machine learning, we often think of our data matrix $\mathbf{A}$ as representing $N$ independent samples or observations, where each sample has $P$ features. In this view, a row of $\mathbf{A}$ , $\mathbf{a}_n$ , is a $P$ -vector representing a single observation, while the column of $\mathbf{A}$ , $\mathbf{a}_p$ , is an $N$ -vector representing a single feature.

For example, the matrix $\mathbf{A}$ might represent $P$ different cities’ populations across $N$ different time points. More concretely, imagine that the columns are cities in the United States and that the rows are years in $[1950-2022]$ . Each cell is the population of a given US city on a given year:

$\mathbf{A}_{\text{US city populations}} = \left[\begin{array}{c|c|c|c|c} & \text{Austin} & \text{New York} & \dots & \text{Seattle} \\ \hline \text{1950} & p_{11} & p_{12} & \dots & p_{1P} \\ \text{1951} & p_{21} & p_{22} & \dots & p_{2P} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \text{2022} & p_{N1} & p_{N2} & \dots & p_{NP} \end{array}\right] \tag{6}$

What does it mean to represent our data in this way? Imagine we want to model the population for a new city, call it “Foobar”. If we represent the changing population for Foobar as an $N$ vector $\mathbf{y}_{\textsf{Foobar}}$ , then we can represent it as a linear combination of the other cities’ changing populations, via $\mathbf{A}$ :

$\mathbf{y}_{\textsf{Foobar}} = x_1 \mathbf{A} \mathbf{e}_1 + x_2 \mathbf{A} \mathbf{e}_2 + \dots + x_P \mathbf{A} \mathbf{e}_P. \tag{7}$

Here, the scalars $\{x_1, \dots, x_P\}$ are just coefficients that make the equality hold. And as we saw above, $\mathbf{A} \mathbf{e}_p$ is just the $p$ -th column of $\mathbf{A}$ . So we can represent Foobar’s population over time as a linear combination of the other cities’ populations:

$\begin{aligned} \mathbf{y}_{\textsf{Foobar}} &= x_1 \mathbf{A} \mathbf{e}_1 + x_2 \mathbf{A} \mathbf{e}_2 + \dots + x_P \mathbf{A} \mathbf{e}_P \\ &= x_1 \left( \mathbf{a}_{\textsf{Austin}} \right) + x_2 \left( \mathbf{a}_{\textsf{New York}} \right) + \dots + x_P \left( \mathbf{a}_{\textsf{Seattle}} \right). \end{aligned} \tag{8}$

In my mind, Equation $8$ is the essence of a matrix-as-data view: we can represent a new datum as a linear combination of existing data. Of course, if we wanted to think about a cross-sectional time slice (a row vector representing $P$ cities at a single time point $n \in \{1950, 1951, \dots 2022\}$ ), we could simply transpose the matrix and repeat the reasoning above. If you’d like another perspective on this topic, Jeremy Kun has a nice blog post on matrices as data.

In my mind, this data-view of matrices really highlights what dimension reduction methods are doing. Imagine, for example, that our data matrix $\mathbf{A}$ representing city populations exhibited lots of multicollinearity. For example, imagine that San Francisco and Seattle had highly correlated changes in population, perhaps due to tech booms and busts since the 1990s. In that context, $\mathbf{A}$ is a low-rank matrix. We could use dimension reduction to transform $\mathbf{A}$ into a matrix $\mathbf{B}$ , having $K$ columns where $K \lt P$ :

$\mathbf{A} \in \mathbb{R}^{N \times P} \quad \stackrel{\textsf{dim. reduction}}{\rightarrow} \quad \mathbf{B} \in \mathbb{R}^{N \times K}. \tag{9}$

And Foobar’s population over time could be approximated as a linear combination of the columns of $\mathbf{B}$ :

$\mathbf{y}_{\textsf{Foobar}} = x_1 \mathbf{b}_1 + x_2 \mathbf{b}_2 + \dots + x_K \mathbf{b}_K. \tag{10}$

Here, we cannot label each $\mathbf{b}_k$ with a city name, since there is no longer a one-to-one mapping between cities and columns. Instead, a column vector $\mathbf{b}_k$ might be interpreted as representing a subgroup of cities with similar changes in population. Illustratively, we might have

$\mathbf{y}_{\textsf{Foobar}} = x_1 \left( \mathbf{b}_{\textsf{tech boom}} \right) + x_2 \left( \mathbf{b}_{\textsf{small town}} \right) + \dots + x_K \left( \mathbf{b}_{\textsf{post-industrial}} \right), \tag{11}$

This is the basic idea of dimension reduction techniques such as principal component analysis. There, the goal is to find a low-rank matrix $\mathbf{B}$ that well-approximates our matrix $\mathbf{A}$ in the sense that it well-approximates the relationships in our data. This allows us to use Equations $10$ or $11$ rather than $8$ .