High-Dimensional Variance
A useful view of a covariance matrix is that it is a natural generalization of variance to higher dimensions. I explore this idea.
When
The mean
However, when
then, at least in my experience, people do not talk about the variance of
Now Equations
Of course, this is not my idea, but this was not how I was taught to think about covariance matrices. But I think it is an illuminating connection. So let’s explore this idea that a covariance matrix is just high-dimensional variance.
Definitions
To start, it is clear from definitions that a covariance matrix is related variance. We can write the outer product in Equation
Clearly the diagonal elements in
But what are the cross-terms? These are the covariances of the pairwise
combinations of
An obvious observation is that Equation
Furthermore, we can see that

This is the intuitive reason that
Again, this generalizes Equation
Using the
And with a little algebra, we can decompose the matrix in Equation
The middle matrix is correlation matrix, which captures the Pearson (linear) correlation
between all the variables in
Equation
Finally, note that Equation
This is because the inverse of a diagonal matrix is simply the reciprocal of each element along the diagonal.
Properties
Some important properties of univariate variance have multidimensional
analogs. For example, let
And in the general case,
So both
As a second example, recall that the univariate variance of
This is intuitive. A constant shift in the distribution does not change its dispersion. And again, in the general case, a constant shift in a multivariate distribution does not change its dispersion:
Finally, a standard decomposition of variance is to write it as
And this standard decomposition has a multi-dimensional analog:
Anyway, my point is not to provide detailed or comprehensive proofs, but only to underscore that covariance matrices have properties that indicate they are simply high-dimensional (co)variances.
Non-negativity
Another neat connection is that covariance matrices are positive semi-definite (PSD), which I’ll denote with
while univariate variances are non-negative numbers:
In this view, the Cholesky decomposition of a PSD matrix is
simply a high-dimensional square root! So the Cholesky factor
Here, I am wildly abusing notation to use
Precision and whitening
The precision of a scalar random variable is the reciprocal of its variance:
Hopefully the name is somewhat intuitive. When a random variable has high
precision, this means it has low variance and thus a smaller range of possible
outcomes. A common place that precision arises is when whitening data, or
standardizing it to have zero mean and unit variance. We can do this by
subtracting the mean of
This is sometimes referred to as z-scoring.
What’s the multivariate analog to this? We can define the
precision matrix
Given the Cholesky decomposition in Equation
So the multivariate analog to Equation
The geometric or visual effect of this operation is to apply a linear
transformation (rotation) of our data (samples of the random vector) with
covariance matrix

Ignoring the mean, we can easily verify this transformation works:
This derivation is a little more tedious with a mean, but hopefully the idea is clear. Why the Cholesky decomposition actually works is a deeper idea, one worth its own post, but I think the essential ideas are also captured in principal components analysis (PCA).
Summary statistics
It would be useful to summarize the information in a covariance matrix with a single number. To my knowledge, there are at least two such summary statistics that capture different types of information.
Total variance. The total variance of a random vector
We can see that total variance is a scalar that summarizes the variance across
the components of
Generalized variance. The generalized variance (Wilks, 1932) of a random vector
There are nice geometric
interpretations of the determinant, but perhaps
the simplest way to think about it here is that the determinant is equal to the
product of the eigenvalues of
So we can think of generalized variance as capturing the magnitude of the linear
transformation represented by
Generalized variance is quite different from total variance. For example, consider the two-dimensional covariance matrix implied by the following values:
Clearly, the total variance is two, while the determinant is one. Now imagine
that the variables are highly correlated, that
Examples
Let’s end with two illuminating examples that use the ideas in this post.
Multivariate normal. First, recall that the probability density function (PDF) for a standard normal random variable is
We can immediately see that the squared term is just the
square of Equation
We can see that the Mahalanobis distance is a multivariate whitening. And the variance is the normalizing term,
has a multivariate analog that is generalized variance above.
Correlated random variables. Consider a scalar random variable
What is the multi-dimensional version of this? If we have two random variables
how can we transform them into a random variable
What this suggests, however, is a generic algorithm for generating correlated
random variables: we multiply
This suggests that an algorithm: draw two i.i.d. random variables
Of course, this is nice because it can be vectorized and extends to an arbitrary number of random variables. And we can easily account for non-unit variances if we would like.
Conclusion
With the proper framing, it is fairly natural to think of
- Wilks, S. S. (1932). Certain generalizations in the analysis of variance. Biometrika, 471–494.