Random Noise and the Central Limit Theorem

Many probabilistic models assume random noise is Gaussian distributed. I explain at least part of the motivation for this, which is grounded in the Central Limit Theorem.

Gaussian noise is ubiquitous in modeling. For example, Bayesian linear regression, probabilistic PCA, Bayesian matrix factorization, and many signal processing models all assume some additive noise term ϵN(μ,Σ)\boldsymbol{\epsilon} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) for some model-specific μ\boldsymbol{\mu} and Σ\boldsymbol{\Sigma}. But why do we assume that random noise is Gaussian? There are two answers to this question. First, the Gaussian distribution has some very nice analytic properties. For example, the sum of two independent Gaussian random variables is also Gaussian and a linear map on a Gaussian distribution produces another Gaussian distribution. But second, the Central Limit Theorem motivates the idea that random noise will most likely be Gaussian. When I first heard this second justification, it was not immediately clear why. The goal of this post is to describe the Central Limit Theorem in detail and then explain how it relates to assuming random noise is Gaussian distributed.

The Central Limit Theorem

The Central Limit Theorem (CLT) is arguably one of the most important ideas in probability and statistics because its implications are widespread. The CLT states that, under certain conditions, the sampling distribution of a normalized sum of independent random variables, themselves not necessarily normally distributed, tends towards a normal distribution. There is a lot in that sentence, and it is worth unpacking slowly. In various research conversations, I have heard people casually evoke the CLT as meaning that “everything is Gaussian”, but this is a little sloppy. The statement is more precise, with important implications in that precision.

First, let us formalize things. Let XiX_i denote the iith draw of a random variable. Note that XiX_i is the random variable before sampling, meaning it is still a random variable and not fixed. And let SnS_n be a random variable representing the average of nn such draws:

Sn=X1+X2++Xnn S_n = \frac{X_1 + X_2 + \dots + X_n}{n}

The CLT’s claim is not about an XiX_i but rather SnS_n after normalization.

There are many different variants of the CLT. For example, the De Moivre–Laplace theorem is a special case of the CLT. For simplicity, I begin with the first one presented on Wikipedia, the self-contained Lindeberg–Lévy CLT (Lindeberg, 1922), which states

Lindeberg–Lévy CLT: Suppose {X1,X2,}\{ X_1, X_2, \dots \} is a sequence of i.i.d. random variables with E[Xi]=μ\mathbb{E}[X_i] = \mu and Var[Xi]=σ2<\text{Var}[X_i] = \sigma^2 < \infty. Then as nn approaches infinity, the random variables n(Snμ)\sqrt{n}(S_n − \mu) converge in distribution to a normal N(0,σ2)\mathcal{N}(0, \sigma^2).

But what is this random variable n(Snμ)\sqrt{n}(S_n − \mu)? Didn’t we just say that the CLT is about SnS_n? To see the connection, let’s do a little manipulation. First, since E[Xi]=μ\mathbb{E}[X_i] = \mu, then

E[Sn]=E[X1+X2++Xnn]=nE[Xi]n=μ \begin{aligned} \mathbb{E}[S_n] &= \mathbb{E} \left[ \frac{X_1 + X_2 + \dots + X_n}{n} \right] \\ &= \frac{n \mathbb{E}[X_i]}{n} \\ &= \mu \end{aligned}

So the term (Snμ)(S_n - \mu) is simply mean-centering the random variable SnS_n.

And what is the variance of SnS_n? Note that for independent random variables XX and YY, Var(X+Y)=Var(X)+Var(Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y). This makes sense. Since the random variables are uncorrelated, the total variance—our total uncertainty about what values they might each take—is the sum of variances. And if α\alpha is a scalar, then α2Var(X)=Var(αX)\alpha^2 \text{Var}(X) = \text{Var}(\alpha X). So

Var(Sn)=Var(X1+X2++Xnn)=i=1nVar(Xin)=i=1n1n2Var(Xi)=Var(Xi)n=σ2n \begin{aligned} \text{Var}(S_n) &= \text{Var} \left( \frac{X_1 + X_2 + \dots + X_n}{n} \right) \\ &= \sum_{i=1}^{n} \text{Var} \left( \frac{X_i}{n} \right) \\ &= \sum_{i=1}^{n} \frac{1}{n^2} \text{Var} (X_i) \\ &= \frac{\text{Var} (X_i)}{n} \\ &= \frac{\sigma^2}{n} \end{aligned}

Now if you want to normalize a Gaussian random variable to be N(0,1)\mathcal{N}(0, 1) distributed, you need to mean center the random variable and divide by the standard deviation:

(Snμ)σ2n=n(Snμ)σ \frac{(S_n - \mu)}{\sqrt{\frac{\sigma^2}{n}}} = \frac{\sqrt{n} (S_n - \mu)}{\sigma}

And this random variable is distributed as

limnn(Snμ)σdN(0,1) \lim_{n \rightarrow \infty} \frac{\sqrt{n} (S_n - \mu)}{\sigma} \stackrel{d}{\rightarrow} \mathcal{N}(0, 1)

which is equivalent to the Lindeberg–Lévy CLT.

I think it’s worth mentioning why the Lindeberg–Lévy CLT takes the form it does, namely that the random variable n(Snμ)\sqrt{n} (S_n - \mu) approaches N(0,σ2)\mathcal{N}(0, \sigma^2). First, while SnS_n and nSnn S_n are both approximately Gaussian distributed, their means and variances are functions of nn, and therefore these two random variables do not converge to a particular distribution. Furthermore, my guess is that the Lindeberg–Lévy CLT mentions converging to N(0,σ2)\mathcal{N}(0, \sigma^2) rather than N(0,1)\mathcal{N}(0, 1) because it makes clear that this normalized random variable n(Snμ)\sqrt{n} (S_n - \mu) converges to a Gaussian distribution with the same variance as the original random variable.

Intuition for the CLT

Now that we know what the CLT claims, we can play with it a little bit. (I typically start with intuition, but in this case, I found that defining things really helped me understand which random variable was Gaussian distributed.) The CLT claims that if we draw nn i.i.d. random variables from a wide class of distributions with a first moment μ\mu and second moment σ2<\sigma^2 < \infty, the random variable n(Snμ)\sqrt{n} (S_n - \mu) is N(0,σ2)\mathcal{N}(0, \sigma^2) distributed. Let’s empirically test this.

For this experiment, let’s use the uniform distribution, U(a,b)\mathcal{U}(a, b). First, let’s look at the distribution of SnS_n where each XiU(a,b)X_i \sim \mathcal{U}(a, b). The reason we want to look at SnS_n rather than n(Snμ)\sqrt{n} (S_n - \mu) is that it is easier to see the Gaussian distribution “peaking” if it is un-normalized (Figure 11).

Figure 1. For each nn, we draw a uniformly distributed random variable XiU(0,5)X_i \sim \mathcal{U}(0, 5) and compute the sum Sn=1ni=1nXiS_n = \frac{1}{n} \sum_{i=1}^{n} X_i. We sample a new SnS_n ten thousand times for each nn and then compute the histogram of the variables SnS_n.

In Figure 11, we can see that when n=1n = 1, the histogram looks like the uniform distribution. But as nn increases, the histogram of the random variable SnS_n looks more and more Gaussian distributed. The minimal example in Python to generate this figure is in the Appendix.

Of course, Figure 11 does not precisely demonstrate what is stated in the Lindeberg–Lévy CLT. We really should normalize SnS_n and verify that n(Snμ)\sqrt{n}(S_n - \mu) converges to N(0,σ2)\mathcal{N}(0, \sigma^2) for the normal distribution with second moment 112(ab)2\frac{1}{12}(a-b)^2 (Figure 22).

Figure 2. Demonstration of the CLT. For each nn, we draw a uniformly distributed random variable XiU(0,5)X_i \sim \mathcal{U}(0, 5) and compute the normalized sum n(Snμ)\sqrt{n} (S_n - \mu). We compute a new n(Snμ)\sqrt{n} (S_n - \mu) ten thousand times for each nn and then compute the histogram of the variables SnS_n. The red curve is the distribution N(0,σ2)\mathcal{N}(0, \sigma^2) where σ2=112(50)2\sigma^2 = \frac{1}{12} (5 - 0)^2.

The Lyapunov CLT

Lindeberg–Lévy CLT has two conditions for our random variables. The random variables must (1) be i.i.d. and (2) have finite variance. If either of these conditions is not met, the CLT is not guaranteed to hold. For example, consider the Cauchy distribution, which is pathological in the sense that both its first and second moments are undefined, i.e. σ2\sigma^2 \nless \infty. If you take the code in the Appendix and re-run it with np.random.standard_cauchy, you will find that SnS_n is not Gaussian distributed.

But it is noteworthy that while the samples must be independent, they need not be identically distributed. After reading the Lindeberg–Lévy CLT, I assumed the data must be identically distributed, but I could not convince myself why. To understand my thinking, consider some sequence of random variables,

X1+Y2+Y3+X4++Xn X_1 + Y_2 + Y_3 + X_4 + \dots + X_n

where XiX_i and YiY_i denote differently distributed random variables. If we just re-order these random variables and re-number them, we get:

X1+X2++Xm+Y1+Y2++Yk X_1 + X_2 + \dots + X_m + Y_1 + Y_2 + \dots + Y_k

where mm is the number of XiX_i samples and kk is the number of YiY_i samples. Clearly by the Lindeberg–Lévy CLT, the sums X1+X2+XmX_1 + X_2 + \dots X_m and Y1+Y2++YkY_1 + Y_2 + \dots + Y_k are Gaussian distributed, and since these random variables are independent, their sums must be Gaussian distributed. After some searching, I realized that this has already been proven and is the Lyapunov CLT which states that the random variables XiX_i must be independent but not necessarily identically distributed.

In my mind, this is an important generalization for understanding why noise is so often modeled as Gaussian distributed.

Additive Gaussian noise

Now that we understand the Lyapunov CLT, the assumption that noise is Gaussian starts to make sense. Noise is not one thing but rather the byproduct of interference from potentially many different sources. Let’s think of an example. Imagine a Bluetooth speaker receiving a signal from your laptop. In this context, noise can be many things: a microwave oven with a similar radio frequency, sensor errors due to overheating, physical interference as you pick up the speaker, and so on. Each of these sources of noise can be thought of as interferring with or being added to the true signal from your laptop. And while these sources of noise are neither Gaussian distributed nor identically distributed in general, their total effect can be plausibly modeled as a single Gaussian random variable—or additive Gaussian noise.

   

Appendix

1. Code to generate Figure 1

import numpy as np
import matplotlib.pyplot as plt

a = 0
b = 5
reps = 2000
fig, axes  = plt.subplots(1, 3)

for n, ax in zip([1, 2, 20], axes.flat):
    rvs = [np.random.uniform(a, b, n).mean() for _ in range(reps)]
    ax.hist(rvs, density=True)
plt.show()
  1. Lindeberg, J. W. (1922). Eine neue Herleitung des Exponentialgesetzes in der Wahrscheinlichkeitsrechnung. Mathematische Zeitschrift, 15(1), 211–225.