De Moivre–Laplace Theorem

I work through a standard proof of the de Moivre–Laplace theorem, which is the earliest version of the central limit theorem.

As I understand it, the de Moivre–Laplace theorem is the earliest version of the central limit theorem (CLT). In his book The Doctrine of Chances (De Moivre, 1738), Abraham de Moivre proved that the probability mass function of the binomial distribution asymptotically approximates the probability density function of a particular normal distribution as its parameter nn grows arbitrarily large. Today, we know that the CLT generalizes this result, and we might say this is a special case of the CLT for the binomial distribution.

To introduce notation, we say that XnX_n is a binomial random variable with parameters nn and pp if

P(Xn=k)=(nk)pkqnk,p[0,1],    q:=1p,    nN.(1) \mathbb{P}(X_n = k) = {n \choose k} p^k q^{n-k}, \qquad p \in [0, 1],\;\;q := 1-p,\;\;n \in \mathbb{N}. \tag{1}

Typically, we view XnX_n as the the sum of nn Bernoulli random variables, each with parameter pp. Intuitively, if we flip nn coins each with bias pp, Equation 11 gives the probability of kk successes.

This is clearly related to the CLT, which loosely states that the properly normalized sum of random variables asymptotically approaches the normal distribution. If we let YiY_i denote these Bernoulli random variables, we can express this idea as

Y1+Y2++YnXn    N ⁣(np,npq),(2) \overbrace{\vphantom{\Big|} Y_1 + Y_2 + \dots + Y_n}^{X_n} \;\simeq\; \mathcal{N}\!\left(np, npq\right), \tag{2}

where \simeq denotes asymptotic equivalence as nn \rightarrow \infty.

This is probably the most intuitive form of the CLT because if we simply plot the probability mass function (PMF) for the binomial distribution for increasing values of nn, we get a discrete distribution which nearly immediately looks a lot like the normal distribution even for relatively small nn (Figure 11). In contrast, I think the CLT is much less obvious feeling if I were to claim (correctly) that the properly normalized sum of skew normal random variables is also normally distributed!

Figure 1. Binomial distributions for increasing nn and fixed p=1/2p=1/2. The red line is the probability density function of the normal distribution with mean npnp and variance npqnpq.

A modern version of de Moivre’s proof is tedious, but it’s not actually that hard to follow. This post is simply my notes on that proof. To start, let’s rewrite the binomial coefficient without the factorial using Stirling’s approximation:

n!2πn(ne)n.(3) n! \simeq \sqrt{2\pi n} \left(\frac{n}{e}\right)^n. \tag{3}

As a historical aside, note that while Stirling is credited with this approximation, it was actually de Moivre who discovered an early version of it while working on these ideas. So de Moivre has been robbed twice, once for this approximation and once for the normal distribution sometimes being called the “Gaussian” rather than the “de Moivrian”. Anyway, using Stirling’s approximation, we can rewrite the binomial coefficient as

(nk)2πn(2πk)(2π(nk))(ne)n(ke)k(nke)kn=n2πk(nk))(nnkk(nk)nk).(4) \begin{aligned} {n \choose k} & \simeq \sqrt{\frac{2\pi n}{(2\pi k)(2\pi (n-k))}} \left(\frac{n}{e}\right)^n \left(\frac{k}{e}\right)^{-k} \left(\frac{n-k}{e}\right)^{k-n} \\ &= \sqrt{\frac{n}{2\pi k (n-k))}} \left(\frac{n^n}{k^k (n-k)^{n-k}}\right). \end{aligned} \tag{4}

If we multiply this term by the “raw probabilities” pkqnkp^k q^{n-k} and group the terms raised to the powers kk and nkn-k, we get:

(nk)pkqnkn2πk(nk))(nnkk(nk)nk)pkqnk=n2πk(nk))(npk)k(nqnk)nk.(5) \begin{aligned} {n \choose k} p^k q^{n-k} &\simeq \sqrt{\frac{n}{2\pi k (n-k))}} \left(\frac{n^n}{k^k (n-k)^{n-k}}\right) p^k q^{n-k} \\ &= \sqrt{\frac{n}{2\pi k (n-k))}} \left( \frac{np}{k} \right)^k \left( \frac{nq}{n-k} \right)^{n-k}. \end{aligned} \tag{5}

My understanding as to the motivation for the next two steps is that we want to “push” nn into the denominator, which is often nice in asymptotics because it makes terms vanish as nn gets larger. Let’s tackle the normalizing term (square root) and the probabilities separately.

First, the square root. Note that by the law of large numbers, as nn gets very large, k/nk/n arbitrarily approaches the true probability of success pp. So let’s rewrite the the square root in terms of k/nk/n and then write k/nk/n in terms of pp:

(nk)n2πk(nk))=12πknn(1kn))12πnpq.(6) {n \choose k} \simeq \sqrt{\frac{n}{2\pi k (n-k))}} = \sqrt{\frac{1}{2\pi \frac{k}{n} n (1 - \frac{k}{n}))}} \simeq \frac{1}{\sqrt{2\pi n p q}}. \tag{6}

If you were already familiar with the normal distribution, this term should look suspiciously like the normalizing constant!

Second, the probabilities. The next step is a fairly standard trick, which is to convert a product into a sum by taking the exp-log of the product. Looking only at the terms raised to kk and nkn-k in Equation 55, we get:

(npk)k(nqnk)nk=exp{log(npk)k+log(nqnk)nk}=exp{klog(knp)+(kn)log(nknq)}.(7) \begin{aligned} \left( \frac{np}{k} \right)^k \left( \frac{nq}{n-k} \right)^{n-k} &= \exp \left\{ \log \left( \frac{np}{k} \right)^k + \log \left( \frac{nq}{n-k} \right)^{n-k} \right\} \\ &= \exp \left\{ - k \log \left( \frac{k}{np} \right) + (k-n) \log \left( \frac{n-k}{nq} \right) \right\}. \end{aligned} \tag{7}

The next trick is express kk in terms of a standardized binomial random variable zz. Notice that XnX_n is the sum of nn independent Bernoulli random variables. By the linearity of expectation and the linearity of variance under independence, we have:

E[Xn]=i=1nE[Yi]=np,V[Xn]=i=1nV[Yi]=npq.(8) \begin{aligned} \mathbb{E}[X_n] &= \sum_{i=1}^n \mathbb{E}[Y_i] = np, \\ \mathbb{V}[X_n] &= \sum_{i=1}^n \mathbb{V}[Y_i] = npq. \end{aligned} \tag{8}

Since the mean of XnX_n is npnp and its variance is npqnpq, a standardized binomial random variable is

z:=kE[k]V[k]=knpnpq.(9) z := \frac{k - \mathbb{E}[k]}{\sqrt{\mathbb{V}[k]}} = \frac{k - np}{\sqrt{npq}}. \tag{9}

And we can write this in terms of kk as

k=np+znpq.(10) k = np + z \sqrt{npq}. \tag{10}

Putting this definition of kk into the formula above—the point here is to express kk in terms of nn, which is the term we want to pay attention to as it increases—, we get:

exp{klog(knp)+(kn)log(nknq)}=exp{klog(np+znpqnp)+(kn)log(nnpznpqnq)}=exp{klog(1+zqnp)+(kn)log(1zpnq)}.(11) \begin{aligned} &\exp \left\{ - k \log \left( \frac{k}{np} \right) + (k-n) \log \left( \frac{n-k}{nq} \right) \right\} \\ &= \exp \left\{ - k \log \left( \frac{np + z \sqrt{npq}}{np} \right) + (k-n) \log \left( \frac{n-np - z \sqrt{npq}}{nq} \right) \right\} \\ &= \exp \left\{ - k \log \left( 1 + z \sqrt{\frac{q}{np}} \right) + (k-n) \log \left( 1 - z \sqrt{\frac{p}{nq}} \right) \right\}. \end{aligned} \tag{11}

In my mind, the final step is the least obvious, but it’s lovely when you see it. Recall that the Maclaurin series of log(1+x)\log(1+x) is

log(1+x)=xx22+x33(12) \log(1+x) = x - \frac{x^2}{2} + \frac{x^3}{3} - \dots \tag{12}

This is a fairly standard result, and it’s worth just writing out yourself if you’ve never done it. Anyway, we can plug in these two definitions of xx,

x:=zqnp,x:=zpnq,(13) x := z \sqrt{\frac{q}{np}}, \qquad x := -z \sqrt{\frac{p}{nq}}, \tag{13}

into Equation 1212 above, and use that to expand the logs in Equation 1111 into infinite sums. Why are we doing this? The key idea that we’ll see is that nearly every term in each sum will be a fraction with nn in the denominator. So as nn grows larger, these terms will become arbitrarily small. In the limit, they vanish. All that will be left is the normal distribution’s kernel, exp{0.5z2}\exp\{-0.5 z^2\}. Let’s do this.

First, let’s just look at one of the log terms. We can write the left one as:

klog(1+zqnp)=(np+znpq)[z(qnp)1/212z2qnp+13z3(qnp)3/2+].(14) \begin{aligned} &-k \log \left( 1 + z \sqrt{\frac{q}{np}} \right) \\ &= -(np + z\sqrt{npq})\left[z \left(\frac{q}{np}\right)^{1/2} - \frac{1}{2} z^2 \frac{q}{np} + \frac{1}{3} z^3 \left(\frac{q}{np}\right)^{3/2} + \dots \right]. \end{aligned} \tag{14}

The key thing to see is that for most terms in the sum, after we multiply it by nn or n\sqrt{n}, we still have nn in the denominator. And these terms vanish since for some constant cc, the ratio c/nc/n goes to zero as nn \rightarrow \infty. So multiplying the terms in Equation 1414, we get

[znpq+12z2q13z3q3/2np+]+[z2q+12z3q3/2np13z4q2np+]=znpq+12z2qz2q=znpq12z2q.(15) \begin{aligned} &\left[ - z\sqrt{npq} + \frac{1}{2} z^2 q - \frac{1}{3} z^3 \frac{q^{3/2}}{\sqrt{np}} + \dots \right] + \left[ - z^2 q + \frac{1}{2} z^3 \frac{q^{3/2}}{\sqrt{np}} - \frac{1}{3} z^4 \frac{q^2}{np} + \dots \right] \\ &= -z\sqrt{npq} + \frac{1}{2} z^2 q - z^2 q \\ &= -z\sqrt{npq} - \frac{1}{2} z^2 q. \end{aligned} \tag{15}

That’s the basic idea. If we do the expansion for the other term in Equation 1111, we’ll see that it’s equal to:

(kn)log(1zpnq)=znpq+12z2pz2p+znpq12z2p.(16) \begin{aligned} (k-n) \log \left( 1 - z \sqrt{\frac{p}{nq}} \right) &= z\sqrt{npq} + \frac{1}{2} z^2 p - z^2 p + \dots \\ &\simeq z\sqrt{npq} - \frac{1}{2} z^2 p. \end{aligned} \tag{16}

Putting these two terms together, we can see that the exponent term is equal to:

exp{klog(1+zqnp)+(kn)log(1zpnq)}exp{znpq12z2q+znpq12z2p}=exp{12z2p12z2q}=exp{12z2(p+q)}=exp{12z2}.(17) \begin{aligned} &\exp \left\{ - k \log \left( 1 + z \sqrt{\frac{q}{np}} \right) + (k-n) \log \left( 1 - z \sqrt{\frac{p}{nq}} \right) \right\} \\ &\simeq \exp\left\{ -z\sqrt{npq} - \frac{1}{2} z^2 q + z\sqrt{npq} - \frac{1}{2} z^2 p \right\} \\ &= \exp\left\{ - \frac{1}{2} z^2 p - \frac{1}{2} z^2 q \right\} \\ &= \exp\left\{ - \frac{1}{2} z^2 (p + q) \right\} \\ &= \exp\left\{ - \frac{1}{2} z^2 \right\}. \end{aligned} \tag{17}

And this is the normal distribution’s kernel! Putting this together with the normalizing term in Equation 66 and then using the definition of the standardized variable zz in Equation 99, we get:

(nk)pkqnk    12πnpqexp{12(knpnpq)2}.(18) {n \choose k} p^k q^{n-k} \;\simeq\; \frac{1}{\sqrt{2\pi n p q}} \exp\left\{ - \frac{1}{2} \left( \frac{k - np}{\sqrt{npq}} \right)^2 \right\}. \tag{18}

And we’re done! This is quite elegant, because we have expressed this asymptotic distribution in terms of the mean and variance of XnX_n.

This is remarkable! I still remember the first time I saw this derived and realized precisely why the normal distribution was so pervasive. The normal distribution is everywhere because if you take a bunch of random noise and smash it together, the result is most likely normally distributed!

Figure 2. (Left) A histogram of YiY_i, where each sample is skew normally distributed. (Right) A histogram of XnX_n, where Xn=Y1++YnX_n = Y_1 + \dots + Y_n is approximately normally distributed.

Note that the more general CLT does not require that the random variables in the sum be Bernoulli distributed. For example, if XnX_n is the sum of nn independent skew normal random variables, XnX_n itself is still normally distributed! See Figure 22 for a numerical experiment demonstrating this. The de Moivre–Laplace Theorem was the first hint that this more general result, the central limit theorem, was actually true.

  1. De Moivre, A. (1738). The doctrine of chances: or, A method of calculating the probabilities of events in play. Woodfall.