Asymptotic Normality of Maximum Likelihood Estimators

Under certain regularity conditions, maximum likelihood estimators are "asymptotically efficient", meaning that they achieve the Cramér–Rao lower bound in the limit. I discuss this result.

Published

28 November 2019

Given a statistical model $\mathbb{P}_{\theta}$ and a random variable $X \sim \mathbb{P}_{\theta_0}$ where $\theta_0$ are the true generative parameters, maximum likelihood estimation (MLE) finds a point estimate $\hat{\theta}_N$ such that the resulting distribution “most likely” generated the data. MLE is popular for a number of theoretical reasons, one such reason being that MLE is asymtoptically efficient: in the limit, a maximum likelihood estimator achieves minimum possible variance or the Cramér–Rao lower bound. Recall that point estimators, as functions of $X$ , are themselves random variables. Therefore, a low-variance estimator $\hat{\theta}_N$ estimates the true parameter $\theta_0$ more precisely.

To state our claim more formally, let $X = \langle X_1, \dots, X_N \rangle$ be a finite sample where $X \sim \mathbb{P}_{\theta_0}$ with $\theta_0 \in \Theta$ being the true but unknown parameter. Let $\rightarrow^p$ denote converges in probability and $\rightarrow^d$ denote converges in distribution. Our claim of asymptotic normality is the following:

Asymptotic normality: Assume $\hat{\theta}_N \rightarrow^p \theta_0$ with $\theta_0 \in \Theta$ and that other regularity conditions hold. Then

$\sqrt{N}(\hat{\theta}_N - \theta_0) \rightarrow^d \mathcal{N}(0, \mathcal{I}(\theta_0)^{-1}) \tag{1}$

where $\mathcal{I}(\theta_0)$ is the Fisher information.

By “other regularity conditions”, I simply mean that I do not want to make a detailed accounting of every assumption for this post. Obviously, one should consult a standard textbook for a more rigorous treatment.

If asymptotic normality holds, then asymptotic efficiency falls out because it immediately implies

$\hat{\theta}_N \rightarrow^d \mathcal{N}(\theta_0, \mathcal{I}_N(\theta_0)^{-1}). \tag{2}$

I use the notation $\mathcal{I}_N(\theta)$ for the Fisher information for $X$ and $\mathcal{I}(\theta)$ for the Fisher information for a single $X_n \in X$ . Therefore, $\mathcal{I}_N(\theta) = N \mathcal{I}(\theta)$ provided the data are i.i.d. See my previous post on properties of the Fisher information for details.

The goal of this post is to discuss the asymptotic normality of maximum likelihood estimators. This post relies on understanding the Fisher information and the Cramér–Rao lower bound.

Proof of asymptotic normality

To prove asymptotic normality of MLEs, define the normalized log-likelihood function and its first and second derivatives with respect to $\theta$ as

$\begin{aligned} L_N(\theta) &= \frac{1}{N} \log f_X(x; \theta), \\\\ L^{\prime}_N(\theta) &= \frac{\partial}{\partial \theta} \left( \frac{1}{N} \log f_X(x; \theta) \right), \\\\ L^{\prime\prime}_N(\theta) &= \frac{\partial^2}{\partial \theta^2} \left( \frac{1}{N} \log f_X(x; \theta) \right). \end{aligned} \tag{3}$

By definition, the MLE is a maximum of the log likelihood function and therefore,

$\hat{\theta}_N = \arg\!\max_{\theta \in \Theta} \log f_X(x; \theta) \quad \implies \quad L^{\prime}_N(\hat{\theta}_N) = 0. \tag{4}$

Now let’s apply the mean value theorem,

Mean value theorem: Let $f$ be a continuous function on the closed interval $[a, b]$ and differentiable on the open interval. Then there exists a point $c \in (a, b)$ such that

$f^{\prime}(c) = \frac{f(a) - f(b)}{a - b} \tag{5}$

where $f = L_N^{\prime}$ , $a = \hat{\theta}_N$ and $b = \theta_0$ . Then for some point $c = \tilde{\theta} \in (\hat{\theta}_N, \theta_0)$ , we have

$L_N^{\prime}(\hat{\theta}_N) = L_N^{\prime}(\theta_0) + L_N^{\prime\prime}(\tilde{\theta})(\hat{\theta}_N - \theta_0). \tag{6}$

Above, we have just rearranged terms. (Note that other proofs might apply the more general Taylor’s theorem and show that the higher-order terms are bounded in probability.) Now by definition $L^{\prime}_N(\hat{\theta}_N) = 0$ , and we can write

$\hat{\theta}_N - \theta_0 = - \frac{L_N^{\prime}(\theta_0)}{L_N^{\prime\prime}(\tilde{\theta})} \quad \implies \quad \sqrt{N}(\hat{\theta}_N - \theta_0) = - \frac{\sqrt{N} L_N^{\prime}(\theta_0)}{L_N^{\prime\prime}(\tilde{\theta})} \tag{7}$

Let’s tackle the numerator and denominator separately. The upshot is that we can show the numerator converges in distribution to a normal distribution using the Central Limit Theorem, and that the denominator converges in probability to a constant value using the Weak Law of Large Numbers. Then we can invoke Slutsky’s theorem.

For the numerator, by the linearity of differentiation and the log of products we have

$\begin{aligned} \sqrt{N} L^{\prime}_N(\theta_0) &= \sqrt{N} \left( \frac{1}{N} \left[ \frac{\partial}{\partial \theta} \log f_X(X; \theta_0) \right] \right) \\ &= \sqrt{N} \left( \frac{1}{N} \left[ \frac{\partial}{\partial \theta} \log \prod_{n=1}^N f_X(X_n; \theta_0) \right] \right) \\ &= \sqrt{N} \left( \frac{1}{N} \sum_{n=1}^N \left[ \frac{\partial}{\partial \theta} \log f_X(X_n; \theta_0) \right] \right) \\ &= \sqrt{N} \left( \frac{1}{N} \sum_{n=1}^N \left[ \frac{\partial}{\partial \theta} \log f_X(X_n; \theta_0) \right] - \mathbb{E}\left[\frac{\partial}{\partial \theta} \log f_X(X_1; \theta_0)\right] \right) \tag{8}. \end{aligned}$

In the last line, we use the fact that the expected value of the score function (derivative of log likelihood) is zero. Without loss of generality, we take $X_1$ ,

$\mathbb{E}\left[\frac{\partial}{\partial \theta} \log f_X(X_1; \theta_0)\right] = 0. \tag{9}$

See my previous post on properties of the Fisher information for a proof. Equation $8$ allows us to invoke the Central Limit Theorem to say that

$\sqrt{N} L^{\prime}_N(\theta_0) \rightarrow^d \mathcal{N}\left(0, \mathbb{V}\left[\frac{\partial}{\partial \theta} \log f_X(X_1; \theta_0)\right]\right). \tag{10}$

This variance is just the Fisher information for a single observation,

$\begin{aligned} \mathbb{V}\left[\frac{\partial}{\partial \theta} \log f_X(X_1; \theta_0)\right] &= \mathbb{E}\left[\left(\frac{\partial}{\partial \theta} \log f_X(X_1; \theta_0)\right)^2\right] - \left(\underbrace{\mathbb{E}\left[\frac{\partial}{\partial \theta} \log f_X(X_1; \theta_0)\right]}_{=\,0}\right)^2 \\ &= \mathcal{I}(\theta_0). \end{aligned} \tag{11}$

For the denominator, we first invoke the Weak Law of Large Numbers (WLLN) for any $\theta$ ,

$\begin{aligned} L_N^{\prime\prime}(\theta) &= \frac{1}{N} \left( \frac{\partial^2}{\partial \theta^2} \log f_X(X; \theta) \right) \\ &= \frac{1}{N} \left( \frac{\partial^2}{\partial \theta^2} \log \prod_{n=1}^N f_X(X_n; \theta) \right) \\ &= \frac{1}{N} \sum_{n=1}^N \left( \frac{\partial^2}{\partial \theta^2} \log f_X(X_n; \theta) \right) \\ &\rightarrow^p \mathbb{E}\left[ \frac{\partial^2}{\partial \theta^2} \log f_X(X_1; \theta) \right]. \end{aligned} \tag{12}$

In the last step, we invoke the WLLN without loss of generality on $X_1$ . Now note that $\tilde{\theta} \in (\hat{\theta}_N, \theta_0)$ by construction, and we assume that $\hat{\theta}_N \rightarrow^p \theta_0$ . Taken together, we have

$L_N^{\prime\prime}(\tilde{\theta}) \rightarrow^p \mathbb{E}\left[ \frac{\partial^2}{\partial \theta^2} \log f_X(X_1; \theta_0) \right] = - \mathcal{I}(\theta_0). \tag{13}$

If you’re unconvinced that the expected value of the derivative of the score is equal to the negative of the Fisher information, once again see my previous post on properties of the Fisher information for a proof.

To summarize, we have shown that

$\sqrt{N} L^{\prime}_N(\theta_0) \rightarrow^d \mathcal{N}(0, \mathcal{I}(\theta_0)) \tag{14}$

and

$L^{\prime\prime}_N(\tilde{\theta}) \rightarrow^p - \mathcal{I}(\theta_0). \tag{15}$

We invoke Slutsky’s theorem, and we’re done:

$\sqrt{N}(\hat{\theta}_N - \theta_0) \rightarrow^d \mathcal{N}\left(\frac{1}{\mathcal{I}(\theta_0)} \right). \tag{16}$

As discussed in the introduction, asymptotic normality immediately implies

$\hat{\theta}_N \rightarrow^d \mathcal{N}(\theta_0, \mathcal{I}_N(\theta_0)^{-1}). \tag{17}$

As our finite sample size $N$ increases, the MLE becomes more concentrated or its variance becomes smaller and smaller. In the limit, MLE achieves the lowest possible variance, the Cramér–Rao lower bound.

Example with Bernoulli distribution

Let’s look at a complete example. Let $X_1, \dots, X_N$ be i.i.d. samples from a Bernoulli distribution with true parameter $p$ . The log likelihood is

$\begin{aligned} \log f_X(X; p) &= \sum_{n=1}^N \log \left[p^{X_n} (1-p)^{1 - X_n} \right] \\ &= \sum_{n=1}^N \left[ X_n \log p + (1 - X_n) \log (1 - p) \right]. \end{aligned} \tag{18}$

This works because $X_n$ only has support $\{0, 1\}$ . If we compute the derivative of this log likelihood, set it equal to zero, and solve for $p$ , we’ll have $\hat{p}_N$ , the MLE. First, let’s compute the derivative:

$\begin{aligned} \frac{\partial}{\partial p} \log f_X(X; p) &= \sum_{n=1}^N \left[ \frac{\partial}{\partial p} X_n \log p + \frac{\partial}{\partial p} (1 - X_n)\log (1 - p) \right] \\ &= \sum_{n=1}^N \left[ \frac{X_n}{p} - \frac{1 - X_n}{1 - p} \right] \\ &= \sum_{n=1}^N \left[ \frac{X_n}{p} + \frac{X_n - 1}{1 - p} \right]. \end{aligned} \tag{19}$

The negative sign is due to the chain rule when computing the derivative of $\log(1-p)$ . Now let’s set it equal to zero and solve for $p$ :

$\begin{aligned} 0 &= \sum_{n=1}^N \left[ \frac{X_n}{p} + \frac{X_n}{1 - p} \right] - \frac{N}{1 - p} \\ \frac{N}{1 - p} &= \sum_{n=1}^N X_n \left[ \frac{1}{p} + \frac{1}{1 - p} \right] \\ \frac{p(1 - p)}{1 - p} &= \frac{1}{N} \sum_{n=1}^N X_n. \end{aligned} \tag{20}$

The terms $(1 - p)$ cancel, leaving us with the MLE:

$\hat{p}_N = \frac{1}{N} \sum_{n=1}^N X_n. \tag{21}$

In other words, the MLE of the Bernoulli bias is just the average of the observations, which makes sense. The second derivative is the derivative of Equation $19$ or

$\frac{\partial}{\partial p} \sum_{n=1}^N \left[ \frac{X_n}{p} + \frac{X_n - 1}{1 - p} \right] = \sum_{n=1}^N \left[ - \frac{X_n}{p^2} + \frac{X_n - 1}{(1 - p)^2} \right]. \tag{22}$

The Fisher information is the negative expected value of this second derivative or

$\begin{aligned} \mathcal{I}_N(p) &= -\mathbb{E}\left[ \sum_{n=1}^N \left[ - \frac{X_n}{p^2} + \frac{X_n - 1}{(1 - p)^2} \right] \right] \\ &= \sum_{n=1}^N \left[ \frac{\mathbb{E}[X_n]}{p^2} - \frac{\mathbb{E}[X_n] - 1}{(1 - p)^2} \right] \\ &= \sum_{n=1}^N \left[ \frac{1}{p} + \frac{1}{1 - p} \right] \\ &= \frac{N}{p(1-p)}. \end{aligned} \tag{23}$

Thus, by the asymptotic normality of the MLE of the Bernoullli distribution—to be completely rigorous, we should show that the Bernoulli distribution meets the required regularity conditions—we know that

$\hat{p}_N \rightarrow^d \mathcal{N}\left(p, \frac{p(1-p)}{N}\right). \tag{24}$

We can empirically test this by drawing the probability density function of the above normal distribution, as well as a histogram of $\hat{p}_N$ for many iterations (Figure $1$ ).

Figure 1. The probability density function of

\mathcal{N}(p, p(1-p)/N)

(red), as well as a histogram of

\hat{p}_N

(gray) over many experimental iterations. The true value of

p

0.4

Here is the minimum code required to generate the above figure:

import numpy as np
from   scipy.stats import norm
import matplotlib.pyplot as plt


# Plot the asymptotically normal distribution.
N  = 1000
p0 = 0.4
xx = np.arange(0.3, 0.5, 0.001)
yy = norm.pdf(xx, p0, np.sqrt((p0 * (1-p0)) / N))

# Generate many random samples of size N and compute MLE.
mles = []
for _ in range(10000):
    X = np.random.binomial(1, p0, size=N)
    mles.append(X.mean())

# Plot histogram of MLEs.
plt.plot(xx, yy)
plt.hist(mles, bins=20, density=True)
plt.show()

Acknowledgements

I relied on a few different excellent resources to write this post:

My in-class lecture notes for Matias Cattaneo’s ORF 524, Statistical Theory and Methods.
These lecture notes for MIT 18.650, Statistics for Applications.
This post by Matthew Stephen’s et al for a complete example.
I thank Sam Morin for pointing out a couple mistakes in the Bernoulli derivations.