The Fisher Information

I document several properties of the Fisher information or the variance of the derivative of the log likelihood.

Published

21 November 2019

The goal of this post is to enumerate and derive several key properties of the Fisher information, which quantifies how much information a random variable carries about its unknown generative parameters.

Let $X = (X_1, \dots, X_N)$ be a random sample from $\mathbb{P}_{\theta} \in \{\mathbb{P}_{\theta} : \theta \in \Theta\}$ with joint density $f_{\theta}(X)$ . The log likelihood is

$\mathcal{L}(\theta) = \log f_{\theta}(X). \tag{1}$

Since any point estimate $\hat{\theta}$ of $\theta$ is itself a random variable—because the point estimate is a function of $X$ —then we can think of the log likelihood as a “random curve”. The score, or gradient of the log likelihood w.r.t. to $\theta$ evaluated at a particular point, tells us how sensitive the log likelihood is to changes in parameter values. The Fisher information is the variance of the score,

$\mathcal{I}_N(\theta) = \mathbb{E}\left[\left( \frac{\partial}{\partial \theta} \log f_{\theta}(X) \right)^2\right] \stackrel{\star}{=} \mathbb{V}[\log f_{\theta}(X)]. \tag{2}$

Step $\star$ holds because for any random variable $Z$ , $\mathbb{V}[Z] = \mathbb{E}[Z^2] - \mathbb{E}[Z]^2$ and, as we will prove in a moment,

$\mathbb{E}\left[\frac{\partial}{\partial \theta} \log f_{\theta}(X)\right] = 0, \tag{3}$

under certain regularity conditions.

To quote this StackExchange answer, “The Fisher information determines how quickly the observed score function converges to the shape of the true score function.” The true score function depends on the unobserved population, while the observed score function is a random variable that depends on the random sample $X$ . A bigger Fisher information means the score function is more dispersed, suggesting that $\hat{\theta}$ will have less information about $X$ than if the Fisher information were smaller.

Properties

Expected score is zero

If we can swap integration and differentiation, then

$\begin{aligned} \mathbb{E}\left[\frac{\partial}{\partial \theta} \log p(X; \theta)\right] &\stackrel{\star}{=} \int \left[ \frac{\frac{\partial}{\partial \theta} p(x; \theta)}{p(x; \theta)} \right] p(x; \theta) \text{d}x \\ &= \int \frac{\partial}{\partial \theta} p(x; \theta) \text{d}x \\ &= \frac{\partial}{\partial \theta} \int p(x; \theta) \text{d}x \\ &= 0. \end{aligned} \tag{4}$

In step $\star$ , we use the fact that

$g(x) = \log f(x) \implies g^{\prime}(x) = \frac{f^{\prime}(x)}{f(x)}. \tag{5}$

Alternative definition

If $f_{\theta}$ is twice differentiable and if we can swap integration and differentiation, then $\mathcal{I}_N(\theta)$ can be equivalently written as

$\mathcal{I}_N(\theta) = - \mathbb{E}\left[\frac{\partial^2}{\partial \theta^2} \log f_{\theta}(X) \right]. \tag{6}$

To see this, first note that

$\begin{aligned} \frac{\partial^2}{\partial \theta^2} \log f_{\theta}(X) &= \frac{\partial}{\partial \theta} \left( \frac{\partial}{\partial \theta} \log f_{\theta}(X) \right) \\ &= \frac{\partial}{\partial \theta} \left( \frac{\frac{\partial}{\partial \theta} f_{\theta}(X)}{f_{\theta}(X)} \right) \\ &\stackrel{\star}{=} \frac{f_{\theta}(X) \frac{\partial^2}{\partial \theta^2} f_{\theta}(X) - \frac{\partial}{\partial \theta} f_{\theta} (X) \frac{\partial}{\partial \theta} f_{\theta} (X)}{f_{\theta}(X)^2} \\ &= \frac{\frac{\partial^2}{\partial \theta^2} f_{\theta}(X)}{f_{\theta}(X)} - \left[\frac{\frac{\partial}{\partial \theta} f_{\theta} (X)}{f_{\theta}(X)}\right]^2 \end{aligned} \tag{7}$

We use the quotient rule from calculus in step $\star$ . Now notice that

$\begin{aligned} \mathbb{E}\left[\frac{\frac{\partial^2}{\partial \theta^2} f_{\theta}(X)}{f_{\theta}(X)}\right] &= \int \left[\frac{\frac{\partial^2}{\partial \theta^2} f_{\theta}(X)}{f_{\theta}(X)}\right] f_{\theta}(X) \text{d}X \\ &= \int \frac{\partial^2}{\partial \theta^2} f_{\theta}(X) \text{d}X \\ &= \frac{\partial^2}{\partial \theta^2} \int f_{\theta}(X) \text{d}X \\ &= 0. \end{aligned} \tag{8}$

Nonnegativity

Since $f_{\theta}(X) \geq 0$ , then $\log f_{\theta}(X) \geq 0$ and $\mathcal{I}_N(\theta) \geq 0$ . This should also be obvious since variance is nonnegative.

Reformulation for i.i.d. settings

If $X_n$ are i.i.d., then

$\begin{aligned} \mathcal{I}_N(\theta) &= - \mathbb{E}\left[\frac{\partial^2}{\partial \theta^2} \log f_{\theta}(X)\right] \\ &= - \mathbb{E}\left[\frac{\partial^2}{\partial \theta^2} \sum_{n=1}^{N} \log f_{\theta}(X_n)\right] \\ &= - \sum_{n=1}^{N} \mathbb{E}\left[\frac{\partial^2}{\partial \theta^2} \log f_{\theta}(X_n)\right] \\ &= - N \mathbb{E}\left[\frac{\partial^2}{\partial \theta^2} \log f_{\theta}(X_n)\right]. \end{aligned} \tag{9}$

Above, we abuse notation a bit by writing the joint density and marginal densities as both $f_{\theta}(X)$ . We can distinguish the Fisher information for a single sample as $\mathcal{I}(\theta)$ or

$\mathcal{I}(\theta) = -\mathbb{E}\left[\frac{\partial^2}{\partial \theta^2} \log f_{\theta}(X_n)\right]. \tag{10}$

This means

$\mathcal{I}_N(\theta) = N \mathcal{I}(\theta). \tag{11}$

Reparameterization

Let $\eta = \psi(\theta)$ be a reparameterization and let $g(x; \eta) = f(x, \psi^{-1}(\eta))$ denote the reparameterized density. Let $\mathcal{I}_g(\cdot)$ and $\mathcal{I}_f(\cdot)$ denote the Fisher information under the two respective densities. Then

$\begin{aligned} \mathcal{I}_{g}(\eta) &= \mathbb{E}_g\left[\left(\frac{\partial}{\partial \eta} \log g(x; \eta)\right)^2\right] \\ &= \mathbb{E}_g\left[\left(\frac{\partial}{\partial \eta} \log f(x; \psi^{-1}(\eta))\right)^2\right] \\ &= \mathbb{E}_g\left[\left(\frac{\partial}{\partial \theta} \log f(x; \psi^{-1}(\eta)) \frac{\partial}{\partial \eta} \psi^{-1}(\eta) \right)^2\right] \\ &= \mathbb{E}_g\left[\left(\frac{\partial}{\partial \theta} \log f(x; \psi^{-1}(\eta))\right)^2 \right] \left[\frac{\partial}{\partial \eta} \psi^{-1}(\eta) \right)^2 \\ &= \mathcal{I}_f(\psi^{-1}(\eta)) \left(\frac{\partial}{\partial \eta} \psi^{-1}(\eta) \right)^2 \\ &= \mathcal{I}_f(\psi^{-1}(\eta)) \left(\frac{1}{\frac{\partial}{\partial \eta} \psi(\psi^{-1}(\eta))}\right)^2 \end{aligned} \tag{12}$

The main idea is to apply the chain rule. The last step uses a property of derivatives of inverse functions from calculus. In words, if a density is reparameterized by $\psi$ , the new Fisher information is the old Fisher information times a function of the gradient of $\psi$ w.r.t. $\eta$ .