The Fisher Information

I document several properties of the Fisher information or the variance of the derivative of the log likelihood.

The goal of this post is to enumerate and derive several key properties of the Fisher information, which quantifies how much information a random variable carries about its unknown generative parameters.

Let X=(X1,,XN)X = (X_1, \dots, X_N) be a random sample from Pθ{Pθ:θΘ}\mathbb{P}_{\theta} \in \{\mathbb{P}_{\theta} : \theta \in \Theta\} with joint density fθ(X)f_{\theta}(X). The log likelihood is

L(θ)=logfθ(X).(1) \mathcal{L}(\theta) = \log f_{\theta}(X). \tag{1}

Since any point estimate θ^\hat{\theta} of θ\theta is itself a random variable—because the point estimate is a function of XX—then we can think of the log likelihood as a “random curve”. The score, or gradient of the log likelihood w.r.t. to θ\theta evaluated at a particular point, tells us how sensitive the log likelihood is to changes in parameter values. The Fisher information is the variance of the score,

IN(θ)=E[(θlogfθ(X))2]=V[logfθ(X)].(2) \mathcal{I}_N(\theta) = \mathbb{E}\left[\left( \frac{\partial}{\partial \theta} \log f_{\theta}(X) \right)^2\right] \stackrel{\star}{=} \mathbb{V}[\log f_{\theta}(X)]. \tag{2}

Step \star holds because for any random variable ZZ, V[Z]=E[Z2]E[Z]2\mathbb{V}[Z] = \mathbb{E}[Z^2] - \mathbb{E}[Z]^2 and, as we will prove in a moment,

E[θlogfθ(X)]=0,(3) \mathbb{E}\left[\frac{\partial}{\partial \theta} \log f_{\theta}(X)\right] = 0, \tag{3}

under certain regularity conditions.

To quote this StackExchange answer, “The Fisher information determines how quickly the observed score function converges to the shape of the true score function.” The true score function depends on the unobserved population, while the observed score function is a random variable that depends on the random sample XX. A bigger Fisher information means the score function is more dispersed, suggesting that θ^\hat{\theta} will have less information about XX than if the Fisher information were smaller.

Properties

Expected score is zero

If we can swap integration and differentiation, then

E[θlogp(X;θ)]=[θp(x;θ)p(x;θ)]p(x;θ)dx=θp(x;θ)dx=θp(x;θ)dx=0.(4) \begin{aligned} \mathbb{E}\left[\frac{\partial}{\partial \theta} \log p(X; \theta)\right] &\stackrel{\star}{=} \int \left[ \frac{\frac{\partial}{\partial \theta} p(x; \theta)}{p(x; \theta)} \right] p(x; \theta) \text{d}x \\ &= \int \frac{\partial}{\partial \theta} p(x; \theta) \text{d}x \\ &= \frac{\partial}{\partial \theta} \int p(x; \theta) \text{d}x \\ &= 0. \end{aligned} \tag{4}

In step \star, we use the fact that

g(x)=logf(x)    g(x)=f(x)f(x).(5) g(x) = \log f(x) \implies g^{\prime}(x) = \frac{f^{\prime}(x)}{f(x)}. \tag{5}

Alternative definition

If fθf_{\theta} is twice differentiable and if we can swap integration and differentiation, then IN(θ)\mathcal{I}_N(\theta) can be equivalently written as

IN(θ)=E[2θ2logfθ(X)].(6) \mathcal{I}_N(\theta) = - \mathbb{E}\left[\frac{\partial^2}{\partial \theta^2} \log f_{\theta}(X) \right]. \tag{6}

To see this, first note that

2θ2logfθ(X)=θ(θlogfθ(X))=θ(θfθ(X)fθ(X))=fθ(X)2θ2fθ(X)θfθ(X)θfθ(X)fθ(X)2=2θ2fθ(X)fθ(X)[θfθ(X)fθ(X)]2(7) \begin{aligned} \frac{\partial^2}{\partial \theta^2} \log f_{\theta}(X) &= \frac{\partial}{\partial \theta} \left( \frac{\partial}{\partial \theta} \log f_{\theta}(X) \right) \\ &= \frac{\partial}{\partial \theta} \left( \frac{\frac{\partial}{\partial \theta} f_{\theta}(X)}{f_{\theta}(X)} \right) \\ &\stackrel{\star}{=} \frac{f_{\theta}(X) \frac{\partial^2}{\partial \theta^2} f_{\theta}(X) - \frac{\partial}{\partial \theta} f_{\theta} (X) \frac{\partial}{\partial \theta} f_{\theta} (X)}{f_{\theta}(X)^2} \\ &= \frac{\frac{\partial^2}{\partial \theta^2} f_{\theta}(X)}{f_{\theta}(X)} - \left[\frac{\frac{\partial}{\partial \theta} f_{\theta} (X)}{f_{\theta}(X)}\right]^2 \end{aligned} \tag{7}

We use the quotient rule from calculus in step \star. Now notice that

E[2θ2fθ(X)fθ(X)]=[2θ2fθ(X)fθ(X)]fθ(X)dX=2θ2fθ(X)dX=2θ2fθ(X)dX=0.(8) \begin{aligned} \mathbb{E}\left[\frac{\frac{\partial^2}{\partial \theta^2} f_{\theta}(X)}{f_{\theta}(X)}\right] &= \int \left[\frac{\frac{\partial^2}{\partial \theta^2} f_{\theta}(X)}{f_{\theta}(X)}\right] f_{\theta}(X) \text{d}X \\ &= \int \frac{\partial^2}{\partial \theta^2} f_{\theta}(X) \text{d}X \\ &= \frac{\partial^2}{\partial \theta^2} \int f_{\theta}(X) \text{d}X \\ &= 0. \end{aligned} \tag{8}

Nonnegativity

Since fθ(X)0f_{\theta}(X) \geq 0, then logfθ(X)0\log f_{\theta}(X) \geq 0 and IN(θ)0\mathcal{I}_N(\theta) \geq 0. This should also be obvious since variance is nonnegative.

Reformulation for i.i.d. settings

If XnX_n are i.i.d., then

IN(θ)=E[2θ2logfθ(X)]=E[2θ2n=1Nlogfθ(Xn)]=n=1NE[2θ2logfθ(Xn)]=NE[2θ2logfθ(Xn)].(9) \begin{aligned} \mathcal{I}_N(\theta) &= - \mathbb{E}\left[\frac{\partial^2}{\partial \theta^2} \log f_{\theta}(X)\right] \\ &= - \mathbb{E}\left[\frac{\partial^2}{\partial \theta^2} \sum_{n=1}^{N} \log f_{\theta}(X_n)\right] \\ &= - \sum_{n=1}^{N} \mathbb{E}\left[\frac{\partial^2}{\partial \theta^2} \log f_{\theta}(X_n)\right] \\ &= - N \mathbb{E}\left[\frac{\partial^2}{\partial \theta^2} \log f_{\theta}(X_n)\right]. \end{aligned} \tag{9}

Above, we abuse notation a bit by writing the joint density and marginal densities as both fθ(X)f_{\theta}(X). We can distinguish the Fisher information for a single sample as I(θ)\mathcal{I}(\theta) or

I(θ)=E[2θ2logfθ(Xn)].(10) \mathcal{I}(\theta) = -\mathbb{E}\left[\frac{\partial^2}{\partial \theta^2} \log f_{\theta}(X_n)\right]. \tag{10}

This means

IN(θ)=NI(θ).(11) \mathcal{I}_N(\theta) = N \mathcal{I}(\theta). \tag{11}

Reparameterization

Let η=ψ(θ)\eta = \psi(\theta) be a reparameterization and let g(x;η)=f(x,ψ1(η))g(x; \eta) = f(x, \psi^{-1}(\eta)) denote the reparameterized density. Let Ig()\mathcal{I}_g(\cdot) and If()\mathcal{I}_f(\cdot) denote the Fisher information under the two respective densities. Then

Ig(η)=Eg[(ηlogg(x;η))2]=Eg[(ηlogf(x;ψ1(η)))2]=Eg[(θlogf(x;ψ1(η))ηψ1(η))2]=Eg[(θlogf(x;ψ1(η)))2][ηψ1(η))2=If(ψ1(η))(ηψ1(η))2=If(ψ1(η))(1ηψ(ψ1(η)))2(12) \begin{aligned} \mathcal{I}_{g}(\eta) &= \mathbb{E}_g\left[\left(\frac{\partial}{\partial \eta} \log g(x; \eta)\right)^2\right] \\ &= \mathbb{E}_g\left[\left(\frac{\partial}{\partial \eta} \log f(x; \psi^{-1}(\eta))\right)^2\right] \\ &= \mathbb{E}_g\left[\left(\frac{\partial}{\partial \theta} \log f(x; \psi^{-1}(\eta)) \frac{\partial}{\partial \eta} \psi^{-1}(\eta) \right)^2\right] \\ &= \mathbb{E}_g\left[\left(\frac{\partial}{\partial \theta} \log f(x; \psi^{-1}(\eta))\right)^2 \right] \left[\frac{\partial}{\partial \eta} \psi^{-1}(\eta) \right)^2 \\ &= \mathcal{I}_f(\psi^{-1}(\eta)) \left(\frac{\partial}{\partial \eta} \psi^{-1}(\eta) \right)^2 \\ &= \mathcal{I}_f(\psi^{-1}(\eta)) \left(\frac{1}{\frac{\partial}{\partial \eta} \psi(\psi^{-1}(\eta))}\right)^2 \end{aligned} \tag{12}

The main idea is to apply the chain rule. The last step uses a property of derivatives of inverse functions from calculus. In words, if a density is reparameterized by ψ\psi, the new Fisher information is the old Fisher information times a function of the gradient of ψ\psi w.r.t. η\eta.