The Fisher Information

I document several properties of the Fisher information or the variance of the derivative of the log likelihood.

The goal of this post is to enumerate and derive several key properties of the Fisher information, which quantifies how much information a random variable carries about its unknown generative parameters.

Let be a random sample from with joint density . The log likelihood is

Since any point estimate of is itself a random variable—because the point estimate is a function of —then we can think of the log likelihood as a “random curve”. The score, or gradient of the log likelihood w.r.t. to evaluated at a particular point, tells us how sensitive the log likelihood is to changes in parameter values. The Fisher information is the variance of the score,

Step holds because for any random variable , and, as we will prove in a moment,

under certain regularity conditions.

To quote this StackExchange answer, “The Fisher information determines how quickly the observed score function converges to the shape of the true score function.” The true score function depends on the unobserved population, while the observed score function is a random variable that depends on the random sample . A bigger Fisher information means the score function is more dispersed, suggesting that will have less information about than if the Fisher information were smaller.

Properties

Expected score is zero

If we can swap integration and differentiation, then

In step , we use the fact that

Alternative definition

If is twice differentiable and if we can swap integration and differentiation, then can be equivalently written as

To see this, first note that

We use the quotient rule from calculus in step . Now notice that

Nonnegativity

Since , then and . This should also be obvious since variance is nonnegative.

Reformulation for i.i.d. settings

If are i.i.d., then

Above, we abuse notation a bit by writing the joint density and marginal densities as both . We can distinguish the Fisher information for a single sample as or

This means

Reparameterization

Let be a reparameterization and let denote the reparameterized density. Let and denote the Fisher information under the two respective densities. Then

The main idea is to apply the chain rule. The last step uses a property of derivatives of inverse functions from calculus. In words, if a density is reparameterized by , the new Fisher information is the old Fisher information times a function of the gradient of w.r.t. .