The goal of this post is to enumerate and derive several key properties of the Fisher information, which quantifies how much information a random variable carries about its unknown generative parameters.
Let X=(X1,…,XN) be a random sample from Pθ∈{Pθ:θ∈Θ} with joint density fθ(X). The log likelihood is
L(θ)=logfθ(X).(1)
Since any point estimate θ^ of θ is itself a random variable—because the point estimate is a function of X—then we can think of the log likelihood as a “random curve”. The score, or gradient of the log likelihood w.r.t. to θ evaluated at a particular point, tells us how sensitive the log likelihood is to changes in parameter values. The Fisher information is the variance of the score,
IN(θ)=E[(∂θ∂logfθ(X))2]=⋆V[logfθ(X)].(2)
Step ⋆ holds because for any random variable Z, V[Z]=E[Z2]−E[Z]2 and, as we will prove in a moment,
E[∂θ∂logfθ(X)]=0,(3)
under certain regularity conditions.
To quote this StackExchange answer, “The Fisher information determines how quickly the observed score function converges to the shape of the true score function.” The true score function depends on the unobserved population, while the observed score function is a random variable that depends on the random sample X. A bigger Fisher information means the score function is more dispersed, suggesting that θ^ will have less information about X than if the Fisher information were smaller.
Properties
Expected score is zero
If we can swap integration and differentiation, then
E[∂θ∂logp(X;θ)]=⋆∫[p(x;θ)∂θ∂p(x;θ)]p(x;θ)dx=∫∂θ∂p(x;θ)dx=∂θ∂∫p(x;θ)dx=0.(4)
In step ⋆, we use the fact that
g(x)=logf(x)⟹g′(x)=f(x)f′(x).(5)
Alternative definition
If fθ is twice differentiable and if we can swap integration and differentiation, then IN(θ) can be equivalently written as
IN(θ)=−E[∂θ2∂2logfθ(X)].(6)
To see this, first note that
∂θ2∂2logfθ(X)=∂θ∂(∂θ∂logfθ(X))=∂θ∂(fθ(X)∂θ∂fθ(X))=⋆fθ(X)2fθ(X)∂θ2∂2fθ(X)−∂θ∂fθ(X)∂θ∂fθ(X)=fθ(X)∂θ2∂2fθ(X)−[fθ(X)∂θ∂fθ(X)]2(7)
We use the quotient rule from calculus in step ⋆. Now notice that
E[fθ(X)∂θ2∂2fθ(X)]=∫[fθ(X)∂θ2∂2fθ(X)]fθ(X)dX=∫∂θ2∂2fθ(X)dX=∂θ2∂2∫fθ(X)dX=0.(8)
Nonnegativity
Since fθ(X)≥0, then logfθ(X)≥0 and IN(θ)≥0. This should also be obvious since variance is nonnegative.
If Xn are i.i.d., then
IN(θ)=−E[∂θ2∂2logfθ(X)]=−E[∂θ2∂2n=1∑Nlogfθ(Xn)]=−n=1∑NE[∂θ2∂2logfθ(Xn)]=−NE[∂θ2∂2logfθ(Xn)].(9)
Above, we abuse notation a bit by writing the joint density and marginal densities as both fθ(X). We can distinguish the Fisher information for a single sample as I(θ) or
I(θ)=−E[∂θ2∂2logfθ(Xn)].(10)
This means
IN(θ)=NI(θ).(11)
Reparameterization
Let η=ψ(θ) be a reparameterization and let g(x;η)=f(x,ψ−1(η)) denote the reparameterized density. Let Ig(⋅) and If(⋅) denote the Fisher information under the two respective densities. Then
Ig(η)=Eg[(∂η∂logg(x;η))2]=Eg[(∂η∂logf(x;ψ−1(η)))2]=Eg[(∂θ∂logf(x;ψ−1(η))∂η∂ψ−1(η))2]=Eg[(∂θ∂logf(x;ψ−1(η)))2][∂η∂ψ−1(η))2=If(ψ−1(η))(∂η∂ψ−1(η))2=If(ψ−1(η))(∂η∂ψ(ψ−1(η))1)2(12)
The main idea is to apply the chain rule. The last step uses a property of derivatives of inverse functions from calculus. In words, if a density is reparameterized by ψ, the new Fisher information is the old Fisher information times a function of the gradient of ψ w.r.t. η.