The information entropy or entropy of a random variable is the average amount information or “surprise” due to the range values it can take. High-probability events have low entropy (not surprising), and low-probability events have high entropy (surprising). For example, if I tell you the sun will rise tomorrow morning, you will be less surprised than if I tell you it will rain. Entropy quantifies your surprise.
The equation for entropy is
H(x)=−∫p(x)logp(x)dx.(1)
Here, the negative sign means high probability events have less entropy. The logarithm ensures that independent events have additive information entropy. If h(x) denotes the information entropy of a single event, then
h(x,y)=−logp(x,y)=−logp(x)−logp(y)=h(x)+h(y).(2)
The units of entropy depend on the logarithm’s base. If log=log2, then the units are “bits”. If log=loge=ln, then the units are “nats” or natural units of information. Let’s work through the entropy calculations for the univariate and multivariate Gaussians.
Entropy of the univariate Gaussian. Let x be a Gaussian distributed random variable:
x∼N(μ,σ2).(3)
Then its entropy is
H(x)=−∫p(x)logp(x)dx=−E[logN(μ,σ2)]=−E[log[(2πσ2)−1/2exp(−2σ21(x−μ)2)]]=21log(2πσ2)+2σ21E[(x−μ)2]=⋆21log(2πσ2)+21.(4)
Step ⋆ holds because E[(x−μ)2]=σ2. In words, the entropy of x is just a function of its variance σ2. This makes sense. As σ2 gets larger, the range of possible values x can take gets bigger, and the entropy or average amount of surprise increases.
Entropy of the multivariate Gaussian. Now let x be multivariate Gaussian distributed,
x∼ND(μ,Σ).(5)
The derivation is nearly the same:
H(x)=−∫p(x)logp(x)dx=−E[logN(μ,Σ)]=−E[log[(2π)−D/2∣Σ∣−1/2exp(−21(x−μ)⊤Σ−1(x−μ))]]=2Dlog(2π)+21log∣Σ∣+21E[(x−μ)⊤Σ−1(x−μ)]=⋆2D(1+log(2π))+21log∣Σ∣.(5)
Step ⋆ is a little trickier. It relies on several properties of the trace operator:
E[(x−μ)⊤Σ−1(x−μ)]=E[tr((x−μ)⊤Σ−1(x−μ))]=E[tr(Σ−1(x−μ)(x−μ)⊤)]=tr(Σ−1E[(x−μ)(x−μ)⊤])=tr(Σ−1Σ)=tr(ID)=D.(6)
Again, we use the fact that E[(x−μ)(x−μ)⊤])=Σ, and again, the average amount of surprise encoded in x is a function of the covariance.
Notice that entropy is not just a function of Σ but the determinant of Σ. A geometric intuition for the determinant of a covariance matrix is worth its own blog post, but note that ∣Σ∣ is actually called the generalized variance, a scalar which generalizes variance to multivariate distributions (Wilks, 1932). Intuitively, ∣Σ∣ is analogous to σ2. As ∣Σ∣ gets larger, the entropy increases.