Entropy of the Gaussian

I derive the entropy for the univariate and multivariate Gaussian distributions.

The information entropy or entropy of a random variable is the average amount information or “surprise” due to the range values it can take. High-probability events have low entropy (not surprising), and low-probability events have high entropy (surprising). For example, if I tell you the sun will rise tomorrow morning, you will be less surprised than if I tell you it will rain. Entropy quantifies your surprise.

The equation for entropy is

H(x)=p(x)logp(x)dx.(1) H(x) = -\int p(x) \log p(x) \text{d}x. \tag{1}

Here, the negative sign means high probability events have less entropy. The logarithm ensures that independent events have additive information entropy. If h(x)h(x) denotes the information entropy of a single event, then

h(x,y)=logp(x,y)=logp(x)logp(y)=h(x)+h(y).(2) \begin{aligned} h(x, y) &= -\log p(x, y) \\ &= -\log p(x) - \log p(y) \\ &= h(x) + h(y). \end{aligned} \tag{2}

The units of entropy depend on the logarithm’s base. If log=log2\log = \log_2, then the units are “bits”. If log=loge=ln\log = \log_e = \ln, then the units are “nats” or natural units of information. Let’s work through the entropy calculations for the univariate and multivariate Gaussians.

Entropy of the univariate Gaussian. Let xx be a Gaussian distributed random variable:

xN(μ,σ2).(3) x \sim \mathcal{N}(\mu, \sigma^2). \tag{3}

Then its entropy is

H(x)=p(x)logp(x)dx=E[logN(μ,σ2)]=E[log[(2πσ2)1/2exp(12σ2(xμ)2)]]=12log(2πσ2)+12σ2E[(xμ)2]=12log(2πσ2)+12.(4) \begin{aligned} H(x) &= -\int p(x) \log p(x) \text{d}x \\ &= -\mathbb{E}[\log \mathcal{N}(\mu, \sigma^2)] \\ &= -\mathbb{E}\big[\log\big[ (2\pi\sigma^2)^{-1/2} \exp\big( -\frac{1}{2\sigma^2}(x-\mu)^2 \big)\big]\big] \\ &= \frac{1}{2} \log(2\pi\sigma^2) + \frac{1}{2\sigma^2}\mathbb{E}[(x-\mu)^2] \\ &\stackrel{\star}{=} \frac{1}{2} \log(2\pi\sigma^2) + \frac{1}{2}. \end{aligned} \tag{4}

Step \star holds because E[(xμ)2]=σ2\mathbb{E}[(x - \mu)^2] = \sigma^2. In words, the entropy of xx is just a function of its variance σ2\sigma^2. This makes sense. As σ2\sigma^2 gets larger, the range of possible values xx can take gets bigger, and the entropy or average amount of surprise increases.

Entropy of the multivariate Gaussian. Now let x\mathbf{x} be multivariate Gaussian distributed,

xND(μ,Σ).(5) \mathbf{x} \sim \mathcal{N}_D(\boldsymbol{\mu}, \boldsymbol{\Sigma}). \tag{5}

The derivation is nearly the same: H(x)=p(x)logp(x)dx=E[logN(μ,Σ)]=E[log[(2π)D/2Σ1/2exp(12(xμ)Σ1(xμ))]]=D2log(2π)+12logΣ+12E[(xμ)Σ1(xμ)]=D2(1+log(2π))+12logΣ.(5) \begin{aligned} H(\mathbf{x}) &= -\int p(\mathbf{x}) \log p(\mathbf{x}) \text{d}\mathbf{x} \\ &= -\mathbb{E}[\log \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})] \\ &= -\mathbb{E}\big[\log\big[(2\pi)^{-D/2} |\boldsymbol{\Sigma}|^{-1/2} \exp\big(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\big) \big]\big] \\ &= \frac{D}{2} \log(2\pi) + \frac{1}{2} \log |\boldsymbol{\Sigma}| + \frac{1}{2} \mathbb{E}[(\mathbf{x} - \boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})] \\ &\stackrel{\star}{=} \frac{D}{2} (1 + \log(2\pi)) + \frac{1}{2} \log |\boldsymbol{\Sigma}|. \end{aligned} \tag{5}

Step \star is a little trickier. It relies on several properties of the trace operator:

E[(xμ)Σ1(xμ)]=E[tr((xμ)Σ1(xμ))]=E[tr(Σ1(xμ)(xμ))]=tr(Σ1E[(xμ)(xμ)])=tr(Σ1Σ)=tr(ID)=D.(6) \begin{aligned} \mathbb{E}[(\mathbf{x} - \boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})] &= \mathbb{E}[\text{tr}((\mathbf{x} - \boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}))] \\ &= \mathbb{E}[\text{tr}(\boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})(\mathbf{x} - \boldsymbol{\mu})^{\top})] \\ &= \text{tr}(\boldsymbol{\Sigma}^{-1} \mathbb{E}[(\mathbf{x} - \boldsymbol{\mu})(\mathbf{x} - \boldsymbol{\mu})^{\top}]) \\ &= \text{tr}(\boldsymbol{\Sigma}^{-1} \boldsymbol{\Sigma}) \\ &= \text{tr}(\mathbf{I}_D) \\ &= D. \end{aligned} \tag{6}

Again, we use the fact that E[(xμ)(xμ)])=Σ\mathbb{E}[(\mathbf{x} - \boldsymbol{\mu})(\mathbf{x} - \boldsymbol{\mu})^{\top}]) = \boldsymbol{\Sigma}, and again, the average amount of surprise encoded in x\mathbf{x} is a function of the covariance.

Notice that entropy is not just a function of Σ\boldsymbol{\Sigma} but the determinant of Σ\boldsymbol{\Sigma}. A geometric intuition for the determinant of a covariance matrix is worth its own blog post, but note that Σ|\boldsymbol{\Sigma}| is actually called the generalized variance, a scalar which generalizes variance to multivariate distributions (Wilks, 1932). Intuitively, Σ|\boldsymbol{\Sigma}| is analogous to σ2\sigma^2. As Σ|\boldsymbol{\Sigma}| gets larger, the entropy increases.

  1. Wilks, S. S. (1932). Certain generalizations in the analysis of variance. Biometrika, 471–494.