Entropy of the Gaussian

I derive the entropy for the univariate and multivariate Gaussian distributions.

Published

01 September 2020

The information entropy or entropy of a random variable is the average amount information or “surprise” due to the range values it can take. High-probability events have low entropy (not surprising), and low-probability events have high entropy (surprising). For example, if I tell you the sun will rise tomorrow morning, you will be less surprised than if I tell you it will rain. Entropy quantifies your surprise.

The equation for entropy is

$H(x) = -\int p(x) \log p(x) \text{d}x. \tag{1}$

Here, the negative sign means high probability events have less entropy. The logarithm ensures that independent events have additive information entropy. If $h(x)$ denotes the information entropy of a single event, then

$\begin{aligned} h(x, y) &= -\log p(x, y) \\ &= -\log p(x) - \log p(y) \\ &= h(x) + h(y). \end{aligned} \tag{2}$

The units of entropy depend on the logarithm’s base. If $\log = \log_2$ , then the units are “bits”. If $\log = \log_e = \ln$ , then the units are “nats” or natural units of information. Let’s work through the entropy calculations for the univariate and multivariate Gaussians.

Entropy of the univariate Gaussian. Let $x$ be a Gaussian distributed random variable:

$x \sim \mathcal{N}(\mu, \sigma^2). \tag{3}$

Then its entropy is

$\begin{aligned} H(x) &= -\int p(x) \log p(x) \text{d}x \\ &= -\mathbb{E}[\log \mathcal{N}(\mu, \sigma^2)] \\ &= -\mathbb{E}\big[\log\big[ (2\pi\sigma^2)^{-1/2} \exp\big( -\frac{1}{2\sigma^2}(x-\mu)^2 \big)\big]\big] \\ &= \frac{1}{2} \log(2\pi\sigma^2) + \frac{1}{2\sigma^2}\mathbb{E}[(x-\mu)^2] \\ &\stackrel{\star}{=} \frac{1}{2} \log(2\pi\sigma^2) + \frac{1}{2}. \end{aligned} \tag{4}$

Step $\star$ holds because $\mathbb{E}[(x - \mu)^2] = \sigma^2$ . In words, the entropy of $x$ is just a function of its variance $\sigma^2$ . This makes sense. As $\sigma^2$ gets larger, the range of possible values $x$ can take gets bigger, and the entropy or average amount of surprise increases.

Entropy of the multivariate Gaussian. Now let $\mathbf{x}$ be multivariate Gaussian distributed,

$\mathbf{x} \sim \mathcal{N}_D(\boldsymbol{\mu}, \boldsymbol{\Sigma}). \tag{5}$

The derivation is nearly the same: $\begin{aligned} H(\mathbf{x}) &= -\int p(\mathbf{x}) \log p(\mathbf{x}) \text{d}\mathbf{x} \\ &= -\mathbb{E}[\log \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})] \\ &= -\mathbb{E}\big[\log\big[(2\pi)^{-D/2} |\boldsymbol{\Sigma}|^{-1/2} \exp\big(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\big) \big]\big] \\ &= \frac{D}{2} \log(2\pi) + \frac{1}{2} \log |\boldsymbol{\Sigma}| + \frac{1}{2} \mathbb{E}[(\mathbf{x} - \boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})] \\ &\stackrel{\star}{=} \frac{D}{2} (1 + \log(2\pi)) + \frac{1}{2} \log |\boldsymbol{\Sigma}|. \end{aligned} \tag{5}$

Step $\star$ is a little trickier. It relies on several properties of the trace operator:

$\begin{aligned} \mathbb{E}[(\mathbf{x} - \boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})] &= \mathbb{E}[\text{tr}((\mathbf{x} - \boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}))] \\ &= \mathbb{E}[\text{tr}(\boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})(\mathbf{x} - \boldsymbol{\mu})^{\top})] \\ &= \text{tr}(\boldsymbol{\Sigma}^{-1} \mathbb{E}[(\mathbf{x} - \boldsymbol{\mu})(\mathbf{x} - \boldsymbol{\mu})^{\top}]) \\ &= \text{tr}(\boldsymbol{\Sigma}^{-1} \boldsymbol{\Sigma}) \\ &= \text{tr}(\mathbf{I}_D) \\ &= D. \end{aligned} \tag{6}$

Again, we use the fact that $\mathbb{E}[(\mathbf{x} - \boldsymbol{\mu})(\mathbf{x} - \boldsymbol{\mu})^{\top}]) = \boldsymbol{\Sigma}$ , and again, the average amount of surprise encoded in $\mathbf{x}$ is a function of the covariance.

Notice that entropy is not just a function of $\boldsymbol{\Sigma}$ but the determinant of $\boldsymbol{\Sigma}$ . A geometric intuition for the determinant of a covariance matrix is worth its own blog post, but note that $|\boldsymbol{\Sigma}|$ is actually called the generalized variance, a scalar which generalizes variance to multivariate distributions (Wilks, 1932). Intuitively, $|\boldsymbol{\Sigma}|$ is analogous to $\sigma^2$ . As $|\boldsymbol{\Sigma}|$ gets larger, the entropy increases.

Wilks, S. S. (1932). Certain generalizations in the analysis of variance. Biometrika, 471–494.