Laplace's Method

Laplace's method is used to approximate a distribution with a Gaussian. I explain the technique in general and work through an exercise by David MacKay.

The motivation for Laplace’s method (Laplace, 1774) is more general than our goal of density estimation, and it is worth exploring. Suppose we want to approximate an integral of the form

abekf(x)dx(1) \int_a^b e^{k f(x)} \text{d}x \tag{1}

where f(x)f(x) is a twice differentiable function with a global maximum at x0x_0, k>0k > 0 and is ideally large, and the endpoints aa and bb can be infinite. Consider this ratio,

ekf(x0)ekf(x)=ek(f(x0)f(x)) \frac{e^{k f(x_0)}}{e^{k f(x)}} = e^{k(f(x_0) - f(x))}

which increases exponentially as kk increases linearly. So for large kk, the difference between the maximum point and the rest of the function is accentuated (Figure 11).

Figure 1: Let f(x)=sin(x)/xf(x) = \sin(x)/x. Then x0=0x_0 = 0 and limxx0f(x)=1\lim_{x \rightarrow x_0} f(x) = 1. As kk increases, the ratio r(x)=exp{k(f(x0)f(x))}r(x) = \exp\{k(f(x_0) - f(x))\} increases exponentially.

This observation about the behavior of ekf(x)e^{k f(x)} is what motivates Laplace’s method, and it suggests that we can approximate f(x)f(x) using a Taylor expansion around x0x_0,

f(x)=f(x0)+f(x0)1!(xx0)+f(x0)2!(xx0)2+. f(x) = f(x_0) + \frac{f^{\prime}(x_0)}{1!} (x - x_0) + \frac{f^{\prime\prime}(x_0)}{2!} (x - x_0)^2 + \dots.

Since x0x_0 is at a maximum, f(x0)=0f^{\prime}(x_0) = 0, and the first-order term disappears. If we drop the higher-order terms, then

f(x)f(x0)+12f(x0)(xx0)2(2) f(x) \approx f(x_0) + \frac{1}{2} f^{\prime\prime}(x_0) (x - x_0)^2 \tag{2}

As an example, consider the function f(x)=sin(x)/xf(x) = \sin(x)/x from Figure 11. Then we have

f(x)=sin(x)/xx0=0limxx0f(x)=1limxx0f(x)=13exp{k(sin(x)x)}exp{k(116x2)} \begin{aligned} f(x) &= \sin(x)/x \\ x_0 &= 0 \\ \lim_{x \rightarrow x_0} f(x) &= 1 \\ \lim_{x \rightarrow x_0} f^{\prime\prime}(x) &= - \frac{1}{3} \\ &\Downarrow \\ \exp\Big\{k \Big(\frac{\sin(x)}{x}\Big)\Big\} &\approx \exp\Big\{ k \Big(1 - \frac{1}{6} x^2\Big)\Big\} \end{aligned}

Figure 22 shows the function ekf(x)e^{k f(x)} and the approximation for varying values of kk. This captures our earlier point, which is that as kk increases, the function ekf(x)e^{k f(x)} becomes more peaked around x0x_0, and thus our Taylor-expanded approximation improves.

Figure 2: Let f(x)=sin(x)/xf(x) = \sin(x)/x, g(x)=ekf(x)g(x) = e^{k f(x)}, and h(x)=exp(f(x0)+f(x0)2(xx0)2)h(x) = \exp\Big(f(x_0) + \frac{f^{\prime\prime}(x_0)}{2} (x - x_0)^2 \Big). The panels compare g(x)g(x) to its approximation h(x)h(x) for k{1,1.5,2}k \in \{1, 1.5, 2\}.

At this point, you might have already noticed that our approximation (Equation 22), if appropriately normalized, is just the Gaussian PDF centered at x0x_0. Since x0x_0 is a maximum, f(x0)<0f^{\prime\prime}(x_0) < 0 (recall the Second Derivative Test) and thus f(x0)=f(x0)f^{\prime\prime}(x_0) = -\|f^{\prime\prime}(x_0)\|. We can approximate Equation 11 as,

abekf(x)dxabexp{k(f(x0)12f(x0)(xx0)2)}dx=ekf(x0)abexp{12kf(x0)(xx0)2}dx \begin{aligned} \int_a^b e^{k f(x)} \text{d}x &\approx \int_a^b \exp\Big\{ k \Big( f(x_0) - \frac{1}{2}|f^{\prime\prime}(x_0)|(x - x_0)^2 \Big) \Big\} \text{d}x \\ &= e^{k f(x_0)} \int_a^b \exp\Big\{ - \frac{1}{2} k |f^{\prime\prime}(x_0)|(x - x_0)^2 \Big\} \text{d}x \end{aligned}

This is a Gaussian integral if a=a = -\infty and b=b = \infty, and is a Gaussian density if appropriately normalized. This is precisely what motivates Laplace’s method for density approximation.

Density approximation

Now consider the goal of approximating a distribution, ideally unimodal, with a Gaussian. Now our unnormalized density is f(x)f(x), and the normalized density is p(x)=1Zf(x)p(x) = \frac{1}{Z} f(x) where ZZ is unknown. The mode or peak of this distribution occurs at the maximum x0x_0.

Let’s Taylor-expand lnf(x)\ln f(x) around the point x0x_0. If we write the negative of the second derivative in the second-order term as AA and drop the higher-order terms, then

lnf(x)lnf(x0)A2(xx0)2,A=2x2lnf(x)x=x0(3) \ln f(x) \approx \ln f(x_0) - \frac{A}{2} (x - x_0)^2, \qquad A = -\frac{\partial^2}{\partial x^2} \ln f(x) \Big|_{x = x_0} \tag{3}

If we are estimating parameters, then x0x_0 is just the maximum likelihood or maximum a posteriori (MAP) estimate, and AA is just the negative log likelihood evaluated at the optimal parameter. Thus, we can approximate p(x)p(x) with an unnormalized density q(x)q^{*}(x) such that,

q(x)f(x0)exp{A2(xx0)2} q^{*}(x) \triangleq f(x_0) \exp \Big\{ - \frac{A}{2} (x - x_0)^2 \Big\}

Since we assume q(x)q^{*}(x) is Gaussian, A=1/σ2A = 1 / \sigma^2, and the normalizing constant is

Zq2πf(x0)1A Z_q \triangleq \sqrt{2 \pi} f(x_0) \frac{1}{\sqrt{A}}

Putting it together, the normalized density q(x)q(x) is

p(x)q(x)=q(x)Zq=A2πexp{A2(xx0)2}. p(x) \approx q(x) = \frac{q^{*}(x)}{Z_q} = \sqrt{\frac{A}{2 \pi}} \exp \Big\{ - \frac{A}{2} (x - x_0)^2 \Big\}.

Notice that in density estimation, kk is not free parameter because we want our integral to integrate to 11.

We can generalize Laplace’s method to a pp-dimensional Gaussian random vector x\mathbf{x}. Let the constant AA from the previous section be a Hessian matrix A\mathbf{A} such that

Aij=2xixjlnf(x)x=x0. A_{ij} = - \frac{\partial^2}{\partial x_i \partial x_j} \ln f(\mathbf{x}) \Big|_{\mathbf{x} = \mathbf{x}_0}.

Then the approximation in Equation 33 is

lnf(x)lnf(x0)12(xx0)A1(xx0). \ln f(x) \approx \ln f(\mathbf{x}_0) - \frac{1}{2} (\mathbf{x} - \mathbf{x}_0)^{\top} \mathbf{A}^{-1} (\mathbf{x} - \mathbf{x}_0).

and the normalizing constant is

Zq2πf(x0)1det(A). Z_q \triangleq \sqrt{2 \pi} f(\mathbf{x}_0) \frac{1}{\sqrt{\mathbf{\det(A)}}}.

This is derived from the standard form a multivariate Gaussian with A\mathbf{A} in place of the covariance matrix Σ\boldsymbol{\Sigma}. The normalized density is

p(x)q(x)=q(x)Zq=det(A)2πexp{12(xx0)A1(xx0)}. p(\mathbf{x}) \approx q(\mathbf{\mathbf{x}}) = \frac{q^{*}(x)}{Z_q} = \sqrt{\frac{\mathbf{\det(A)}}{2 \pi}} \exp \Big\{ - \frac{1}{2} (\mathbf{x} - \mathbf{x}_0)^{\top} \mathbf{A}^{-1} (\mathbf{x} - \mathbf{x}_0) \Big\}.

Thus, we have seen that Laplace’s method places a Gaussian at the mode of the approximated distribution and matches the curvature of the density around that point.

MacKay’s exercise

As a complete example, let’s work through David MacKay’s exercise 27.1, part (a):

A photon counter is pointed at a remote star for one minute, in order to infer the rate of photons arriving at the counter per minute, λ\lambda. Assuming the number of photons collected rr has a Poisson distribution with mean λ\lambda,

P(rλ)=exp(λ)λrr!, P(r \mid \lambda) = \exp(-\lambda) \frac{\lambda^r}{r!},

and assuming the improper prior P(λ)=1/λP(\lambda) = 1 / \lambda, make Laplace approximations to the posterior distribution over λ\lambda.

First, since we have a prior, note that λ0=λMAP\lambda_0 = \lambda_{\text{MAP}}. The posterior is

p(λr)p(rλ)p(λ)=eλλr1r! p(\lambda \mid r) \propto p(r \mid \lambda) p(\lambda) = \frac{e^{-\lambda} \lambda^{r-1}}{r!}

The negative log of the posterior is

Q=lnp(λr)=[ln(eλ)+ln(λr1)ln(r!)]λ(r1)log(λ) \begin{aligned} Q &= - \ln p(\lambda \mid r) \\ &= - [ \ln(e^{-\lambda}) + \ln(\lambda^{r-1}) - \ln(r!)] \\ &\propto \lambda - (r - 1) \log(\lambda) \end{aligned}

We drop the term ln(r!)\ln(r!) because it is just a constant with respect to λMAP\lambda_{\text{MAP}}. Now let’s find λMAP\lambda_{\text{MAP}}:

Qλ=1r1λ0=1r1λλMAP=r1 \begin{aligned} \frac{\partial Q}{\partial \lambda} &= 1 - \frac{r-1}{\lambda} \\ 0 &= 1 - \frac{r-1}{\lambda} \\ \lambda_{\text{MAP}} &= r - 1 \end{aligned}

Next, we compute the second derivative, evaluated at the mode of the distribution,

A=2Qλ2λ=λMAP=r1λMAP2=r1(r1)2=1r1 A = \frac{\partial^2 Q}{\partial \lambda^2} \Bigg|_{\lambda = \lambda_{\text{MAP}}} = \frac{r-1}{\lambda^2_{\text{MAP}}} = \frac{r - 1}{(r-1)^2} = \frac{1}{r - 1}

Since A=1r1A = \frac{1}{r - 1} and λMAP=r1\lambda_{\text{MAP}} = r - 1, the normalized density is

q(λ)=A2πexp{A2(λλMAP)2}=12π(r1)exp{12(r1)(λ(r1))2} \begin{aligned} q(\lambda) &= \sqrt{\frac{A}{2 \pi}} \exp \Big\{ - \frac{A}{2} (\lambda - \lambda_{\text{MAP}})^2 \Big\} \\ &= \frac{1}{\sqrt{2 \pi (r - 1)}} \exp \Big\{ - \frac{1}{2 (r - 1)} (\lambda - (r - 1))^2 \Big\} \end{aligned}

This is a Gaussian with a mean and variance of r1r - 1,

p(λr)N(λ;r1,r1) \boxed{p(\lambda \mid r) \approx \mathcal{N}(\lambda; r - 1, r - 1)}

and we’re done.

As a final warning, observe that a Gaussian approximation is often suboptimal. For example, we just derived a Gaussian distribution that has an identical mean and variance. For ease of visualization, let’s recalculate our Laplace approximation of just the Poisson distribution without the improper prior from MacKay. We can see that the approximation gets worse as λ\lambda gets bigger (Figure 33).

Figure 3: Laplace approximations of Pois(xλ)\text{Pois}(x \mid \lambda) for λ{2,10}\lambda \in \{2, 10\}. Since each Gaussian approximation has both a mean and a variance that are functions of λ\lambda, the variance of the Gaussian approximation quickly blows up when λ\lambda is large.

This is due to the fact that the Gaussian approximation has both a mean and a variance of λ\lambda, and its spread increases faster than the Poisson distribution’s.

  1. Laplace, P. S. (1774). Mémoire sur la probabilité des causes par les événements. Mém. De Math. Et Phys. Présentés à l’Acad. Roy. Des Sci, 6, 621–656.