Why Metropolis–Hastings Works

Many authors introduce Metropolis–Hastings through its acceptance criteria without explaining why such a criteria allows us to sample from our target distribution. I provide a formal justification.

Metropolis–Hastings (MH) is an elegant algorithm that is based on a truly deep idea. Suppose we want to sample from a target distribution π \pi^*. We can evaluate π \pi^*, just not sample from it. MH performs a random walk according to a Markov chain whose stationary distribution is π \pi^*. At each step in the chain, a new state is proposed and either accepted or rejected according to a dynamically calculated probability, called the acceptance criteria. The Markov chain is never explicitly constructed; we cannot save the transition probability matrix to disk or print one of its rows. However, if the MH algorithm is run for long enough—until the Markov chain mixes—, then the probability of being on a given state in the chain is equal to the probability of the associated sample. Thus, walking the Markov chain and recording states is, in the long-run, like sampling from π \pi^*.

This should be mind-blowing. This is not obvious. If these ideas are new, you should have to read the above paragraph twice. However, my go-to authors for excellent explanations of machine learning ideas—(Bishop, 2006; MacKay, 2003; Murphy, 2012)—as well as most blogs introduce MH by presenting just the algorithm’s acceptance criteria with no justification for why the algorithm works. For example, after MacKay introduces notation and the acceptance criteria, he writes

It can be shown that for any positive QQ (that is, any QQ such that Q(x,x)>0Q(x, x^{\prime}) > 0 for all x,xx, x^{\prime}) as tt \rightarrow \infty, the probability distribution of x(t)x(t) tends to P(x)P(x).

Above, PP is the target distribution (what I call π\pi^{*}), QQ is the proposal distribution which proposes new samples, and x(t)x(t) is the sample at step tt. That said, the above explanation completely elides the mind-blowing part: how is walking an implicit Markov chain the same as sampling from a target distribution, and how does the acceptance criteria ensure we’re randomly walking according to the desired chain?

The goal of this post is to formally justify the algorithm. The notation and proof is based on (Chib & Greenberg, 1995). I assume the reader understands Markov chains. Please see my previous post for an introduction if needed.

Notation

Consider a Markov chain with transition kernel P(x,A)P(x, A) where xRdx \in \mathbb{R}^d and AA is a subset of our sample space (technically, ABA \in \mathcal{B} where B\mathcal{B} is the Borel σ\sigma-field on Rd\mathbb{R}^d). So in words, P(x,A)P(x, A) is a conditional distribution function of moving from xx to a point in the set AA. Note that a transition kernel is the generalization of a transition matrix in finite state-spaces. Naturally, P(x,Rd)=1P(x, \mathbb{R}^d) = 1 and self-loops are allowed, meaning that P(x,x)P(x, \\{x\\}) is not necessarily zero.

The stationary distribution of a Markov chain is defined as π\pi^{*} where

π(dy)=π(y)dy=RdP(x,dy)π(x)dx. \pi^{*}(\text{d}y) = \pi(y)\text{d}y = \int_{\mathbb{R}^d} P(x, \text{d}y) \pi(x) \text{d}x.

This is just the continuous state-space analog of the discrete case. In a discrete Markov chain Xn\\{X_n\\} taking values in DD, if the transition matrix is P=(pij)_i,jD\mathbf{P} = (p_{ij})\_{i,j \in D}, then stationary distribution is

π=πP \boldsymbol{\pi}^{*} = \boldsymbol{\pi}^{*} \mathbf{P}

where π\boldsymbol{\pi}^{*} is a DD-dimensional row vector. In words, the Markov chain has mixed when the probability of being on a given state no longer changes as we walk the chain.

The nn-th iterate or nnth application of the transition kernel is given by

P(1)(x,A)=P(x,A)P(n)(x,A)=RdP(n1)(x,dy)P(y,A). \begin{aligned} P^{(1)}(x, A) &= P(x, A) \\ P^{(n)}(x, A) &= \int_{\mathbb{R}^d} P^{(n-1)} (x, \text{d}y) P(y, A). \end{aligned}

As nn goes to infinity, the nn-th iterate converges to the stationary distribution or

π(A)=limnP(n)(x,A). \pi^{*}(A) = \lim_{n \rightarrow \infty} P^{(n)}(x, A).

The above is an alternative definition of π\pi^{*}.

Implicit Markov chain construction

Markov chain Monte Carlo (MCMC) approaches the problem of sampling in a beautiful but nonobvious way. We want to sample from a π\pi^{*}. Let’s imagine that π\pi^{*} is the stationary distribution of a particular Markov chain. If we could randomly walk that Markov chain, then we could sample from π\pi^{*} eventually. Thus, we need to construct a transition kernel P(x,A)P(x, A) which converges to π\pi^{*} in the limit.

Suppose we represent the transition kernel as

P(x,dy)=p(x,y)1(xdy)dy+r(x)1(xdy)(1) P(x, \text{d}y) = p(x, y) \mathbf{1}(x \notin \text{d}y) \text{d}y + r(x) \mathbf{1}(x \in \text{d}y) \tag{1}

where p(x,y)p(x, y) is some function, 1(c)\mathbf{1}(c) is an indicator random variable taking the value one if condition cc is true and zero otherwise, and r(x)r(x) is defined as

r(x)=1Rdp(x,y)dy. r(x) = 1 - \int_{\mathbb{R}^d} p(x, y) \text{d}y.

Alternatively, we can write the kernel as

P(x,dy)={1Rdp(x,y)dyif xdyp(x,y)dyelse. P(x, \text{d}y) = \begin{cases} 1 - \int_{\mathbb{R}^d} p(x, y) \text{d}y & \text{if $x \in \text{d}y$} \\ p(x, y) \text{d}y & \text{else}. \end{cases}

Thus, r(x)r(x) is the probability that the Markov chain remains at xx, and Rdp(x,y)dy\int_{\mathbb{R}^d} p(x, y) \text{d}y is not necessarily one because r(x)r(x) is not necessarily zero.

Now consider the following reversibility constraint,

π(x)p(x,y)=π(y)p(y,x).(2) \pi(x) p(x, y) = \pi(y) p(y, x). \tag{2}

If p(x,y)p(x, y) adheres to this constraint, then π()\pi(\cdot) is the stationary distribution of P(x,)P(x, \cdot). To see this, consider the following derivation:

RdP(x,A)π(x)dx=Rd[Ap(x,y)1(xdy)dy+r(x)1(xdy)]π(x)dx=Rd[Ap(x,y)1(xdy)dy]π(x)dx+Rdr(x)1(xA)π(x)dx=Rd[Ap(x,y)1(xdy)dy]π(x)dx+Ar(x)π(x)dx=A[Rdp(x,y)π(x)dx]1(xdy)dy+Ar(x)π(x)dx=A[Rdp(y,x)π(y)dx]1(xdy)dy+Ar(x)π(x)dx=A(1r(y))π(y)dy+Ar(x)π(x)dx=Aπ(x)dx. \begin{aligned} \int_{\mathbb{R}^d} P(x, A) \pi(x) \text{d}x &= \int_{\mathbb{R}^d} \Big[\int_A p(x, y) \mathbf{1}(x \notin \text{d}y) \text{d}y + r(x) \mathbf{1}(x \in \text{d}y) \Big] \pi(x) \text{d}x \\ &= \int_{\mathbb{R}^d} \Big[\int_A p(x, y) \mathbf{1}(x \notin \text{d}y) \text{d}y \Big] \pi(x) \text{d}x + \int_{\mathbb{R}^d} r(x) \mathbf{1}(x \in A) \pi(x) \text{d}x \\ &= \int_{\mathbb{R}^d} \Big[\int_A p(x, y) \mathbf{1}(x \notin \text{d}y) \text{d}y \Big] \pi(x) \text{d}x + \int_A r(x) \pi(x) \text{d}x \\ &= \int_A \Big[\int_{\mathbb{R}^d} p(x, y) \pi(x) \text{d}x \Big] \mathbf{1}(x \notin \text{d}y)\text{d}y + \int_A r(x) \pi(x) \text{d}x \\ &\stackrel{\star}{=} \int_A \Big[\int_{\mathbb{R}^d} p(y, x) \pi(y) \text{d}x \Big] \mathbf{1}(x \notin \text{d}y)\text{d}y + \int_A r(x) \pi(x) \text{d}x \\ &= \int_A (1 - r(y)) \pi(y) \text{d}y + \int_A r(x) \pi(x) \text{d}x \\ &= \int_A \pi(x) \text{d}x. \end{aligned}

Step \star is the key step. It only holds because of the reversibility constraint, and it’s what allows us to cancel all terms except π(y)dy\pi(y) \text{d}y.

Let’s review. We want to sample from some target distribution π\pi^{*}. We imagine that this distribution is the stationary distribution of some Markov chain, but we don’t know the chain’s transition kernel P(x,A)P(x, A). The above derivation demonstrates that if we define P(x,A)P(x, A) as Equation 11 and ensure that p(x,y)p(x, y) adheres to the reversibility constraint in Equation 22, then we will have found the transition kernel for a chain whose stationary distribution is our target distribution. This is the essence of Metropolis–Hastings. As we will see, MH’s acceptance criteria is constructed to ensure that the reversibility constraint is met.

Metropolis–Hastings

We want to construct the function p(x,y)p(x, y) such that it is reversible. Consider a candidate generating density, q(yx)q(y \mid x). This density, as in rejection sampling, generates candidate samples yy conditioned on xx that will be either rejected or accepted depending on some criteria. Note that if xx and yy are states, this is a Markov process because no past states are considered. The future only depends on the present. Since q()q(\cdot) is a density, q(yx)dy=1\int q(y \mid x) \text{d}y = 1. If the following were true,

q(yx)π(x)=q(xy)π(y) q(y \mid x) \pi(x) = q(x \mid y) \pi(y)

then we’re done. We’ve satisfied Equation 22. But most likely this is not the case. For example, we might find that

q(yx)π(x)q(xy)π(y). q(y \mid x) \pi(x) \geq q(x \mid y) \pi(y).

Our random process moves from xx to yy more often than from yy to xx. MH ensures equilibrium (reversibility) this by restricting some moves (samples) according to an acceptance criterion, α(x,y)1\alpha(x, y) \leq 1. Thus, if a move is not made, the process returns to xx. Intuitively, this is what allows us to balance q(yx)π(x)q(y \mid x)\pi(x) with q(xy)π(y)q(x \mid y)\pi(y) . Let’s assume that moves from xx to yy happen more often than the reverse. Then our criteria would be

q(yx)π(x)α(x,y)=q(xy)π(y)α(x,y)=q(xy)π(y)q(yx)π(x). \begin{aligned} q(y \mid x) \pi(x) \alpha(x, y) &= q(x \mid y) \pi(y) \\ \alpha(x, y) &= \frac{q(x \mid y) \pi(y)}{q(y \mid x) \pi(x)}. \end{aligned}

Of course, the probabilities could be reversed, but we can handle both cases this with a single expression:

α(x,y)=min{1,q(xy)π(y)q(yx)π(x)}.(3) \alpha(x, y) = \min \Bigg\{ 1, \frac{q(x \mid y) \pi(y) }{q(y \mid x) \pi(x)} \Bigg\}. \tag{3}

If MH moves from xx to yy more than the reverse, then the numerator is greater than the denominator, and the probability of accepting a move to yy from xx goes down. If we move from yy to xx more than the reverse, the sample (move from xx to yy) is accepted with probability one.

We have found our function p(x,y)p(x, y). It is

pMH(x,y)=α(x,y)q(yx),xy. p_{\texttt{MH}}(x, y) = \alpha(x, y) q(y \mid x), \qquad x \neq y.

And while it is unnecessary to write down, for completeness the full transition kernel PMH(x,y)P_{\texttt{MH}}(x, y) is

PMH(x,y)=α(x,y)q(yx)prob. leaving x+[1Rdα(x,y)q(yx)dy]1(xdy).prob. staying on x P_{\texttt{MH}}(x, y) = \overbrace{\phantom{\Bigg|} \alpha(x, y) q(y \mid x) }^{\text{prob. leaving $x$}} + \overbrace{\phantom{\Bigg|} \Big[ 1 - \int_{\mathbb{R}^d} \alpha(x, y) q(y \mid x) \text{d}y \Big] \mathbf{1}(x \in \text{d}y). }^{\text{prob. staying on $x$}}

In summary, if we conditionally sample according to the density q(yx)q(y \mid x) and accept the proposed sample according to α(x,y)\alpha(x, y), then we’ll be randomly walking according to a Markov chain whose stationary distribution is π\pi^{*}.

Metropolis

Notice that for a symmetric conditional distribution—so q(yx)=q(xy)q(y \mid x) = q(x \mid y)—, Equation 33 simplifies to

α(x,y)=min{1,π(y)π(x)}. \alpha(x, y) = \min \Bigg\{ 1, \frac{\pi(y) }{\pi(x)} \Bigg\}.

This is the predecessor to Metropolis–Hastings, the Metropolis algorithm invented by Nicholas Metropolis et al in 1953 (Metropolis et al., 1953). In 1970, Wilfred Hastings extended Equation 44 to the more general asymmetrical case in Equation 33 (Hastings, 1970). For example, if we assume that q(yx)q(y \mid x) is a conditional Gaussian distribution, then running the Metropolis–Hastings algorithm is equivalent to running the Metropolis algorithm.

Example: Rosenbrock density

Imagine we want to sample from the Rosenbrock density (Figure 11),

π(x1,x2;a,b)exp{(ax1)2+b(x2x12)220}.(4) \pi^{*}(x_1, x_2; a, b) \propto \exp\Big\{-\frac{(a - x_1)^2 + b(x_2 - x_1^2)^2}{20}\Big\}. \tag{4}

The Rosenbrock function (Rosenbrock, 1960) is a well-known test function in optimization because while finding a minimum is relatively easy, finding the global minimum at (11, 11) is less trivial. (Goodman & Weare, 2010) adapted the function to serve as a benchmark for MCMC algorithms.

Figure 1. The Rosenbrock density (Equation 44) with a=1a = 1 and b=100b = 100. Darker colors indicate higher probability. The global minimum is at (x,y)=(a,a2)=(1,1)(x, y) = (a, a^2) = (1, 1) and denoted with a red "X".

We must provide a candidate transition kernel q(yx)q(y \mid x). The only requirement is that we can sample conditionally from it and that q(yx)dy=1\int q(y \mid x) \text{d}y = 1 for all xx. Perhaps the simplest possible kernel is to simply add Gaussian noise to xx. Thus, our candidate distribution is

yxN(x;0,σ2),y=x+N(0,σ2) y \mid x \sim \mathcal{N}(x; 0, \sigma^2), \qquad y = x + \mathcal{N}(0, \sigma^2)

where the variance σ2\sigma^2 controls how big each step is. Since the density is symmetric, we can implement the simpler Metropolis algorithm. We can accept a proposal sample with probability α(x,y)\alpha(x, y) by drawing a uniform random variable with support [0,1][0, 1], and checking if it is less than the acceptance criteria. For our setup, we have:

import numpy as np
from   numpy.random import multivariate_normal as mvn
import matplotlib.pyplot as plt

n_iters    = 1000
samples    = np.empty((n_iters, 2))
samples[0] = np.random.uniform(low=[-3, -3], high=[3, 10], size=2)
rosen      = lambda x, y: np.exp(-((1 - x)**2 + 100*(y - x**2)**2) / 20)

for i in range(1, n_iters):
    curr  = samples[i-1]
    prop  = curr + mvn(np.zeros(2), np.eye(2) * 0.1)
    alpha = rosen(*prop) / rosen(*curr)
    if np.random.uniform() < alpha:
        curr = prop
    samples[i] = curr

plt.plot(samples[:, 0], samples[:, 1])
plt.show()

That’s it. Despite its conceptual depth, Metropolis–Hastings is surprisingly simple to implement, and it is not hard to imagine writing a more general implementation that handles Equation 33 for any arbitrary π\pi and q()q(\cdot).

Figure 2. Three randomly initialized Markov chains run on the Rosenbrock density (Equation 44) using the Metropolis–Hastings algorithm. After mixing, each chain walks regions in regions where the probability is high. The global minimum is at (x,y)=(a,a2)=(1,1)(x, y) = (a, a^2) = (1, 1) and denoted with a black "X".

The above code is the basis for Figure 22, which runs three Markov chains from randomly initialized starting points. This example highlights two important implementation details for Metropolis–Hastings. First, step size (or σ2\sigma^2 for our candidate distribution) is important. If the step size were too large relative to the support of the Rosenbrock density, it would be difficult to sample near the distribution’s mode. Second, a so-called burn-in period in which initial samples are discarded is typically used. This is because samples before the chain starts to mix are not representative of the underlying distribution.

Conclusion

Metropolis–Hastings is a beautifully simple algorithm based on a truly original idea. We have these mathematical objects called Markov chains that, when ergodic, converge to their respective stationary distirbutions. We turn this idea on its head, and imagine that our target distribution is the stationary distribution for a given chain, and then implicitly construct that chain on the fly. The acceptance criteria ensures that the transition kernel will, in the long-run, induce the stationary distribution. Despite the algorithm being based on many deep ideas—think about how much intellectual machinery is embedded in this single paragraph—implementing the algorithm takes fewer than a dozen lines of code.

 

Acknowledgements

I thank Wei Deng for catching a couple typos around step \star.

  1. Bishop, C. M. (2006). Pattern Recognition and Machine Learning.
  2. MacKay, D. J. C. (2003). Information theory, inference and learning algorithms. Cambridge university press.
  3. Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.
  4. Chib, S., & Greenberg, E. (1995). Understanding the metropolis-hastings algorithm. The American Statistician, 49(4), 327–335.
  5. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6), 1087–1092.
  6. Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications.
  7. Rosenbrock, H. H. (1960). An automatic method for finding the greatest or least value of a function. The Computer Journal, 3(3), 175–184.
  8. Goodman, J., & Weare, J. (2010). Ensemble samplers with affine invariance. Communications in Applied Mathematics and Computational Science, 5(1), 65–80.