The ELBO in Variational Inference

I derive the evidence lower bound (ELBO) in variational inference and explore its relationship to the objective in expectation–maximization and the variational autoencoder.

Variational inference

In Bayesian inference, we are often interested in the posterior distribution p(ZX)p(\mathbf{Z} \mid \mathbf{X}) where X\mathbf{X} are our observations, {x1,,xN}\{\mathbf{x}_1, \dots, \mathbf{x}_N\} and Z\mathbf{Z} are latent variables, {z1,,zN}\{\mathbf{z}_1, \dots, \mathbf{z}_N\}. However, in many practical models of interest, this posterior is intractable because we cannot compute the evidence or denominator of Bayes’ theorem, p(X)p(\mathbf{X}). This evidence is hard to compute because we have introduced latent variables that must now be marginalized out. Such integrals are often intractable in the sense that (1) we do not have an analytic expression for them or (2) they are computationally intractable. See my previous post on HMMs or (Blei et al., 2017) for examples.

The main idea of variational inference (VI) is to use optimization to find a simpler or more tractable distribution q(Z)q(\mathbf{Z}) from a family of distributions Q\mathcal{Q} such that it is close to the desired posterior distribution p(ZX)p(\mathbf{Z} \mid \mathbf{X}) (Figure 11). In VI, we define “close to” using the Kullback-Leibler (KL) divergence. Thus, the desired VI objective is

q(Z)=arg ⁣minq(Z)QDKL[q(Z)p(ZX)].(1) q^*(\mathbf{Z}) = \arg\!\min_{q(\mathbf{Z}) \in \mathcal{Q}} D_{\text{KL}}[q(\mathbf{Z}) \lVert p(\mathbf{Z} \mid \mathbf{X})]. \tag{1}

Minimizing the KL divergence can be interpreted as minimizing the relative entropy between the two distributions. See my previous post on the KL divergence for a discussion.

Figure 1. Diagram of VI. A family of distributions Q\mathcal{Q} is visualized as a blob. VI starts with some initial distribution q(0)(Z)Qq^{(0)}(\mathbf{Z}) \in \mathcal{Q} and then iteratively minimizes the KL divergence between the approximating distribution at iteration tt, call this q(t)(Z)q^{(t)}(\mathbf{Z}), and the desired posterior p(ZX)p(\mathbf{Z} \mid \mathbf{X}). The goal is to find an optimal distribution q(Z)q^{*}(\mathbf{Z}) where optimality is defined as having the smallest possible KL divergence between q(Z)q^{*}(\mathbf{Z}) and p(ZX)p(\mathbf{Z} \mid \mathbf{X}).

Evidence-lower bound

The main challenge with the variational inference objective in Eq. 11 is that it implicitly depends on the evidence, p(X)p(\mathbf{X}). Thus, we have not yet gotten around the intractability discussed above. To see this dependence, let’s write out the definition of the KL divergence:

DKL[q(Z)p(ZX)]=qq(Z)logq(Z)p(ZX)=Eq(Z)[logq(Z)p(ZX)]=Eq(Z)[logq(Z)]Eq(Z)[logp(Z,X)]ELBO(q)+logp(X).(2) \begin{aligned} D_{\text{KL}}[q(\mathbf{Z}) \lVert p(\mathbf{Z} | \mathbf{X})] &= \int_{q} q(\mathbf{Z}) \log \frac{q(\mathbf{Z})}{p(\mathbf{Z} | \mathbf{X})} \\ &= \mathbb{E}_{q(\mathbf{Z})}\left[ \log \frac{q(\mathbf{Z})}{p(\mathbf{Z} | \mathbf{X})} \right] \\ &= \underbrace{\mathbb{E}_{q(\mathbf{Z})}[ \log q(\mathbf{Z})] - \mathbb{E}_{q(\mathbf{Z})}[\log p(\mathbf{Z}, \mathbf{X})]}_{-\text{ELBO}(q)} + \log p(\mathbf{X}). \end{aligned} \tag{2}

Because we cannot compute the desired KL divergence, we optimize a different objective that is equivalent to this KL divergence up to constant. This new objective is called the evidence lower bound or ELBO:

ELBO(q):=Eq(Z)[logp(Z,X)]Eq(Z)[logq(Z)].(3) \text{ELBO}(q) := \mathbb{E}_{q(\mathbf{Z})}[\log p(\mathbf{Z}, \mathbf{X})] - \mathbb{E}_{q(\mathbf{Z})}[ \log q(\mathbf{Z})]. \tag{3}

This is a negation of the left two terms in Eq. 22. We can rewrite Eq. 22 as

logp(X)=ELBO(q)+DKL[q(Z)p(ZX)].(4) \log p(\mathbf{X}) = \text{ELBO}(q) + D_{\text{KL}}[q(\mathbf{Z}) \lVert p(\mathbf{Z} | \mathbf{X})]. \tag{4}

Why is the ELBO so-named? Since the KL divergence is non-negative, we know

logp(X)ELBO(q).(5) \log p(\mathbf{X}) \geq \text{ELBO}(q). \tag{5}

In other words, the log evidence logp(X)\log p(\mathbf{X}), a fixed quantity for any set of observations X\mathbf{X}, cannot be less than the ELBO. So if we maximize the ELBO, we minimize the desired KL divergence. This is VI in a nutshell.

Relationship to EM

It’s fun to observe the relationship between VI and expectation–maximization (EM). EM maximizes the expected log likelihood when q(Z)=p(ZX)q(\mathbf{Z}) = p(\mathbf{Z} | \mathbf{X}), i.e.

logp(X)=ELBO(q)EM maximizes this+KL[q(Z)p(ZX)]Since this is zero.(6) \log p(\mathbf{X}) = \overbrace{\vphantom{\Big|}\text{ELBO}(q)}^{\text{EM maximizes this}} + \overbrace{\vphantom{\Big|} \text{KL}[q(\mathbf{Z}) \lVert p(\mathbf{Z} | \mathbf{X})]}^{\text{Since this is zero}}. \tag{6}

To be a bit more pedantic, at iteration tt with current parameters estimates θ(t)\boldsymbol{\theta}^{(t)}, EM optimizes this expected complete log likelihood inside the ELBO:

logp(Xθ)=Ep(ZX,θ(t))[logp(X,Zθ)]EM maximizes thisEp(ZX,θ(t))[logp(ZX)].(7) \log p(\mathbf{X} \mid \boldsymbol{\theta}) = \overbrace{\mathbb{E}_{p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)})} \left[ \log p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})\right]}^{\text{EM maximizes this}} - \mathbb{E}_{p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)})} \left[ \log p(\mathbf{Z} \mid \mathbf{X}) \right]. \tag{7}

See my previous post on EM for why EM ignores the right term in Eq. 77.

Gradient-based VI

VI is a framework, and there are a variety of different ways to optimize this ELBO. (See (Blei et al., 2017) for a much deeper discussion.) A popular approach in deep generative modeling is to use gradient-based optimization of the ELBO. Describing a low-variance, gradient-based estimator of the ELBO is a main contribution of (Kingma & Welling, 2013). This allows for us to approximate the posterior with more flexible density estimators, such as neural networks. The most famous example of gradient-based VI is probably the variational autoencoder. See (Kingma & Welling, 2013) or my previous post on the reparameterization trick for details.

  1. Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518), 859–877.
  2. Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. ArXiv Preprint ArXiv:1312.6114.