Expectation–Maximization

For many latent variable models, maximizing the complete log likelihood is easier than maximizing the log likelihood. The expectation–maximization (EM) algorithm leverages this fact to construct and optimize a tight lower bound. I rederive EM.

Published

10 November 2019

Consider a probabilistic model with data $\mathbf{x} = \{x_n, \dots, x_N\}$ , latent variables $\mathbf{z} = \{z_1, \dots, z_N\}$ , and parameters $\boldsymbol{\theta}$ . To perform maximum likelihood inference, we need to compute the log likelihood $\log p(\mathbf{x} \mid \boldsymbol{\theta})$ . However, our modeling assumption is that we have some latent variables $\mathbf{z}$ . We could handle these latent variables by marginalizing them out,

$\log p(\mathbf{x} \mid \boldsymbol{\theta}) = \log \sum_{\mathbf{z}} p(\mathbf{x}, \mathbf{z} \mid \boldsymbol{\theta}). \tag{1}$

However, this may be intractable. For example, a hidden Markov model with $N$ observations whose latent variables take one of $K$ states has $K^N$ terms in Equation $1$ .

There is a standard solution to this general problem: Expectation-Maximization or EM (Dempster et al., 1977). It relies on the fact that in many statistical problems, maximizing the complete log likelihood $p(\mathbf{x}, \mathbf{z}, \mid \boldsymbol{\theta})$ is actually easier than maximizing the log likelihood. Rather than optimizing Equation $1$ directly, EM iteratively optimizes a lower bound. As we will see, this lower bound is tight, meaning no greater value is also a lower bound, and maximizing the lower bound guarantees that we maximize the likelihood.

First, let’s derive the lower bound,

$\begin{aligned} \log p(\mathbf{x} \mid \boldsymbol{\theta}) &= \log \sum_{\mathbf{z}} p(\mathbf{x}, \mathbf{z} \mid \boldsymbol{\theta}) \\ &= \log \sum_{\mathbf{z}} q(\mathbf{z}) \frac{p(\mathbf{x}, \mathbf{z} \mid \boldsymbol{\theta})}{q(\mathbf{z})} \\ &= \log \Big( \mathbb{E}_{q(\mathbf{z})}\Big[\frac{p(\mathbf{x}, \mathbf{z} \mid \boldsymbol{\theta})}{q(\mathbf{z})}\Big] \Big) \\ &\geq \mathbb{E}_{q(\mathbf{z})} \Big[ \log \Big( \frac{p(\mathbf{x}, \mathbf{z} \mid \boldsymbol{\theta})}{q(\mathbf{z})} \Big)\Big]. \end{aligned}$

To see why the inequality holds, let $f(x) = \log(x)$ . Since $f$ is a concave function, we can invoke Jensen’s inequality or $f(\mathbb{E}[a]) \geq \mathbb{E}[f(a)]$ . Please see my previous post for a proof, but note that this inequality is reversed if $f(\cdot)$ is convex. We can then use $\log(a/b) = \log(a) - \log(b)$ and the linearity of expectation to write,

$\log p(\mathbf{x} \mid \boldsymbol{\theta}) \geq \underbrace{\phantom{\Big|}\mathbb{E}_{q(\mathbf{z})} \Big[ \log p(\mathbf{x}, \mathbf{z} \mid \boldsymbol{\theta})\Big]}_{\text{Expected complete log likelihood}} \underbrace{\phantom{\Big|}- \mathbb{E}_{q(\mathbf{z})} \Big[ \log q(\mathbf{z}) \Big]}_{\text{Entropy of $q$}}. \tag{2}$

Since we’re taking the expectation with respect to $q(\mathbf{z})$ , we know that $q(\mathbf{z})$ is a density. However, which density should we choose if we want a tight lower bound? Jensen’s inequality holds with equality if $f(a)$ is a constant or if

$\frac{p(\mathbf{x}, \mathbf{z} \mid \boldsymbol{\theta})}{q(\mathbf{z})} = \text{non-random $c$}.$

For the fraction to be a constant, we know it must be true that $q(\mathbf{z}) \propto p(\mathbf{x}, \mathbf{z} \mid \boldsymbol{\theta})$ . And since $q(\mathbf{z})$ is a density and must normalize to one, we have

$\begin{aligned} q(\mathbf{z}) &= \frac{p(\mathbf{x}, \mathbf{z} \mid \boldsymbol{\theta})}{\sum_{\mathbf{z}^{\prime}} p(\mathbf{x}, \mathbf{z}^{\prime} \mid \boldsymbol{\theta})} \\ &= \frac{p(\mathbf{x}, \mathbf{z} \mid \boldsymbol{\theta})}{p(\mathbf{x} \mid \boldsymbol{\theta})} \\ &= p(\mathbf{z} \mid \mathbf{x}, \boldsymbol{\theta}). \end{aligned}$

Thus, we have found the ideal $q(\mathbf{z})$ in our lower-bound approximation of the log likelihood. We can rewrite the inequality in Equation $2$ as an equality,

$\log p(\mathbf{x} \mid \boldsymbol{\theta}) = \mathbb{E}_{p(\mathbf{z} \mid \mathbf{x}, \boldsymbol{\theta})} \big[ \log p(\mathbf{x}, \mathbf{z} \mid \boldsymbol{\theta})\big] - \mathbb{E}_{p(\mathbf{z} \mid \mathbf{x}, \boldsymbol{\theta})} \big[ \log p(\mathbf{z} \mid \mathbf{x}, \boldsymbol{\theta}) \big].$

We are almost ready to construct the EM algorithm. However, since EM is an iterative algorithm, let’s first introduce some notation based on iteration indexes,

$\begin{aligned} \log p(\mathbf{x} \mid \boldsymbol{\theta}) &= \mathbb{E}_{p(\mathbf{z} \mid \mathbf{x}, \boldsymbol{\theta}^{(t)})} \big[ \log p(\mathbf{x}, \mathbf{z} \mid \boldsymbol{\theta})\big] - \mathbb{E}_{p(\mathbf{z} \mid \mathbf{x}, \boldsymbol{\theta}^{(t)})} \big[ \log p(\mathbf{z} \mid \mathbf{x}, \boldsymbol{\theta}) \big] \\ &= Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) + H(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) \end{aligned}$

where $Q(\cdot)$ equal to the first expectation and $H(\cdot)$ equal to the negative of the second expectation. This notation is meant to express an expectation of $\boldsymbol{\theta}$ with respect to some other parameter value $\boldsymbol{\theta}^{(t)}$ .

Now carefully consider the following reasoning. By Gibb’s inequality, we know that

$H(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) \geq H(\boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)}).$

This is intuitive. The cross entropy (left-hand side) is always greater or equal to the entropy (right-hand side). The measure of “surprise” can only go up. Alternatively—and this is how many other authors explain EM—you can note that $H(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) - H(\boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)})$ is a Kullback–Leibler (KL) divergence, which is nonnegative. Please see my previous post on entropy and the KL divergence if these ideas are not clear.

The above logic implies,

$\begin{aligned} \log p(\mathbf{x} \mid \boldsymbol{\theta}) - \log p(\mathbf{x} \mid \boldsymbol{\theta}^{(t)}) &= Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) - Q(\boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)}) + H(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) - H(\boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)}) \\ &\geq Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) - Q(\boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)}). \tag{3} \end{aligned}$

The inequality in Equation $3$ demonstrates that if we maximize $Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)})$ , then we maximize the log likelihood $\log p(\mathbf{x} \mid \boldsymbol{\theta})$ by the same amount.

Thus, EM works by iteratively optimizing the expected complete log likelihood $Q(\cdot)$ rather than $\log p(\mathbf{x} \mid \boldsymbol{\theta})$ . It consists of two eponymous steps:

$\begin{aligned} \textbf{E-step:} && Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) &= \mathbb{E}_{p(\mathbf{z} \mid \mathbf{x}, \boldsymbol{\theta}^{(t)})} \big[ \log p(\mathbf{x}, \mathbf{z} \mid \boldsymbol{\theta})\big] \\ \textbf{M-step:} && \boldsymbol{\theta}^{(t+1)} &= \arg\!\max_{\boldsymbol{\theta}} Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}). \end{aligned}$

The E-step is so-called because it constructs the expectation of the complete log likelihood. The M-step is so-called because it then maximizes that quantity. Intuitively we can think of the E-step as constructing the desired lower bound, and the M-step as optimizing that bound.

While this post outlines the logic for why optimizing $Q(\cdot)$ optimizes $\log p(\mathbf{x} \mid \boldsymbol{\theta})$ , it does not prove that this iterative algorithm converges to the desired log likelihood. This was actually proven by (Wu, 1983) six years after Dempster et al’s paper, as the latter had a mistake in their proof due to a misuse of the triangle inequality.

As a final note, it may not be obvious that computing or maximizing the expected complete log likelihood is any easier than maximizing the log likelihood. However, recall that the intractability of the log likelihood arises from marginalizing out $\mathbf{z}$ , which may induce an exponential number of terms. The complete log likelihood does not suffer from this issue as it just expresses the probability of $\mathbf{x}$ and $\mathbf{z}$ conditioned on $\boldsymbol{\theta}$ . To see this, it might be useful to work through a complete example of fitting EM to a model. Please see my post on factor analysis for such an example.

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 1–38.
Wu, C. F. J. (1983). On the convergence properties of the EM algorithm. The Annals of Statistics, 11(1), 95–103.