Inference for Hidden Markov Models

Expectation–maximization for hidden Markov models is called the Baum–Welch algorithm, and it relies on the forward–backward algorithm for efficient computation. I review HMMs and then present these algorithms in detail.

The simplest probabilistic model of sequential data is that the data are i.i.d. We assume the data are Gaussian or binomial or from a nonparametric distribution we cannot write down, but we do not assume that model changes over time. We do not assume we can make any inference about the next observation given the current one. The graphical model for this assumption would be a sequence of unconnected nodes.

Of course, this assumption prevents modeling any sequential structure. For example, in text, certain words are more likely to follow a given word than others, while an i.i.d. assumption would obliterate this structure by assuming each word is independent from the words around it. One simple model, called a first-order Markov model, is that each observation only depends on the previous one. For example, if we assume that each observation is Gaussian and its mean linearly depends on the previous data point, then we are working with an autoregressive model.

First-order models can be extended to MM-th order Markov models which assume each observation depends on the previous MM observations. However, the number of parameters of this model is exponential in MM. For example, imagine we flipped a coin MM times, and we need to consider all possible sequence of coin flips. Each time MM increases by one, the number of possible sequences doubles. Thus, inference is intractable for even modest values of MM.

A powerful probabilistic model of sequential data that still limits the number of parameters is the hidden Markov model (HMM). The key idea is that a latent variable or state variable, rather than the data, evolves according to a discrete, first-order Markov process. Each observation is conditionally independent of every other observation given the value of its associated latent variable (Fig. 11).

Figure 1: Graphical model for a hidden Markov model. The hidden state variables z1,,zN\mathbf{z}_1, \dots, \mathbf{z}_N follow a Markov process, while each observation xn\mathbf{x}_n is conditionally independent from other data given its associated hidden state variable zn\mathbf{z}_n.

However, and this is the clever bit, notice that no observation is dd-separated from any other observation. There is always an unblocked path between any two observations because all observations are connected via the latent variables. Thus, our predictive distribution is

p(xn+1x1,,xn),(1) p(\mathbf{x}_{n+1} \mid \mathbf{x}_1, \dots, \mathbf{x}_n), \tag{1}

meaning that xn+1\mathbf{x}_{n+1} depends on all previous observations. Thus, by introducing these latent variables, we can tractably compute a flexible predictive distribution.

The goal of this post is to work through HMMs in detail. In particular, I want to focus on the Baum–Welch and forward–backward algorithms. I assume the reader understands Markov chains and expectation–maximization.

Example: Rainier weather data

Before diving into the model and inference details, let’s look at an example. I fit a hidden Markov model using the code below on Mount Rainier weather data. The data features are: temperature, relative humidity, daily wind speed, wind direction, and battery voltage. Each feature was collected daily between 3 August 2014 and 31 December 2015. I used three state variables, meaning the HMM will assign each day to one of three latent or hidden states. In Fig. 22, I plotted temperature across time (left panel) and temperature vs. wind direction (right panel). I colored the data points by their hidden states. As we can see, the inferred states are fairly interpretable, roughly capturing warm, cool, and cold days. Interestingly, it looks like the coldest days on Rainier have wind coming from the south.

Figure 2: Mount Rainier weather data between 3 August 2014 and 31 December 2015 with days. In both plots, days (data points) are labeled by a three-state hidden Markov model. (Left) Temperature across time. (Right) Temperature vs. wind direction.

Probabilistic model

Now that we have an intuition for what kinds of problems HMMs address, let’s dive into the details. Let X={x1,,xN}\mathbf{X} = \{\mathbf{x}_1, \dots, \mathbf{x}_N\} be NN sequential observations where xnRD\mathbf{x}_n \in \mathbb{R}^{D} where DD is a counting number. HMMs assume that each observation xn\mathbf{x}_n has a corresponding latent variable zn\mathbf{z}_n. We denote this set of latent variables as Z={z1,,zN}\mathbf{Z} = \{\mathbf{z}_1, \dots, \mathbf{z}_N\}. Note that each zn\mathbf{z}_n is a one-hot vector, not a scalar. This means zn\mathbf{z}_n is a KK-vector—meaning the HMM has KK states—with K1K-1 zeros and 11 one, indicating the assignment. We do this so that we can treat zn\mathbf{z}_n as a multinomial random variable, e.g.:

zn=[00100],(2) \mathbf{z}_{n} = \begin{bmatrix} 0 \\ 0 \\ 1 \\ 0 \\ \vdots \\ 0 \end{bmatrix}, \tag{2}

would indicate that the nn-th latent state is 33 (using one-based indexing). Why do we use this representation? This lets us think of zn\mathbf{z}_n as a single draw from a multinomial distribution with parameters 11 and [p1,,pK][p_1, \dots, p_K]^{\top}.

Markov assumption

We assume the latent variables follow a random process, a KK-state Markov chain. For example, consider a K=2K = 2 state Markov chain modeling the weather. The weather can either be sunny (zn=s\mathbf{z}_n = s) or rainy (zn=r\mathbf{z}_n = r) at time index nn (day, hour, minute, etc.). The transition probabilities are given in Fig. 33.

Figure 3: State diagram for a Markov chain {zn}\{\mathbf{z}_n\} with states S={r,s}\mathcal{S} = \{r, s\}. The probability of moving from rr to ss is pp and from ss to rr is qq. The remaining probabilities can be computed because probabilities over all disjoint events must sum to one.

For an HMM, imagine that we only observe humidity as a percentage. Thus, at each time point nn, xn\mathbf{x}_n is a scalar in the range [0,1][0, 1], representings the day’s humidity, while the latent variable zn\mathbf{z}_n evolves according to the chain in Fig. 33. Thus, if we could infer Z\mathbf{Z}, we could cluster our humidity data into K=2K=2 states.

The random process Z\mathbf{Z} is first-order Markovian. This Markov assumption is that the future only depends on the present. Formally, a random process z1,z2,\mathbf{z}_1, \mathbf{z}_2, \dots taking values in S\mathcal{S} is called a Markov chain if

p(zn+1zn,,z1)=p(zn+1zn).(3) p(\mathbf{z}_{n+1} \mid \mathbf{z}_{n}, \dots, \mathbf{z}_1) = p(\mathbf{z}_{n+1} \mid \mathbf{z}_{n}). \tag{3}

Thus, for a Markov chain such as in Fig. 44, we can say that X ⁣ ⁣ ⁣ ⁣ZYX \perp\!\!\!\!\perp Z \mid Y, which is read: “XX is independent of ZZ given YY”. For example, height and vocabulary are not independent. In the absence of any other information, we might guess that someone who is six-feet tall has a bigger vocabulary than someone who is four-feet tall. However, height and vocabulary are conditionally independent given age. If we learn that the six-foot tall person is a fast-growing fourteen-year-old while the the four-foot tall person is thirty-five, we might switch our guess as to who has the bigger vocabulary. In Fig. 44, “vocabulary” is the ZZ node, height is the YY node, and age is the XX node. We can see that if we don’t know YY, we can use XX to infer ZZ; but if we do know YY, XX is irrelevant.

Figure 4: Graphical model in which state ZZ is conditionally independent from XX given ZZ.

An important implication of the Markovian assumption is that the joint probability of our HMM factorizes nicely,

p(Z)=p(z1)p(z2z1)p(z3z2,z1)p(zNzN1,,z1)=p(z1)p(z2z1)p(z3z2)p(zNzN1)=p(z1)n=1Np(zn+1zn).(4) \begin{aligned} p(\mathbf{Z}) &= p(\mathbf{z}_1) p(\mathbf{z}_2 \mid \mathbf{z}_1) p(\mathbf{z}_3 \mid \mathbf{z}_2, \mathbf{z}_1) \dots p(\mathbf{z}_N \mid \mathbf{z}_{N-1}, \dots, \mathbf{z}_1) \\ &= p(\mathbf{z}_1) p(\mathbf{z}_2 \mid \mathbf{z}_1) p(\mathbf{z}_3 \mid \mathbf{z}_2) \dots p(\mathbf{z}_N \mid \mathbf{z}_{N-1}) \\ &= p(\mathbf{z}_1) \prod_{n=1}^{N} p(\mathbf{z}_{n+1} \mid \mathbf{z}_n). \end{aligned} \tag{4}

Because of this factorization, the joint probability p(Z)p(\mathbf{Z}) can be completely represented via a transition probability matrix. Let AijA_{ij} be the probability of transitioning from state ii to state jj. Then the transition probability matrix is A=[Aij]i,jS\mathbf{A} = [A_{ij}]_{i,j \in S}. Since there are KK states, A\mathbf{A} is a K×KK \times K matrix. Furthermore, we assume the Markov process is stationary or time-homogeneous, meaning

p(zn=jzn=i)=Aij,n.(5) p(\mathbf{z}_n = j \mid \mathbf{z}_n = i) = A_{ij}, \quad \forall n. \tag{5}

In words, the transition probabilities are not changing as a function of nn. This assumption can be changed to produce a time-inhomogeneous Markov chain, but we’ll focus on the time-homogeneous model.

Emission probabilities

In addition to the transition probabilities, an HMM has emission probabilities, which model how we assume our data are distributed given the state variable zn\mathbf{z}_n:

p(xnzn=k,ϕk),(6) p(\mathbf{x}_n \mid \mathbf{z}_n = k, \boldsymbol{\phi}_k), \tag{6}

where ϕk\boldsymbol{\phi}_k are the parameters of some distribution which we posit. For example, we might model xn\mathbf{x}_n as a DD-variate normal given zn\mathbf{z}_n or

xnND(xnμk,Σk).(7) \mathbf{x}_n \sim \mathcal{N}_D(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k). \tag{7}

Thus, zn\mathbf{z}_n indexes the parameter of the distribution, meaning it specifies which parameters to use in the set {ϕk}k=1K\{\boldsymbol{\phi}_k\}_{k=1}^{K} at time nn. Clearly, these probabilities are of interest because if we infer zn\mathbf{z}_n, then the emission probabilities tell us something about our data. For example, in the Mount Rainier example, I fit a Gaussian HMM; the Gaussian mean vector μk\boldsymbol{\mu}_k for each of the three states tells us the mean value for the data vector xn\mathbf{x}_n.

Initialization

The last thing to note is how to initialize the state distribution of our Markov chain. Let π\boldsymbol{\pi} be a KK-dimensional vector representing the probability of the Markov chain starting on each of the KK states. Thus, the components of π\boldsymbol{\pi} sum to one, k=1Kπk=1\sum_{k=1}^{K} \pi_k = 1, and π\boldsymbol{\pi} is called the initial state distribution. Recall that a multinomial random variable with parameters (M,π1,,πK)(M, \pi_1, \dots, \pi_K) models the probability of rolling a KK-sided die MM times, where πk\pi_k is the bias of the die landing on the kkth side. With M=1M = 1 and π=[π1,,πK]\boldsymbol{\pi} = [ \pi_1, \dots, \pi_K ]^{\top}, we can express our modeling assumption about z1\mathbf{z}_1 taking one of KK values as a multinomial random variable,

z1Mult(1,π).(8) \mathbf{z}_1 \sim \text{Mult}(1, \boldsymbol{\pi}). \tag{8}

HMMs can be either supervised or unsupervised. In the supervised setting, our observations have state labels X={(x1,z1),,(xN,zN)}\mathbf{X} = \{(\mathbf{x}_1, \mathbf{z}_1), \dots, (\mathbf{x}_N, \mathbf{z}_N)\}. In this context, we can use empirical Bayes to estimate π\boldsymbol{\pi},

π^k=n=1N1(zn=k)N(9) \hat{\pi}_k = \frac{\sum_{n=1}^{N} \mathbf{1}(\mathbf{z}_n = k)}{N} \tag{9}

where 1(zn=k)\mathbf{1}(\mathbf{z}_n = k) is an indicator random variable. The goal of a supervised HMM is to estimate the initial state, the transition probability matrix A\mathbf{A}, and the emission probabilities p(xnzn,ϕ)p(\mathbf{x}_n \mid \mathbf{z}_n, \boldsymbol{\phi}). This post will focus on the unsupervised setting. In this case, we need to estimate the model parameters θ={π,A,ϕ}\boldsymbol{\theta} = \{\boldsymbol{\pi}, \mathbf{A}, \boldsymbol{\phi} \} (the initial state, the transition probability matrix, and the emission probabilities) and the state variables Z\mathbf{Z}.

Filtering and smoothing

Before discussing inference, it is helpful to discuss two concepts in the HMM literature: filtering and smoothing. Filtering computes a belief state, or the probability of the latent variable zn\mathbf{z}_n being on a certain state in S\mathcal{S}, given the history of evidence so far (Fig. 55). Formally, filtering computes

p(znX1:n),(10) p(\mathbf{z}_n \mid \mathbf{X}_{1:n}), \tag{10}

where the notation Xa:b\mathbf{X}_{a:b} denotes the set {xa,xa+1,,xb1,xb}\{ \mathbf{x}_a, \mathbf{x}_{a+1}, \dots, \mathbf{x}_{b-1}, \mathbf{x}_b \} for aba \leq b. Using our running example of weather, filtering answers the question: given weather measurements up until the current time, what state are we in now? Notice that the Markov property forces us to look at all of our observations. This is because of how the model factorizes. Since the dependencies between time points are encoded in the random process, we need to marginalize over all of the previous latent variables to condition on X1:n\mathbf{X}_{1:n}.

Figure 5: Filtering or posterior inference of zn\mathbf{z}_n given all the observations up until that time point, x1,,xn\mathbf{x}_1, \dots, \mathbf{x}_n.

As we’ll see, the forward–backward algorithm for HMMs performs filtering in a recursive forward pass over the data. At each time point, it computes it’s belief about the current zn\mathbf{z}_n and then uses that estimate in its estimate for zn+1\mathbf{z}_{n+1}.

In smoothing, we compute a belief state given observations up to and including future time, relative to zn\mathbf{z}_n (Fig. 66). Formally,

p(znX).(11) p(\mathbf{z}_n \mid \mathbf{X}). \tag{11}

For example, smoothing answers the question: given all the weather measurement data available to us, what state were we in some time period ago? Once again, since the dependencies between time points are encoded in the random process, to condition on x1:N\mathbf{x}_{1:N} requires marginalizing over all the latent variables. Thus, we cannot look at fewer than all our observations.

Figure 6: Smoothing or posterior inference of zn\mathbf{z}_n given all the observations x1,,xN\mathbf{x}_1, \dots, \mathbf{x}_N.

Smoothing is performed during the backward pass of the forward–backward algorithm. You can think of it as “smoothing” because we’re using all the data we have seen so far to update our original, filtering-based estimate of each zn\mathbf{z}_n.

There are a number of other tasks we can perform with HMMs. Prediction computes p(xn+1x1:n)p(\mathbf{x}_{n+1} \mid \mathbf{x}_{1:n}). Viterbi decoding labels the most likely states for a new observed sequence x1:M\mathbf{x}_{1:M}. Posterior sampling allows us to randomly sample a latent state sequence z1:N\mathbf{z}_{1:N} given an observed sequence x1:N\mathbf{x}_{1:N}. In this post, we’ll focus on just filtering and smoothing, since these two steps are the core components of the forward–backward algorithm for HMM inference.

Baum–Welch algorithm

To perform maximum likelihood estimation on an HMM, we need to compute the likelihood p(Xθ)p(\mathbf{X} \mid \boldsymbol{\theta}) for θ={π,A,ϕ}\boldsymbol{\theta} = \{\boldsymbol{\pi}, \mathbf{A}, \boldsymbol{\phi}\}. We could compute this by marginalizing over the latent variables,

p(Xθ)=Zp(X,Zθ).(12) p(\mathbf{X} \mid \boldsymbol{\theta}) = \sum_{\mathbf{Z}} p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta}). \tag{12}

In words, we could marginalize out our uncertainty of the state variables by considering all possible values. However, there are KNK^N terms in this equation. This is because for all NN observations, there are KK possible state values at time nn. To see this, consider Fig. 77. Here we unroll an HMM with K=3K=3 states, representing all possible state values over time. A single configuration of Z\mathbf{Z}—a single term in the sum in Eq. 1212—amounts to a single path along the lattice in Fig. 77. There are KNK^N such paths. Since the number of terms in the likelihood scales exponentially in NN, computing p(Xθ)p(\mathbf{X} \mid \boldsymbol{\theta}) directly is intractable.

Figure 7: Unrolling a K=3K=3 state HMM into a K×NK \times N lattice. The bolded path is a single set of assignments Z\mathbf{Z}.

The standard solution to this kind of problem is expectation–maximization (EM). As we’ll see, the inference algorithm is a bit more complicated than “just EM”, and it has a name, the Baum–Welch algorithm (Baum et al., 1970). The M-step of the Baum–Welch algorithm is fairly straightforward, but the E-step is tricky and leverages the forward–backward algorithm.

EM works by iteratively optimizing the expected complete log likelihood rather than logp(Xθ)\log p(\mathbf{X} \mid \boldsymbol{\theta}). It consists of two eponymous steps:

E-step:Q(θθ(t))=Ep(ZX,θ(t))[logp(X,Zθ)]M-step:θ(t+1)=arg ⁣maxθQ(θθ(t)).(13) \begin{aligned} \textbf{E-step:} && Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) &= \mathbb{E}_{p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)})} \big[ \log p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})\big] \\ \textbf{M-step:} && \boldsymbol{\theta}^{(t+1)} &= \arg\!\max_{\boldsymbol{\theta}} Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}). \end{aligned} \tag{13}

In the E-step, we construct a tight lower bound to the log likelihood. In the M-step, we optimize this lower bound. EM relies on the fact that it is often easier to optimize the complete log likelihood p(X,Zθ)p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta}) than it is to optimize the log likelihood p(Xθ)p(\mathbf{X} \mid \boldsymbol{\theta}). We also want to estimate the parameters θ\boldsymbol{\theta} under the modeling assumption that Z\mathbf{Z} exists, and therefore we need the M-step. In more detail, our two steps are:

E-step:Estimate E[Z(t+1)X,θ] given Xπ(t),A(t) and ϕ(t).M-step:Estimate π(t+1),A(t+1) and ϕ(t+1) given E[Z(t+1)X,θ] and X.(14) \begin{aligned} \textbf{E-step:} &&& \text{Estimate $\mathbb{E}[\mathbf{Z}^{(t+1)} \mid \mathbf{X}, \boldsymbol{\theta}]$ given $\mathbf{X}$, $\boldsymbol{\pi}^{(t)}, \mathbf{A}^{(t)}$ and $\boldsymbol{\phi}^{(t)}$.} \\ \textbf{M-step:} &&& \text{Estimate $\boldsymbol{\pi}^{(t+1)}, \mathbf{A}^{(t+1)}$ and $\boldsymbol{\phi}^{(t+1)}$ given $\mathbb{E}[\mathbf{Z}^{(t+1)} \mid \mathbf{X}, \boldsymbol{\theta}]$ and $\mathbf{X}$.} \end{aligned} \tag{14}

We use the expectations in Eq. 1414 to compute the expected complete log likelihood. As we will see, we compute the E-step using the forward–backward algorithm; and we compute the M-step using maximum likelihood estimation and Lagrange multipliers.

E-step

First, let’s construct the complete log likelihood:

logp(Z,Xθ)=log{p(z1)[n=2Np(znzn1)][n=1Np(xnzn)]}=logp(z1)+n=2Nlogp(znzn1)+n=1Nlogp(xnzn).(15) \begin{aligned} \log p(\mathbf{Z}, \mathbf{X} \mid \boldsymbol{\theta}) &= \log \Big\{ p(\mathbf{z}_1) \Big[ \prod_{n=2}^{N} p(\mathbf{z}_{n} \mid \mathbf{z}_{n-1}) \Big] \Big[ \prod_{n=1}^{N} p(\mathbf{x}_n \mid \mathbf{z}_n) \Big] \Big\} \\ &= \log p(\mathbf{z}_1) + \sum_{n=2}^{N} \log p(\mathbf{z}_{n} \mid \mathbf{z}_{n-1}) + \sum_{n=1}^{N} \log p(\mathbf{x}_n \mid \mathbf{z}_n). \end{aligned} \tag{15}

Let θ(t)\boldsymbol{\theta}^{(t)} denote the current parameter estimates. We’ll estimate θ(t+1)\boldsymbol{\theta}^{(t+1)} in the M-step. The expected complete log likelihood is

Ep(ZX,θ(t))[logp(Z,Xθ(t))]=E[logp(z1)]+E[n=2Nlogp(znzn1)]+E[n=1Nlogp(xnzn)].(16) \begin{aligned} &\mathbb{E}_{p(\mathbf{Z} \mid \textbf{X}, \boldsymbol{\theta}^{(t)})} \big[\log p(\mathbf{Z}, \mathbf{X} \mid \boldsymbol{\theta}^{(t)})\big] \\ &= \mathbb{E} \big[\log p(\mathbf{z}_1)\big] + \mathbb{E} \big[ \sum_{n=2}^{N} \log p(\mathbf{z}_{n} \mid \mathbf{z}_{n-1})\big] + \mathbb{E} \big[ \sum_{n=1}^{N} \log p(\mathbf{x}_n \mid \mathbf{z}_n)\big]. \end{aligned} \tag{16}

Now recall that we’re conditioning on our parameter estimates θ(t)={π(t),A(t),ϕ(t)}\boldsymbol{\theta}^{(t)} = \{\boldsymbol{\pi}^{(t)}, \mathbf{A}^{(t)}, \boldsymbol{\phi}^{(t)}\}, and the only randomness is in the latent state variables Z\mathbf{Z}, which is a collection of multinomial random variables. So let’s write each expectation as explicit sums over the state variables:

k=1KE[z1k]logπk+n=2Nj=1Kk=1KE[zn1,jznk]logAjk+n=1Nk=1KE[znk]logp(xnϕk).(17) \sum_{k=1}^{K} \mathbb{E}[ z_{1k} ] \log \pi_k + \sum_{n=2}^{N} \sum_{j=1}^K \sum_{k=1}^K \mathbb{E}[z_{n-1,j} z_{nk}] \log A_{jk} + \sum_{n=1}^{N} \sum_{k=1}^{K} \mathbb{E}[z_{nk}] \log p(\mathbf{x}_n \mid \boldsymbol{\phi}_k). \tag{17}

Here, znkz_{nk} is the kk-th component of the vector zn\mathbf{z}_n. See A1 for a complete derivation of Eq. 1717. This might seem like a turgid representation, but it has two benefits. First, we have isolated the random quantities; and second, we have written everything explicitly in terms of our parameters θ\boldsymbol{\theta}. Since the E-step amounts to estimating Z\mathbf{Z} under the expected complete log likelihood, we just need to compute the terms inside expectations in Eq. 1717 given our parameters θ\boldsymbol{\theta} and then use those expectations in the M-step. We don’t actually have to compute the value of the expected complete log likelihood.

Now note that each znkz_{nk} is a component of a multinomial random variable zn\mathbf{z}_n. So the expectation is just the probability that znkz_{nk} takes the value 11:

E[znk]=k=1Kznkp(z=kX,θ(t)),E[zn1,jznk]=k=1Kzn1,jznkp(z=kX,θ(t)).(18) \begin{aligned} \mathbb{E}[z_{nk}] &= \sum_{k^{\prime}=1}^{K} z_{nk} p(\mathbf{z} = k^{\prime} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)}), \\ \mathbb{E}[z_{n-1, j} z_{nk}] &= \sum_{k^{\prime}=1}^{K} z_{n-1,j} z_{nk} p(\mathbf{z} = k^{\prime} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)}). \end{aligned} \tag{18}

In the sum, K1K-1 terms are 00 because z\mathbf{z} is a one-hot vector.

The goal of the E-step is to efficiently compute these posterior moments (Eq. 1818), and this is done using the forward–backward algorithm. Following Bishop, I will refer to the these posterior moments as gamma (γ\gamma) and xi (ξ\xi) respectively:

γ(znk)=E[znk],ξ(zn1,j,znk)=E[zn1,jznk].(19) \begin{aligned} \gamma(z_{nk}) &= \mathbb{E}[z_{nk}], \\ \xi(z_{n-1,j}, z_{nk}) &= \mathbb{E}[z_{n-1, j} z_{nk}]. \end{aligned} \tag{19}

The forward–backward algorithm estimates γ(znk)\gamma(z_{nk}) and ξ(zn1,j,znk)\xi(z_{n-1,j}, z_{nk}) for every n{1,,N}n \in \{1, \dots, N\} and every k{1,,K}k \in \{1, \dots, K\}. We’ll walk through the algorithm in detail in the next section, but now let’s look at the M-step.

M-step

In the M-step, we want to maximize the parameters, θ={π,A,ϕ}\boldsymbol{\theta} = \{\boldsymbol{\pi}, \mathbf{A}, \boldsymbol{\phi}\}, given Z(t)\mathbf{Z}^{(t)}. The first two parameters can be optimized using Lagrange multipliers. Let’s walk through optimizing π\boldsymbol{\pi} in detail. The Lagrangian function for π\boldsymbol{\pi} is

L(π,λπ)=k=1KE[z1k]logπk+Q(θθ(t))+λπ(k=1Kπk1),(20) L(\boldsymbol{\pi}, \lambda_{\pi}) = \underbrace{\sum_{k=1}^{K} \mathbb{E}[z_{1k}] \log \pi_k + \dots}_{Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)})} + \lambda_{\pi} \Big( \sum_{k=1}^K \pi_k - 1 \Big), \tag{20}

where λπ\lambda_{\pi} is the Lagrange multiplier for π\boldsymbol{\pi} and where “\dots” represent the other terms in Eq. 1717 that do not depend on π\boldsymbol{\pi}. The partial derivatives of this function with respect to both each πk\pi_k and λπ\lambda_{\pi} is:

L(π,λπ)πk=E[z1k]πk+λπ,L(π,λπ)λπ=k=1Kπk1.(21) \begin{aligned} \frac{\partial L(\boldsymbol{\pi}, \lambda_{\pi})}{\partial \pi_k} &= \frac{\mathbb{E}[z_{1k}]}{\pi_k} + \lambda_{\pi}, \\ \frac{\partial L(\boldsymbol{\pi}, \lambda_{\pi})}{\partial \lambda_{\pi}} &= \sum_{k=1}^{K} \pi_k - 1. \end{aligned} \tag{21}

We set these equal to zero and solve the system of equations. We can write each πk\pi_k in terms of λπ\lambda_{\pi}:

E[z1k]πk+λπ=0    πk=E[z1k]λπ.(22) \frac{\mathbb{E}[z_{1k}]}{\pi_k} + \lambda_{\pi} = 0 \iff \pi_k = - \frac{\mathbb{E}[z_{1k}]}{\lambda_{\pi}}. \tag{22}

We can plug this into the second equation in Eq. 2121 to solve for λπ\lambda_{\pi}:

k=1Kπk1=0    k=1K(E[z1k]λπ)=1    k=1KE[z1k]=λπ.(23) \sum_{k=1}^{K} \pi_k - 1 = 0 \iff \sum_{k=1}^{K} \Big( - \frac{\mathbb{E}[z_{1k}]}{\lambda_{\pi}} \Big) = 1 \iff -\sum_{k=1}^{K} \mathbb{E}[z_{1k}] = \lambda_{\pi}. \tag{23}

Using Eq. 2222 and 2323, we see that the optimal πk\pi_k is

πk=E[z1k]λπ=E[z1k]j=1KE[z1j]=γ(z1k)j=1Kγ(z1j).(24) \pi_k = - \frac{\mathbb{E}[z_{1k}]}{\lambda_{\pi}} = \frac{\mathbb{E}[z_{1k}]}{\sum_{j=1}^{K} \mathbb{E}[z_{1j}]} = \frac{\gamma(z_{1k})}{\sum_{j=1}^{K} \gamma(z_{1j})}. \tag{24}

This is an intuitive result. The probability of being on the initial state kk is the expected value of the kk-th component of first latent variable, properly normalized. The logic for A\mathbf{A} is the same. The optimal AjkA_{jk} is:

Ajk=n=1Nξ(zn1,j,znk)n=2Nk=1Kξ(zn1,j,znk).(25) A_{jk} = \frac{\sum_{n=1}^N \xi(z_{n-1,j}, z_{nk})}{\sum_{n=2}^{N} \sum_{k=1}^{K} \xi(z_{n-1,j}, z_{nk})}. \tag{25}

See A2 for a complete derivation.

Finally, we need to solve for the optimal model parameters ϕk\boldsymbol{\phi}_k. These values depend on the functional form of the density p(xnϕk)p(\mathbf{x}_n \mid \boldsymbol{\phi}_k). For a concrete example, let’s assume our emissions are Gaussian; therefore, ϕk={μk,Σk}\boldsymbol{\phi}_k = \{\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k\}. First, notice that our complete log likelihood in Eq. 1515 only depends on ϕk\boldsymbol{\phi}_k through the rightmost term:

n=1Nk=1KE[znk]logp(xnϕk).(26) \sum_{n=1}^{N} \sum_{k=1}^{K} \mathbb{E}[z_{nk}] \log p(\mathbf{x}_n \mid \boldsymbol{\phi}_k). \tag{26}

When taking the derivative of Eq. 1515 w.r.t. to the parameters in ϕk\boldsymbol{\phi}_k, clearly this is the only term that will not necessarily go to zero. Furthermore, note that every term in the sum over KK will go zero except for the kkth parameters. Finally, this term has no constraints, and we can optimize the parameters via maximum likelihood estimation. In the Gaussian case, the optimal values are:

μk=n=1NE[znk]xnn=1NE[znk],Σk=n=1NE[znk](xnμk)(xnμk)n=1NE[znk].(27) \begin{aligned} \boldsymbol{\mu}_k &= \frac{\sum_{n=1}^{N} \mathbb{E}[z_{nk}] \mathbf{x}_n}{\sum_{n=1}^{N} \mathbb{E}[z_{nk}]}, \\\\ \boldsymbol{\Sigma}_k &= \frac{\sum_{n=1}^{N} \mathbb{E}[z_{nk}] (\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^{\top}}{\sum_{n=1}^{N} \mathbb{E}[z_{nk}]}. \end{aligned} \tag{27}

See A3 for complete derivations.

We’ve almost completed everything we need for EM inference for HMMs. If we can efficiently compute the expectations in Eq. 1515 in our E-step—or γ\gamma and ξ\xi in Eq. 1919—, then all the updates in the M-step (Eq. 2424, 2525, and 2727) immediately follow. As mentioned previously, this requires the forward–backward algorithm.

Forward–backward algorithm

The forward–backward algorithm computes p(znX,θ(t))p(\mathbf{z}_n \mid \mathbf{X}, \boldsymbol{\theta}^{(t)}) and p(zn1,znX,θ(t))p(\mathbf{z}_{n-1}, \mathbf{z}_n \mid \mathbf{X}, \boldsymbol{\theta}^{(t)}), which are required by the posterior moments (Eq. 1919). More specifically, the algorithm computes these two terms,

α(zn)=p(zn,X1:n),(forward pass)β(zn)=p(Xn+1:Nzn),(backward pass)(28) \begin{aligned} \alpha(\mathbf{z}_n) &= p(\mathbf{z}_n, \mathbf{X}_{1:n}), && \text{(forward pass)} \\ \beta(\mathbf{z}_n) &= p(\mathbf{X}_{n+1:N} \mid \mathbf{z}_n), && \text{(backward pass)} \end{aligned} \tag{28}

because the posterior mean and posterior second moment can be written using them. The first moment requires

p(znX)=p(X,zn)p(X)=p(X1:n,Xn+1:N,zn)p(X)=p(Xn+1:Nzn)p(zn,X1:n)p(X)=α(zn)β(zn)p(X).(29) \begin{aligned} p(\mathbf{z}_n \mid \mathbf{X}) &= \frac{p(\mathbf{X}, \mathbf{z}_n)}{p(\mathbf{X})} \\ &= \frac{p(\mathbf{X}_{1:n}, \mathbf{X}_{n+1:N}, \mathbf{z}_n)}{p(\mathbf{X})} \\ &\stackrel{\star}{=} \frac{p(\mathbf{X}_{n+1:N} \mid \mathbf{z}_n) p(\mathbf{z}_n, \mathbf{X}_{1:n})}{p(\mathbf{X})} \\ &= \frac{\alpha(\mathbf{z}_n) \beta(\mathbf{z}_n)}{p(\mathbf{X})}. \end{aligned} \tag{29}

And the second moment requires

p(zn1,znX)=p(Xzn1,zn)p(zn1,zn)p(X)=p(Xzn1,zn)p(zn1,zn)p(X)=p(x1:n1zn1)p(xnzn)p(xn+1:Nzn)p(znzn1)p(zn1)p(X)=α(zn1)β(zn)p(xnzn)p(znzn1)p(X).(30) \begin{aligned} p(\mathbf{z}_{n-1}, \mathbf{z}_n \mid \mathbf{X}) &= \frac{p(\mathbf{X} \mid \mathbf{z}_{n-1}, \mathbf{z}_n) p(\mathbf{z}_{n-1}, \mathbf{z}_n)}{p(\mathbf{X})} \\ &= \frac{p(\mathbf{X} \mid \mathbf{z}_{n-1}, \mathbf{z}_n) p(\mathbf{z}_{n-1}, \mathbf{z}_n)}{p(\mathbf{X})} \\ &\stackrel{\star}{=} \frac{p(\mathbf{x}_{1:n-1} \mid \mathbf{z}_{n-1}) p(\mathbf{x}_n \mid \mathbf{z}_n) p(\mathbf{x}_{n+1:N} \mid \mathbf{z}_n) p(\mathbf{z}_n \mid \mathbf{z}_{n-1}) p(\mathbf{z}_{n-1})}{p(\mathbf{X})} \\ &= \frac{\alpha(\mathbf{z}_{n-1}) \beta(\mathbf{z}_n) p(\mathbf{x}_n \mid \mathbf{z}_n) p(\mathbf{z}_n \mid \mathbf{z}_{n-1})}{p(\mathbf{X})}. \end{aligned} \tag{30}

In the steps labeled \star, we apply our modeling assumptions: that Z\mathbf{Z} are Markov and that future observations only depend on the current latent variable. Look at the graphical model in Fig. 11 again if needed.

In the language of HMMs, the forward–backward algorithm does both filtering and smoothing. Computing α(zn)\alpha(\mathbf{z}_n) is effectively filtering, and computing the first posterior moment is smoothing:

[p(znX1:n)Filteringp(zn,X1:n)=α(zn),[p(znX)Smoothingp(Xn+1:Nzn)p(zn,X1:n)=α(zn)β(zn).(31) \begin{aligned} \overbrace{\vphantom{\big[} p(\mathbf{z}_n \mid \mathbf{X}_{1:n})}^{\text{Filtering}} &\propto p(\mathbf{z}_n, \mathbf{X}_{1:n}) = \alpha(\mathbf{z}_n), \\ \overbrace{\vphantom{\big[} p(\mathbf{z}_n \mid \mathbf{X})}^{\text{Smoothing}} &\propto p(\mathbf{X}_{n+1:N} \mid \mathbf{z}_n) p(\mathbf{z}_n, \mathbf{X}_{1:n}) = \alpha(\mathbf{z}_n) \beta(\mathbf{z}_n). \end{aligned} \tag{31}

I think it’s fair to think of computing the posterior second moment as a kind of smoothing as well. Now let’s look at the two passes.

Forward pass

The main idea of the forward pass is to marginalize over the previous latent variable to develop a recursive message-passing algorithm. This will allow us to use dynamic programming for efficient computation:

α(zn)=p(zn,X1:n)=zn1p(zn,zn1,X1:n)=zn1p(zn,zn1,xn,X1:n1)=zn1p(xnzn,zn1,X1:n1)p(zn,zn1,X1:n1)=zn1p(xnzn)p(znzn1,X1:n1)p(zn1,X1:n1)=p(xnzn)zn1p(znzn1)α(zn1).(32) \begin{aligned} \alpha(\mathbf{z}_n) &= p(\mathbf{z}_n, \mathbf{X}_{1:n}) \\ &= \sum_{\mathbf{z}_{n-1}} p(\mathbf{z}_n, \mathbf{z}_{n-1}, \mathbf{X}_{1:n}) \\ &= \sum_{\mathbf{z}_{n-1}} p(\mathbf{z}_n, \mathbf{z}_{n-1}, \mathbf{x}_{n}, \mathbf{X}_{1:n-1}) \\ &= \sum_{\mathbf{z}_{n-1}} p(\mathbf{x}_{n} \mid \mathbf{z}_n, \cancel{\mathbf{z}_{n-1}}, \cancel{\mathbf{X}_{1:n-1}}) p(\mathbf{z}_n, \mathbf{z}_{n-1}, \mathbf{X}_{1:n-1}) \\ &= \sum_{\mathbf{z}_{n-1}} p(\mathbf{x}_{n} \mid \mathbf{z}_n) p(\mathbf{z}_n \mid \mathbf{z}_{n-1}, \cancel{\mathbf{X}_{1:n-1}}) p(\mathbf{z}_{n-1}, \mathbf{X}_{1:n-1}) \\ &= p(\mathbf{x}_{n} \mid \mathbf{z}_n) \sum_{\mathbf{z}_{n-1}} p(\mathbf{z}_n \mid \mathbf{z}_{n-1}) \alpha(\mathbf{z}_{n-1}). \end{aligned} \tag{32}

Terms cancel due to modeling assumptions. Again, look at the graphical model; if a node separates two other nodes as in Fig. 11, then it induces conditional independence.

Backward pass

The backward pass is a similar idea, but rather than marginalizing over the previous hidden state, we marginalize over the next hidden state:

β(zn)=p(Xn+1:Nzn)=zn+1p(zn+1,Xn+1:Nzn)=zn+1p(zn+1,xn+1,xn+2:Nzn)=zn+1p(xn+2:Nzn+1,xn+1,zn)p(zn+1,xn+1zn)=zn+1p(xn+2:Nzn+1)p(xn+1zn+1,zn)p(zn+1zn)=zn+1β(zn+1)p(xn+1zn+1,zn)p(zn+1zn).(33) \begin{aligned} \beta(\mathbf{z}_n) &= p(\mathbf{X}_{n+1:N} \mid \mathbf{z}_n) \\ &= \sum_{\mathbf{z}_{n+1}} p(\mathbf{z}_{n+1}, \mathbf{X}_{n+1:N} \mid \mathbf{z}_n) \\ &= \sum_{\mathbf{z}_{n+1}} p(\mathbf{z}_{n+1}, \mathbf{x}_{n+1}, \mathbf{x}_{n+2:N} \mid \mathbf{z}_n) \\ &= \sum_{\mathbf{z}_{n+1}} p(\mathbf{x}_{n+2:N} \mid \mathbf{z}_{n+1}, \cancel{\mathbf{x}_{n+1}}, \cancel{\mathbf{z}_n}) p(\mathbf{z}_{n+1}, \mathbf{x}_{n+1} \mid \mathbf{z}_n) \\ &= \sum_{\mathbf{z}_{n+1}} p(\mathbf{x}_{n+2:N} \mid \mathbf{z}_{n+1}) p(\mathbf{x}_{n+1} \mid \mathbf{z}_{n+1}, \cancel{\mathbf{z}_n}) p(\mathbf{z}_{n+1} \mid \mathbf{z}_n) \\ &= \sum_{\mathbf{z}_{n+1}} \beta(\mathbf{z}_{n+1}) p(\mathbf{x}_{n+1} \mid \mathbf{z}_{n+1}, \cancel{\mathbf{z}_n}) p(\mathbf{z}_{n+1} \mid \mathbf{z}_n). \end{aligned} \tag{33}

Once again, this is a recursive algorithm in which we can message pass a previously computed quantity backward.

Initial conditions and evidence

Finally, we need to initialize our recursive algorithm. Since we compute the α(z)\alpha(\mathbf{z}) terms in a forward filtering pass, we need a value for α(z1)\alpha(\mathbf{z}_1). This is easy to compute:

α(z1)=p(x1,z1)=p(x1z1)p(z1)=k=1K{πkp(x1ϕk)}z1k.(34) \alpha(\mathbf{z}_1) = p(\mathbf{x}_1, \mathbf{z}_1) = p(\mathbf{x}_1 \mid \mathbf{z}_1) p(\mathbf{z}_1) = \prod_{k=1}^{K} \left\{\pi_k p(\mathbf{x}_1 \mid \boldsymbol{\phi}_k) \right\}^{z_{1k}}. \tag{34}

In words, this is the probability of x1\mathbf{x}_1 for each state, weighted by the initial probability of that state.

However, what are the initial conditions for β(zN)\beta(\mathbf{z}_N)? (Recall that the recursion starts at the last observation since we compute the β(z)\beta(\mathbf{z}) terms in reverse.) Notice that if we set n=Nn = N and apply the definition of α(z)\alpha(\mathbf{z}) in Eq. 2929, we have

p(zNX)=p(zN,X)β(zN)p(X).(35) p(\mathbf{z}_N \mid \mathbf{X}) = \frac{p(\mathbf{z}_N, \mathbf{X})\beta(\mathbf{z}_N)}{p(\mathbf{X})}. \tag{35}

Thus, it’s clear that β(zN)=1\beta(\mathbf{z}_N) = 1 for each state kk; otherwise, we would not have properly normalized distributions.

Finally, we can compute the evidence by summing over both sides of Eq. 2929:

p(znX)=α(zn)β(zn)p(X)k=1Kp(zn=kX)=k=1Kα(zn=k)β(zn=k)p(X)1=k=1Kα(zn=k)β(zn=k)p(X)p(X)=k=1Kα(zn=k)β(zn=k).(36) \begin{aligned} p(\mathbf{z}_n \mid \mathbf{X}) &= \frac{\alpha(\mathbf{z}_n) \beta(\mathbf{z}_n)}{p(\mathbf{X})} \\ \sum_{k=1}^{K} p(\mathbf{z}_n = k \mid \mathbf{X}) &= \sum_{k=1}^{K} \frac{\alpha(\mathbf{z}_n = k) \beta(\mathbf{z}_n = k)}{p(\mathbf{X})} \\ 1 &= \sum_{k=1}^{K} \frac{\alpha(\mathbf{z}_n = k) \beta(\mathbf{z}_n = k)}{p(\mathbf{X})} \\ p(\mathbf{X}) &= \sum_{k=1}^{K} \alpha(\mathbf{z}_n = k) \beta(\mathbf{z}_n = k). \end{aligned} \tag{36}

EM summary

Inference for HMMs is pretty complex. There is just a lot of tedious derivations. To review, EM for HMMs or the Baum–Welch algorithm is:

E-step:

Do the forward–backward algorithm to compute the α\alpha and β\beta terms:

α(zn)=p(xnzn)zn1p(znzn1)α(zn1),β(zn)=zn+1β(zn+1)p(xn+1zn+1)p(zn+1zn),(37) \begin{aligned} \alpha(\mathbf{z}_n) &= p(\mathbf{x}_{n} \mid \mathbf{z}_n) \sum_{\mathbf{z}_{n-1}} p(\mathbf{z}_n \mid \mathbf{z}_{n-1}) \alpha(\mathbf{z}_{n-1}), \\ \beta(\mathbf{z}_n) &= \sum_{\mathbf{z}_{n+1}} \beta(\mathbf{z}_{n+1}) p(\mathbf{x}_{n+1} \mid \mathbf{z}_{n+1}) p(\mathbf{z}_{n+1} \mid \mathbf{z}_n), \end{aligned} \tag{37}

Use the α\alpha and β\beta terms to compute the posterior moments:

γ(znk)=E[znk]=k=1Kznkα(zn=k)β(zn=k)p(X),ξ(zn1,j,znk)=E[zn1,jznk]=k=1Kzn1,kznkα(zn1=k)β(zn=k)p(xnzn)p(znzn1)p(X).(38) \begin{aligned} \gamma(z_{nk}) &= \mathbb{E}[z_{nk}] \\ &= \sum_{k^{\prime}=1}^{K} z_{nk} \frac{\alpha(\mathbf{z}_n=k^{\prime}) \beta(\mathbf{z}_n=k^{\prime})}{p(\mathbf{X})}, \\ \\ \xi(z_{n-1,j}, z_{nk}) &= \mathbb{E}[z_{n-1,j} z_{nk}] \\ &= \sum_{k^{\prime}=1}^{K} z_{n-1,k} z_{nk} \frac{\alpha(\mathbf{z}_{n-1}=k^{\prime}) \beta(\mathbf{z}_n=k^{\prime}) p(\mathbf{x}_n \mid \mathbf{z}_n) p(\mathbf{z}_n \mid \mathbf{z}_{n-1})}{p(\mathbf{X})}. \end{aligned} \tag{38}

M-step:

Maximize the parameters θ\boldsymbol{\theta} using the posterior moments computed in the E-step:

πk=γ(z1k)j=1Kγ(z1j),Ajk=n=1Nξ(zn1,j,znk)n=2Nk=1Kξ(zn1,j,znk),μk=n=1Nγ(znk)xnn=1Nγ(znk),Σk=n=1Nγ(znk)(xnμk)(xnμk)n=1Nγ(znk).(39) \begin{aligned} \pi_k &= \frac{\gamma(z_{1k})}{\sum_{j=1}^{K} \gamma(z_{1j})}, \\\\ A_{jk} &= \frac{\sum_{n=1}^N \xi(z_{n-1,j}, z_{nk})}{\sum_{n=2}^{N} \sum_{k=1}^{K} \xi(z_{n-1,j}, z_{nk})}, \\\\ \boldsymbol{\mu}_k &= \frac{\sum_{n=1}^{N} \gamma(z_{nk}) \mathbf{x}_n}{\sum_{n=1}^{N} \gamma(z_{nk})}, \\\\ \boldsymbol{\Sigma}_k &= \frac{\sum_{n=1}^{N} \gamma(z_{nk}) (\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^{\top}}{\sum_{n=1}^{N} \gamma(z_{nk})}. \end{aligned} \tag{39}

Implementation

Now that we understand inference for HMMs, let’s look at a complete implementation of the Baum–Welch algorithm.

A practical limitation is that the forward–backward algorithm can be numerically unstable. This is because we are multiplying small probabilities many times in the forward–backward algorithm. This can quickly result in numerical underflow. In (Rabiner & Juang, 1986), the authors propose addressing this instability using scaling factors. However, I have opted for the use of logarithms following (Mann, 2006). This approach relies on the log-sum-exp trick to normalize log quantities.

This code was used to generate the results for Mount Rainier weather in Fig. 22. I annotate each step with equations from (Bishop, 2006).

import numpy as np
from   scipy.special import logsumexp
from   scipy.stats import multivariate_normal as mvn


def baum_welch(obs, states, n_iters):
    """EM for hidden Markov models, i.e. the Baum–Welch algorithm. Numerical 
    instability handled by working in log space.
    """
    N, D = obs.shape
    K = len(states)

    # Initialize parameters \theta:
    #
    # 1. Probaiblities \pi (size K).
    # 2. Transition probability matrix A.
    # 3. Emission parameters \phi (Gaussian case: \mu and \Sigma).
    #
    log_pi = np.ones(K) / K
    assert np.isclose(log_pi.sum(), 1)

    log_A  = np.random.normal(0, 1, size=(K, K))
    log_A -= logsumexp(log_A, axis=1, keepdims=True)
    assert np.allclose(np.sum(np.exp(log_A), axis=1), 1)

    means  = np.random.normal(0, 1, size=(K, D))
    covars = np.empty((K, D, D))
    for k in range(K):
        # Ensure initial covariance matrices are PSD.
        tmp = np.random.random((D, 2*D))
        covars[k] = tmp @ tmp.T

    # Initialize emission probabilities (size N × K).
    log_emm_prob = np.random.random((N, K))

    # The n-th row is log(alpha(z_n)).
    # The k-th column is value z_n takes.
    # So (nk)-th cell is alpha(z_n = k).
    log_alpha = np.zeros((N, K))
    log_beta  = np.zeros((N, K))

    for _ in range(n_iters):

        # E-step (forward-backward algorithm).
        # ------------------------------------
        for k in range(K):
            log_alpha[0, k] = log_pi[k] + log_emm_prob[0, k]

        for n in range(1, N):
            for k in range(K):
                tmp = np.empty(K)
                for j in range(K):
                    tmp[j] = log_alpha[n-1, j] + log_A[j, k]
                log_alpha[n, k] = logsumexp(tmp) + log_emm_prob[n, k]

        log_beta[N-1] = 0  # log(1)
        for n in reversed(range(N-1)):
            for k in range(K):
                tmp = np.empty(K)
                for j in range(K):
                    tmp[j] = (log_beta[n+1, j] 
                              + log_emm_prob[n+1, j] 
                              + log_A[k, j])
                log_beta[n, k] = logsumexp(tmp)

        # M-step.
        # ------------------------------------

        # Compute first posterior moment, \gamma (size N × K).
        # Eq. 13.33 in Bishop.
        log_gamma = log_alpha + log_beta
        log_evidence = logsumexp(log_alpha[N-1])
        gamma = np.exp(log_gamma - log_evidence)
        assert np.allclose(np.sum(gamma, axis=1), 1)

        # Compute second posterior moment, \xi (size N × K × K). For the n-th 
        # sample, the (K × K) matrix A is defined such that 
        # A_{ij} = E[z_{n} = i, z_{n+1} = j].
        #
        # Eq. 13.43 in Bishop.
        log_xi = np.empty((N, K, K))
        for n in range(N-1):
            tmp = np.empty((K, K))
            for k in range(K):
                for j in range(K):
                    tmp[k, j] = (log_alpha[n, k]
                                 + log_beta[n+1, j] 
                                 + log_emm_prob[n+1, j] 
                                 + log_A[k, j]
                                 - log_evidence)
            log_xi[n] = tmp

        # Eq. 13.18 in Bishop.
        log_pi = log_gamma[0] - logsumexp(log_gamma[0])
        assert log_pi.size == K
        assert np.isclose(np.sum(np.exp(log_pi)), 1)

        # Eq. 13.19 in Bishop.
        for i in range(K):
            for j in range(K):
                log_A[i, j] = logsumexp(log_xi[1:, i, j]) 
                              - logsumexp(log_xi[1:, i, :])
        assert np.allclose(np.sum(np.exp(log_A), axis=1), 1)

        # Use matrix multiplication to sum over N.
        # Eq. 13.20 in Bishop.
        for k in range(K):
            means[k] = obs.T @ gamma[:, k]
            means[k] /= gamma[:, k].sum()

        # Compute new covariances.
        # Eq. 13.21 in Bishop.
        for k in range(K):
            covars[k] = np.zeros((D, D))
            for n in range(N):
                dev = obs[n] - means[k]
                covars[k] += gamma[n, k] * np.outer(dev, dev.T)
            covars[k] /= gamma[:, k].sum()

        # Recompute emission probabilities using inferred states.
        for n in range(N):
            x = obs[n]
            for k in range(K):
                mu = means[k]
                # var = covars[k] + 100 * np.eye(D)
                M = np.random.random((N, D))
                var = M.T @ M
                log_emm_prob[n, k] = mvn.logpdf(x, mu, var)

    Z = np.argmax(gamma, axis=1)
    theta = (np.exp(log_pi), np.exp(log_A), means, covars)
    return Z, theta

Conclusion

While hidden Markov models are fairly complicated in terms of inference, they are conceptually simple and offer flexibility in the predictive distribution while remaining tractable. The Baum–Welch algorithm, which relies on the recursive forward–backward algorithm, is an efficient, model-specific version of EM for this sequential latent variable model.

   

Appendix

A1. Expected complete log likelihood

Recall that zn\mathbf{z}_n is a one-hot vector. Thus, it is a vector of zeros with a single one. Let znkz_{nk} denote the kk-th component of the nn-th state variable. Since x0=1x^0 = 1 and x1=xx^1 = x for any value of xx, we can index into a KK-vector a\mathbf{a} using the following trick:

ai=k=1Kakznk.(A1.1) a_i = \prod_{k=1}^{K} a_k^{z_{nk}}. \tag{A1.1}

In words, each component of a\mathbf{a} is raised to the zero-th power except for the ii-th component, which is raised to the power one. This is useful because it represents the value aia_i in terms of the state variable. For example, we can write the transition probabilities as

p(znzn1)=j=1Kk=1KAjkzn1,jznk.(A1.2) p(\mathbf{z}_n \mid \mathbf{z}_{n-1}) = \prod_{j=1}^K \prod_{k=1}^K A_{jk}^{z_{n-1,j} z_{nk}}. \tag{A1.2}

We are indexing into the transition matrix A\mathbf{A}, and raising each cell value using the values associated with our one-hot vectors zn\mathbf{z}_n and zn1\mathbf{z}_{n-1}. This means the expected log of this transition probability can be written as

logp(znzn1)=j=1Kk=1Kzn1,jznklogAjkE[logp(znzn1)]=j=1Kk=1KE[zn1,jznk]logAjk.(A1.3) \begin{aligned} \log p(\mathbf{z}_n \mid \mathbf{z}_{n-1}) &= \sum_{j=1}^K \sum_{k=1}^K z_{n-1,j} z_{nk} \log A_{jk} \\ &\Downarrow \\ \mathbb{E} \big[ \log p(\mathbf{z}_n \mid \mathbf{z}_{n-1}) \big] &= \sum_{j=1}^K \sum_{k=1}^K \mathbb{E}[z_{n-1,j} z_{nk}] \log A_{jk}. \end{aligned} \tag{A1.3}

The initial probability can be written as:

p(z1)=k=1Kπkz1klogp(z1)=k=1Kz1klogπkE[logp(z1)]=k=1KE[z1k]logπk.(A1.4) \begin{aligned} p(\mathbf{z}_1) &= \prod_{k=1}^K \pi_k^{z_{1k}} \\ &\Downarrow \\ \log p(\mathbf{z}_1) &= \sum_{k=1}^K z_{1k} \log \pi_k \\ &\Downarrow \\ \mathbb{E}[\log p(\mathbf{z}_1)] &= \sum_{k=1}^K \mathbb{E}[z_{1k}] \log \pi_k. \end{aligned} \tag{A1.4}

Finally, the emission probabilities are:

p(xnϕk)=k=1Kp(xnϕk)znklogp(xnϕk)=k=1Kznklogp(xnϕk)E[logp(xnϕk)]=k=1KE[znk]logp(xnϕk).(A1.5) \begin{aligned} p(\mathbf{x}_n \mid \boldsymbol{\phi}_k) &= \prod_{k=1}^K p(\mathbf{x}_n \mid \boldsymbol{\phi}_k)^{z_{nk}} \\ &\Downarrow \\ \log p(\mathbf{x}_n \mid \boldsymbol{\phi}_k) &= \sum_{k=1}^K z_{nk} \log p(\mathbf{x}_n \mid \boldsymbol{\phi}_k) \\ \\ &\Downarrow \\ \mathbb{E}[ \log p(\mathbf{x}_n \mid \boldsymbol{\phi}_k) ] &= \sum_{k=1}^K \mathbb{E}[z_{nk}] \log p(\mathbf{x}_n \mid \boldsymbol{\phi}_k). \end{aligned} \tag{A1.5}

A2. Optimal AjkA_{jk}

Since AijA_{ij} is the transition probability of moving from state ii to state jj, each row of A\mathbf{A} must sum to one, and so each row needs its own Lagrange multiplier. The Lagrange function for each row Aj\mathbf{A}_j is

L(Aj,λAi)=+n=2Nj=1Kk=1KE[zn1,jznk]logAjk+Q(θθ(t))+λAj(k=1KAjk1),(A2.1) L(\mathbf{A}_j, \lambda_{A_i}) = \underbrace{\dots + \sum_{n=2}^{N} \sum_{j=1}^K \sum_{k=1}^K \mathbb{E}[z_{n-1,j} z_{nk}] \log A_{jk} + \dots}_{Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)})} + \lambda_{\mathbf{A}_{j}} \Big( \sum_{k=1}^K A_{jk} - 1 \Big), \tag{A2.1}

The partial derivatives with respect to AjkA_{jk} and λAj\lambda_{\mathbf{A}_j} are:

L(Aj,λAi)Ajk=n=2NE[zn1,jznk]Ajk+λAj,L(Aj,λAi)λAj=k=1KAjk1.(A2.2) \begin{aligned} \frac{\partial L(\mathbf{A}_j, \lambda_{A_i})}{\partial A_{jk}} &= \sum_{n=2}^N \frac{\mathbb{E}[z_{n-1,j} z_{nk}]}{A_{jk}} + \lambda_{\mathbf{A}_j}, \\ \frac{\partial L(\mathbf{A}_j, \lambda_{A_i})}{\partial \lambda_{\mathbf{A}_j}} &= \sum_{k=1}^{K} A_{jk} - 1. \end{aligned} \tag{A2.2}

Let’s set both partial derivatives equal to zero and solve for AjkA_{jk} in terms of λAj\lambda_{\mathbf{A}_j}:

Ajk=n=2NE[zn1,jznk]λAj(A2.3) A_{jk} = - \sum_{n=2}^N \frac{\mathbb{E}[z_{n-1,j}z_{nk}]}{\lambda_{\mathbf{A}_j}} \tag{A2.3}

And

k=1KAjk1=0    k=1K(n=2NE[zn1,jznk]λAj)=1    λAj=n=2Nk=1KE[zn1,jznk].(A2.4) \begin{aligned} \sum_{k=1}^{K} A_{jk} - 1 = 0 &\iff \sum_{k=1}^{K} \Big( - \sum_{n=2}^N \frac{\mathbb{E}[z_{n-1,j}z_{nk}]}{\lambda_{\mathbf{A}_j}} \Big) = 1 \\ &\iff \lambda_{\mathbf{A}_j} = - \sum_{n=2}^{N} \sum_{k=1}^{K} \mathbb{E}[z_{n-1,j}z_{nk}]. \end{aligned} \tag{A2.4}

So the optimal AjkA_{jk} is:

Ajk=n=2NE[zn1,jznk]n=2Nk=1KE[zn1,jznk].(A2.5) A_{jk} = \frac{\sum_{n=2}^N \mathbb{E}[z_{n-1,j}z_{nk}]}{\sum_{n=2}^{N} \sum_{k=1}^{K} \mathbb{E}[z_{n-1,j}z_{nk}]}. \tag{A2.5}

A3. Optimal μk\boldsymbol{\mu}_k and Σk\boldsymbol{\Sigma}_k

To solve for each μk\boldsymbol{\mu}_k, let’s write down the log likelihood of the Gaussian and eliminate terms that go to zero:

μk[n=1Nk=1KE[znk]logp(xnϕk)]=μk[n=1NE[znk]logp(xnϕk)]=n=1NE[znk][μk{12(xnμk)Σk1(xnμk)12logΣkD2log2π}]=n=1NE[znk][μk{12(xnμk)Σk1(xnμk)}].(A3.1) \begin{aligned} &\frac{\partial}{\partial \boldsymbol{\mu}_k} \Big[ \sum_{n=1}^{N} \sum_{k^{\prime}=1}^{K} \mathbb{E}[z_{nk^{\prime}}] \log p(\mathbf{x}_n \mid \boldsymbol{\phi}_{k^{\prime}}) \Big] \\ &= \frac{\partial}{\partial \boldsymbol{\mu}_k} \Big[ \sum_{n=1}^{N} \mathbb{E}[z_{nk}] \log p(\mathbf{x}_n \mid \boldsymbol{\phi}_k) \Big] \\ &= \sum_{n=1}^{N} \mathbb{E}[z_{nk}] \Big[ \frac{\partial}{\partial \boldsymbol{\mu}_k} \Big\{ -\frac{1}{2} (\mathbf{x}_n - \boldsymbol{\mu}_k)^{\top} \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k) -\frac{1}{2} \log |\boldsymbol{\Sigma}_k| - \frac{D}{2} \log 2\pi \Big\}\Big] \\ &= \sum_{n=1}^{N} \mathbb{E}[z_{nk}] \Big[ \frac{\partial}{\partial \boldsymbol{\mu}_k} \Big\{ -\frac{1}{2} (\mathbf{x}_n - \boldsymbol{\mu}_k)^{\top} \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k) \Big\}\Big]. \end{aligned} \tag{A3.1}

Using Eq. 8686 from the Matrix Cookbook,

s(xs)W(xs)=2W(xs),(MC.86) \frac{\partial}{\partial \mathbf{s}} (\mathbf{x} - \mathbf{s})^{\top} \mathbf{W} (\mathbf{x} - \mathbf{s}) = - 2 \mathbf{W}(\mathbf{x} - \mathbf{s}), \tag{MC.86}

we can compute the desired derivative:

n=1NE[znk][Σk1(xnμk)].(A3.2) \sum_{n=1}^{N} \mathbb{E}[z_{nk}] \Big[ \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k) \Big]. \tag{A3.2}

We can set this equal to zero and solve for μk\boldsymbol{\mu}_k:

n=1NE[znk][Σk1(xnμk)]=0    n=1NE[znk]Σk1xn=n=1NE[znk]Σk1μk    μk=n=1NE[znk]xnn=1NE[znk].(A3.3) \begin{aligned} &\sum_{n=1}^{N} \mathbb{E}[z_{nk}] \Big[ \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k) \Big] = 0 \\ &\iff \sum_{n=1}^{N} \mathbb{E}[z_{nk}] \boldsymbol{\Sigma}_k^{-1} \mathbf{x}_n = \sum_{n=1}^{N} \mathbb{E}[z_{nk}] \boldsymbol{\Sigma}_k^{-1} \boldsymbol{\mu}_k \\ &\iff \boldsymbol{\mu}_k = \frac{\sum_{n=1}^{N} \mathbb{E}[z_{nk}] \mathbf{x}_n}{\sum_{n=1}^{N} \mathbb{E}[z_{nk}]}. \end{aligned} \tag{A3.3}

Note that we can distribute Σk\boldsymbol{\Sigma}_k across the expectation because the expectation is a scalar. Now let’s solve for Σk\boldsymbol{\Sigma}_k. Modifying A3.2\text{A3.2}, we get:

n=1NE[znk][Σk{12(xnμk)Σk1(xnμk)12logΣk}].(A3.4) \sum_{n=1}^{N} \mathbb{E}[z_{nk}] \Big[ \frac{\partial}{\partial \boldsymbol{\Sigma}_k} \Big\{ -\frac{1}{2} (\mathbf{x}_n - \boldsymbol{\mu}_k)^{\top} \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k) -\frac{1}{2} \log |\boldsymbol{\Sigma}_k|\Big\}\Big]. \tag{A3.4}

For the Mahalanobis term, we get use Eq. 6161 from the Matrix Cookbook:

XaX1b=(X1)ab(X1).(MC.61) \frac{\partial}{\partial \mathbf{X}} \mathbf{a}^{\top} \mathbf{X}^{-1} \mathbf{b} = - (\mathbf{X}^{-1})^{\top}\mathbf{a}\mathbf{b}^{\top} (\mathbf{X}^{-1})^{\top}. \tag{MC.61}

Since our covariance matrices are symmetric, (X1)=(X)1=X1(\mathbf{X}^{-1})^{\top} = (\mathbf{X}^{\top})^{-1} = \mathbf{X}^{-1}. This gives us:

Σk(xnμk)Σk1(xnμk)=Σk(xnΣk1xnμkΣk1xnxnΣk1μk+μkΣk1μk)=Σ1(xnxnxnμkμkxn+μkμk)Σ1=Σ1(xnμk)(xnμk)Σ1.(A3.5) \begin{aligned} &\frac{\partial}{\partial \boldsymbol{\Sigma}_k} (\mathbf{x}_n - \boldsymbol{\mu}_k)^{\top} \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k) \\ &= \frac{\partial}{\partial \boldsymbol{\Sigma}_k} \Big(\mathbf{x}_n^{\top} \boldsymbol{\Sigma}_k^{-1} \mathbf{x}_n - \boldsymbol{\mu}_k^{\top} \boldsymbol{\Sigma}_k^{-1} \mathbf{x}_n - \mathbf{x}_n^{\top} \boldsymbol{\Sigma}_k^{-1} \boldsymbol{\mu}_k + \boldsymbol{\mu}_k^{\top} \boldsymbol{\Sigma}_k^{-1} \boldsymbol{\mu}_k \Big) \\ &= -\boldsymbol{\Sigma}^{-1} \Big( \mathbf{x}_n \mathbf{x}_n^{\top} - \mathbf{x}_n \boldsymbol{\mu}_k^{\top} - \boldsymbol{\mu}_k \mathbf{x}_n^{\top} + \boldsymbol{\mu}_k \boldsymbol{\mu}_k^{\top} \Big) \boldsymbol{\Sigma}^{-1} \\ &= -\boldsymbol{\Sigma}^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^{\top}\boldsymbol{\Sigma}^{-1}. \end{aligned} \tag{A3.5}

For the log determinant, we use Eq. 5757:

log[abs(X)]X=(X1)(MC.57) \frac{\partial \log[\text{abs}(|\mathbf{X}|)]}{\partial \mathbf{X}} = (\mathbf{X}^{-1})^{\top} \tag{MC.57}

We can use this identity because our covariance matrices are positive semidefinite, and therefore X0\|\mathbf{X}\| \geq 0. Putting these identities together, we get:

n=1NE[znk][Σk{12(xnμk)Σk1(xnμk)12logΣk}]=12n=1NE[znk]{Σk1(xnμk)(xnμk)Σk1+Σk1}.(A3.6) \begin{aligned} &\sum_{n=1}^{N} \mathbb{E}[z_{nk}] \Big[ \frac{\partial}{\partial \boldsymbol{\Sigma}_k} \Big\{ -\frac{1}{2} (\mathbf{x}_n - \boldsymbol{\mu}_k)^{\top} \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k) -\frac{1}{2} \log |\boldsymbol{\Sigma}_k|\Big\}\Big] \\ &= -\frac{1}{2} \sum_{n=1}^{N} \mathbb{E}[z_{nk}] \Big\{ -\boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^{\top}\boldsymbol{\Sigma}_k^{-1} + \boldsymbol{\Sigma}_k^{-1} \Big\}. \end{aligned} \tag{A3.6}

Let’s set this result equal to zero and solve for Σk\boldsymbol{\Sigma}_k:

12n=1NE[znk]{Σk1(xnμk)(xnμk)Σk1+Σk1}=0    12n=1NE[znk]Σk1(xnμk)(xnμk)Σk1=12n=1NE[znk]Σk1    n=1NE[znk]Σk1(xnμk)(xnμk)=n=1NE[znk]    Σk=n=1NE[znk](xnμk)(xnμk)n=1NE[znk].(A3.7) \begin{aligned} &-\frac{1}{2} \sum_{n=1}^{N} \mathbb{E}[z_{nk}] \Big\{ -\boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^{\top}\boldsymbol{\Sigma}_k^{-1} + \boldsymbol{\Sigma}_k^{-1} \Big\} = 0 \\ &\iff \frac{1}{2} \sum_{n=1}^{N} \mathbb{E}[z_{nk}] \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^{\top}\boldsymbol{\Sigma}_k^{-1} = \frac{1}{2} \sum_{n=1}^{N} \mathbb{E}[z_{nk}] \boldsymbol{\Sigma}_k^{-1} \\ &\iff \sum_{n=1}^{N} \mathbb{E}[z_{nk}] \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^{\top} = \sum_{n=1}^{N} \mathbb{E}[z_{nk}] \\ &\iff \boldsymbol{\Sigma}_k = \frac{\sum_{n=1}^{N} \mathbb{E}[z_{nk}] (\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^{\top}}{\sum_{n=1}^{N} \mathbb{E}[z_{nk}]}. \end{aligned} \tag{A3.7}

  1. Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical Statistics, 41(1), 164–171.
  2. Rabiner, L., & Juang, B. (1986). An introduction to hidden Markov models. Ieee Assp Magazine, 3(1), 4–16.
  3. Mann, T. P. (2006). Numerically stable hidden Markov model implementation. An HMM Scaling Tutorial, 1–8.
  4. Bishop, C. M. (2006). Pattern Recognition and Machine Learning.