Scaling Factors for Hidden Markov Models

Inference for hidden Markov models (HMMs) is numerically unstable. A standard approach to resolving this instability is to use scaling factors. I discuss this idea in detail.

Inference for hidden Markov models (HMMs) using the standard recursions is numerically unstable. This is because the main algorithm for this task, the Baum–Welch algorithm, recursively computes a joint distribution over the current latent state and all previous observations. Since joint probabilities will decrease as the number of variables increases, the probabilities in the forward recursions will quickly decay to numbers smaller than machine precision.

One approach to resolving this issue is to use log probabilities and the log-sum-exp trick. This was how I originally implemented an HMM in my previous post. However, another approach is to use an idea called scaling factors (Bishop, 2006). The basic idea of scaling factors is to modify the standard recursions in the forward–backward algorithm to track conditional distributions that are numerically stable. My understanding is that this approach is common in state-space modeling more generally. For example, the Kalman filter’s standard filtering recursions use this idea.

To formalize this, let xn\mathbf{x}_n denote our observation at time nn, and let zn\mathbf{z}_n denote a KK-dimensional hidden variable at time nn. Let X1:n\mathbf{X}_{1:n} denote a contiguous sequence of observations, {x1,,xn}\{\mathbf{x}_1, \dots, \mathbf{x}_n\}. Recall that the forward pass of the forward–backward algorithm is

α(zn)=p(xnzn)zn1p(znzn1)α(zn1),(1) \alpha(\mathbf{z}_n) = p(\mathbf{x}_{n} \mid \mathbf{z}_n) \sum_{\mathbf{z}_{n-1}} p(\mathbf{z}_n \mid \mathbf{z}_{n-1}) \alpha(\mathbf{z}_{n-1}), \tag{1}

where α(zn)\alpha(\mathbf{z}_n) is conventional notation for the joint distribution p(zn,X1:n)p(\mathbf{z}_n, \mathbf{X}_{1:n}). Now let’s divide both sides of Equation 11 by the evidence so far, p(X1:n)p(\mathbf{X}_{1:n}):

α(zn)p(X1:n)=p(xnzn)zn1p(znzn1)α(zn1)p(X1:n),(2) \frac{\alpha(\mathbf{z}_n)}{p(\mathbf{X}_{1:n})} = \frac{p(\mathbf{x}_{n} \mid \mathbf{z}_n) \sum_{\mathbf{z}_{n-1}} p(\mathbf{z}_n \mid \mathbf{z}_{n-1}) \alpha(\mathbf{z}_{n-1})}{p(\mathbf{X}_{1:n})}, \tag{2}

which can be simplified as

p(znX1:n)=p(xnzn)zn1p(znzn1)α(zn1)p(xnX1:n1)p(X1:n1)=p(xnzn)zn1p(znzn1)p(zn1X1:n1)p(xnX1:n1).(3) \begin{aligned} p(\mathbf{z}_n \mid \mathbf{X}_{1:n}) &= \frac{p(\mathbf{x}_{n} \mid \mathbf{z}_n) \sum_{\mathbf{z}_{n-1}} p(\mathbf{z}_n \mid \mathbf{z}_{n-1}) \alpha(\mathbf{z}_{n-1})}{p(\mathbf{x}_n \mid \mathbf{X}_{1:n-1}) p(\mathbf{X}_{1:n-1})} \\ &= \frac{p(\mathbf{x}_{n} \mid \mathbf{z}_n) \sum_{\mathbf{z}_{n-1}} p(\mathbf{z}_n \mid \mathbf{z}_{n-1}) p(\mathbf{z}_{n-1} \mid \mathbf{X}_{1:n-1})}{p(\mathbf{x}_n \mid \mathbf{X}_{1:n-1})}. \end{aligned} \tag{3}

We can write this new recursion using α^(zn)\hat{\alpha}(\mathbf{z}_n) to denote the conditional distribution,

α^(zn)=p(xnzn)zn1p(znzn1)α^(zn1)p(xnX1:n1).(4) \hat{\alpha}(\mathbf{z}_n) = \frac{p(\mathbf{x}_{n} \mid \mathbf{z}_n) \sum_{\mathbf{z}_{n-1}} p(\mathbf{z}_n \mid \mathbf{z}_{n-1}) \hat{\alpha}(\mathbf{z}_{n-1})}{p(\mathbf{x}_n \mid \mathbf{X}_{1:n-1})}. \tag{4}

Before proceeding, let’s ask the critical question: why is α^(zn)\hat{\alpha}(\mathbf{z}_n) more numerically stable than α(zn)\alpha(\mathbf{z}_n)? It is because α^(zn)\hat{\alpha}(\mathbf{z}_n) is a distribution over KK values, where KK is the number of latent states in the HMM. Not only is KK typically much smaller than nn and NN (the total number of observations), it is fixed across time.

We just need to figure out how to rescale α^(zn)\hat{\alpha}(\mathbf{z}_n) to get back α(zn)\alpha(\mathbf{z}_n). Let sns_n denote the scaling factor we wish to track at each time point nn, which is just the conditional distribution in the numerator of Equation 44:

sn=p(xnX1:n1).(5) s_n = p(\mathbf{x}_n \mid \mathbf{X}_{1:n-1}). \tag{5}

We can rewrite Equation 44 using sns_n as

α^(zn)=p(xnzn)zn1p(znzn1)α^(zn1)sn.(6) \hat{\alpha}(\mathbf{z}_n) = \frac{p(\mathbf{x}_{n} \mid \mathbf{z}_n) \sum_{\mathbf{z}_{n-1}} p(\mathbf{z}_n \mid \mathbf{z}_{n-1}) \hat{\alpha}(\mathbf{z}_{n-1})}{s_n}. \tag{6}

At each step in the algorithm, we want to compute sns_n. We can do this easily provided KK is not a very large number. This is because if we think of α^(zn)\hat{\alpha}(\mathbf{z}_n) as a vector with KK probabilities, we know that it must sum to unity. So sns_n is simply the value that normalizes the right-hand side of Equation 66.

Finally, we can compute α(zn)\alpha(\mathbf{z}_n) from α^(zn)\hat{\alpha}(\mathbf{z}_n) using the scaling factors up through sns_n. This is because of the chain rule:

p(X1:n)=i=1np(xiX1:i1)=i=1nsi.(7) p(\mathbf{X}_{1:n}) = \prod_{i=1}^n p(\mathbf{x}_i \mid \mathbf{X}_{1:i-1}) = \prod_{i=1}^n s_i. \tag{7}

To be explicit, we have:

α(zn)=α^(zn)i=1nsi.(8) \alpha(\mathbf{z}_n) = \hat{\alpha}(\mathbf{z}_n) \prod_{i=1}^n s_i. \tag{8}

So that’s it. At each time point, we compute our forward recursions using the numerically stable conditional distribution α^(zn)\hat{\alpha}(\mathbf{z}_n) over KK values. We can efficiently compute sns_n by normalizing the KK-dimensional probability vector representing the conditional distribution. Finally, we can compute α(zn)\alpha(\mathbf{z}_n), which is needed for the full Baum–Welch algorithm, using the product of scaling factors.

We also need to apply this same idea to the β(zn)\beta(\mathbf{z}_n) terms in the backward pass. However, I don’t think a discussion of this topic adds much intuition, and the derivations are not challenging. See (Bishop, 2006) for a complete discussion, if desired.

  1. Bishop, C. M. (2006). Pattern Recognition and Machine Learning.