Scaling Factors for Hidden Markov Models
Inference for hidden Markov models (HMMs) is numerically unstable. A standard approach to resolving this instability is to use scaling factors. I discuss this idea in detail.
Inference for hidden Markov models (HMMs) using the standard recursions is numerically unstable. This is because the main algorithm for this task, the Baum–Welch algorithm, recursively computes a joint distribution over the current latent state and all previous observations. Since joint probabilities will decrease as the number of variables increases, the probabilities in the forward recursions will quickly decay to numbers smaller than machine precision.
One approach to resolving this issue is to use log probabilities and the log-sum-exp trick. This was how I originally implemented an HMM in my previous post. However, another approach is to use an idea called scaling factors (Bishop, 2006). The basic idea of scaling factors is to modify the standard recursions in the forward–backward algorithm to track conditional distributions that are numerically stable. My understanding is that this approach is common in state-space modeling more generally. For example, the Kalman filter’s standard filtering recursions use this idea.
To formalize this, let denote our observation at time , and let denote a -dimensional hidden variable at time . Let denote a contiguous sequence of observations, . Recall that the forward pass of the forward–backward algorithm is
where is conventional notation for the joint distribution . Now let’s divide both sides of Equation by the evidence so far, :
which can be simplified as
We can write this new recursion using to denote the conditional distribution,
Before proceeding, let’s ask the critical question: why is more numerically stable than ? It is because is a distribution over values, where is the number of latent states in the HMM. Not only is typically much smaller than and (the total number of observations), it is fixed across time.
We just need to figure out how to rescale to get back . Let denote the scaling factor we wish to track at each time point , which is just the conditional distribution in the numerator of Equation :
We can rewrite Equation using as
At each step in the algorithm, we want to compute . We can do this easily provided is not a very large number. This is because if we think of as a vector with probabilities, we know that it must sum to unity. So is simply the value that normalizes the right-hand side of Equation .
Finally, we can compute from using the scaling factors up through . This is because of the chain rule:
To be explicit, we have:
So that’s it. At each time point, we compute our forward recursions using the numerically stable conditional distribution over values. We can efficiently compute by normalizing the -dimensional probability vector representing the conditional distribution. Finally, we can compute , which is needed for the full Baum–Welch algorithm, using the product of scaling factors.
We also need to apply this same idea to the terms in the backward pass. However, I don’t think a discussion of this topic adds much intuition, and the derivations are not challenging. See (Bishop, 2006) for a complete discussion, if desired.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning.