Scaling Factors for Hidden Markov Models

Inference for hidden Markov models (HMMs) is numerically unstable. A standard approach to resolving this instability is to use scaling factors. I discuss this idea in detail.

Published

13 August 2022

Inference for hidden Markov models (HMMs) using the standard recursions is numerically unstable. This is because the main algorithm for this task, the Baum–Welch algorithm, recursively computes a joint distribution over the current latent state and all previous observations. Since joint probabilities will decrease as the number of variables increases, the probabilities in the forward recursions will quickly decay to numbers smaller than machine precision.

One approach to resolving this issue is to use log probabilities and the log-sum-exp trick. This was how I originally implemented an HMM in my previous post. However, another approach is to use an idea called scaling factors (Bishop, 2006). The basic idea of scaling factors is to modify the standard recursions in the forward–backward algorithm to track conditional distributions that are numerically stable. My understanding is that this approach is common in state-space modeling more generally. For example, the Kalman filter’s standard filtering recursions use this idea.

To formalize this, let $\mathbf{x}_n$ denote our observation at time $n$ , and let $\mathbf{z}_n$ denote a $K$ -dimensional hidden variable at time $n$ . Let $\mathbf{X}_{1:n}$ denote a contiguous sequence of observations, $\{\mathbf{x}_1, \dots, \mathbf{x}_n\}$ . Recall that the forward pass of the forward–backward algorithm is

$\alpha(\mathbf{z}_n) = p(\mathbf{x}_{n} \mid \mathbf{z}_n) \sum_{\mathbf{z}_{n-1}} p(\mathbf{z}_n \mid \mathbf{z}_{n-1}) \alpha(\mathbf{z}_{n-1}), \tag{1}$

where $\alpha(\mathbf{z}_n)$ is conventional notation for the joint distribution $p(\mathbf{z}_n, \mathbf{X}_{1:n})$ . Now let’s divide both sides of Equation $1$ by the evidence so far, $p(\mathbf{X}_{1:n})$ :

$\frac{\alpha(\mathbf{z}_n)}{p(\mathbf{X}_{1:n})} = \frac{p(\mathbf{x}_{n} \mid \mathbf{z}_n) \sum_{\mathbf{z}_{n-1}} p(\mathbf{z}_n \mid \mathbf{z}_{n-1}) \alpha(\mathbf{z}_{n-1})}{p(\mathbf{X}_{1:n})}, \tag{2}$

which can be simplified as

$\begin{aligned} p(\mathbf{z}_n \mid \mathbf{X}_{1:n}) &= \frac{p(\mathbf{x}_{n} \mid \mathbf{z}_n) \sum_{\mathbf{z}_{n-1}} p(\mathbf{z}_n \mid \mathbf{z}_{n-1}) \alpha(\mathbf{z}_{n-1})}{p(\mathbf{x}_n \mid \mathbf{X}_{1:n-1}) p(\mathbf{X}_{1:n-1})} \\ &= \frac{p(\mathbf{x}_{n} \mid \mathbf{z}_n) \sum_{\mathbf{z}_{n-1}} p(\mathbf{z}_n \mid \mathbf{z}_{n-1}) p(\mathbf{z}_{n-1} \mid \mathbf{X}_{1:n-1})}{p(\mathbf{x}_n \mid \mathbf{X}_{1:n-1})}. \end{aligned} \tag{3}$

We can write this new recursion using $\hat{\alpha}(\mathbf{z}_n)$ to denote the conditional distribution,

$\hat{\alpha}(\mathbf{z}_n) = \frac{p(\mathbf{x}_{n} \mid \mathbf{z}_n) \sum_{\mathbf{z}_{n-1}} p(\mathbf{z}_n \mid \mathbf{z}_{n-1}) \hat{\alpha}(\mathbf{z}_{n-1})}{p(\mathbf{x}_n \mid \mathbf{X}_{1:n-1})}. \tag{4}$

Before proceeding, let’s ask the critical question: why is $\hat{\alpha}(\mathbf{z}_n)$ more numerically stable than $\alpha(\mathbf{z}_n)$ ? It is because $\hat{\alpha}(\mathbf{z}_n)$ is a distribution over $K$ values, where $K$ is the number of latent states in the HMM. Not only is $K$ typically much smaller than $n$ and $N$ (the total number of observations), it is fixed across time.

We just need to figure out how to rescale $\hat{\alpha}(\mathbf{z}_n)$ to get back $\alpha(\mathbf{z}_n)$ . Let $s_n$ denote the scaling factor we wish to track at each time point $n$ , which is just the conditional distribution in the numerator of Equation $4$ :

$s_n = p(\mathbf{x}_n \mid \mathbf{X}_{1:n-1}). \tag{5}$

We can rewrite Equation $4$ using $s_n$ as

$\hat{\alpha}(\mathbf{z}_n) = \frac{p(\mathbf{x}_{n} \mid \mathbf{z}_n) \sum_{\mathbf{z}_{n-1}} p(\mathbf{z}_n \mid \mathbf{z}_{n-1}) \hat{\alpha}(\mathbf{z}_{n-1})}{s_n}. \tag{6}$

At each step in the algorithm, we want to compute $s_n$ . We can do this easily provided $K$ is not a very large number. This is because if we think of $\hat{\alpha}(\mathbf{z}_n)$ as a vector with $K$ probabilities, we know that it must sum to unity. So $s_n$ is simply the value that normalizes the right-hand side of Equation $6$ .

Finally, we can compute $\alpha(\mathbf{z}_n)$ from $\hat{\alpha}(\mathbf{z}_n)$ using the scaling factors up through $s_n$ . This is because of the chain rule:

$p(\mathbf{X}_{1:n}) = \prod_{i=1}^n p(\mathbf{x}_i \mid \mathbf{X}_{1:i-1}) = \prod_{i=1}^n s_i. \tag{7}$

To be explicit, we have:

$\alpha(\mathbf{z}_n) = \hat{\alpha}(\mathbf{z}_n) \prod_{i=1}^n s_i. \tag{8}$

So that’s it. At each time point, we compute our forward recursions using the numerically stable conditional distribution $\hat{\alpha}(\mathbf{z}_n)$ over $K$ values. We can efficiently compute $s_n$ by normalizing the $K$ -dimensional probability vector representing the conditional distribution. Finally, we can compute $\alpha(\mathbf{z}_n)$ , which is needed for the full Baum–Welch algorithm, using the product of scaling factors.

We also need to apply this same idea to the $\beta(\mathbf{z}_n)$ terms in the backward pass. However, I don’t think a discussion of this topic adds much intuition, and the derivations are not challenging. See (Bishop, 2006) for a complete discussion, if desired.

Bishop, C. M. (2006). Pattern Recognition and Machine Learning.