Implementing Bayesian Online Changepoint Detection

I annotate my Python implementation of the framework in Adams and MacKay's 2007 paper, "Bayesian Online Changepoint Detection".

Published

20 October 2020

Since I first wrote about Bayesian online changepoint detection (BOCD), I have received a number of emails asking about implementation details. This is understandable, since the code near the end is relatively brief. The goal of this post is to explain my Python implementation of BOCD. This models the simplest case, Gaussian data with fixed variance and changing mean, but I think it highlights the main points. I assume you are familiar with both BOCD and the notation used in either my post or the original paper (Adams & MacKay, 2007).

Posterior predictive

We want to recursively compute the following joint distribution:

$p(r_t, \mathbf{x}_{1:t}) = \sum_{r_{t-1}} \overbrace{\vphantom{\Big|} p(x_t \mid r_t, \mathbf{x}^{(\ell)})}^{\text{Predictive}} \overbrace{\vphantom{\Big|} p(r_t \mid r_{t-1})}^{\text{Changepoint prior}} \overbrace{\vphantom{\Big|} p(r_{t-1}, \mathbf{x}_{1:t-1})}^{\text{Message}}. \tag{1}$

The code for these recursions is fairly straightforward, and I’ll tackle them in the next section. However, the term labeled “predictive” is tricky because it requires some knowledge of Bayesian inference. The big idea is that the predictive term,

$p(x_t \mid r_t, \mathbf{x}^{(\ell)}) \tag{2}$

is just the posterior predictive when we hypothesize that the run length is $\ell$ , when the last changepoint occurred $\ell$ discrete time points ago. Thus, the only data we can condition on is $\mathbf{x}^{(\ell)}$ , which is just notational sugar for $\mathbf{x}_{(t-\ell+1):t}$ , which is in turn notational sugar for all the data between time points $(t-\ell+1)$ and $t$ . Furthermore, we can leverage conjugacy—when the prior is the same functional form as the posterior—to do computationally efficient sequential updates.

As mentioned, let’s work through the simplest case, a Gaussian with unknown mean. In other words, the model is:

$\begin{aligned} \mu_x &\sim \mathcal{N}(\mu_x \mid \mu_0, \sigma_0^2), \\ x &\sim \mathcal{N}(x \mid \mu_x, \sigma_x^2), \end{aligned} \tag{3}$

where $\mu_x$ changes according to a changepoint prior. In this post, I’ll assume the changepoint prior is constant. The variance $\sigma_x^2$ is fixed, while $\mu_0$ and $\sigma_0^2$ are priors. The posterior predictive for this model, after seeing $n$ data points, is

$\begin{aligned} \frac{1}{\sigma_n^2} &= \frac{1}{\sigma_0^2} + \frac{n}{\sigma_x^2}, \\ \mu_n &= \sigma_n^2 \Big( \frac{\mu_0}{\sigma_0^2} + \frac{\sum_{i=1}^{n} x_i}{\sigma_x^2} \Big), \\ p(x \mid \mathbf{x}_{1:n}) &= \mathcal{N}(x \mid \mu_n, \sigma_n^2 + \sigma). \end{aligned} \tag{4}$

See Section $2$ in (Murphy, 2007) for detailed derivations of these equations. In particular, see Murphy’s equations $19$ , $24$ , and $36$ . I also have a post on Bayesian inference for the Gaussian, but I just rederived Murphy’s work to understand it better.

At each time $t$ , BOCD models $(t+1)$ hypotheses: the run length is $0, 1, 2, \dots t$ . Therefore, at each time $t$ , we need to keep track of $(t+1)$ posterior predictive parameters:

$\{(\mu_0, \sigma_0^2), (\mu_1, \sigma_1^2), (\mu_2, \sigma_2^2), \dots, (\mu_t, \sigma_t^2)\}. \tag{5}$

To be clear, the subscripts above are the number of data points associated with a given run length hypothesis. For example, $\mu_2$ is the posterior predictive mean when we assume $\ell = 2$ ; there are two observations associated with this hypothesis: $\mathbf{x}^{(2)} = \{ x_{t-1}, x_t \}$ . Importantly, since the parameter updates are additive, we can easily compute a vector of parameter updates at time $t$ using those computed at time $(t-1)$ . To see this, let’s write down the sequence of parameters for $\sigma_n^2$ for time points $t = 0, 1, 2, \dots$ :

$\begin{aligned} t=0, & \quad \big[ 1/\sigma_0^2 \big] \\ t=1, & \quad \big[ 1/\sigma_0^2, \quad \overbrace{1/\sigma_0^2 + 1/\sigma_x^2}^{1/\sigma_1^2} \big] \\ t=2, & \quad \big[ 1/\sigma_0^2, \quad 1/\sigma_0^2 + 1/\sigma_x^2, \quad \overbrace{1/\sigma_0^2 + 2/\sigma_x^2}^{1/\sigma_2^2} \big] \\ t=3, & \quad \big[ 1/\sigma_0^2, \quad 1/\sigma_0^2 + 1/\sigma_x^2, \quad 1/\sigma_0^2 + 2/\sigma_x^2, \quad \overbrace{1/\sigma_0^2 + 3/\sigma_x^2}^{1/\sigma_3^2} \big] \\ & \vdots \end{aligned} \tag{6}$

So at time $t = k$ , we can generate the parameters using those at time $t = k - 1$ . We simply add $1/\sigma_x^2$ to each parameter at time $t = k-1$ , and then set the prior $1 / \sigma_0^2$ as the first element. In vectorized code, we just initialize our precision parameters as

prec_params = np.array([1/var0])

and then recursively update them as

new_prec_params = prec_params + (1/varx)
prec_params     = np.append([1/var0], new_prec_params)

This works by leveraging NumPy’s (Harris et al., 2020) element-wise vector addition. Now let’s look at a sequence of parameters for $\mu_n$ for time points $t = 0,1,2,\dots$ :

$\begin{aligned} t=0, & \quad \Big[ \mu_0 \Big] \\ t=1, & \quad \Big[ \mu_0, \quad \overbrace{\sigma_1^2 \Big( \frac{\mu_0}{\sigma_0^2} + \frac{x_1}{\sigma_x^2} \Big)}^{\mu_1} \Big] \\ t=2, & \quad \Big[ \mu_0, \quad \sigma_1^2 \Big( \frac{\mu_0}{\sigma_0^2} + \frac{x_2}{\sigma_x^2} \Big), \quad \overbrace{\sigma_2^2 \Big( \frac{\mu_0}{\sigma_0^2} + \frac{x_1 + x_2}{\sigma_x^2} \Big)}^{\mu_2} \Big] \\ t=3, & \quad \Big[ \mu_0, \quad \sigma_1^2 \Big( \frac{\mu_0}{\sigma_0^2} + \frac{x_3}{\sigma_x^2} \Big), \quad \sigma_2^2 \Big( \frac{\mu_0}{\sigma_0^2} + \frac{x_2 + x_3}{\sigma_x^2} \Big), \quad \overbrace{\sigma_3^2 \Big( \frac{\mu_0}{\sigma_0^2} + \frac{x_1 + x_2 + x_3}{\sigma_x^2} \Big)}^{\mu_3} \Big] \\ & \vdots \end{aligned} \tag{7}$

One caveat: observe the data indices. For example, at time $t=2$ , the second element in each vector of parameters contains the observation $x_2$ rather than $x_1$ . This is not a typo. At time $t=2$ , data point $x_2$ is the first and only data point in $\mathbf{x}^{(1)}$ , meaning under the run-length hypothesis that a changepoint occurred one time point ago. This is a subtle but important point; at each time point, BOCD is tracking all possible run length hypotheses, and so even the current observation is the “first observation” under the hypothesis that a changepoint just occurred.

Anyway, the updates from Eq. $7$ are a bit trickier, but there’s a nice solution. Notice the relationship between two sequential updates:

$\begin{aligned} \mu_{k-1} &= \sigma_{k-1}^2 \Big( \frac{\mu_0}{\sigma_0^2} + \frac{x_1 + \dots + x_{k-1}}{\sigma_x^2} \Big), \\ \mu_{k} &= \sigma_k^2 \Big( \frac{\mu_0}{\sigma_0^2} + \frac{x_1 + \dots + x_{k-1} + x_k}{\sigma_x^2} \Big). \end{aligned} \tag{8}$

This means we can write $\mu_k$ in terms of $\mu_{k-1}$ :

$\mu_k = \sigma_k^2 \Big( \frac{\mu_{k-1}}{\sigma_{k-1}^2} + \frac{x_k}{\sigma_x} \Big). \tag{9}$

So we can do a clever element-wise vector multiplication to generate the next set of parameters from the previous. First, we initialize our mean parameters as

mean_params = np.array([mean0])

and then recursively update them as

new_mean_params = (mean_params * prec_params[:-1] + (x / varx)) / new_prec_params
mean_params     = np.append([mean0], new_mean_params)

where new_prec_params is defined in a snippet above. Notice that at time $t$ , the previous parameters array mean_params has length $t$ . Since we have already updated prec_params, it has length $(t+1)$ . This is why we perform the element-wise multiplication mean_params * prec_params[:-1]. We then add the new observation, normalized by its assumed variance $\sigma_x^2$ , and finally divide each element by the new precision parameters (or multiply by the variance). Notice that new_prec_params is also of length $t$ since the prior has not yet been prepended.

Finally, given our estimated mean and precision parameters, we can compute the posterior predictive probability for a new data point for each possible run length hypothesis at time $t$ . We just use Eq. $4$ where $n$ is the number of data points in the current run length hypothesis:

preds = np.empty(t)
for n in range(t):
    mean_n   = mean_params[n]
    var_n    = np.sqrt(1. / prec_params[n] + varx)
    preds[n] = norm(mean_n, std_n).pdf(x)

The above code is correct, but we can easily vectorize it:

post_means = mean_params[:t]
post_stds  = np.sqrt(1. / prec_params[:t] + varx)
preds      = norm(post_means, post_stds).pdf(x)

Filtering recursions

Recursive estimation of the run length posterior is, at least in my mind, much easier to grok. We first initialize a $(T+1) \times (T+1)$ lower-triangular matrix $R$ representing the run length posterior for $T$ time points. The plus one is because we need a base case for our recursion, although it is not strictly necessary Furthermore, we need a base case for the “message” in Eq. $1$ , or the previously computed joint:

R = np.zeros((T + 1, T + 1))
R[0, 0] = 1
message = np.array([1])

Then, for each observation, we need to compute the predictive probabilities and multiply them times our changepoint prior and message. In this case, I assume a fixed changepoint prior denoted by hazard:

preds        = model.pred_prob(t, x)
growth_probs = preds * message * (1 - hazard)
cp_prob      = sum(preds * message * hazard)
new_joint    = np.append(cp_prob, growth_probs)

Next, we normalize our new joint distribution by the evidence and store it in $R$ . Strictly speaking, we do not need to use $R$ at all, but it is useful and fun for visualizing:

R[t, :t+1] = new_joint
evidence = sum(new_joint)
R[t, :t+1] /= evidence

Finally, update our sufficient statistics or parameters and message pass:

model.update_statistics(t, x)
message = new_joint

Numerical stability

At this point, we have a working version of BOCD. However, the code is numerically unstable for big datasets because we are constantly multiplying small numbers, resulting in numerical underflow. The standard fix is to work in log space. This is straightforward. We just need to use norm.logpdf instead of norm.pdf in when computing predictive probabilities, and we need to use the log-sum-exp trick to normalize. Now our code looks like this:

# 3. Evaluate predictive probabilities.
log_pis = model.log_pred_prob(t, x)

# 4. Calculate growth probabilities.
log_growth_probs = log_pis + log_message + log_1mH

# 5. Calculate changepoint probabilities.
log_cp_prob = logsumexp(log_pis + log_message + log_H)

# 6. Calculate evidence
new_log_joint = np.append(log_cp_prob, log_growth_probs)

# 7. Determine run length distribution.
log_R[t, :t+1]  = new_log_joint
log_R[t, :t+1] -= logsumexp(new_log_joint)

# 8. Update sufficient statistics.
model.update_params(t, x)

# Pass message.
log_message = new_log_joint

See the code on GitHub for a complete working example with all variables defined.

Conclusion

As we can see, BOCD is not a terribly complicated algorithm conceptually, but actually implementing it requires both knowledge of Bayesian inference and some tedious derivations. In my opinion, the sequential updates are the trickiest part, and I wouldn’t be surprised if there is an off-by-one error in my own logic or code; please email if you find a mistake. Hopefully going through these updates in painful detail underscores what’s happening in the algorithm.

Adams, R. P., & MacKay, D. J. C. (2007). Bayesian online changepoint detection. ArXiv Preprint ArXiv:0710.3742.
Murphy, K. P. (2007). Conjugate Bayesian analysis of the Gaussian distribution. Def, 1(2\sigma2), 16.
Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., & others. (2020). Array programming with NumPy. Nature, 585(7825), 357–362.