Implementing Bayesian Online Changepoint Detection

I annotate my Python implementation of the framework in Adams and MacKay's 2007 paper, "Bayesian Online Changepoint Detection".

Since I first wrote about Bayesian online changepoint detection (BOCD), I have received a number of emails asking about implementation details. This is understandable, since the code near the end is relatively brief. The goal of this post is to explain my Python implementation of BOCD. This models the simplest case, Gaussian data with fixed variance and changing mean, but I think it highlights the main points. I assume you are familiar with both BOCD and the notation used in either my post or the original paper (Adams & MacKay, 2007).

Posterior predictive

We want to recursively compute the following joint distribution:

p(rt,x1:t)=rt1p(xtrt,x())Predictivep(rtrt1)Changepoint priorp(rt1,x1:t1)Message.(1) p(r_t, \mathbf{x}_{1:t}) = \sum_{r_{t-1}} \overbrace{\vphantom{\Big|} p(x_t \mid r_t, \mathbf{x}^{(\ell)})}^{\text{Predictive}} \overbrace{\vphantom{\Big|} p(r_t \mid r_{t-1})}^{\text{Changepoint prior}} \overbrace{\vphantom{\Big|} p(r_{t-1}, \mathbf{x}_{1:t-1})}^{\text{Message}}. \tag{1}

The code for these recursions is fairly straightforward, and I’ll tackle them in the next section. However, the term labeled “predictive” is tricky because it requires some knowledge of Bayesian inference. The big idea is that the predictive term,

p(xtrt,x())(2) p(x_t \mid r_t, \mathbf{x}^{(\ell)}) \tag{2}

is just the posterior predictive when we hypothesize that the run length is \ell, when the last changepoint occurred \ell discrete time points ago. Thus, the only data we can condition on is x()\mathbf{x}^{(\ell)}, which is just notational sugar for x(t+1):t\mathbf{x}_{(t-\ell+1):t}, which is in turn notational sugar for all the data between time points (t+1)(t-\ell+1) and tt. Furthermore, we can leverage conjugacy—when the prior is the same functional form as the posterior—to do computationally efficient sequential updates.

As mentioned, let’s work through the simplest case, a Gaussian with unknown mean. In other words, the model is:

μxN(μxμ0,σ02),xN(xμx,σx2),(3) \begin{aligned} \mu_x &\sim \mathcal{N}(\mu_x \mid \mu_0, \sigma_0^2), \\ x &\sim \mathcal{N}(x \mid \mu_x, \sigma_x^2), \end{aligned} \tag{3}

where μx\mu_x changes according to a changepoint prior. In this post, I’ll assume the changepoint prior is constant. The variance σx2\sigma_x^2 is fixed, while μ0\mu_0 and σ02\sigma_0^2 are priors. The posterior predictive for this model, after seeing nn data points, is

1σn2=1σ02+nσx2,μn=σn2(μ0σ02+i=1nxiσx2),p(xx1:n)=N(xμn,σn2+σ).(4) \begin{aligned} \frac{1}{\sigma_n^2} &= \frac{1}{\sigma_0^2} + \frac{n}{\sigma_x^2}, \\ \mu_n &= \sigma_n^2 \Big( \frac{\mu_0}{\sigma_0^2} + \frac{\sum_{i=1}^{n} x_i}{\sigma_x^2} \Big), \\ p(x \mid \mathbf{x}_{1:n}) &= \mathcal{N}(x \mid \mu_n, \sigma_n^2 + \sigma). \end{aligned} \tag{4}

See Section 22 in (Murphy, 2007) for detailed derivations of these equations. In particular, see Murphy’s equations 1919, 2424, and 3636. I also have a post on Bayesian inference for the Gaussian, but I just rederived Murphy’s work to understand it better.

At each time tt, BOCD models (t+1)(t+1) hypotheses: the run length is 0,1,2,t0, 1, 2, \dots t. Therefore, at each time tt, we need to keep track of (t+1)(t+1) posterior predictive parameters:

{(μ0,σ02),(μ1,σ12),(μ2,σ22),,(μt,σt2)}.(5) \{(\mu_0, \sigma_0^2), (\mu_1, \sigma_1^2), (\mu_2, \sigma_2^2), \dots, (\mu_t, \sigma_t^2)\}. \tag{5}

To be clear, the subscripts above are the number of data points associated with a given run length hypothesis. For example, μ2\mu_2 is the posterior predictive mean when we assume =2\ell = 2; there are two observations associated with this hypothesis: x(2)={xt1,xt}\mathbf{x}^{(2)} = \{ x_{t-1}, x_t \}. Importantly, since the parameter updates are additive, we can easily compute a vector of parameter updates at time tt using those computed at time (t1)(t-1). To see this, let’s write down the sequence of parameters for σn2\sigma_n^2 for time points t=0,1,2,t = 0, 1, 2, \dots:

t=0,[1/σ02]t=1,[1/σ02,1/σ02+1/σx21/σ12]t=2,[1/σ02,1/σ02+1/σx2,1/σ02+2/σx21/σ22]t=3,[1/σ02,1/σ02+1/σx2,1/σ02+2/σx2,1/σ02+3/σx21/σ32](6) \begin{aligned} t=0, & \quad \big[ 1/\sigma_0^2 \big] \\ t=1, & \quad \big[ 1/\sigma_0^2, \quad \overbrace{1/\sigma_0^2 + 1/\sigma_x^2}^{1/\sigma_1^2} \big] \\ t=2, & \quad \big[ 1/\sigma_0^2, \quad 1/\sigma_0^2 + 1/\sigma_x^2, \quad \overbrace{1/\sigma_0^2 + 2/\sigma_x^2}^{1/\sigma_2^2} \big] \\ t=3, & \quad \big[ 1/\sigma_0^2, \quad 1/\sigma_0^2 + 1/\sigma_x^2, \quad 1/\sigma_0^2 + 2/\sigma_x^2, \quad \overbrace{1/\sigma_0^2 + 3/\sigma_x^2}^{1/\sigma_3^2} \big] \\ & \vdots \end{aligned} \tag{6}

So at time t=kt = k, we can generate the parameters using those at time t=k1t = k - 1. We simply add 1/σx21/\sigma_x^2 to each parameter at time t=k1t = k-1, and then set the prior 1/σ021 / \sigma_0^2 as the first element. In vectorized code, we just initialize our precision parameters as

prec_params = np.array([1/var0])

and then recursively update them as

new_prec_params = prec_params + (1/varx)
prec_params     = np.append([1/var0], new_prec_params)

This works by leveraging NumPy’s (Harris et al., 2020) element-wise vector addition. Now let’s look at a sequence of parameters for μn\mu_n for time points t=0,1,2,t = 0,1,2,\dots:

t=0,[μ0]t=1,[μ0,σ12(μ0σ02+x1σx2)μ1]t=2,[μ0,σ12(μ0σ02+x2σx2),σ22(μ0σ02+x1+x2σx2)μ2]t=3,[μ0,σ12(μ0σ02+x3σx2),σ22(μ0σ02+x2+x3σx2),σ32(μ0σ02+x1+x2+x3σx2)μ3](7) \begin{aligned} t=0, & \quad \Big[ \mu_0 \Big] \\ t=1, & \quad \Big[ \mu_0, \quad \overbrace{\sigma_1^2 \Big( \frac{\mu_0}{\sigma_0^2} + \frac{x_1}{\sigma_x^2} \Big)}^{\mu_1} \Big] \\ t=2, & \quad \Big[ \mu_0, \quad \sigma_1^2 \Big( \frac{\mu_0}{\sigma_0^2} + \frac{x_2}{\sigma_x^2} \Big), \quad \overbrace{\sigma_2^2 \Big( \frac{\mu_0}{\sigma_0^2} + \frac{x_1 + x_2}{\sigma_x^2} \Big)}^{\mu_2} \Big] \\ t=3, & \quad \Big[ \mu_0, \quad \sigma_1^2 \Big( \frac{\mu_0}{\sigma_0^2} + \frac{x_3}{\sigma_x^2} \Big), \quad \sigma_2^2 \Big( \frac{\mu_0}{\sigma_0^2} + \frac{x_2 + x_3}{\sigma_x^2} \Big), \quad \overbrace{\sigma_3^2 \Big( \frac{\mu_0}{\sigma_0^2} + \frac{x_1 + x_2 + x_3}{\sigma_x^2} \Big)}^{\mu_3} \Big] \\ & \vdots \end{aligned} \tag{7}

One caveat: observe the data indices. For example, at time t=2t=2, the second element in each vector of parameters contains the observation x2x_2 rather than x1x_1. This is not a typo. At time t=2t=2, data point x2x_2 is the first and only data point in x(1)\mathbf{x}^{(1)}, meaning under the run-length hypothesis that a changepoint occurred one time point ago. This is a subtle but important point; at each time point, BOCD is tracking all possible run length hypotheses, and so even the current observation is the “first observation” under the hypothesis that a changepoint just occurred.

Anyway, the updates from Eq. 77 are a bit trickier, but there’s a nice solution. Notice the relationship between two sequential updates:

μk1=σk12(μ0σ02+x1++xk1σx2),μk=σk2(μ0σ02+x1++xk1+xkσx2).(8) \begin{aligned} \mu_{k-1} &= \sigma_{k-1}^2 \Big( \frac{\mu_0}{\sigma_0^2} + \frac{x_1 + \dots + x_{k-1}}{\sigma_x^2} \Big), \\ \mu_{k} &= \sigma_k^2 \Big( \frac{\mu_0}{\sigma_0^2} + \frac{x_1 + \dots + x_{k-1} + x_k}{\sigma_x^2} \Big). \end{aligned} \tag{8}

This means we can write μk\mu_k in terms of μk1\mu_{k-1}:

μk=σk2(μk1σk12+xkσx).(9) \mu_k = \sigma_k^2 \Big( \frac{\mu_{k-1}}{\sigma_{k-1}^2} + \frac{x_k}{\sigma_x} \Big). \tag{9}

So we can do a clever element-wise vector multiplication to generate the next set of parameters from the previous. First, we initialize our mean parameters as

mean_params = np.array([mean0])

and then recursively update them as

new_mean_params = (mean_params * prec_params[:-1] + (x / varx)) / new_prec_params
mean_params     = np.append([mean0], new_mean_params)

where new_prec_params is defined in a snippet above. Notice that at time tt, the previous parameters array mean_params has length tt. Since we have already updated prec_params, it has length (t+1)(t+1). This is why we perform the element-wise multiplication mean_params * prec_params[:-1]. We then add the new observation, normalized by its assumed variance σx2\sigma_x^2, and finally divide each element by the new precision parameters (or multiply by the variance). Notice that new_prec_params is also of length tt since the prior has not yet been prepended.

Finally, given our estimated mean and precision parameters, we can compute the posterior predictive probability for a new data point for each possible run length hypothesis at time tt. We just use Eq. 44 where nn is the number of data points in the current run length hypothesis:

preds = np.empty(t)
for n in range(t):
    mean_n   = mean_params[n]
    var_n    = np.sqrt(1. / prec_params[n] + varx)
    preds[n] = norm(mean_n, std_n).pdf(x)

The above code is correct, but we can easily vectorize it:

post_means = mean_params[:t]
post_stds  = np.sqrt(1. / prec_params[:t] + varx)
preds      = norm(post_means, post_stds).pdf(x)

Filtering recursions

Recursive estimation of the run length posterior is, at least in my mind, much easier to grok. We first initialize a (T+1)×(T+1)(T+1) \times (T+1) lower-triangular matrix RR representing the run length posterior for TT time points. The plus one is because we need a base case for our recursion, although it is not strictly necessary Furthermore, we need a base case for the “message” in Eq. 11, or the previously computed joint:

R = np.zeros((T + 1, T + 1))
R[0, 0] = 1
message = np.array([1])

Then, for each observation, we need to compute the predictive probabilities and multiply them times our changepoint prior and message. In this case, I assume a fixed changepoint prior denoted by hazard:

preds        = model.pred_prob(t, x)
growth_probs = preds * message * (1 - hazard)
cp_prob      = sum(preds * message * hazard)
new_joint    = np.append(cp_prob, growth_probs)

Next, we normalize our new joint distribution by the evidence and store it in RR. Strictly speaking, we do not need to use RR at all, but it is useful and fun for visualizing:

R[t, :t+1] = new_joint
evidence = sum(new_joint)
R[t, :t+1] /= evidence

Finally, update our sufficient statistics or parameters and message pass:

model.update_statistics(t, x)
message = new_joint

Numerical stability

At this point, we have a working version of BOCD. However, the code is numerically unstable for big datasets because we are constantly multiplying small numbers, resulting in numerical underflow. The standard fix is to work in log space. This is straightforward. We just need to use norm.logpdf instead of norm.pdf in when computing predictive probabilities, and we need to use the log-sum-exp trick to normalize. Now our code looks like this:

# 3. Evaluate predictive probabilities.
log_pis = model.log_pred_prob(t, x)

# 4. Calculate growth probabilities.
log_growth_probs = log_pis + log_message + log_1mH

# 5. Calculate changepoint probabilities.
log_cp_prob = logsumexp(log_pis + log_message + log_H)

# 6. Calculate evidence
new_log_joint = np.append(log_cp_prob, log_growth_probs)

# 7. Determine run length distribution.
log_R[t, :t+1]  = new_log_joint
log_R[t, :t+1] -= logsumexp(new_log_joint)

# 8. Update sufficient statistics.
model.update_params(t, x)

# Pass message.
log_message = new_log_joint

See the code on GitHub for a complete working example with all variables defined.

Conclusion

As we can see, BOCD is not a terribly complicated algorithm conceptually, but actually implementing it requires both knowledge of Bayesian inference and some tedious derivations. In my opinion, the sequential updates are the trickiest part, and I wouldn’t be surprised if there is an off-by-one error in my own logic or code; please email if you find a mistake. Hopefully going through these updates in painful detail underscores what’s happening in the algorithm.

  1. Adams, R. P., & MacKay, D. J. C. (2007). Bayesian online changepoint detection. ArXiv Preprint ArXiv:0710.3742.
  2. Murphy, K. P. (2007). Conjugate Bayesian analysis of the Gaussian distribution. Def, 1(2\sigma2), 16.
  3. Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., & others. (2020). Array programming with NumPy. Nature, 585(7825), 357–362.