Bayesian Online Changepoint Detection
Adams and MacKay's 2007 paper, "Bayesian Online Changepoint Detection", introduces a modular Bayesian framework for online estimation of changes in the generative parameters of sequential data. I discuss this paper in detail.
Introduction
Given sequential data such as stock market prices or streaming stories in a newswire, we might be interested in when these data change in some way, such as a stock price falling or the document topics shifting. If we view our data as observations from a generative process, then we care about when the generative parameters change. An abrupt change in these parameters is called a changepoint, and changepoint detection is the modeling and inferring of these events.
The goal of this post is to explain Ryan P. Adams and David J.C. MacKay’s technical report on Bayesian online changepoint detection (Adams & MacKay, 2007) in my own words, and to work through the framework and code for a particular model. While I focus my discussion on Adams and MacKay’s paper, (Fearnhead & Liu, 2007) is a similar model for the same problem, and it is worth reading as a comparison.
Understanding this post will require understanding conjugacy and the exponential family. The example implementation requires understanding Bayesian inference for the Gaussian.
Model
Let $x_t \in \mathbb{R}^d$ denote the $t$th observation in a data sequence where $d$ is fixed, and let $\mathbf{x}_{s:t}$ denote the sequence $x_s, x_{s+1}, \dots, x_{t1}, x_{t}$ for $s \leq t$. We assume that our data $\mathbf{x}_{1:T}$ can be partitioned such that the data within each partition are i.i.d. samples from some distribution; this is the product partition model (Barry & Hartigan, 1992). Formally, let $\boldsymbol{\theta}^{(p)}$ denote the generative parameters for partition $p$ and $\boldsymbol{\eta}$ denote the hyperprior. Then $\boldsymbol{\theta}^{(p)} \sim p(\boldsymbol{\eta})$ are i.i.d. for $p = 1, 2, \dots, P$ (Figure $1\text{a}$).
Bayesian online changepoint detection works by modeling the time since the last changepoint, called the run length. The run length at time $t$ is denoted $r_t$. Logically, it can take one of two values,
In words, the run length only ever increases by one or drops to zero (Figure $1\text{b}$). The transition probabilities are modeled by the changepoint prior $p(r_t \mid r_{t1})$, which the assumption that $r_t$ is conditionally independent of everything else given $r_{t1}$.
It is useful to express the generative procedure in a probabilistic graphical model (Figure $2$). The dynamics of $r_t$ follow a Markov process. The data are conditionally independent of everything else given the generative parameters. We only observe our data, $x_1, x_2, \dots, x_T$. We will discuss the hyperparameters $\boldsymbol{\eta}$ and $H$ later.
If we take the generative model in Figure $2$ and marginalize out the parameters $\theta^{(p)}$, then we induce dependencies between the observations (Figure $3$). See this link for a review of this principle. We will see this formally in Equation $1$. Figure $3$ looks a lot like a hidden Markov model with a bit more structure: the latent variables are run length variables with more limited support and there are dependencies between observations. To compute $p(x_t \mid r_t, \mathbf{x}_{1:t1})$ efficiently, the algorithm uses the exponential family and message passing to condition on $\boldsymbol{\eta}$, which since the data are i.i.d., eliminates the dependency on $\mathbf{x}_{1:t1}$.
Ultimately, we want to infer both the runlength posterior distribution $p(r_t \mid \mathbf{x}_{1:t})$ and the posterior predictive distribution $p(x_{t+1} \mid \mathbf{x}_{1:t})$ so that we can predict the next data point given all the data we have seen so far.
Recursive RL posterior estimation
In Bayesian inference, if $\mathcal{D}$ is a set of $N$ i.i.d. observations $\{\mathbf{x}_n\}_{n=1}^{N}$ and $\boldsymbol{\theta}$ are model parameters, then the posterior predictive distribution over a new observation $\hat{\mathbf{x}}$ is
In Bayesian online changepoint detection, this posterior takes a slightly different form. The posterior predictive is the probability of the next data point, $x_{t+1}$, given the observations so far, $\mathbf{x}_{1:t}$, and it is computed by marginalizing out the run length $r_t$,
While Adams and MacKay refer to term analogous to “Model” as a “marginal predictive”, I will use underlying probabilistic model (UPM) predictive to clearly disambiguate it from other predictives in this discussion. In the next section, we will see how to efficiently compute this term, but for now assume that it is straightforward. (Equation $1$ differs slightly from Adams and MacKay’s formulation; see the appendix.)
If we can compute the UPM predictive, then we just need to compute the RL posterior, which is proportional to the joint,
We can write this joint recursively as
The first two steps are just marginalization and the chain rule. However, the third step is trickier because it only holds if our modeling assumptions (Figure $3$) are made. (If this derivation feels like it came out of thin air, see the appendix.) Both of these equations
fall out of the assumptions encoded in the graphical model in Figure $3$. If you expected $x_t$ to depend on $r_{t1}$ as per the paper, see the appendix.) Summarizing and annotating the derivation, we have
We now have a recursive algorithm for computing the RL posterior. The algorithm is recursive because after we compute the joint distribution $p(r_{t1}, \mathbf{x}_{1:t1})$, we can forward messagepass this distribution’s mass to the calculation for $p(r_{t}, \mathbf{x}_{1:t})$. Of course, a recursive algorithm must define its initial conditions (what is $p(r_0)$), and we will address this later.
It is useful to think about this recursive algorithm visually. Borrowing language from the paper, we can visualize the messagepassing algorithm as living on a trellis (Figure $4$). At each time point, mass is either passed “upward” such that the runlength is incremented or “downward” where it is truncated to zero. At each time point $t$, the discrete RL posterior distribution has support over $t+1$ values.
This observation has an important algorithmic consequence. When we compute what Adams and MacKay call the growth probabilities or the probability that the runlength increases, the summation in Equation $3$ disappears because each term in the summation is zero except when $r_{t} = r_{t1} + 1$. It is only when calculating the changepoint probabilities, when the mass is passed downwards, that we sum over previous joint probabilities.
In summary, provided we can compute the UPM predictive and the changepoint prior in Equation $3$ and can initialize our recursive algorithm, we will have everything we need for both inference and prediction. Let’s tackle these in that order.
Underlying probabilistic model
Computing the UPM predictive leverages conjugacy and the exponential family. In principle, one could abbreviate this section as most of it is not methodologically novel to the paper. However, when I first read the paper, computing the UPM predictive was not obvious to me, and I will go into detail to explain it.
We assume our model is from an exponential family (EF) member with parameters $\boldsymbol{\eta}$. Since we need to compute $p(x_{t+1} \mid r_t, \mathbf{x}_{1:t})$, we can eliminate the dependency on the previous observations ($\mathbf{x}_{1:t}$) by integrating over the model’s parameters,
This is because $x_{t+1}$ is conditionally independent of $\mathbf{x}_{1:t}$ given $\boldsymbol{\eta}$. We know the EF model in Equation $4$ because it is by construction; we can decide that our data is best modeled by any member of the exponential family. But what is our EF posterior and how do we compute this integral? A useful fact about exponential family models is that we can avoid computing these directly.
Posterior predictive for conjugate models
Before discussing the exponential family, let’s review the prior predictive for any conjugate model. Let $\mathcal{D}$ be our observations, $\hat{\mathbf{x}}$ be a new observation, $\boldsymbol{\theta}$ be our parameters, and $\boldsymbol{\alpha}$ be our hyperparameters. The prior predictive distribution is the distribution on $\hat{\mathbf{x}}$ given $\boldsymbol{\alpha}$ marginalized over $\boldsymbol{\theta}$,
Since our prior on $\boldsymbol{\theta}$ is conjugate, we know that
for some different hyperparameters $\boldsymbol{\alpha}’$. This is not a deep statement; we are just observing that if we are using a conjugate prior, then the prior is same functional form as the posterior. This implies
Thus, the posterior predictive is the same as the prior predictive but with hyperparameter $\boldsymbol{\alpha}’$ instead of $\boldsymbol{\alpha}$. If we can compute $\boldsymbol{\alpha}’$, we can compute the posterior predictive without integration and without computing the posterior $p(\boldsymbol{\theta} \mid \mathcal{D}, \boldsymbol{\alpha})$. And conjugateexponential models allow us to efficiently compute $\boldsymbol{\alpha}’$.
Posterior predictive for exponential family models
Now let’s discuss the prior predictive for the exponential family. Recall that the general form for any member of the exponential family is
where $\boldsymbol{\eta}$ is the natural parameter, $h(\mathbf{x})$ is the underlying measure, $u(\mathbf{x})$ is the sufficient statistic of the data, and $g(\boldsymbol{\eta})$ is a normalizer. The conjugate prior with hyperparameters $\nu$ and $\boldsymbol{\chi}$ is
where $f(\boldsymbol{\chi}, \nu)$ depends on the form of the exponential family member and $\boldsymbol{\eta}$ and $g(\boldsymbol{\eta})$ are the same as before. If we multiply the likelihood times the prior to get the posterior, we have
Since the first $N + 1$ terms are constant with respect to $\boldsymbol{\eta}$, we can write
Note that this is the same exponential family form as the prior, with parameters:
In other words, we have an efficient way to compute the updated hyperparameters $\nu’$ and $\boldsymbol{\chi}’$, which allow us to compute the posterior predictive (Equation $3$) without integration and without computing the posterior over $\boldsymbol{\eta}$. This is what Adams and MacKay mean when they write,
This [UPM predictive] distribution, while generally not itself an exponentialfamily distribution, is usually a simple function of the sufficient statistics.
Importantly, these sufficient statistics are additive and can therefore be computed sequentially.
Messagepassing parameters
The upshot of the previous subsection is that rather than computing the UPM predictive by computing the EF posterior and integrating out the parameters, we just need to keep track of the exponential family parameters by time $t1$ to make a prediction at time $t$. Recall that at each time point $t$, $r_{t1}$ can take $t$ possible values (Figure $4$). Let $\boldsymbol{\nu}_{t1}^{(\ell)}$ and $\boldsymbol{\chi}_{t1}^{(\ell)}$ denote exponential family parameters at time $t$ when $r_{t1}$ takes value $\ell$, then
While Adams and MacKay do not describe it this way in the paper, I find it useful to imagine the parameter updates as another messagepassing algorithm living on a trellis of possible run lengths, similar to Figure $4$ (Figure $5$).
At each time point $t$, we want to model a new possible runlength. In this case, each parameter value is just its prior. Otherwise, we compute $\nu_{t}^{(\ell)}$ and $\boldsymbol{\chi}_{t}^{(\ell)}$ for each run length value $r_t = \ell$ by messagepassing up the trellis. Formally,
Changepoint prior
The changepoint prior allows us to encode our knowledge about our data’s changepoints into the model. Let $\mathrm{T}$ be a discrete, nonnegative random variable for the current run length. Let $f(\tau)$ denote the probability that the current run length is $\tau$. Let $S(\tau)$ be the survival function at $\tau$, which is the probability that $\mathrm{T}$ takes a value greater than $\tau$ or
Then hazard function $H(\tau)$ is
The hazard function quantifies the answer to the question: “Provided a changepoint has not occurred by run length $\tau$, what is the probability that it will occur at $\tau$?” The actual form of the hazard function clearly depends on the distribution of $\mathrm{T}$ (Figure $6$).
Our modeling assumption is that our changepoint prior is
For Equation $7$, we only need to compute two cases. The first case is the probability of a changepoint, which is just the hazard function. The second case is the probability that a changepoint does not occur, which is just one minus the probability of a changepoint. This prior is computationally efficient because we only have two cases with nonzero mass, and we only need to compute the hazard function once to compute both cases.
If $\mathrm{T}$ is a geometric random variable with a probability of success $p$, then $H(\tau) = p$ (Forbes, Evans, Hastings, & Peacock, 2011). This kind of process is called memoryless because the hazard function does not depend on time.
Initial conditions
Earlier, we punted on discussing the initial conditions of our recursive algorithm,
We now have the notation to address this question. We have two scenarios. The first scenario is that a changepoint occurred just before the first data point is observed or $p(r_0 = 0) = 1$. This may not be the most realistic modeling assumption, but it is simple to implement.
The second scenario is that the first changepoint is in the future but we have access to some recent data. The example Adams gives is that of climate change. We have access to historic and current climate data and hypothesize that a changepoint will occur in the future. In this case, the prior over the initial run length is the normalized survival function
where $Z$ is an appropriate normalizing constant. In words, this is all the mass in the future (note the sum starts at $\tau + 1$, not $\tau$) divided by all the mass of the distribution $f(\cdot)$. The greater the value of $\tau$, the more mass the sum accumulates or the higher the probability that a changepoint will occur by some large $\tau$.
Sequential inference algorithm
We are now prepared to understand Bayesian online changepoint detection in algorithmic detail. This algorithm estimates the RL posterior distribution and performs prediction at each time point $t$. Let’s walk through the paper’s Algorithm $1$ with annotations as needed:

Set priors and initial conditions.

Observe new datum $x_t$.

Compute UPM predictive probabilities. This calculation is for each possible run length value $\ell$,
At time $t1$, there are $t$ possible run lengths; we forward messagepass the mass for these values up the trellis to compute the UPM predictive over $x_t$; therefore there are $t$ probabilities $\pi_{t1}^{(\ell)}$.

Compute growth probabilities. The growth probabilities are the probabilities $p(r_t = r_{t1} + 1, \mathbf{x}_{1:t})$ for each possible run length value. To compute the probability for a specific value $r_t = \ell$, Equation $3$ is
Note that there is no sum over $r_{t1}$ because for a given run length, the only valid “growth value” is $r_{t1} = \ell  1$.

Compute changepoint probability. The changepoint probability is the probability that the run length drops to $0$. Again using Equation $3$, we see
In this case we have a sum over $r_{t1}$ because $r_{t1}$ can take any value between $0$ and $t$.

Compute the evidence. This is just the normalizer in Equation $2$.

Compute the RL posterior. Equation $2$.

Update sufficient statistics. Equation $6$.

Perform Prediction. Equation $1$.

Set $t = t+1$. Return to Step 2.
A complete example
Now let’s work through a complete example and implementation of the algorithm. Consider the sequence of $T = 200$ i.i.d. observations, $\mathcal{D} = \{\mathbf{x}_t\}_{t=1}^{T}$, in Figure $7\text{a}$. The groundtruth changepoints are denoted with red dashed lines. We might guess that these data look approximately Gaussian distributed and that while the mean $\mu$ sometimes changes, the variance $\sigma^2$ looks fairly consistent. Thus, to model these data, let’s use a Gaussian model with unknown $\mu$ and fixed precision $\lambda = 1 / \sigma^2$. The posterior predictive (the UPM predictive) for this model is
where $\hat{x} \in \mathbb{R}$ is a new observation and $\mu_T$ and $\lambda_T$ are defined as
See my post on Bayesian Inference for the Gaussian or (Murphy, 2007) for a complete derivation. To make the algebra easier, assume that $\lambda = \lambda_{\text{prior}} = 1$; then we have
In changepoint detection, the number of samples $T$ is just the current time point $t$. So the parameter updates for $\mu_t^{(\ell)}$ and $\lambda_t^{(\ell)}$ as $t$ increases are
In both instances, the parameter update is a simple function of the previous parameters and the next data point.
To make things easy, let’s assume a memoryless hazard function, $H(\tau) = 1/60$ because we see three changepoints in $200$ data points. The actual changepoint probability I used to generate the data was $1/80$. It is beyond the scope of this post, but clearly this hazard function is an important hyperparameter.
Now we are ready for code. While we have leveraged a lot of conceptual machinery, the code is strikingly simple:
import numpy as np
from scipy.stats import norm
R = np.zeros((T+1, T+1))
R[0, 0] = 1
max_R = np.empty(T+1)
max_R[0] = 1
H = 1/60
mu_params = np.array([mu0])
lamb_params = np.array([lamb0])
rl_pred = lambda x, mu, lamb: norm.pdf(x, mu, 1/lamb + 1)
for t in range(1, T+1):
x = data[t1]
pis = np.array([rl_pred(x, mu_params[i], lamb_params[i]) for i in range(t)])
R[t, 1:t+1] = R[t1, :t] * pis * (1H)
R[t, 0] = sum(R[t1, :t] * pis * H)
R[t, :] /= sum(R[t, :])
max_R[t] = np.argmax(R[t, :])
offsets = np.arange(1, t+1)
mu_params = np.append([mu0], (mu_params * offsets + x) / (offsets + 1))
lamb_params = np.append([lamb0], lamb_params + 1)
Note that the above code is a didactic snippet, i.e. you cannot paste this into a file and run it. For a complete example, see my GitHub. We can visualize the results of this algorithm by plotting both the RL posterior and the max posterior value (Figure $7\text{b}$).
Finally, note that Bayesian online changepoint detection is a modular algorithm because both computing the UPM predictive and updating the parameters could be methods on a distribution class that keeps track of the state of its own parameters. Adams notes this in the paper by saying that the framework delineates “between the implementation of the changepoint algorithm and the implementation of the model.”
Conclusions
Bayesian online changepoint detection is a modular Bayesian framework for online estimation of changepoints. It models changes in the generative parameters of data by estimating the posterior over the run length, and it leverages conjugateexponential models to achieve modularity and efficient parameter estimation. The method is Bayesian in that the changepoint prior allows us to encode knowledge about the world and it models uncertainty about $\boldsymbol{\theta}$, and it is online because it computes the RL posterior at each time step using only previously seen data.
This paper is from 2007, and I am not uptodate on the literature on extending this work. However, I would guess that key areas for future work would be addressing the memory usage, which grows quadratically with $T$; robustly handling a misspecified model or prior; and developing approximate inference routines for nonconjugate models.
Acknowledgements
I thank Diana Cai for giving me an introductory overview of this model and providing working code and Ryan Adams for answering questions about the paper.
Appendix
1. The UPM predictive depends only on associated data
Adams and MacKay assume that $x_{t+1}$ only depends on $r_t$ and the recent data in the same partition, not on all the data so far, $\mathbf{x}_{1:t}$. Therefore, the UPM predictive should actually be written as
Note that the authors use $\mathbf{x}_t^{(\ell)}$ as shorthand for $\textbf{x}_{(t  r_t):t}$ when $r_t = \ell$. Notation aside, this distinction does not matter because we will compute the UPM predictive using properties of conjugateexponential models. For ease of notation, I will elide this detail as I do not think doing so harms understanding.
2. Equation $1$ and the forward algorithm
When I first read the paper, Equation $3$ seemed as if it arose ex nihilo. In my mind, the authors dreamed up $r_t$, knew to marginalize out $r_{t1}$, and voilà! a recursive algorithm. In fact, Equation $3$ is similar to the forward algorithm for hidden Markov models. In this context, $r_t$ is the latent variable, and there are no generative parameters. Wikipedia’s derivation is quite clear; for stable reference, consult either (Bishop, 2006) or (Murphy, 2012).
3. $x_t$ does not depend on $r_{t1}$
In Adams and MacKay’s paper, the modeling assumption is
I dislike this notation because it means we would need to offset the observation nodes in Figure $3$ by one. Following hidden Markov model graphical models, I associate each $r_t$ with an observation $x_t$, meaning that
This is just notation that makes drawing the graphical model a bit easier.
 Adams, R. P., & MacKay, D. J. C. (2007). Bayesian online changepoint detection. ArXiv Preprint ArXiv:0710.3742.
 Fearnhead, P., & Liu, Z. (2007). Online inference for multiple changepoint problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(4), 589–605.
 Barry, D., & Hartigan, J. A. (1992). Product partition models for change point problems. The Annals of Statistics, 260–279.
 Forbes, C., Evans, M., Hastings, N., & Peacock, B. (2011). Statistical distributions. John Wiley & Sons.
 Murphy, K. P. (2007). Conjugate Bayesian analysis of the Gaussian distribution. Def, 1(2\sigma2), 16.
 Bishop, C. M. (2006). Pattern Recognition and Machine Learning.
 Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.