Bayesian Online Changepoint Detection

Adams and MacKay's 2007 paper, "Bayesian Online Changepoint Detection", introduces a modular Bayesian framework for online estimation of changes in the generative parameters of sequential data. I discuss this paper in detail.

Introduction

Given sequential data such as stock market prices or streaming stories in a newswire, we might be interested in when these data change in some way, such as a stock price falling or the document topics shifting. If we view our data as observations from a generative process, then we care about when the generative parameters change. An abrupt change in these parameters is called a changepoint, and changepoint detection is the modeling and inferring of these events.

The goal of this post is to explain Ryan P. Adams and David J.C. MacKay’s technical report on Bayesian online changepoint detection (Adams & MacKay, 2007) in my own words, and to work through the framework and code for a particular model. While I focus my discussion on Adams and MacKay’s paper, (Fearnhead & Liu, 2007) is a similar model for the same problem, and it is worth reading as a comparison.

Understanding this post will require understanding conjugacy and the exponential family. The example implementation requires understanding Bayesian inference for the Gaussian.

Model

Let xtRdx_t \in \mathbb{R}^d denote the tt-th observation in a data sequence, and let xs:t\mathbf{x}_{s:t} denote the sequence xs,xs+1,,xt1,xtx_s, x_{s+1}, \dots, x_{t-1}, x_{t} for sts \leq t. We assume that our TT data points x1:T\mathbf{x}_{1:T} can be partitioned such that the data within each partition are i.i.d. samples from some distribution; this is the product partition model (Barry & Hartigan, 1992). Formally, let θ(p)\boldsymbol{\theta}^{(p)} denote the generative parameters for partition pp and θ0\boldsymbol{\theta}_0 denote the hyperprior. Then θ(p)p(θ0)\boldsymbol{\theta}^{(p)} \sim p(\boldsymbol{\theta}_0) are i.i.d. for partitions p=1,2,,Pp = 1, 2, \dots, P (Figure 1a1\text{a}).

Figure 1. Conceptual diagrams of (a) synthetic data partitioned by two changepoints—the changepoint for the third partition is unknown—and (b) the associated run length rtr_t as a function of time tt. Changepoints occur between partitions, and rtr_t drops to 00 at a changepoint.

Bayesian online changepoint detection works by modeling the time since the last changepoint, called the run length. The run length at time tt is denoted rtr_t. Logically, it can take one of two values,

rt={0if changepoint at time trt1+1else. r_{t} = \begin{cases} 0 & \text{if changepoint at time $t$} \\ r_{t-1} + 1 & \text{else}. \end{cases}

In words, the run length only ever increases by one or drops to zero (Figure 1b1\text{b}). The transition probabilities are modeled by the changepoint prior p(rtrt1)p(r_t \mid r_{t-1}), which is the assumption that rtr_t is conditionally independent of everything else given rt1r_{t-1}.

Ultimately, we want to infer both the run-length posterior distribution p(rtx1:t)p(r_t \mid \mathbf{x}_{1:t}) and the posterior predictive distribution p(xt+1x1:t)p(x_{t+1} \mid \mathbf{x}_{1:t}) so that we can predict the next data point given all the data we have seen so far.

Recursive RL posterior estimation

In Bayesian inference, if D\mathcal{D} is a set of NN i.i.d. observations {xn}n=1N\{\mathbf{x}_n\}_{n=1}^{N} and θ\boldsymbol{\theta} are model parameters, then the posterior predictive distribution over a new observation x^\hat{\mathbf{x}} is

p(x^D)=p(x^θ)  Modelp(θD)Posterior  dθ. p(\hat{\mathbf{x}} \mid \mathcal{D}) = \int \overbrace{\vphantom{\Big|} p(\hat{\mathbf{x}} \mid \boldsymbol{\theta})\;}^{\text{Model}} \overbrace{\vphantom{\Big|} p(\boldsymbol{\theta} \mid \mathcal{D})}^{\text{Posterior}} \; \text{d} \boldsymbol{\theta}.

In words, we weigh our probabilistic model’s predictions by our posterior uncertainty of the model’s parameters. Marginalization means doing this for every possible parameter value. In Bayesian online changepoint detection, this posterior takes a slightly different form. The posterior predictive is the probability of the next data point, xt+1x_{t+1}, given the observations so far, x1:t\mathbf{x}_{1:t}, and it is computed by marginalizing out the run length rtr_t,

p(xt+1x1:t)=rtp(xt+1,rtx1:t)=rtp(xt+1rt,x())UPM predictivep(rtx1:t)RL posterior.(1) \begin{aligned} p(x_{t+1} \mid \mathbf{x}_{1:t}) &= \sum_{r_t} p(x_{t+1}, r_t \mid \mathbf{x}_{1:t}) \\ &= \sum_{r_t} \overbrace{\vphantom{\Big|} p(x_{t+1} \mid r_t, \mathbf{x}^{(\ell)})}^{\text{UPM predictive}} \overbrace{\vphantom{\Big|} p(r_t \mid \mathbf{x}_{1:t})}^{\text{RL posterior}}. \tag{1} \end{aligned}

We use x()\mathbf{x}^{(\ell)} to denote all observations associated with rt=r_t = \ell. If this is confusing, think of rt=r_t = \ell as a hypothesis that a changepoint occurred \ell time steps ago. Thus, only data within the past \ell time points should contribute to our predictive distribution over xt+1\mathbf{x}_{t+1}. This notation is a little sloppy, and we could write equation 11 more explicitly as

p(xt+1x1:t)==0tp(xt+1rt=,x(t):t)p(rt=x1:t). p(x_{t+1} \mid \mathbf{x}_{1:t}) = \sum_{\ell = 0}^{t} p(x_{t+1} \mid r_t = \ell, \mathbf{x}_{(t-\ell):t}) p(r_t = \ell \mid \mathbf{x}_{1:t}).

While Adams and MacKay refer to the term analogous to “model” as a “marginal predictive”, I will use underlying probabilistic model (UPM) predictive to clearly disambiguate it from other predictives in this discussion. In the next section, we will see how to efficiently compute this term. For now assume that it is straightforward.

If we can compute the UPM predictive, then we just need to compute the RL posterior, which is proportional to the joint,

p(rtx1:t)=p(rt,x1:t)rtp(rt,x1:t).(2) p(r_t \mid \mathbf{x}_{1:t}) = \frac{p(r_t, \mathbf{x}_{1:t})}{\sum_{r_{t'}} p(r_{t'}, \mathbf{x}_{1:t})} \tag{2}.

We can write this joint recursively as

p(rt,x1:t)=rt1p(rt,rt1,xt,x1:t1)=rt1p(rt,xtrt1,x1:t1)p(rt1,x1:t1)=rt1p(xtrt,rt1,x())p(rtrt1,x1:t1)p(rt1,x1:t1) \begin{aligned} p(r_t, \mathbf{x}_{1:t}) &= \sum_{r_{t-1}} p(r_t, r_{t-1}, x_t, \mathbf{x}_{1:t-1}) \\ &= \sum_{r_{t-1}} p(r_t, x_t \mid r_{t-1}, \mathbf{x}_{1:t-1}) p(r_{t-1}, \mathbf{x}_{1:t-1}) \\ &= \sum_{r_{t-1}} p(x_t \mid r_t, \cancel{r_{t-1}}, \mathbf{x}^{(\ell)}) p(r_t \mid r_{t-1}, \cancel{\mathbf{x}_{1:t-1}}) p(r_{t-1}, \mathbf{x}_{1:t-1}) \end{aligned}

The first two steps are just marginalization and the chain rule. However, the third step is trickier because it only holds if our modeling assumptions are made. Both of these equations

p(rtrt1,x1:t1)=p(rtrt1),p(xtrt,rt1,x1:t1)=p(xtrt,x()) \begin{aligned} p(r_t \mid r_{t-1}, \mathbf{x}_{1:t-1}) &= p(r_t \mid r_{t-1}), \\ p(x_t \mid r_t, r_{t-1}, \mathbf{x}_{1:t-1}) &= p(x_t \mid r_t, \mathbf{x}^{(\ell)}) \end{aligned}

fall out of our modeling assumptions. Summarizing and annotating the derivation, we have

p(rt,x1:t)=rt1p(xtrt,x())UPM predictivep(rtrt1)Changepoint priorp(rt1,x1:t1)Message.(3) p(r_t, \mathbf{x}_{1:t}) = \sum_{r_{t-1}} \overbrace{\vphantom{\Big|} p(x_t \mid r_t, \mathbf{x}^{(\ell)})}^{\text{UPM predictive}} \overbrace{\vphantom{\Big|} p(r_t \mid r_{t-1})}^{\text{Changepoint prior}} \overbrace{\vphantom{\Big|} p(r_{t-1}, \mathbf{x}_{1:t-1})}^{\text{Message}}. \tag{3}

We now have a recursive algorithm for computing the RL posterior. The algorithm is recursive because after we compute the joint distribution p(rt1,x1:t1)p(r_{t-1}, \mathbf{x}_{1:t-1}), we can forward message-pass this distribution’s mass to the calculation for p(rt,x1:t)p(r_{t}, \mathbf{x}_{1:t}). Of course, a recursive algorithm must define its initial conditions (what is p(r0)p(r_0)?), and we will address this later.

As an aside, when I first read the paper, Equation 33 seemed as if it arose ex nihilo. In my mind, the authors dreamed up rtr_t, knew to marginalize out rt1r_{t-1}, and voilà! a recursive algorithm. In fact, Equation 33 is similar to the forward algorithm for hidden Markov models. In this context, rtr_t is the latent variable, and there are no generative parameters. Wikipedia’s derivation is quite clear; for stable reference, consult either (Bishop, 2006) or (Murphy, 2012).

It is useful to think about this recursive algorithm visually. Borrowing language from the paper, we can visualize the message-passing algorithm as living on a trellis (Figure 22). At each time point, mass is either passed “upward” such that the run-length is incremented or “downward” where it is truncated to zero. At each time point tt, the discrete RL posterior distribution has support over t+1t+1 values.

Figure 2. Diagram of the message-passing algorithm. Each node has associated mass. For example, the probability of p(r4=2,x1:4)p(r_4 = 2, \mathbf{x}_{1:4}) is associated with the node indexed by t=4t=4 and rt=2r_t=2.

This observation has an important algorithmic consequence. When we compute what Adams and MacKay call the growth probabilities or the probability that the run-length increases, the summation in Equation 33 disappears because each term in the summation is zero except when rt=rt1+1r_{t} = r_{t-1} + 1. It is only when calculating the changepoint probabilities, when the mass is passed downwards, that we sum over previous joint probabilities.

In summary, provided we can compute the UPM predictive and the changepoint prior in Equation 33 and can initialize our recursive algorithm, we will have everything we need for both inference and prediction. Let’s tackle these in that order.

Underlying probabilistic model

Computing the UPM predictive leverages conjugacy and the exponential family. In principle, one could abbreviate this section as most of it is not methodologically novel to the paper. However, when I first read the paper, computing the UPM predictive was not obvious to me, and I will go into detail to explain it.

We assume our model is from an exponential family (EF) member with parameters η\boldsymbol{\eta}. To compute p(xt+1rt=,x())p(x_{t+1} \mid r_t = \ell, \mathbf{x}^{(\ell)}), we could integrate over the model’s parameters,

p(xt+1rt,x())UPM predictive=p(xt+1η)EF modelp(ηrt,x1:t)EF posteriordη.(4) \overbrace{\vphantom{\Big|} p(x_{t+1} \mid r_t, \mathbf{x}^{(\ell)})}^{\text{UPM predictive}} = \int \overbrace{\vphantom{\Big|} p(x_{t+1} \mid \boldsymbol{\eta})}^{\text{EF model}} \overbrace{\vphantom{\Big|} p(\boldsymbol{\eta} \mid r_t, \mathbf{x}_{1:t})}^{\text{EF posterior}} \text{d} \boldsymbol{\eta}. \tag4

We know the EF model in Equation 44 because it is by construction; we can decide that our data is best modeled by any member of the exponential family. But what is our EF posterior and how do we compute this integral? A useful fact about exponential family models is that we can avoid computing these integrals directly.

Posterior predictive for conjugate models

Before discussing the exponential family, let’s review the prior predictive for any conjugate model. Let D\mathcal{D} be our observations, x^\hat{\mathbf{x}} be a new observation, θ\boldsymbol{\theta} be our parameters, and α\boldsymbol{\alpha} be our hyperparameters. The prior predictive distribution is the distribution on x^\hat{\mathbf{x}} given α\boldsymbol{\alpha} marginalized over θ\boldsymbol{\theta},

p(x^α)=p(x^θ)  Modelp(θα)  Priordθ. p(\hat{\mathbf{x}} \mid \boldsymbol{\alpha}) = \int \overbrace{\vphantom{\Big|} p(\hat{\mathbf{x}} \mid \boldsymbol{\theta})\;}^{\text{Model}} \overbrace{\vphantom{\Big|} p(\boldsymbol{\theta} \mid \boldsymbol{\alpha}) \;}^{\text{Prior}} \text{d} \boldsymbol{\theta}.

Since our prior on θ\boldsymbol{\theta} is conjugate, we know that

p(θD,α)=p(θα) p(\boldsymbol{\theta} \mid \mathcal{D}, \boldsymbol{\alpha}) = p(\boldsymbol{\theta} \mid \boldsymbol{\alpha}')

for some different hyperparameters α\boldsymbol{\alpha}'. This is not a deep statement; we are just observing that if we are using a conjugate prior, then the prior is same functional form as the posterior. This implies

p(x^D,α)=p(x^θ)p(θD,α)dθ=p(x^θ)p(θα)dθ=p(x^α). \begin{aligned} p(\hat{\mathbf{x}} \mid \mathcal{D}, \boldsymbol{\alpha}) &= \int p(\hat{\mathbf{x}} \mid \boldsymbol{\theta}) p(\boldsymbol{\theta} \mid \mathcal{D}, \boldsymbol{\alpha}) \text{d} \boldsymbol{\theta} \\ &= \int p(\hat{\mathbf{x}} \mid \boldsymbol{\theta}) p(\boldsymbol{\theta} \mid \boldsymbol{\alpha}') \text{d} \boldsymbol{\theta} \\ &= p(\hat{\mathbf{x}} \mid \boldsymbol{\alpha}'). \end{aligned}

Thus, the posterior predictive is the same as the prior predictive but with hyperparameter α\boldsymbol{\alpha}' instead of α\boldsymbol{\alpha}. If we can compute α\boldsymbol{\alpha}', we can compute the posterior predictive without integration and without computing the posterior p(θD,α)p(\boldsymbol{\theta} \mid \mathcal{D}, \boldsymbol{\alpha}). And conjugate-exponential models allow us to efficiently compute α\boldsymbol{\alpha}'.

Posterior predictive for exponential family models

Now let’s discuss the prior predictive for the exponential family. Recall that the general form for any member of the exponential family is

p(xη)=h(x)g(η)exp{ηu(x)} p(\mathbf{x} \mid \boldsymbol{\eta}) = h(\mathbf{x}) g(\boldsymbol{\eta}) \exp \big\{ \boldsymbol{\eta}^{\top} u(\mathbf{x}) \big\}

where η\boldsymbol{\eta} is the natural parameter, h(x)h(\mathbf{x}) is the underlying measure, u(x)u(\mathbf{x}) is the sufficient statistic of the data, and g(η)g(\boldsymbol{\eta}) is a normalizer. The conjugate prior with hyperparameters ν\nu and χ\boldsymbol{\chi} is

p(ηχ,ν)=f(χ,ν)g(η)νexp{ηχ} p(\boldsymbol{\eta} \mid \boldsymbol{\chi}, \nu) = f(\boldsymbol{\chi}, \nu) g(\boldsymbol{\eta})^{\nu} \exp \big\{\boldsymbol{\eta}^{\top} \boldsymbol{\chi} \big\}

where f(χ,ν)f(\boldsymbol{\chi}, \nu) depends on the form of the exponential family member and η\boldsymbol{\eta} and g(η)g(\boldsymbol{\eta}) are the same as before. If we multiply the likelihood times the prior to get the posterior, we have

p(ηX,χ,ν)=((i=1Nh(xn))g(η)nexp{ηn=1Nu(xn)})Likelihood(f(χ,ν)g(η)νexp{ηχ})Prior=(i=1Nh(xn))f(χ,ν)g(η)N+νexp{ηn=1Nu(xn)+ηχ}. \begin{aligned} p(\boldsymbol{\eta} \mid X, \boldsymbol{\chi}, \nu) &= \overbrace{\vphantom{\Bigg|} \Big( \Big( \prod_{i=1}^{N} h(\mathbf{x}_n) \Big) g(\boldsymbol{\eta})^{n} \exp \big\{ \boldsymbol{\eta}^{\top} \sum_{n=1}^N u(\mathbf{x}_n) \big\} \Big)}^{\text{Likelihood}} \overbrace{\vphantom{\Bigg|} \Big( f(\boldsymbol{\chi}, \nu) g(\boldsymbol{\eta})^{\nu} \exp \big\{\boldsymbol{\eta}^{\top} \boldsymbol{\chi} \big\} \Big)}^{\text{Prior}} \\ &= \Big( \prod_{i=1}^{N} h(\mathbf{x}_n) \Big) f(\boldsymbol{\chi}, \nu) g(\boldsymbol{\eta})^{N + \nu} \exp \big\{ \boldsymbol{\eta}^{\top} \sum_{n=1}^N u(\mathbf{x}_n) + \boldsymbol{\eta}^{\top} \boldsymbol{\chi} \big\}. \end{aligned}

Since the first N+1N + 1 terms are constant with respect to η\boldsymbol{\eta}, we can write

p(ηX,χ,ν)g(η)N+νexp{η(n=1Nu(xn)+χ)}. p(\boldsymbol{\eta} \mid X, \boldsymbol{\chi}, \nu) \propto g(\boldsymbol{\eta})^{N + \nu} \exp \big\{ \boldsymbol{\eta}^{\top} \big( \sum_{n=1}^N u(\mathbf{x}_n) + \boldsymbol{\chi} \big) \big\}.

Note that this is the same exponential family form as the prior, with parameters:

ν=νprior+Nχ=χprior+n=1Nu(xn).(5) \begin{aligned} \nu' &= \nu_{\text{prior}} + N \\ \boldsymbol{\chi}' &= \boldsymbol{\chi}_{\text{prior}} + \sum_{n=1}^{N} u(\mathbf{x}_n). \end{aligned} \tag{5}

In other words, we have an efficient way to compute the updated hyperparameters ν\nu' and χ\boldsymbol{\chi}', which allow us to compute the posterior predictive (Equation 33) without integration and without computing the posterior over η\boldsymbol{\eta}. This is what Adams and MacKay mean when they write,

This [UPM predictive] distribution, while generally not itself an exponential-family distribution, is usually a simple function of the sufficient statistics.

Importantly, these sufficient statistics are additive and can therefore be computed sequentially.

Message-passing parameters

The upshot of the previous subsection is that rather than computing the UPM predictive by computing the EF posterior and integrating out the parameters, we just need to keep track of the exponential family parameters by time t1t-1 to make a prediction at time tt. Recall that at each time point tt, rt1r_{t-1} can take tt possible values (Figure 22). Let νt1()\boldsymbol{\nu}_{t-1}^{(\ell)} and χt1()\boldsymbol{\chi}_{t-1}^{(\ell)} denote exponential family parameters at time tt when rt1r_{t-1} takes value \ell, then

p(xtrt1=,x())denoted byp(xtνt1(),χt1()). p(x_t \mid r_{t-1} = \ell, \mathbf{x}^{(\ell)}) \quad \text{denoted by} \quad p(x_t \mid \boldsymbol{\nu}_{t-1}^{(\ell)}, \boldsymbol{\chi}_{t-1}^{(\ell)}).

While Adams and MacKay do not describe it this way in the paper, I find it useful to imagine the parameter updates as another message-passing algorithm living on a trellis of possible run lengths, similar to Figure 22 (Figure 33).

Figure 3. Diagram of the parameter updates living on a trellis of possible values. (Left) The ν\nu parameters are always incremented by 11, while (Right) the χ\boldsymbol{\chi} parameters are always incremented by the sufficient statistic of the next observation.

At each time point tt, we want to model a new possible run-length. In this case, each parameter value is just its prior. Otherwise, we compute νt()\nu_{t}^{(\ell)} and χt()\boldsymbol{\chi}_{t}^{(\ell)} for each run length value rt=r_t = \ell by message-passing up the trellis. Formally,

νt(0)=νpriorχt(0)=χpriorνt()=νt1(1)+1χt()=χt1(1)+u(xt).(6) \begin{aligned} \nu_{t}^{(0)} &= \nu_{\text{prior}} \\ \boldsymbol{\chi}_{t}^{(0)} &= \boldsymbol{\chi}_{\text{prior}} \\ \nu_{t}^{(\ell)} &= \nu_{t-1}^{(\ell-1)} + 1 \\ \boldsymbol{\chi}_{t}^{(\ell)} &= \boldsymbol{\chi}_{t-1}^{(\ell-1)} + u(x_{t}). \tag{6} \end{aligned}

Changepoint prior

The changepoint prior allows us to encode our knowledge about our data’s changepoints into the model. Let T\mathrm{T} be a discrete, nonnegative random variable for the current run length. Let f(τ)f(\tau) denote the probability that the current run length is τ\tau. Let S(τ)S(\tau) be the survival function at τ\tau, which is the probability that T\mathrm{T} takes a value greater than τ\tau or

S(τ)=P{Tτ}=τ=τf(τ) S(\tau) = \mathbb{P}\{\mathrm{T} \geq \tau\} = \sum_{\tau' = \tau}^{\infty} f(\tau')

Then hazard function H(τ)H(\tau) is

H(τ)=f(τ)S(τ). H(\tau) = \frac{f(\tau)}{S(\tau)}.

The hazard function quantifies the answer to the question: “Provided a changepoint has not occurred by run length τ\tau, what is the probability that it will occur at τ\tau?” The actual form of the hazard function clearly depends on the distribution of T\mathrm{T} (Figure 44).

Figure 4. Diagram of the probability H(τ)H(\tau) for a hypothetical prior distribution f(τ)f(\tau).

Our modeling assumption is that our changepoint prior is

p(rtrt1)={H(rt1+1)if rt=01H(rt1+1)if rt=rt1+10otherwise.(7) p(r_t \mid r_{t-1}) = \begin{cases} H(r_{t-1} + 1) & \text{if $r_t = 0$} \\ 1 - H(r_{t-1} + 1) & \text{if $r_t = r_{t-1} + 1$} \\ 0 & \text{otherwise}. \end{cases} \tag{7}

For Equation 77, we only need to compute two cases. The first case is the probability of a changepoint, which is just the hazard function. The second case is the probability that a changepoint does not occur, which is just one minus the probability of a changepoint. This prior is computationally efficient because we only have two cases with nonzero mass, and we only need to compute the hazard function once to compute both cases.

If T\mathrm{T} is a geometric random variable with a probability of success pp, then H(τ)=pH(\tau) = p (Forbes et al., 2011). This kind of process is called memoryless because the hazard function does not depend on time.

Initial conditions

Earlier, we punted on discussing the initial conditions of our recursive algorithm,

p(r0=0,x1:t=)=p(r0=0). p(r_0 = 0, \mathbf{x}_{1:t} = \emptyset) = p(r_0 = 0).

We now have the notation to address this question. We have two scenarios. The first scenario is that a changepoint occurred just before the first data point is observed or p(r0=0)=1p(r_0 = 0) = 1. This may not be the most realistic modeling assumption, but it is simple to implement.

The second scenario is that the first changepoint is in the future but we have access to some recent data. The example Adams gives is that of climate change. We have access to historic and current climate data and hypothesize that a changepoint will occur in the future. In this case, the prior over the initial run length is the normalized survival function

p(r0=τ)=1Zτ=τ+1f(τ) p(r_0 = \tau) = \frac{1}{Z} \sum_{\tau' = \tau+1}^{\infty} f(\tau')

where ZZ is an appropriate normalizing constant. In words, this is all the mass in the future (note the sum starts at τ+1\tau + 1, not τ\tau) divided by all the mass of the distribution f()f(\cdot). The greater the value of τ\tau, the more mass the sum accumulates or the higher the probability that a changepoint will occur by some large τ\tau.

Sequential inference algorithm

We are now prepared to understand Bayesian online changepoint detection in algorithmic detail. This algorithm estimates the RL posterior distribution and performs prediction at each time point tt. Let’s walk through the paper’s Algorithm 11 with annotations as needed:

  1. Set priors and initial conditions.

    p(r0)={1if changepoint at time t=0p(r0=τ)elseν1(0)=νpriorχ1(0)=χpriort=1. \begin{aligned} p(r_0) &= \begin{cases} 1 & \text{if changepoint at time $t=0$} \\ p(r_0 = \tau) & \text{else} \end{cases} \\ \nu_{1}^{(0)} &= \nu_{\text{prior}} \\ \boldsymbol{\chi}_{1}^{(0)} &= \boldsymbol{\chi}_{\text{prior}} \\ t &= 1. \end{aligned}

  2. Observe new datum xtx_t.

  3. Compute UPM predictive probabilities. This calculation is for each possible run length value ,\ell,

    πt1()=p(xtνt1(),χt1()). \pi_{t-1}^{(\ell)} = p(x_t \mid \nu_{t-1}^{(\ell)}, \boldsymbol{\chi}_{t-1}^{(\ell)}).

    At time t1t-1, there are tt possible run lengths; we forward message-pass the mass for these values up the trellis to compute the UPM predictive over xtx_t; therefore there are tt probabilities πt1()\pi_{t-1}^{(\ell)}.

  4. Compute growth probabilities. The growth probabilities are the probabilities p(rt=rt1+1,x1:t)p(r_t = r_{t-1} + 1, \mathbf{x}_{1:t}) for each possible run length value. To compute the probability for a specific value rt=r_t = \ell, Equation 33 is

    p(rt=,x1:t)=p(rt1,x1:t1)πt1()(1H(rt1)). p(r_t = \ell, \mathbf{x}_{1:t}) = p(r_{t-1}, \mathbf{x}_{1:t-1}) \pi_{t-1}^{(\ell)} (1 - H(r_{t-1})).

    Note that there is no sum over rt1r_{t-1} because for a given run length, the only valid “growth value” is rt1=1r_{t-1} = \ell - 1.

  5. Compute changepoint probability. The changepoint probability is the probability that the run length drops to 00. Again using Equation 33, we see

    p(rt=0,x1:t)=rt1p(rt1,x1:t1)πt1()H(rt1). p(r_t = 0, \mathbf{x}_{1:t}) = \sum_{r_{t-1}} p(r_{t-1}, \mathbf{x}_{1:t-1}) \pi_{t-1}^{(\ell)} H(r_{t-1}).

    In this case we have a sum over rt1r_{t-1} because rt1r_{t-1} can take any value between 00 and tt.

  6. Compute the evidence. This is just the normalizer in Equation 22.

  7. Compute the RL posterior. Equation 22.

  8. Update sufficient statistics. Equation 66.

  9. Perform Prediction. Equation 11.

  10. Set t=t+1t = t+1. Return to Step 2.

A complete example

Now let’s work through a complete example and implementation of the algorithm. Consider the sequence of T=200T = 200 i.i.d. observations, D={xt}t=1T\mathcal{D} = \{\mathbf{x}_t\}_{t=1}^{T}, in Figure 5a5\text{a}. The groundtruth changepoints are denoted with red dashed lines. We might guess that these data look approximately Gaussian distributed and that while the mean μ\mu sometimes changes, the variance σ2\sigma^2 looks fairly consistent. Thus, to model these data, let’s use a Gaussian model with unknown μ\mu and fixed precision λ=1/σ2\lambda = 1 / \sigma^2. The posterior predictive (the UPM predictive) for this model is

p(x^D)=N(x^μT,1λ+1λT) p(\hat{x} \mid \mathcal{D}) = \mathcal{N}(\hat{x} \mid \mu_T, \frac{1}{\lambda} + \frac{1}{\lambda_T})

where x^R\hat{x} \in \mathbb{R} is a new observation and μT\mu_T and λT\lambda_T are defined as

μT=μpriorλ+t=1TxtλpriorTλprior+1λλT=λprior+Tλ \begin{aligned} \mu_T &= \frac{\frac{\mu_{\text{prior}}}{\lambda} + \frac{\sum_{t=1}^{T} x_t}{\lambda_{\text{prior}}}}{ \frac{T}{\lambda_{\text{prior}}} + \frac{1}{\lambda}} \\\\ \lambda_T &= \lambda_{\text{prior}} + T \lambda \end{aligned}

See my post on Bayesian Inference for the Gaussian or (Murphy, 2007) for a complete derivation. To make the algebra easier, assume that λ=λprior=1\lambda = \lambda_{\text{prior}} = 1; then we have

μT=μprior+t=1TxtT+1λT=λprior+T \begin{aligned} \mu_T &= \frac{\mu_{\text{prior}} + \sum_{t=1}^{T} x_t}{T + 1} \\\\ \lambda_T &= \lambda_{\text{prior}} + T \end{aligned}

In changepoint detection, the number of samples TT is just the current time point tt. So the parameter updates for μt()\mu_t^{(\ell)} and λt()\lambda_t^{(\ell)} as tt increases are

μ0()=μprior()μ1()=μ0+x12μ2()=2μ1+x23μ3()=3μ2+x34λ0()=λprior()λ1()=1+λ1()λ2()=1+λ2()λ3()=1+λ3() \begin{aligned} \mu_0^{(\ell)} &= \mu_{\text{prior}}^{(\ell)} \\ \mu_1^{(\ell)} &= \frac{\mu_0 + x_1}{2} \\ \mu_2^{(\ell)} &= \frac{2 \mu_1 + x_2}{3} \\ \mu_3^{(\ell)} &= \frac{3 \mu_2 + x_3}{4} \\ &\dots \\ \\ \\ \lambda_{0}^{(\ell)} &= \lambda_{\text{prior}}^{(\ell)} \\ \lambda_{1}^{(\ell)} &= 1 + \lambda_1^{(\ell)} \\ \lambda_{2}^{(\ell)} &= 1 + \lambda_2^{(\ell)} \\ \lambda_{3}^{(\ell)} &= 1 + \lambda_3^{(\ell)} \\ &\dots \end{aligned}

In both instances, the parameter update is a simple function of the previous parameters and the next data point.

Figure 5. (Top) Synthetic data generated from a Gaussian distribution with fixed variance and a changing mean. Groundtruth changepoints are denoted with red dashed lines. (Bottom) The RL posterior at each time step using a logarithmic color scale. Darker indicates higher probability. The saw-toothed red line is the max probability at each time point.

To make things easy, let’s assume a memoryless hazard function, H(τ)=1/60H(\tau) = 1/60 because we see three changepoints in 200200 data points. The actual changepoint probability I used to generate the data was 1/801/80. It is beyond the scope of this post, but clearly this hazard function is an important hyperparameter.

Now we are ready for code. While we have leveraged a lot of conceptual machinery, the code is strikingly simple:

import numpy as np
from   scipy.stats import norm


R        = np.zeros((T+1, T+1))
R[0, 0]  = 1
max_R    = np.empty(T+1)
max_R[0] = 1

H           = 1/60
mu_params   = np.array([mu0])
lamb_params = np.array([lamb0])

rl_pred = lambda x, mu, lamb: norm.pdf(x, mu, 1/lamb + 1)

for t in range(1, T+1):
    x = data[t-1]

    pis = np.array([rl_pred(x, mu_params[i], lamb_params[i]) for i in range(t)])

    R[t, 1:t+1] = R[t-1, :t] * pis * (1-H)
    R[t, 0]     = sum(R[t-1, :t] * pis * H)
    R[t, :]    /= sum(R[t, :])

    max_R[t] = np.argmax(R[t, :])

    offsets     = np.arange(1, t+1)
    mu_params   = np.append([mu0], (mu_params * offsets + x) / (offsets + 1))
    lamb_params = np.append([lamb0], lamb_params + 1)

Note that the above code is a didactic snippet, i.e. you cannot paste this into a file and run it. For a complete example, see my GitHub. We can visualize the results of this algorithm by plotting both the RL posterior and the max posterior value (Figure 5b5\text{b}).

Finally, note that Bayesian online changepoint detection is a modular algorithm because both computing the UPM predictive and updating the parameters could be methods on a distribution class that keeps track of the state of its own parameters. Adams notes this in the paper by saying that the framework delineates “between the implementation of the changepoint algorithm and the implementation of the model.”

Conclusions

Bayesian online changepoint detection is a modular Bayesian framework for online estimation of changepoints. It models changes in the generative parameters of data by estimating the posterior over the run length, and it leverages conjugate-exponential models to achieve modularity and efficient parameter estimation. The method is Bayesian in that the changepoint prior allows us to encode knowledge about the world and it models uncertainty about θ\boldsymbol{\theta}, and it is online because it computes the RL posterior at each time step using only previously seen data.

This paper is from 2007, and I am not up-to-date on the literature on extending this work. However, I would guess that key areas for future work would be addressing the memory usage, which grows quadratically with TT; robustly handling a misspecified model or prior; and developing approximate inference routines for non-conjugate models.

   

Acknowledgements

I thank Diana Cai for giving me an introductory overview of this model and providing working code and Ryan Adams for answering questions about the paper.

  1. Adams, R. P., & MacKay, D. J. C. (2007). Bayesian online changepoint detection. ArXiv Preprint ArXiv:0710.3742.
  2. Fearnhead, P., & Liu, Z. (2007). On-line inference for multiple changepoint problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(4), 589–605.
  3. Barry, D., & Hartigan, J. A. (1992). Product partition models for change point problems. The Annals of Statistics, 260–279.
  4. Bishop, C. M. (2006). Pattern Recognition and Machine Learning.
  5. Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.
  6. Forbes, C., Evans, M., Hastings, N., & Peacock, B. (2011). Statistical distributions. John Wiley & Sons.
  7. Murphy, K. P. (2007). Conjugate Bayesian analysis of the Gaussian distribution. Def, 1(2\sigma2), 16.