Bayesian Online Changepoint Detection

Adams and MacKay's 2007 paper, "Bayesian Online Changepoint Detection", introduces a modular Bayesian framework for online estimation of changes in the generative parameters of sequential data. I discuss this paper in detail.

Introduction

Given sequential data such as stock market prices or streaming stories in a newswire, we might be interested in when these data change in some way, such as a stock price falling or the document topics shifting. If we view our data as observations from a generative process, then we care about when the generative parameters change. An abrupt change in these parameters is called a changepoint, and changepoint detection is the modeling and inferring of these events.

The goal of this post is to explain Ryan P. Adams and David J.C. MacKay’s technical report on Bayesian online changepoint detection (Adams & MacKay, 2007) in my own words, and to work through the framework and code for a particular model. While I focus my discussion on Adams and MacKay’s paper, (Fearnhead & Liu, 2007) is a similar model for the same problem, and it is worth reading as a comparison.

Understanding this post will require understanding conjugacy and the exponential family. The example implementation requires understanding Bayesian inference for the Gaussian.

Model

Let $x_t \in \mathbb{R}^d$ denote the $t$-th observation in a data sequence where $d$ is fixed, and let $\mathbf{x}_{s:t}$ denote the sequence $x_s, x_{s+1}, \dots, x_{t-1}, x_{t}$ for $s \leq t$. We assume that our data $\mathbf{x}_{1:T}$ can be partitioned such that the data within each partition are i.i.d. samples from some distribution; this is the product partition model (Barry & Hartigan, 1992). Formally, let $\boldsymbol{\theta}^{(p)}$ denote the generative parameters for partition $p$ and $\boldsymbol{\eta}$ denote the hyperprior. Then $\boldsymbol{\theta}^{(p)} \sim p(\boldsymbol{\eta})$ are i.i.d. for $p = 1, 2, \dots, P$ (Figure $1\text{a}$).

Figure 1. Conceptual diagrams of (a) synthetic data partitioned by two changepoints—the changepoint for the third partition is unknown—and (b) the associated run length $r_t$ as a function of time $t$. Changepoints occur between partitions, and $r_t$ drops to $0$ at a changepoint.

Bayesian online changepoint detection works by modeling the time since the last changepoint, called the run length. The run length at time $t$ is denoted $r_t$. Logically, it can take one of two values,

In words, the run length only ever increases by one or drops to zero (Figure $1\text{b}$). The transition probabilities are modeled by the changepoint prior $p(r_t \mid r_{t-1})$, which the assumption that $r_t$ is conditionally independent of everything else given $r_{t-1}$.

It is useful to express the generative procedure in a probabilistic graphical model (Figure $2$). The dynamics of $r_t$ follow a Markov process. The data are conditionally independent of everything else given the generative parameters. We only observe our data, $x_1, x_2, \dots, x_T$. We will discuss the hyperparameters $\boldsymbol{\eta}$ and $H$ later.

Figure 2. Graphical model for Bayesian online changepoint detection. The run length $r_t$ follows a Markov process. The data is conditionally independent of everything else given the generative parameters. There are $P$ partitions, and thus $T = \sum_{p=1}^P T_p$. The hyperparameters $\boldsymbol{\theta}_0$ and $H$ will discuss later.

If we take the generative model in Figure $2$ and marginalize out the parameters $\theta^{(p)}$, then we induce dependencies between the observations (Figure $3$). See this link for a review of this principle. We will see this formally in Equation $1$. Figure $3$ looks a lot like a hidden Markov model with a bit more structure: the latent variables are run length variables with more limited support and there are dependencies between observations. To compute $p(x_t \mid r_t, \mathbf{x}_{1:t-1})$ efficiently, the algorithm uses the exponential family and message passing to condition on $\boldsymbol{\eta}$, which since the data are i.i.d., eliminates the dependency on $\mathbf{x}_{1:t-1}$.

Figure 3. Graphical model for Bayesian online changepoint detection with the generative parameters integrated out. This marginalization induces conditional dependencies between the data points.

Ultimately, we want to infer both the run-length posterior distribution $p(r_t \mid \mathbf{x}_{1:t})$ and the posterior predictive distribution $p(x_{t+1} \mid \mathbf{x}_{1:t})$ so that we can predict the next data point given all the data we have seen so far.

Recursive RL posterior estimation

In Bayesian inference, if $\mathcal{D}$ is a set of $N$ i.i.d. observations $\{\mathbf{x}_n\}_{n=1}^{N}$ and $\boldsymbol{\theta}$ are model parameters, then the posterior predictive distribution over a new observation $\hat{\mathbf{x}}$ is

In Bayesian online changepoint detection, this posterior takes a slightly different form. The posterior predictive is the probability of the next data point, $x_{t+1}$, given the observations so far, $\mathbf{x}_{1:t}$, and it is computed by marginalizing out the run length $r_t$,

While Adams and MacKay refer to term analogous to “Model” as a “marginal predictive”, I will use underlying probabilistic model (UPM) predictive to clearly disambiguate it from other predictives in this discussion. In the next section, we will see how to efficiently compute this term, but for now assume that it is straightforward. (Equation $1$ differs slightly from Adams and MacKay’s formulation; see the appendix.)

If we can compute the UPM predictive, then we just need to compute the RL posterior, which is proportional to the joint,

We can write this joint recursively as

The first two steps are just marginalization and the chain rule. However, the third step is trickier because it only holds if our modeling assumptions (Figure $3$) are made. (If this derivation feels like it came out of thin air, see the appendix.) Both of these equations

fall out of the assumptions encoded in the graphical model in Figure $3$. If you expected $x_t$ to depend on $r_{t-1}$ as per the paper, see the appendix.) Summarizing and annotating the derivation, we have

We now have a recursive algorithm for computing the RL posterior. The algorithm is recursive because after we compute the joint distribution $p(r_{t-1}, \mathbf{x}_{1:t-1})$, we can forward message-pass this distribution’s mass to the calculation for $p(r_{t}, \mathbf{x}_{1:t})$. Of course, a recursive algorithm must define its initial conditions (what is $p(r_0)$), and we will address this later.

It is useful to think about this recursive algorithm visually. Borrowing language from the paper, we can visualize the message-passing algorithm as living on a trellis (Figure $4$). At each time point, mass is either passed “upward” such that the run-length is incremented or “downward” where it is truncated to zero. At each time point $t$, the discrete RL posterior distribution has support over $t+1$ values.

Figure 4. Diagram of the message-passing algorithm. Each node has associated mass. For example, the probability of $p(r_4 = 2, \mathbf{x}_{1:4})$ is associated with the node indexed by $t=4$ and $r_t=2$.

This observation has an important algorithmic consequence. When we compute what Adams and MacKay call the growth probabilities or the probability that the run-length increases, the summation in Equation $3$ disappears because each term in the summation is zero except when $r_{t} = r_{t-1} + 1$. It is only when calculating the changepoint probabilities, when the mass is passed downwards, that we sum over previous joint probabilities.

In summary, provided we can compute the UPM predictive and the changepoint prior in Equation $3$ and can initialize our recursive algorithm, we will have everything we need for both inference and prediction. Let’s tackle these in that order.

Underlying probabilistic model

Computing the UPM predictive leverages conjugacy and the exponential family. In principle, one could abbreviate this section as most of it is not methodologically novel to the paper. However, when I first read the paper, computing the UPM predictive was not obvious to me, and I will go into detail to explain it.

We assume our model is from an exponential family (EF) member with parameters $\boldsymbol{\eta}$. Since we need to compute $p(x_{t+1} \mid r_t, \mathbf{x}_{1:t})$, we can eliminate the dependency on the previous observations ($\mathbf{x}_{1:t}$) by integrating over the model’s parameters,

This is because $x_{t+1}$ is conditionally independent of $\mathbf{x}_{1:t}$ given $\boldsymbol{\eta}$. We know the EF model in Equation $4$ because it is by construction; we can decide that our data is best modeled by any member of the exponential family. But what is our EF posterior and how do we compute this integral? A useful fact about exponential family models is that we can avoid computing these directly.

Posterior predictive for conjugate models

Before discussing the exponential family, let’s review the prior predictive for any conjugate model. Let $\mathcal{D}$ be our observations, $\hat{\mathbf{x}}$ be a new observation, $\boldsymbol{\theta}$ be our parameters, and $\boldsymbol{\alpha}$ be our hyperparameters. The prior predictive distribution is the distribution on $\hat{\mathbf{x}}$ given $\boldsymbol{\alpha}$ marginalized over $\boldsymbol{\theta}$,

Since our prior on $\boldsymbol{\theta}$ is conjugate, we know that

for some different hyperparameters $\boldsymbol{\alpha}’$. This is not a deep statement; we are just observing that if we are using a conjugate prior, then the prior is same functional form as the posterior. This implies

Thus, the posterior predictive is the same as the prior predictive but with hyperparameter $\boldsymbol{\alpha}’$ instead of $\boldsymbol{\alpha}$. If we can compute $\boldsymbol{\alpha}’$, we can compute the posterior predictive without integration and without computing the posterior $p(\boldsymbol{\theta} \mid \mathcal{D}, \boldsymbol{\alpha})$. And conjugate-exponential models allow us to efficiently compute $\boldsymbol{\alpha}’$.

Posterior predictive for exponential family models

Now let’s discuss the prior predictive for the exponential family. Recall that the general form for any member of the exponential family is

where $\boldsymbol{\eta}$ is the natural parameter, $h(\mathbf{x})$ is the underlying measure, $u(\mathbf{x})$ is the sufficient statistic of the data, and $g(\boldsymbol{\eta})$ is a normalizer. The conjugate prior with hyperparameters $\nu$ and $\boldsymbol{\chi}$ is

where $f(\boldsymbol{\chi}, \nu)$ depends on the form of the exponential family member and $\boldsymbol{\eta}$ and $g(\boldsymbol{\eta})$ are the same as before. If we multiply the likelihood times the prior to get the posterior, we have

Since the first $N + 1$ terms are constant with respect to $\boldsymbol{\eta}$, we can write

Note that this is the same exponential family form as the prior, with parameters:

In other words, we have an efficient way to compute the updated hyperparameters $\nu’$ and $\boldsymbol{\chi}’$, which allow us to compute the posterior predictive (Equation $3$) without integration and without computing the posterior over $\boldsymbol{\eta}$. This is what Adams and MacKay mean when they write,

This [UPM predictive] distribution, while generally not itself an exponential-family distribution, is usually a simple function of the sufficient statistics.

Importantly, these sufficient statistics are additive and can therefore be computed sequentially.

Message-passing parameters

The upshot of the previous subsection is that rather than computing the UPM predictive by computing the EF posterior and integrating out the parameters, we just need to keep track of the exponential family parameters by time $t-1$ to make a prediction at time $t$. Recall that at each time point $t$, $r_{t-1}$ can take $t$ possible values (Figure $4$). Let $\boldsymbol{\nu}_{t-1}^{(\ell)}$ and $\boldsymbol{\chi}_{t-1}^{(\ell)}$ denote exponential family parameters at time $t$ when $r_{t-1}$ takes value $\ell$, then

While Adams and MacKay do not describe it this way in the paper, I find it useful to imagine the parameter updates as another message-passing algorithm living on a trellis of possible run lengths, similar to Figure $4$ (Figure $5$).

Figure 5. Diagram of the parameter updates living on a trellis of possible values. (Left) The $\nu$ parameters are always incremented by $1$, while (Right) the $\boldsymbol{\chi}$ parameters are always incremented by the sufficient statistic of the next observation.

At each time point $t$, we want to model a new possible run-length. In this case, each parameter value is just its prior. Otherwise, we compute $\nu_{t}^{(\ell)}$ and $\boldsymbol{\chi}_{t}^{(\ell)}$ for each run length value $r_t = \ell$ by message-passing up the trellis. Formally,

Changepoint prior

The changepoint prior allows us to encode our knowledge about our data’s changepoints into the model. Let $\mathrm{T}$ be a discrete, nonnegative random variable for the current run length. Let $f(\tau)$ denote the probability that the current run length is $\tau$. Let $S(\tau)$ be the survival function at $\tau$, which is the probability that $\mathrm{T}$ takes a value greater than $\tau$ or

Then hazard function $H(\tau)$ is

The hazard function quantifies the answer to the question: “Provided a changepoint has not occurred by run length $\tau$, what is the probability that it will occur at $\tau$?” The actual form of the hazard function clearly depends on the distribution of $\mathrm{T}$ (Figure $6$).

Figure 6. Diagram of the probability $H(\tau)$ for a hypothetical prior distribution $f(\tau)$.

Our modeling assumption is that our changepoint prior is

For Equation $7$, we only need to compute two cases. The first case is the probability of a changepoint, which is just the hazard function. The second case is the probability that a changepoint does not occur, which is just one minus the probability of a changepoint. This prior is computationally efficient because we only have two cases with nonzero mass, and we only need to compute the hazard function once to compute both cases.

If $\mathrm{T}$ is a geometric random variable with a probability of success $p$, then $H(\tau) = p$ (Forbes, Evans, Hastings, & Peacock, 2011). This kind of process is called memoryless because the hazard function does not depend on time.

Initial conditions

Earlier, we punted on discussing the initial conditions of our recursive algorithm,

We now have the notation to address this question. We have two scenarios. The first scenario is that a changepoint occurred just before the first data point is observed or $p(r_0 = 0) = 1$. This may not be the most realistic modeling assumption, but it is simple to implement.

The second scenario is that the first changepoint is in the future but we have access to some recent data. The example Adams gives is that of climate change. We have access to historic and current climate data and hypothesize that a changepoint will occur in the future. In this case, the prior over the initial run length is the normalized survival function

where $Z$ is an appropriate normalizing constant. In words, this is all the mass in the future (note the sum starts at $\tau + 1$, not $\tau$) divided by all the mass of the distribution $f(\cdot)$. The greater the value of $\tau$, the more mass the sum accumulates or the higher the probability that a changepoint will occur by some large $\tau$.

Sequential inference algorithm

We are now prepared to understand Bayesian online changepoint detection in algorithmic detail. This algorithm estimates the RL posterior distribution and performs prediction at each time point $t$. Let’s walk through the paper’s Algorithm $1$ with annotations as needed:

  1. Set priors and initial conditions.

  2. Observe new datum $x_t$.

  3. Compute UPM predictive probabilities. This calculation is for each possible run length value $\ell$,

    At time $t-1$, there are $t$ possible run lengths; we forward message-pass the mass for these values up the trellis to compute the UPM predictive over $x_t$; therefore there are $t$ probabilities $\pi_{t-1}^{(\ell)}$.

  4. Compute growth probabilities. The growth probabilities are the probabilities $p(r_t = r_{t-1} + 1, \mathbf{x}_{1:t})$ for each possible run length value. To compute the probability for a specific value $r_t = \ell$, Equation $3$ is

    Note that there is no sum over $r_{t-1}$ because for a given run length, the only valid “growth value” is $r_{t-1} = \ell - 1$.

  5. Compute changepoint probability. The changepoint probability is the probability that the run length drops to $0$. Again using Equation $3$, we see

    In this case we have a sum over $r_{t-1}$ because $r_{t-1}$ can take any value between $0$ and $t$.

  6. Compute the evidence. This is just the normalizer in Equation $2$.

  7. Compute the RL posterior. Equation $2$.

  8. Update sufficient statistics. Equation $6$.

  9. Perform Prediction. Equation $1$.

  10. Set $t = t+1$. Return to Step 2.

A complete example

Now let’s work through a complete example and implementation of the algorithm. Consider the sequence of $T = 200$ i.i.d. observations, $\mathcal{D} = \{\mathbf{x}_t\}_{t=1}^{T}$, in Figure $7\text{a}$. The groundtruth changepoints are denoted with red dashed lines. We might guess that these data look approximately Gaussian distributed and that while the mean $\mu$ sometimes changes, the variance $\sigma^2$ looks fairly consistent. Thus, to model these data, let’s use a Gaussian model with unknown $\mu$ and fixed precision $\lambda = 1 / \sigma^2$. The posterior predictive (the UPM predictive) for this model is

where $\hat{x} \in \mathbb{R}$ is a new observation and $\mu_T$ and $\lambda_T$ are defined as

See my post on Bayesian Inference for the Gaussian or (Murphy, 2007) for a complete derivation. To make the algebra easier, assume that $\lambda = \lambda_{\text{prior}} = 1$; then we have

In changepoint detection, the number of samples $T$ is just the current time point $t$. So the parameter updates for $\mu_t^{(\ell)}$ and $\lambda_t^{(\ell)}$ as $t$ increases are

In both instances, the parameter update is a simple function of the previous parameters and the next data point.

Figure 7. (Top) Synthetic data generated from a Gaussian distribution with fixed variance and a changing mean. Groundtruth changepoints are denoted with red dashed lines. (Bottom) The RL posterior at each time step using a logarithmic color scale. Darker indicates higher probability. The saw-toothed red line is the max probability at each time point.

To make things easy, let’s assume a memoryless hazard function, $H(\tau) = 1/60$ because we see three changepoints in $200$ data points. The actual changepoint probability I used to generate the data was $1/80$. It is beyond the scope of this post, but clearly this hazard function is an important hyperparameter.

Now we are ready for code. While we have leveraged a lot of conceptual machinery, the code is strikingly simple:

import numpy as np
from   scipy.stats import norm


R        = np.zeros((T+1, T+1))
R[0, 0]  = 1
max_R    = np.empty(T+1)
max_R[0] = 1

H           = 1/60
mu_params   = np.array([mu0])
lamb_params = np.array([lamb0])

rl_pred = lambda x, mu, lamb: norm.pdf(x, mu, 1/lamb + 1)

for t in range(1, T+1):
    x = data[t-1]

    pis = np.array([rl_pred(x, mu_params[i], lamb_params[i]) for i in range(t)])

    R[t, 1:t+1] = R[t-1, :t] * pis * (1-H)
    R[t, 0]     = sum(R[t-1, :t] * pis * H)
    R[t, :]    /= sum(R[t, :])

    max_R[t] = np.argmax(R[t, :])

    offsets     = np.arange(1, t+1)
    mu_params   = np.append([mu0], (mu_params * offsets + x) / (offsets + 1))
    lamb_params = np.append([lamb0], lamb_params + 1)

Note that the above code is a didactic snippet, i.e. you cannot paste this into a file and run it. For a complete example, see my GitHub. We can visualize the results of this algorithm by plotting both the RL posterior and the max posterior value (Figure $7\text{b}$).

Finally, note that Bayesian online changepoint detection is a modular algorithm because both computing the UPM predictive and updating the parameters could be methods on a distribution class that keeps track of the state of its own parameters. Adams notes this in the paper by saying that the framework delineates “between the implementation of the changepoint algorithm and the implementation of the model.”

Conclusions

Bayesian online changepoint detection is a modular Bayesian framework for online estimation of changepoints. It models changes in the generative parameters of data by estimating the posterior over the run length, and it leverages conjugate-exponential models to achieve modularity and efficient parameter estimation. The method is Bayesian in that the changepoint prior allows us to encode knowledge about the world and it models uncertainty about $\boldsymbol{\theta}$, and it is online because it computes the RL posterior at each time step using only previously seen data.

This paper is from 2007, and I am not up-to-date on the literature on extending this work. However, I would guess that key areas for future work would be addressing the memory usage, which grows quadratically with $T$; robustly handling a misspecified model or prior; and developing approximate inference routines for non-conjugate models.

   

Acknowledgements

I thank Diana Cai for giving me an introductory overview of this model and providing working code and Ryan Adams for answering questions about the paper.

   

Appendix

1. The UPM predictive depends only on associated data

Adams and MacKay assume that $x_{t+1}$ only depends on $r_t$ and the recent data in the same partition, not on all the data so far, $\mathbf{x}_{1:t}$. Therefore, the UPM predictive should actually be written as

Note that the authors use $\mathbf{x}_t^{(\ell)}$ as shorthand for $\textbf{x}_{(t - r_t):t}$ when $r_t = \ell$. Notation aside, this distinction does not matter because we will compute the UPM predictive using properties of conjugate-exponential models. For ease of notation, I will elide this detail as I do not think doing so harms understanding.

2. Equation $1$ and the forward algorithm

When I first read the paper, Equation $3$ seemed as if it arose ex nihilo. In my mind, the authors dreamed up $r_t$, knew to marginalize out $r_{t-1}$, and voilà! a recursive algorithm. In fact, Equation $3$ is similar to the forward algorithm for hidden Markov models. In this context, $r_t$ is the latent variable, and there are no generative parameters. Wikipedia’s derivation is quite clear; for stable reference, consult either (Bishop, 2006) or (Murphy, 2012).

3. $x_t$ does not depend on $r_{t-1}$

In Adams and MacKay’s paper, the modeling assumption is

I dislike this notation because it means we would need to offset the observation nodes in Figure $3$ by one. Following hidden Markov model graphical models, I associate each $r_t$ with an observation $x_t$, meaning that

This is just notation that makes drawing the graphical model a bit easier.

  1. Adams, R. P., & MacKay, D. J. C. (2007). Bayesian online changepoint detection. ArXiv Preprint ArXiv:0710.3742.
  2. Fearnhead, P., & Liu, Z. (2007). On-line inference for multiple changepoint problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(4), 589–605.
  3. Barry, D., & Hartigan, J. A. (1992). Product partition models for change point problems. The Annals of Statistics, 260–279.
  4. Forbes, C., Evans, M., Hastings, N., & Peacock, B. (2011). Statistical distributions. John Wiley & Sons.
  5. Murphy, K. P. (2007). Conjugate Bayesian analysis of the Gaussian distribution. Def, 1(2\sigma2), 16.
  6. Bishop, C. M. (2006). Pattern Recognition and Machine Learning.
  7. Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.