A Practical Implementation of Gaussian Process Regression

I discuss Rasmussen and Williams's Algorithm 2.1 for an efficient implementation of Gaussian process regression.

In a previous post, I introduced Gaussian process (GP) regression with small didactic code examples. By design, my implementation was naive: I focused on code that computed each term in the equations as explicitly as possible. However, (Rasmussen & Williams, 2006) provide an efficient algorithm (Algorithm 2.12.1 in their textbook) for fitting and predicting with a Gaussian process regressor. The goal of this post is to explain this practical algorithm in detail.

Recall from the previous post that to fit a GP to noisy observations, we need to compute

E[f]=K(X,X)[K(X,X)+σn2I]1yCov(f)=K(X,X)K(X,X)[K(X,X)+σn2I]1K(X,X)) \begin{aligned} \mathbb{E}[\mathbf{f}_{*}] &= K(X_*, X) [K(X, X) + \sigma^2_n I]^{-1} \mathbf{y} \\ \text{Cov}(\mathbf{f}_{*}) &= K(X_*, X_*) - K(X_*, X) [K(X, X) + \sigma^2_n I]^{-1} K(X, X_*)) \end{aligned}

where f\mathbf{f}_{*} is a test output (a Gaussian random vector), XX is training data, XX_* is test data, KK is the kernel function, and σn2I\sigma^2_n I is noise. Since writing terms such as K(X,X)K(X, X_{*}) is tedious, let’s use more compact notation following Rasmussen and Williams. Let K=K(X,X)K = K(X, X) and K=K(X,X)K_{*} = K(X, X_{*}). Thus, K=K(X,X)K_{*}^{\top} = K(X_{*}, X), and there is no shorthand for K(X,X)K(X_{*}, X_{*}). For a single test data point x\mathbf{x}_{*}, we write k\mathbf{k}_{*} to denote the vector of covariances between x\mathbf{x}_{*} and the nn training data points. Thus, for a single test data point x\mathbf{x}_{*}, we want to compute

fˉ=k[K+σn2I]1yV[f]=k(x,x)k[K+σn2I]1k.(1) \begin{aligned} \bar{f_{*}} &= \mathbf{k}_{*}^{\top} [K + \sigma^2_n I]^{-1} \mathbf{y} \\ \mathbb{V}[f_{*}] &= k(\mathbf{x}_*, \mathbf{x}_*) - \mathbf{k}_{*}^{\top} [K + \sigma^2_n I]^{-1} \mathbf{k}^{*}. \end{aligned} \tag{1}

Note that fˉ\bar{f_{*}} and V[f]\mathbb{V}[f_{*}] are scalar quantities that parameterize a univariate Gaussian for the Gaussian random variable fˉ\bar{f_{*}}. If k\mathbf{k}_{*} were instead an n×mn \times m matrix in which each of the mm column vectors of was a vector of covariances between that test point and the nn training points, then fˉ\bar{f_{*}} would be an mm-vector of means and V[f]\mathbb{V}[f_{*}] would be an m×mm \times m covariance matrix; and these two quantities would parameterize a multivariate Gaussian distribution.

The central computational challenge for fitting a Gaussian process is that standard techniques for matrix inversion require O(n3)O(n^3) time for an n×nn \times n matrix—so fitting a GP scales cubically with the number of training points.

Stable matrix inversion with Cholesky factorization

If ARn×nA \in \mathbb{R}^{n \times n} is symmetric positive definite, a (relatively) fast and numerically stable way to solve a system of equations Ax=bA\textbf{x} = \mathbf{b} is using Cholesky factorization. The upshot is that AA can be decomposed into the product of a lower triangular matrix LL and its transpose,

A=LL. A = LL^{\top}.

Please see Chapter 23 of (Trefethen & Bau III, 1997) for a detailed explanation of how to compute LL and why doing so is numerically stable. This decomposition can be used to solve Ax=bA\textbf{x} = \mathbf{b} using the following logic:

Ax=bLLx=bLLx=Lyfor some yLx=y. \begin{aligned} A \textbf{x} &= \mathbf{b} \\ LL^{\top} \textbf{x} &= \mathbf{b} \\ LL^{\top} \textbf{x} &= L \mathbf{y} \qquad \text{for some $\mathbf{y}$} \\ L^{\top} \textbf{x} &= \mathbf{y}. \end{aligned}

Thus, if we can solve for y\mathbf{y} in Ly=bL \mathbf{y} = \mathbf{b}, and then solve for x\mathbf{x} in Lx=yL^{\top} \mathbf{x} = \mathbf{y}, we will have solved for the same x\mathbf{x} that solves Ax=bA \mathbf{x} = \mathbf{b}. Let the notation AbA \setminus \mathbf{b} denote the vector x\mathbf{x} that solves Ax=bA \mathbf{x} = \mathbf{b}. Then we have

x=L(Lb). \mathbf{x} = L^{\top} \setminus (L \setminus \mathbf{b}).

Since LL and LL^{\top} are lower and upper triangular matrices, we can can use forward and back substitution respectively to efficiently solve these equations. If LL is a lower triangular matrix, then Ly=bL \mathbf{y} = \mathbf{b} can be written as

1,1y1=b12,1y1+2,2y2=b2=n,1y1+n,2y2++n,nyn=bn \begin{array}{ccccccccc} \ell_{1,1} y_1 & & & & & & & = & b_1 \\ \ell_{2,1} y_1 & + & \ell_{2,2} y_2 & & & & & = & b_2 \\ \vdots & & \vdots & & \ddots & & & = & \vdots \\ \ell_{n,1} y_1 & + & \ell_{n,2} y_2 & + & \dots & + & \ell_{n,n} y_n & = & b_n \end{array}

Clearly, we can solve for y1y_1 because it is just b1/1,1b_1 / \ell_{1,1}. We can then use this value to solve for y2y_2:

y2=b22,1y12,2, y_2 = \frac{b_2 - \ell_{2,1} y_1}{\ell_{2,2}},

and so forth. Solving for y\mathbf{y} in this way is called forward substitution. Back substitution is an analogous algorithm for upper triangular matrices: rather than working from the top down, you work from the bottom up.

If we can solve for LL=ALL^{\top} = A, then we have a fast way of inverting the matrix (solving for AIA \setminus I).

A practical algorithm

Now that we understand how to use Cholesky decomposition for matrix inversion, we are ready to compute both quantities in Equation 11 using Algorithm 2.12.1. The algorithm is,

L=cholesky(K+σn2I)α=L(Ly)fˉ=kαv=LkV[f]=k(x,x)vv. \begin{aligned} L &= \text{cholesky}(K + \sigma^2_n I) \\ \boldsymbol{\alpha} &= L^{\top} \setminus (L \setminus \mathbf{y}) \\ \bar{f_{*}} &= \mathbf{k}_{*}^{\top} \boldsymbol{\alpha} \\ \mathbf{v} &= L \setminus \mathbf{k}_{*} \\ \mathbb{V}[f_{*}] &= k(\mathbf{x}_{*}, \mathbf{x}_{*}) - \mathbf{v}^{\top} \mathbf{v}. \end{aligned}

This looks complicated, but it is just dense notation. Let’s unpack it. First, let A=K+σn2IA = K + \sigma^2_n I. Then to invert AA, we know from the previous section that we to solve for A1A^{-1} in the equation AA1=IA A^{-1} = I. Following the logic in the previous section,

A1=L(LI) A^{-1} = L^{\top} \setminus (L \setminus I)

If we then multiply both sides by y\mathbf{y}, the derivation becomes,

AA1y=IyLLA1y=yLLA1y=LTyfor some TLA1y=Ty, \begin{aligned} AA^{-1} \mathbf{y} &= I \mathbf{y} \\ LL^{\top} A^{-1} \mathbf{y} &= \mathbf{y} \\ LL^{\top} A^{-1} \mathbf{y} &= L T \mathbf{y} \qquad \text{for some $T$} \\ L^{\top} A^{-1} \mathbf{y} &= T \mathbf{y}, \end{aligned}

and thus

A1y=LTy=L(Ly). \begin{aligned} A^{-1} \mathbf{y} &= L^{\top} \setminus T \mathbf{y} \\ &= L^{\top} \setminus (L \setminus \mathbf{y}). \end{aligned}

Of course, L(Ly)L^{\top} \setminus (L \setminus \mathbf{y}) is just what Rasmussen and Williams call α\boldsymbol{\alpha}, and clearly

α=A1y    kα=kA1y=fˉ \boldsymbol{\alpha} = A^{-1} \textbf{y} \quad\implies\quad k_{*}^{\top} \boldsymbol{\alpha} = \mathbf{k}_{*}^{\top} A^{-1} \textbf{y} = \bar{f_{*}}

as desired.

The second quantity is also notationally dense, but can be unpacked with a little effort. First, let’s solve for v\mathbf{v}^{\top}:

v=LkLv=kv=L1kv=(L1k)=k(L)1. \begin{aligned} \mathbf{v} &= L \setminus \mathbf{k}_{*} \\ &\downarrow \\ L \mathbf{v} &= \mathbf{k}_{*} \\ \mathbf{v} &= L^{-1} \mathbf{k}_{*} \\ \mathbf{v}^{\top} &= (L^{-1} \mathbf{k}_{*})^{\top} \\ &= \mathbf{k}_{*}^{\top} (L^{\top})^{-1}. \end{aligned}

Above, we use the fact that (B1)=(B)1(B^{-1})^{\top} = (B^{\top})^{-1}. Then vv\mathbf{v}^{\top} \mathbf{v} is

vv=k(L)1L1k=k(LL)1k=kA1k \begin{aligned} \mathbf{v}^{\top} \mathbf{v} &= \mathbf{k}_{*}^{\top} (L^{\top})^{-1} L^{-1} \mathbf{k}_{*} \\ &= \mathbf{k}_{*}^{\top} (L L^{\top})^{-1} \mathbf{k}_{*} \\ &= \mathbf{k}_{*}^{\top} A^{-1} \mathbf{k}_{*} \end{aligned}

where we use the fact that B1C1=(CB)1B^{-1} C^{-1} = (CB)^{-1}. Putting this together, clearly

k(x,x)vv=k(x,x)kA1k=V[f]. \begin{aligned} k(\mathbf{x}_{*}, \mathbf{x}_{*}) - \mathbf{v}^{\top} \mathbf{v} &= k(\mathbf{x}_{*}, \mathbf{x}_{*}) - \mathbf{k}_{*}^{\top} A^{-1} \mathbf{k}_{*} \\ &= \mathbb{V}[f_*]. \end{aligned}

In summary, since A=K+σn2IA = K + \sigma_n^2 I is a symmetric positive definite matrix, we can use Cholesky factorization to compute LL, which we can then use to compute both the mean and variance as desired.

Implementation

Implementing a GP regressor using Algorithm 2.12.1 is straightforward with SciPy’s cho_solve. cho_solve solves the linear equation Ax=bA \mathbf{x} = \mathbf{b} given the Cholesky factorization of AA. You simply pass it LL and a boolean variable indicating whether or not LL is lower (True) or upper (False) triangular. Here is an efficient implemention using a radial basis function (RBF) kernel:

import numpy as np
from   scipy.spatial.distance import pdist, cdist, squareform
from   scipy.linalg import cholesky, cho_solve


class GPRegressor:

    def __init__(self, length_scale=1):
        self.length_scale = length_scale
        # In principle, this could be configurable.
        self.kernel = rbf_kernel

    def fit(self, X, y):
        self.kernel_ = self.kernel(X, length_scale=self.length_scale)
        lower = True
        L = cholesky(self.kernel_, lower=lower)
        self.alpha_ = cho_solve((L, lower), y)
        self.X_train_ = X
        self.L_ = L

    def predict(self, X):
        K_star = self.kernel(X, self.X_train_, length_scale=self.length_scale)
        y_mean = K_star.dot(self.alpha_)
        lower = True
        v = cho_solve((self.L_, lower), K_star.T)
        y_cov = self.kernel(X, length_scale=self.length_scale) - K_star.dot(v)
        return y_mean, y_cov


def rbf_kernel(X, Y=None, length_scale=1):
    if Y is None:
        dists = pdist(X / length_scale, metric='sqeuclidean')
        K = np.exp(-.5 * dists)
        K = squareform(K)
        np.fill_diagonal(K, 1)
    else:
        dists = cdist(X / length_scale, Y / length_scale, metric='sqeuclidean')
        K = np.exp(-.5 * dists)
    return K

And that’s it. My understanding is that this is state-of-the-art for standard Gaussian process regression. Obviously, exact inference with GPs is still an issue because this still has time complexity O(n3)O(n^3) where nn is the number of training data points.

Log marginal likelihood

In Chapter 5 of Rasmussen and Williams, the authors discuss the importance of the marginal likelihood,

p(yX)=p(yf,X)p(fX)df, p(\mathbf{y} \mid X) = \int p(\mathbf{y} \mid \mathbf{f}, X) p(\mathbf{f} \mid X) \text{d}\mathbf{f},

in Bayesian model selection. That chapter is rich enough to warrant its own reading (or post), but intuitively, the marginal likelihood captures how likely the targets y\mathbf{y} are given the data XX, after marginalizing out all possible functions f\mathbf{f}. Since the Gaussian process regression modeling assumption with noisy observations is that yN(0,K+σn2I)\mathbf{y} \sim \mathcal{N}(\mathbf{0}, K + \sigma^2_n I), this the log marginal likelihood can be written as

logp(yX)=12y(K+σn2I)1y12logdet(K+σn2I)n2log2π.(2) \log p(\mathbf{y} \mid X) = -\frac{1}{2} \mathbf{y}^{\top} (K + \sigma_n^2 I)^{-1} \mathbf{y} - \frac{1}{2} \log \det(K + \sigma_n^2 I) - \frac{n}{2} \log 2 \pi. \tag{2}

Please see Chapter 5 for a more detailed discussion. I include this here because Algorithm 2.12.1 also efficiently computes Equation 22, and it is illuminating to work through the derivation in detail. First, note that

12y(K+σn2I)1y=12yα -\frac{1}{2} \mathbf{y}^{\top} (K + \sigma_n^2 I)^{-1} \mathbf{y} = -\frac{1}{2} \mathbf{y}^{\top} \boldsymbol{\alpha}

using α\boldsymbol{\alpha} as defined in the previous section. And det(K+σn2I)\det(K + \sigma_n^2 I) can be efficiently computed using the Cholesky factor LL of K+σn2IK + \sigma_n^2 I:

det(K+σn2I)=det(LL)=det(L)det(L). \det(K + \sigma_n^2 I) = \det(LL^{\top}) = \det(L) \det(L^{\top}).

The reason det(L)\det(L) is efficient to compute is because LL looks like the following,

[1,1002,12,20n,1n,n]. \begin{bmatrix} \ell_{1,1} & 0 & \dots & 0 \\ \ell_{2,1} & \ell_{2,2} & \dots & 0 \\ \vdots & & \ddots & \vdots & \\ \ell_{n,1} & \dots & \dots & \ell_{n,n} \end{bmatrix}.

Since the determinant can be written as the sum of the products of the elements in the top row with their respective minors (see Wikipedia for an example), the first step in the computation can be written as

1,1×[2,2003,23,30n,2n,n]. \ell_{1,1} \times \begin{bmatrix} \ell_{2,2} & 0 & \dots & 0 \\ \ell_{3,2} & \ell_{3,3} & \dots & 0 \\ \vdots & & \ddots & \vdots & \\ \ell_{n,2} & \dots & \dots & \ell_{n,n} \end{bmatrix}.

Continuing this logic, we get

det(L)=i=1ni,i \det(L) = \prod_{i=1}^{n} \ell_{i,i}

and

logdet(K+σn2I)=logdet(L)2=log[(i=1ni,i)2]=2i=1nlogi,i \begin{aligned} \log \det(K + \sigma_n^2 I) &= \log \det(L)^2 \\ &= \log \big[ \big( \prod_{i=1}^{n} \ell_{i,i} \big)^2 \big] \\ &= 2 \sum_{i=1}^{n} \log \ell_{i,i} \end{aligned}

In other words, once again using the Cholesky factorization, we have an efficient way to compute the log marginal likelihood in Equation 22:

logp(yX)=12yαilogi,in2log2π. \log p(\mathbf{y} \mid X) = -\frac{1}{2} \mathbf{y}^{\top} \boldsymbol{\alpha} - \sum_i \log \ell_{i,i} - \frac{n}{2} \log 2 \pi.

   

Acknowledgements

I thank Abhishek G. for pointing out an error when discussing the Cholesky decomposition’s computational complexity.

  1. Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning (Vol. 2, Number 3). MIT Press Cambridge, MA.
  2. Trefethen, L. N., & Bau III, D. (1997). Numerical linear algebra (Vol. 50). Siam.