A Practical Implementation of Gaussian Process Regression

I discuss Rasmussen and Williams's Algorithm 2.1 for an efficient implementation of Gaussian process regression.

Published

12 September 2019

In a previous post, I introduced Gaussian process (GP) regression with small didactic code examples. By design, my implementation was naive: I focused on code that computed each term in the equations as explicitly as possible. However, (Rasmussen & Williams, 2006) provide an efficient algorithm (Algorithm $2.1$ in their textbook) for fitting and predicting with a Gaussian process regressor. The goal of this post is to explain this practical algorithm in detail.

Recall from the previous post that to fit a GP to noisy observations, we need to compute

$\begin{aligned} \mathbb{E}[\mathbf{f}_{*}] &= K(X_*, X) [K(X, X) + \sigma^2_n I]^{-1} \mathbf{y} \\ \text{Cov}(\mathbf{f}_{*}) &= K(X_*, X_*) - K(X_*, X) [K(X, X) + \sigma^2_n I]^{-1} K(X, X_*)) \end{aligned}$

where $\mathbf{f}_{*}$ is a test output (a Gaussian random vector), $X$ is training data, $X_*$ is test data, $K$ is the kernel function, and $\sigma^2_n I$ is noise. Since writing terms such as $K(X, X_{*})$ is tedious, let’s use more compact notation following Rasmussen and Williams. Let $K = K(X, X)$ and $K_{*} = K(X, X_{*})$ . Thus, $K_{*}^{\top} = K(X_{*}, X)$ , and there is no shorthand for $K(X_{*}, X_{*})$ . For a single test data point $\mathbf{x}_{*}$ , we write $\mathbf{k}_{*}$ to denote the vector of covariances between $\mathbf{x}_{*}$ and the $n$ training data points. Thus, for a single test data point $\mathbf{x}_{*}$ , we want to compute

$\begin{aligned} \bar{f_{*}} &= \mathbf{k}_{*}^{\top} [K + \sigma^2_n I]^{-1} \mathbf{y} \\ \mathbb{V}[f_{*}] &= k(\mathbf{x}_*, \mathbf{x}_*) - \mathbf{k}_{*}^{\top} [K + \sigma^2_n I]^{-1} \mathbf{k}^{*}. \end{aligned} \tag{1}$

Note that $\bar{f_{*}}$ and $\mathbb{V}[f_{*}]$ are scalar quantities that parameterize a univariate Gaussian for the Gaussian random variable $\bar{f_{*}}$ . If $\mathbf{k}_{*}$ were instead an $n \times m$ matrix in which each of the $m$ column vectors of was a vector of covariances between that test point and the $n$ training points, then $\bar{f_{*}}$ would be an $m$ -vector of means and $\mathbb{V}[f_{*}]$ would be an $m \times m$ covariance matrix; and these two quantities would parameterize a multivariate Gaussian distribution.

The central computational challenge for fitting a Gaussian process is that standard techniques for matrix inversion require $O(n^3)$ time for an $n \times n$ matrix—so fitting a GP scales cubically with the number of training points.

Stable matrix inversion with Cholesky factorization

If $A \in \mathbb{R}^{n \times n}$ is symmetric positive definite, a (relatively) fast and numerically stable way to solve a system of equations $A\textbf{x} = \mathbf{b}$ is using Cholesky factorization. The upshot is that $A$ can be decomposed into the product of a lower triangular matrix $L$ and its transpose,

$A = LL^{\top}.$

Please see Chapter 23 of (Trefethen & Bau III, 1997) for a detailed explanation of how to compute $L$ and why doing so is numerically stable. This decomposition can be used to solve $A\textbf{x} = \mathbf{b}$ using the following logic:

$\begin{aligned} A \textbf{x} &= \mathbf{b} \\ LL^{\top} \textbf{x} &= \mathbf{b} \\ LL^{\top} \textbf{x} &= L \mathbf{y} \qquad \text{for some $\mathbf{y}$} \\ L^{\top} \textbf{x} &= \mathbf{y}. \end{aligned}$

Thus, if we can solve for $\mathbf{y}$ in $L \mathbf{y} = \mathbf{b}$ , and then solve for $\mathbf{x}$ in $L^{\top} \mathbf{x} = \mathbf{y}$ , we will have solved for the same $\mathbf{x}$ that solves $A \mathbf{x} = \mathbf{b}$ . Let the notation $A \setminus \mathbf{b}$ denote the vector $\mathbf{x}$ that solves $A \mathbf{x} = \mathbf{b}$ . Then we have

$\mathbf{x} = L^{\top} \setminus (L \setminus \mathbf{b}).$

Since $L$ and $L^{\top}$ are lower and upper triangular matrices, we can can use forward and back substitution respectively to efficiently solve these equations. If $L$ is a lower triangular matrix, then $L \mathbf{y} = \mathbf{b}$ can be written as

$\begin{array}{ccccccccc} \ell_{1,1} y_1 & & & & & & & = & b_1 \\ \ell_{2,1} y_1 & + & \ell_{2,2} y_2 & & & & & = & b_2 \\ \vdots & & \vdots & & \ddots & & & = & \vdots \\ \ell_{n,1} y_1 & + & \ell_{n,2} y_2 & + & \dots & + & \ell_{n,n} y_n & = & b_n \end{array}$

Clearly, we can solve for $y_1$ because it is just $b_1 / \ell_{1,1}$ . We can then use this value to solve for $y_2$ :

$y_2 = \frac{b_2 - \ell_{2,1} y_1}{\ell_{2,2}},$

and so forth. Solving for $\mathbf{y}$ in this way is called forward substitution. Back substitution is an analogous algorithm for upper triangular matrices: rather than working from the top down, you work from the bottom up.

If we can solve for $LL^{\top} = A$ , then we have a fast way of inverting the matrix (solving for $A \setminus I$ ).

A practical algorithm

Now that we understand how to use Cholesky decomposition for matrix inversion, we are ready to compute both quantities in Equation $1$ using Algorithm $2.1$ . The algorithm is,

$\begin{aligned} L &= \text{cholesky}(K + \sigma^2_n I) \\ \boldsymbol{\alpha} &= L^{\top} \setminus (L \setminus \mathbf{y}) \\ \bar{f_{*}} &= \mathbf{k}_{*}^{\top} \boldsymbol{\alpha} \\ \mathbf{v} &= L \setminus \mathbf{k}_{*} \\ \mathbb{V}[f_{*}] &= k(\mathbf{x}_{*}, \mathbf{x}_{*}) - \mathbf{v}^{\top} \mathbf{v}. \end{aligned}$

This looks complicated, but it is just dense notation. Let’s unpack it. First, let $A = K + \sigma^2_n I$ . Then to invert $A$ , we know from the previous section that we to solve for $A^{-1}$ in the equation $A A^{-1} = I$ . Following the logic in the previous section,

$A^{-1} = L^{\top} \setminus (L \setminus I)$

If we then multiply both sides by $\mathbf{y}$ , the derivation becomes,

$\begin{aligned} AA^{-1} \mathbf{y} &= I \mathbf{y} \\ LL^{\top} A^{-1} \mathbf{y} &= \mathbf{y} \\ LL^{\top} A^{-1} \mathbf{y} &= L T \mathbf{y} \qquad \text{for some $T$} \\ L^{\top} A^{-1} \mathbf{y} &= T \mathbf{y}, \end{aligned}$

and thus

$\begin{aligned} A^{-1} \mathbf{y} &= L^{\top} \setminus T \mathbf{y} \\ &= L^{\top} \setminus (L \setminus \mathbf{y}). \end{aligned}$

Of course, $L^{\top} \setminus (L \setminus \mathbf{y})$ is just what Rasmussen and Williams call $\boldsymbol{\alpha}$ , and clearly

$\boldsymbol{\alpha} = A^{-1} \textbf{y} \quad\implies\quad k_{*}^{\top} \boldsymbol{\alpha} = \mathbf{k}_{*}^{\top} A^{-1} \textbf{y} = \bar{f_{*}}$

as desired.

The second quantity is also notationally dense, but can be unpacked with a little effort. First, let’s solve for $\mathbf{v}^{\top}$ :

$\begin{aligned} \mathbf{v} &= L \setminus \mathbf{k}_{*} \\ &\downarrow \\ L \mathbf{v} &= \mathbf{k}_{*} \\ \mathbf{v} &= L^{-1} \mathbf{k}_{*} \\ \mathbf{v}^{\top} &= (L^{-1} \mathbf{k}_{*})^{\top} \\ &= \mathbf{k}_{*}^{\top} (L^{\top})^{-1}. \end{aligned}$

Above, we use the fact that $(B^{-1})^{\top} = (B^{\top})^{-1}$ . Then $\mathbf{v}^{\top} \mathbf{v}$ is

$\begin{aligned} \mathbf{v}^{\top} \mathbf{v} &= \mathbf{k}_{*}^{\top} (L^{\top})^{-1} L^{-1} \mathbf{k}_{*} \\ &= \mathbf{k}_{*}^{\top} (L L^{\top})^{-1} \mathbf{k}_{*} \\ &= \mathbf{k}_{*}^{\top} A^{-1} \mathbf{k}_{*} \end{aligned}$

where we use the fact that $B^{-1} C^{-1} = (CB)^{-1}$ . Putting this together, clearly

$\begin{aligned} k(\mathbf{x}_{*}, \mathbf{x}_{*}) - \mathbf{v}^{\top} \mathbf{v} &= k(\mathbf{x}_{*}, \mathbf{x}_{*}) - \mathbf{k}_{*}^{\top} A^{-1} \mathbf{k}_{*} \\ &= \mathbb{V}[f_*]. \end{aligned}$

In summary, since $A = K + \sigma_n^2 I$ is a symmetric positive definite matrix, we can use Cholesky factorization to compute $L$ , which we can then use to compute both the mean and variance as desired.

Implementation

Implementing a GP regressor using Algorithm $2.1$ is straightforward with SciPy’s cho_solve. cho_solve solves the linear equation $A \mathbf{x} = \mathbf{b}$ given the Cholesky factorization of $A$ . You simply pass it $L$ and a boolean variable indicating whether or not $L$ is lower (True) or upper (False) triangular. Here is an efficient implemention using a radial basis function (RBF) kernel:

import numpy as np
from   scipy.spatial.distance import pdist, cdist, squareform
from   scipy.linalg import cholesky, cho_solve


class GPRegressor:

    def __init__(self, length_scale=1):
        self.length_scale = length_scale
        # In principle, this could be configurable.
        self.kernel = rbf_kernel

    def fit(self, X, y):
        self.kernel_ = self.kernel(X, length_scale=self.length_scale)
        lower = True
        L = cholesky(self.kernel_, lower=lower)
        self.alpha_ = cho_solve((L, lower), y)
        self.X_train_ = X
        self.L_ = L

    def predict(self, X):
        K_star = self.kernel(X, self.X_train_, length_scale=self.length_scale)
        y_mean = K_star.dot(self.alpha_)
        lower = True
        v = cho_solve((self.L_, lower), K_star.T)
        y_cov = self.kernel(X, length_scale=self.length_scale) - K_star.dot(v)
        return y_mean, y_cov


def rbf_kernel(X, Y=None, length_scale=1):
    if Y is None:
        dists = pdist(X / length_scale, metric='sqeuclidean')
        K = np.exp(-.5 * dists)
        K = squareform(K)
        np.fill_diagonal(K, 1)
    else:
        dists = cdist(X / length_scale, Y / length_scale, metric='sqeuclidean')
        K = np.exp(-.5 * dists)
    return K

And that’s it. My understanding is that this is state-of-the-art for standard Gaussian process regression. Obviously, exact inference with GPs is still an issue because this still has time complexity $O(n^3)$ where $n$ is the number of training data points.

Log marginal likelihood

In Chapter 5 of Rasmussen and Williams, the authors discuss the importance of the marginal likelihood,

$p(\mathbf{y} \mid X) = \int p(\mathbf{y} \mid \mathbf{f}, X) p(\mathbf{f} \mid X) \text{d}\mathbf{f},$

in Bayesian model selection. That chapter is rich enough to warrant its own reading (or post), but intuitively, the marginal likelihood captures how likely the targets $\mathbf{y}$ are given the data $X$ , after marginalizing out all possible functions $\mathbf{f}$ . Since the Gaussian process regression modeling assumption with noisy observations is that $\mathbf{y} \sim \mathcal{N}(\mathbf{0}, K + \sigma^2_n I)$ , this the log marginal likelihood can be written as

$\log p(\mathbf{y} \mid X) = -\frac{1}{2} \mathbf{y}^{\top} (K + \sigma_n^2 I)^{-1} \mathbf{y} - \frac{1}{2} \log \det(K + \sigma_n^2 I) - \frac{n}{2} \log 2 \pi. \tag{2}$

Please see Chapter 5 for a more detailed discussion. I include this here because Algorithm $2.1$ also efficiently computes Equation $2$ , and it is illuminating to work through the derivation in detail. First, note that

$-\frac{1}{2} \mathbf{y}^{\top} (K + \sigma_n^2 I)^{-1} \mathbf{y} = -\frac{1}{2} \mathbf{y}^{\top} \boldsymbol{\alpha}$

using $\boldsymbol{\alpha}$ as defined in the previous section. And $\det(K + \sigma_n^2 I)$ can be efficiently computed using the Cholesky factor $L$ of $K + \sigma_n^2 I$ :

$\det(K + \sigma_n^2 I) = \det(LL^{\top}) = \det(L) \det(L^{\top}).$

The reason $\det(L)$ is efficient to compute is because $L$ looks like the following,

$\begin{bmatrix} \ell_{1,1} & 0 & \dots & 0 \\ \ell_{2,1} & \ell_{2,2} & \dots & 0 \\ \vdots & & \ddots & \vdots & \\ \ell_{n,1} & \dots & \dots & \ell_{n,n} \end{bmatrix}.$

Since the determinant can be written as the sum of the products of the elements in the top row with their respective minors (see Wikipedia for an example), the first step in the computation can be written as

$\ell_{1,1} \times \begin{bmatrix} \ell_{2,2} & 0 & \dots & 0 \\ \ell_{3,2} & \ell_{3,3} & \dots & 0 \\ \vdots & & \ddots & \vdots & \\ \ell_{n,2} & \dots & \dots & \ell_{n,n} \end{bmatrix}.$

Continuing this logic, we get

$\det(L) = \prod_{i=1}^{n} \ell_{i,i}$

and

$\begin{aligned} \log \det(K + \sigma_n^2 I) &= \log \det(L)^2 \\ &= \log \big[ \big( \prod_{i=1}^{n} \ell_{i,i} \big)^2 \big] \\ &= 2 \sum_{i=1}^{n} \log \ell_{i,i} \end{aligned}$

In other words, once again using the Cholesky factorization, we have an efficient way to compute the log marginal likelihood in Equation $2$ :

$\log p(\mathbf{y} \mid X) = -\frac{1}{2} \mathbf{y}^{\top} \boldsymbol{\alpha} - \sum_i \log \ell_{i,i} - \frac{n}{2} \log 2 \pi.$

Acknowledgements

I thank Abhishek G. for pointing out an error when discussing the Cholesky decomposition’s computational complexity.

Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning (Vol. 2, Number 3). MIT Press Cambridge, MA.
Trefethen, L. N., & Bau III, D. (1997). Numerical linear algebra (Vol. 50). Siam.