A Practical Implementation of Gaussian Process Regression
I discuss Rasmussen and Williams's Algorithm 2.1 for an efficient implementation of Gaussian process regression.
In a previous post, I introduced Gaussian process (GP) regression with small didactic code examples. By design, my implementation was naive: I focused on code that computed each term in the equations as explicitly as possible. However, (Rasmussen & Williams, 2006) provide an efficient algorithm (Algorithm in their textbook) for fitting and predicting with a Gaussian process regressor. The goal of this post is to explain this practical algorithm in detail.
Recall from the previous post that to fit a GP to noisy observations, we need to compute
where is a test output (a Gaussian random vector), is training data, is test data, is the kernel function, and is noise. Since writing terms such as is tedious, let’s use more compact notation following Rasmussen and Williams. Let and . Thus, , and there is no shorthand for . For a single test data point , we write to denote the vector of covariances between and the training data points. Thus, for a single test data point , we want to compute
Note that and are scalar quantities that parameterize a univariate Gaussian for the Gaussian random variable . If were instead an matrix in which each of the column vectors of was a vector of covariances between that test point and the training points, then would be an -vector of means and would be an covariance matrix; and these two quantities would parameterize a multivariate Gaussian distribution.
The central computational challenge for fitting a Gaussian process is that standard techniques for matrix inversion require time for an matrix—so fitting a GP scales cubically with the number of training points.
Stable matrix inversion with Cholesky factorization
If is symmetric positive definite, a (relatively) fast and numerically stable way to solve a system of equations is using Cholesky factorization. The upshot is that can be decomposed into the product of a lower triangular matrix and its transpose,
Please see Chapter 23 of (Trefethen & Bau III, 1997) for a detailed explanation of how to compute and why doing so is numerically stable. This decomposition can be used to solve using the following logic:
Thus, if we can solve for in , and then solve for in , we will have solved for the same that solves . Let the notation denote the vector that solves . Then we have
Since and are lower and upper triangular matrices, we can can use forward and back substitution respectively to efficiently solve these equations. If is a lower triangular matrix, then can be written as
Clearly, we can solve for because it is just . We can then use this value to solve for :
and so forth. Solving for in this way is called forward substitution. Back substitution is an analogous algorithm for upper triangular matrices: rather than working from the top down, you work from the bottom up.
If we can solve for , then we have a fast way of inverting the matrix (solving for ).
A practical algorithm
Now that we understand how to use Cholesky decomposition for matrix inversion, we are ready to compute both quantities in Equation using Algorithm . The algorithm is,
This looks complicated, but it is just dense notation. Let’s unpack it. First, let . Then to invert , we know from the previous section that we to solve for in the equation . Following the logic in the previous section,
If we then multiply both sides by , the derivation becomes,
and thus
Of course, is just what Rasmussen and Williams call , and clearly
as desired.
The second quantity is also notationally dense, but can be unpacked with a little effort. First, let’s solve for :
Above, we use the fact that . Then is
where we use the fact that . Putting this together, clearly
In summary, since is a symmetric positive definite matrix, we can use Cholesky factorization to compute , which we can then use to compute both the mean and variance as desired.
Implementation
Implementing a GP regressor using Algorithm is straightforward with SciPy’s cho_solve
. cho_solve
solves the linear equation given the Cholesky factorization of . You simply pass it and a boolean variable indicating whether or not is lower (True
) or upper (False
) triangular. Here is an efficient implemention using a radial basis function (RBF) kernel:
import numpy as np
from scipy.spatial.distance import pdist, cdist, squareform
from scipy.linalg import cholesky, cho_solve
class GPRegressor:
def __init__(self, length_scale=1):
self.length_scale = length_scale
# In principle, this could be configurable.
self.kernel = rbf_kernel
def fit(self, X, y):
self.kernel_ = self.kernel(X, length_scale=self.length_scale)
lower = True
L = cholesky(self.kernel_, lower=lower)
self.alpha_ = cho_solve((L, lower), y)
self.X_train_ = X
self.L_ = L
def predict(self, X):
K_star = self.kernel(X, self.X_train_, length_scale=self.length_scale)
y_mean = K_star.dot(self.alpha_)
lower = True
v = cho_solve((self.L_, lower), K_star.T)
y_cov = self.kernel(X, length_scale=self.length_scale) - K_star.dot(v)
return y_mean, y_cov
def rbf_kernel(X, Y=None, length_scale=1):
if Y is None:
dists = pdist(X / length_scale, metric='sqeuclidean')
K = np.exp(-.5 * dists)
K = squareform(K)
np.fill_diagonal(K, 1)
else:
dists = cdist(X / length_scale, Y / length_scale, metric='sqeuclidean')
K = np.exp(-.5 * dists)
return K
And that’s it. My understanding is that this is state-of-the-art for standard Gaussian process regression. Obviously, exact inference with GPs is still an issue because this still has time complexity where is the number of training data points.
Log marginal likelihood
In Chapter 5 of Rasmussen and Williams, the authors discuss the importance of the marginal likelihood,
in Bayesian model selection. That chapter is rich enough to warrant its own reading (or post), but intuitively, the marginal likelihood captures how likely the targets are given the data , after marginalizing out all possible functions . Since the Gaussian process regression modeling assumption with noisy observations is that , this the log marginal likelihood can be written as
Please see Chapter 5 for a more detailed discussion. I include this here because Algorithm also efficiently computes Equation , and it is illuminating to work through the derivation in detail. First, note that
using as defined in the previous section. And can be efficiently computed using the Cholesky factor of :
The reason is efficient to compute is because looks like the following,
Since the determinant can be written as the sum of the products of the elements in the top row with their respective minors (see Wikipedia for an example), the first step in the computation can be written as
Continuing this logic, we get
and
In other words, once again using the Cholesky factorization, we have an efficient way to compute the log marginal likelihood in Equation :
Acknowledgements
I thank Abhishek G. for pointing out an error when discussing the Cholesky decomposition’s computational complexity.
- Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning (Vol. 2, Number 3). MIT Press Cambridge, MA.
- Trefethen, L. N., & Bau III, D. (1997). Numerical linear algebra (Vol. 50). Siam.