Comparing Kernel Ridge with Gaussian Process Regression

The posterior mean from a Gaussian process regressor is related to the prediction of a kernel ridge regressor. I explore this connection in detail.

Published

06 January 2020

With appropriate hyperparameters, the posterior mean of a Gaussian process (GP) regressor with observation noise is a kernel ridge regressor. I learned of this relationship from this StackExchange comment by Andreas Mueller. I was surprised, and the goal of this post is to better understand that comment. Following (Welling, 2013), I’ll first introduce kernel ridge regression by re-deriving it from ridge regression. We can then compare this to Gaussian process regression, which I have discussed in detail here and here. Finally, this post relies on understanding the kernel trick. Please see my previous post on the topic if needed.

Ridge regression

Suppose we have a regression problem with data $\{\mathbf{x}_n, y_n\}_{n=1}^{N}$ . Classical linear regression is a simple linear model with additive Gaussian noise, $\varepsilon_n \sim \mathcal{N}(0, \sigma^2)$ ,

$y_n = \mathbf{x}_n^{\top} \boldsymbol{\beta} + \varepsilon_n. \tag{1}$

Linear regression is fit by minimizing the sum of squared residuals. If $\mathbf{X}$ is an $N \times P$ matrix of $N$ data points and $P$ predictors, the normal equations for linear regression are

$\hat{\boldsymbol{\beta}} = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y}. \tag{2}$

See my previous post on ordinary least squares regression for details. To deal with multicollinearity (linear relationships between predictors), ridge regression introduces Tikhonov regularization or an $\ell_2$ -norm penalty to the weights $\boldsymbol{\beta}$ . Ridge regression is so-named because the new optimization amounts to adding values along the diagonal (or “ridge”) of the covariance matrix,

$\hat{\boldsymbol{\beta}} = (\lambda \mathbf{I}_P + \mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y}. \tag{3}$

$\lambda$ is a fixed hyperparameter. If our data matrix $\mathbf{X}$ exhibits multicollinearity, if there are linear dependencies between the columns, then the covariance matrix $\mathbf{X}^{\top} \mathbf{X}$ will be singular. This is because $\text{rank}(\mathbf{A}) = \text{rank}(\mathbf{A}^{\top} \mathbf{A})$ in general. See this appendix from a previous post for a proof. In other words, $\mathbf{X}^{\top} \mathbf{X}$ is non-invertible. Thus, from the perspective of numerical linear algebra, ridge regression stabilizes the matrix inversion in $(2)$ by adding small values to the diagonal in $(3)$ .

Kernel ridge regression

Now let’s kernelize ridge regression. First, let’s replace each feature vector or sample $\mathbf{x}_n$ (a row of $\mathbf{X}$ ) with $\varphi(\mathbf{x}_n)$ where $\varphi: \mathbb{R}^P \mapsto \mathbb{R}^K$ . Then we can write $(1)$ as

$y_n = [\varphi(\mathbf{x}_n)]^{\top} \boldsymbol{\beta} + \boldsymbol{\varepsilon_n} \tag{4}$

where $\boldsymbol{\beta}$ is a $K$ - rather than $P$ -vector. Let $\boldsymbol{\Phi} = [\varphi(\mathbf{x}_1), \dots, \varphi(\mathbf{x}_N)]^{\top}$ be an $N \times K$ matrix. Then the matrix inverse in the kernelized equation $(3)$ , rather than being $P \times P$ , is $K \times K$ ,

$\hat{\boldsymbol{\beta}} = (\lambda \mathbf{I}_K + \boldsymbol{\Phi}^{\top} \boldsymbol{\Phi})^{-1} \boldsymbol{\Phi}^{\top} \mathbf{y}. \tag{5}$

If $K \gg P$ as it is with many choices of $\varphi(\cdot)$ , then $(5)$ is problematic. Recall that $K$ can even be infinite. Kernel ridge regression solves this problem using the Woodbury matrix identity, rewriting $(5)$ as

$\hat{\boldsymbol{\beta}} = \boldsymbol{\Phi} (\lambda \mathbf{I}_N + \boldsymbol{\Phi} \boldsymbol{\Phi}^{\top})^{-1} \mathbf{y}. \tag{6}$

See the appendix for a complete derivation of $(6)$ . Note that we can rewrite $(6)$ as

$\hat{\boldsymbol{\beta}} = \sum_{n=1}^{N} \alpha_n \varphi(\mathbf{x}_n) \quad\text{where}\quad \boldsymbol{\alpha} = (\lambda \mathbf{I}_N + \boldsymbol{\Phi} \boldsymbol{\Phi}^{\top})^{-1} \mathbf{y}. \tag{6}$

Mathematically, this is nothing deep. We’re just applying the definition of matrix multiplication. However, it has deep implications. (Welling, 2013) summarize the idea nicely:

The solution $\mathbf{w}$ [our $\boldsymbol{\beta}$ ] must lie in the span of the data-cases, even if the dimensionality of the feature space is much larger than the number of data-cases. This seems intuitively clear, since the algorithm is linear in feature space.

Thus, if we are given a new test point $\mathbf{x}$ , the predicted $y$ is

$y \equiv [\varphi(\mathbf{x})]^{\top} \boldsymbol{\beta} = [\varphi(\mathbf{x})]^{\top} (\lambda \mathbf{I}_N + \boldsymbol{\Phi} \boldsymbol{\Phi}^{\top})^{-1} \mathbf{y}. \tag{7}$

We’re now ready for the kernel trick, $k(\mathbf{x}, \mathbf{y}) = \langle \varphi(\mathbf{x}), \varphi(\mathbf{y}) \rangle$ , for some positive definite kernel function $k$ . Let $\mathbf{K}$ be a matrix and $\mathbf{k}(\mathbf{x})$ be a vector such that

$\begin{aligned} \mathbf{K}_{ij} &= \varphi(\mathbf{x}_i)^{\top} \varphi(\mathbf{x}_j), \\ \mathbf{k}(\mathbf{x}) &= [k(\mathbf{x}, \mathbf{x}_1), \dots, k(\mathbf{x}, \mathbf{x}_N)]^{\top}. \end{aligned} \tag{8}$

Then we can write $(7)$ as

$y = [\mathbf{k}(\mathbf{x})]^{\top}(\lambda \mathbf{I}_N + \mathbf{K})^{-1}\mathbf{y} \tag{9}$

using the kernel trick. While $\varphi(\cdot)$ might lift the features into possibly infinite-dimensional space, we only need to evaluate $k(\cdot, \cdot)$ on our data.

As with linear regression, ridge regression can be interpreted from a probabilistic perspective (see (Bishop, 2006), section $3.1.1$ or these lecture notes from Ryan P. Adams). In a probabilistic interpretation, $(9)$ is simply the mean of the probabilistic model. (see (Bishop, 2006), equation $3.53$ ). However, our current framing is sufficient for this post.

Gaussian process regression

Now let’s see how GP regression is related to kernel ridge regression. In GP regression with noisy observations, the model is

$y_n = f(\mathbf{x}_n) + \varepsilon_n, \tag{10}$

where $\varepsilon_n$ is i.i.d. Gaussian noise or $\varepsilon_n \sim \mathcal{N}(0, \sigma^2)$ and the function $f$ is itself a random variable,

$f \sim \mathcal{N}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x})). \tag{11}$

In $(11)$ , $m(\cdot)$ is the mean function and $k(\cdot)$ is a positive definite kernel. If this looks unfamiliar, please see my previous post on GP regression. If we use $\mathbf{f}$ to denote $[f(\mathbf{x}_1), \dots, f(\mathbf{x}_N)]^{\top}$ and use subscript $*$ to denote testing rather than training variables, then the full GP model is

$\begin{bmatrix} \mathbf{f}_* \\ \mathbf{f} \end{bmatrix} \sim \mathcal{N} \Bigg( \begin{bmatrix} m(\mathbf{X}_*) \\ m(\mathbf{X}) \end{bmatrix}, \begin{bmatrix} K(\mathbf{X}_*, \mathbf{X}_*) & K(\mathbf{X}_*, \mathbf{X}) \\ K(\mathbf{X}, \mathbf{X}_*) & \sigma^2 \mathbf{I} + K(\mathbf{X}, \mathbf{X}) \end{bmatrix} \Bigg). \tag{12}$

Hopefully it is clear from context that $K(\cdot, \cdot)$ denotes applying the kernel function to every pair in the two matrices (sets of vectors). Since $\mathbb{E}[y_n] = \mathbb{E}[f(\mathbf{x}_n)]$ , we are interested in the mean (and covariance) of $\mathbf{f}_{*}$ —the model’s posterior mean. Thankfully, we can use properties of the conditional Gaussian distribution to easily compute these quantities,

$\begin{aligned} \mathbb{E}[\mathbf{f}_{*}] &= K(\mathbf{X}_*, \mathbf{X}) [\sigma^2 \mathbf{I} + K(\mathbf{X}, \mathbf{X})]^{-1} \mathbf{y}, \\ \text{Cov}(\mathbf{f}_{*}) &= K(\mathbf{X}_*, \mathbf{X}_*) - K(\mathbf{X}_*, \mathbf{X}) [\sigma^2 \mathbf{I} + K(\mathbf{X}, \mathbf{X})]^{-1} K(\mathbf{X}, \mathbf{X}_*)). \end{aligned} \tag{13}$

The expectation in $(13)$ is just a vectorized version of kernel ridge regression in $(9)$ if we assume the same kernel function $k(\cdot, \cdot)$ and if $\sigma^2$ , the variance of the GP noise, is equal to the ridge regression regularization hyperparameter $\lambda$ .

Analysis

Kernel ridge and GP regression are quite similar, but the major difference is that a GP regressor is a generative model of the response. This has a few major consequences. While kernel ridge regression can only predict $y$ , a GP can quantify its uncertainty about $y$ . Furthermore, a GP can generate posterior samples through the generative process. Finally, a GP has a marginal likelihood, and this can be used to find optimal hyperparameters via optimization (see (Rasmussen & Williams, 2006), chapter $5$ ); in kernel ridge regression, one must perform grid search over $\lambda$ .

Appendix

A1. Kernel ridge from the Woodbury identity

If $\mathbf{A}$ is a $P \times P$ full rank matrix that is rank corrected by $\mathbf{UCV}$ where $\mathbf{U} \in \mathbb{R}^{P \times K}$ , $\mathbf{C} \in \mathbb{R}^{K \times K}$ , and $\mathbf{V} \in \mathbb{R}^{K \times P}$ , then the Woodbury identity is

$(\mathbf{A} + \mathbf{UCV})^{-1} = \mathbf{A}^{-1} - \mathbf{A}^{-1} \mathbf{U} (\mathbf{C}^{-1} + \mathbf{V} \mathbf{A}^{-1} \mathbf{U})^{-1} \mathbf{V} \mathbf{A}^{-1}.$

Apply the Woodbury matrix identity with $\mathbf{A} = \mathbf{C} = \mathbf{I}$ . Then

$\begin{aligned} (\mathbf{I} + \mathbf{U}\mathbf{V})^{-1} &= \mathbf{I} - \mathbf{U}(\mathbf{I} + \mathbf{V}\mathbf{U})^{-1} \mathbf{V} \\ (\mathbf{I} + \mathbf{U}\mathbf{V})^{-1} \mathbf{U} &= [\mathbf{I} - \mathbf{U}(\mathbf{I} + \mathbf{V}\mathbf{U})^{-1} \mathbf{V}] \mathbf{U} \\ &= \mathbf{U} - \mathbf{U}(\mathbf{I} + \mathbf{V}\mathbf{U})^{-1} \mathbf{V}\mathbf{U} \\ &= \mathbf{U}(\mathbf{I} + \mathbf{V}\mathbf{U})^{-1} [(\mathbf{I} + \mathbf{V}\mathbf{U}) - \mathbf{V}\mathbf{U}] \\ &= \mathbf{U}(\mathbf{I} + \mathbf{V}\mathbf{U})^{-1}. \end{aligned}$

This is the push-through identity. Now multiply both sides by a scalar, $1/\lambda$ , and recall that $(k \mathbf{A})^{-1} = k^{-1} \mathbf{A}^{-1}$ in general:

$\begin{aligned} (\mathbf{I} + \mathbf{U}\mathbf{V})^{-1} \mathbf{U} &= \mathbf{U}(\mathbf{I} + \mathbf{V}\mathbf{U})^{-1} \\ \frac{\lambda}{\lambda}[(\mathbf{I} + \mathbf{U}\mathbf{V})^{-1} \mathbf{U}] &= \frac{\lambda}{\lambda}[\mathbf{U}(\mathbf{I} + \mathbf{V}\mathbf{U})^{-1}] \\ (\lambda \mathbf{I} + \lambda \mathbf{U}\mathbf{V})^{-1} \lambda\mathbf{U} &= \lambda\mathbf{U}(\lambda \mathbf{I} + \mathbf{V} \lambda \mathbf{U})^{-1} \\ (\lambda \mathbf{I} + \mathbf{W} \mathbf{V})^{-1} \mathbf{W} &= \mathbf{W}(\lambda \mathbf{I} + \mathbf{V} \mathbf{W})^{-1}. \end{aligned}$

In the last step, we set $\mathbf{W} = \lambda \mathbf{U}$ . For kernel ridge regression, we apply the last line with $\mathbf{W} = \boldsymbol{\Phi}^{\top}$ and $\mathbf{V} = \boldsymbol{\Phi}$ .

Welling, M. (2013). Kernel ridge regression. Max Welling’s Classnotes in Machine Learning, 1–3.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning.
Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning (Vol. 2, Number 3). MIT Press Cambridge, MA.