Comparing Kernel Ridge with Gaussian Process Regression

The posterior mean from a Gaussian process regressor is related to the prediction of a kernel ridge regressor. I explore this connection in detail.

With appropriate hyperparameters, the posterior mean of a Gaussian process (GP) regressor with observation noise is a kernel ridge regressor. I learned of this relationship from this StackExchange comment by Andreas Mueller. I was surprised, and the goal of this post is to better understand that comment. Following (Welling, 2013), I’ll first introduce kernel ridge regression by re-deriving it from ridge regression. We can then compare this to Gaussian process regression, which I have discussed in detail here and here. Finally, this post relies on understanding the kernel trick. Please see my previous post on the topic if needed.

Ridge regression

Suppose we have a regression problem with data {xn,yn}n=1N\{\mathbf{x}_n, y_n\}_{n=1}^{N}. Classical linear regression is a simple linear model with additive Gaussian noise, εnN(0,σ2)\varepsilon_n \sim \mathcal{N}(0, \sigma^2),

yn=xnβ+εn.(1) y_n = \mathbf{x}_n^{\top} \boldsymbol{\beta} + \varepsilon_n. \tag{1}

Linear regression is fit by minimizing the sum of squared residuals. If X\mathbf{X} is an N×PN \times P matrix of NN data points and PP predictors, the normal equations for linear regression are

β^=(XX)1Xy.(2) \hat{\boldsymbol{\beta}} = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y}. \tag{2}

See my previous post on ordinary least squares regression for details. To deal with multicollinearity (linear relationships between predictors), ridge regression introduces Tikhonov regularization or an 2\ell_2-norm penalty to the weights β\boldsymbol{\beta}. Ridge regression is so-named because the new optimization amounts to adding values along the diagonal (or “ridge”) of the covariance matrix,

β^=(λIP+XX)1Xy.(3) \hat{\boldsymbol{\beta}} = (\lambda \mathbf{I}_P + \mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y}. \tag{3}

λ\lambda is a fixed hyperparameter. If our data matrix X\mathbf{X} exhibits multicollinearity, if there are linear dependencies between the columns, then the covariance matrix XX\mathbf{X}^{\top} \mathbf{X} will be singular. This is because rank(A)=rank(AA)\text{rank}(\mathbf{A}) = \text{rank}(\mathbf{A}^{\top} \mathbf{A}) in general. See this appendix from a previous post for a proof. In other words, XX\mathbf{X}^{\top} \mathbf{X} is non-invertible. Thus, from the perspective of numerical linear algebra, ridge regression stabilizes the matrix inversion in (2)(2) by adding small values to the diagonal in (3)(3).

Kernel ridge regression

Now let’s kernelize ridge regression. First, let’s replace each feature vector or sample xn\mathbf{x}_n (a row of X\mathbf{X}) with φ(xn)\varphi(\mathbf{x}_n) where φ:RPRK\varphi: \mathbb{R}^P \mapsto \mathbb{R}^K. Then we can write (1)(1) as

yn=[φ(xn)]β+εn(4) y_n = [\varphi(\mathbf{x}_n)]^{\top} \boldsymbol{\beta} + \boldsymbol{\varepsilon_n} \tag{4}

where β\boldsymbol{\beta} is a KK- rather than PP-vector. Let Φ=[φ(x1),,φ(xN)]\boldsymbol{\Phi} = [\varphi(\mathbf{x}_1), \dots, \varphi(\mathbf{x}_N)]^{\top} be an N×KN \times K matrix. Then the matrix inverse in the kernelized equation (3)(3), rather than being P×PP \times P, is K×KK \times K,

β^=(λIK+ΦΦ)1Φy.(5) \hat{\boldsymbol{\beta}} = (\lambda \mathbf{I}_K + \boldsymbol{\Phi}^{\top} \boldsymbol{\Phi})^{-1} \boldsymbol{\Phi}^{\top} \mathbf{y}. \tag{5}

If KPK \gg P as it is with many choices of φ()\varphi(\cdot), then (5)(5) is problematic. Recall that KK can even be infinite. Kernel ridge regression solves this problem using the Woodbury matrix identity, rewriting (5)(5) as

β^=Φ(λIN+ΦΦ)1y.(6) \hat{\boldsymbol{\beta}} = \boldsymbol{\Phi} (\lambda \mathbf{I}_N + \boldsymbol{\Phi} \boldsymbol{\Phi}^{\top})^{-1} \mathbf{y}. \tag{6}

See the appendix for a complete derivation of (6)(6). Note that we can rewrite (6)(6) as

β^=n=1Nαnφ(xn)whereα=(λIN+ΦΦ)1y.(6) \hat{\boldsymbol{\beta}} = \sum_{n=1}^{N} \alpha_n \varphi(\mathbf{x}_n) \quad\text{where}\quad \boldsymbol{\alpha} = (\lambda \mathbf{I}_N + \boldsymbol{\Phi} \boldsymbol{\Phi}^{\top})^{-1} \mathbf{y}. \tag{6}

Mathematically, this is nothing deep. We’re just applying the definition of matrix multiplication. However, it has deep implications. (Welling, 2013) summarize the idea nicely:

The solution w\mathbf{w} [our β\boldsymbol{\beta}] must lie in the span of the data-cases, even if the dimensionality of the feature space is much larger than the number of data-cases. This seems intuitively clear, since the algorithm is linear in feature space.

Thus, if we are given a new test point x\mathbf{x}, the predicted yy is

y[φ(x)]β=[φ(x)](λIN+ΦΦ)1y.(7) y \equiv [\varphi(\mathbf{x})]^{\top} \boldsymbol{\beta} = [\varphi(\mathbf{x})]^{\top} (\lambda \mathbf{I}_N + \boldsymbol{\Phi} \boldsymbol{\Phi}^{\top})^{-1} \mathbf{y}. \tag{7}

We’re now ready for the kernel trick, k(x,y)=φ(x),φ(y)k(\mathbf{x}, \mathbf{y}) = \langle \varphi(\mathbf{x}), \varphi(\mathbf{y}) \rangle, for some positive definite kernel function kk. Let K\mathbf{K} be a matrix and k(x)\mathbf{k}(\mathbf{x}) be a vector such that

Kij=φ(xi)φ(xj),k(x)=[k(x,x1),,k(x,xN)].(8) \begin{aligned} \mathbf{K}_{ij} &= \varphi(\mathbf{x}_i)^{\top} \varphi(\mathbf{x}_j), \\ \mathbf{k}(\mathbf{x}) &= [k(\mathbf{x}, \mathbf{x}_1), \dots, k(\mathbf{x}, \mathbf{x}_N)]^{\top}. \end{aligned} \tag{8}

Then we can write (7)(7) as

y=[k(x)](λIN+K)1y(9) y = [\mathbf{k}(\mathbf{x})]^{\top}(\lambda \mathbf{I}_N + \mathbf{K})^{-1}\mathbf{y} \tag{9}

using the kernel trick. While φ()\varphi(\cdot) might lift the features into possibly infinite-dimensional space, we only need to evaluate k(,)k(\cdot, \cdot) on our data.

As with linear regression, ridge regression can be interpreted from a probabilistic perspective (see (Bishop, 2006), section 3.1.13.1.1 or these lecture notes from Ryan P. Adams). In a probabilistic interpretation, (9)(9) is simply the mean of the probabilistic model. (see (Bishop, 2006), equation 3.533.53). However, our current framing is sufficient for this post.

Gaussian process regression

Now let’s see how GP regression is related to kernel ridge regression. In GP regression with noisy observations, the model is

yn=f(xn)+εn,(10) y_n = f(\mathbf{x}_n) + \varepsilon_n, \tag{10}

where εn\varepsilon_n is i.i.d. Gaussian noise or εnN(0,σ2)\varepsilon_n \sim \mathcal{N}(0, \sigma^2) and the function ff is itself a random variable,

fN(m(x),k(x,x)).(11) f \sim \mathcal{N}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x})). \tag{11}

In (11)(11), m()m(\cdot) is the mean function and k()k(\cdot) is a positive definite kernel. If this looks unfamiliar, please see my previous post on GP regression. If we use f\mathbf{f} to denote [f(x1),,f(xN)][f(\mathbf{x}_1), \dots, f(\mathbf{x}_N)]^{\top} and use subscript * to denote testing rather than training variables, then the full GP model is

[ff]N([m(X)m(X)],[K(X,X)K(X,X)K(X,X)σ2I+K(X,X)]).(12) \begin{bmatrix} \mathbf{f}_* \\ \mathbf{f} \end{bmatrix} \sim \mathcal{N} \Bigg( \begin{bmatrix} m(\mathbf{X}_*) \\ m(\mathbf{X}) \end{bmatrix}, \begin{bmatrix} K(\mathbf{X}_*, \mathbf{X}_*) & K(\mathbf{X}_*, \mathbf{X}) \\ K(\mathbf{X}, \mathbf{X}_*) & \sigma^2 \mathbf{I} + K(\mathbf{X}, \mathbf{X}) \end{bmatrix} \Bigg). \tag{12}

Hopefully it is clear from context that K(,)K(\cdot, \cdot) denotes applying the kernel function to every pair in the two matrices (sets of vectors). Since E[yn]=E[f(xn)]\mathbb{E}[y_n] = \mathbb{E}[f(\mathbf{x}_n)], we are interested in the mean (and covariance) of f\mathbf{f}_{*}—the model’s posterior mean. Thankfully, we can use properties of the conditional Gaussian distribution to easily compute these quantities,

E[f]=K(X,X)[σ2I+K(X,X)]1y,Cov(f)=K(X,X)K(X,X)[σ2I+K(X,X)]1K(X,X)).(13) \begin{aligned} \mathbb{E}[\mathbf{f}_{*}] &= K(\mathbf{X}_*, \mathbf{X}) [\sigma^2 \mathbf{I} + K(\mathbf{X}, \mathbf{X})]^{-1} \mathbf{y}, \\ \text{Cov}(\mathbf{f}_{*}) &= K(\mathbf{X}_*, \mathbf{X}_*) - K(\mathbf{X}_*, \mathbf{X}) [\sigma^2 \mathbf{I} + K(\mathbf{X}, \mathbf{X})]^{-1} K(\mathbf{X}, \mathbf{X}_*)). \end{aligned} \tag{13}

The expectation in (13)(13) is just a vectorized version of kernel ridge regression in (9)(9) if we assume the same kernel function k(,)k(\cdot, \cdot) and if σ2\sigma^2, the variance of the GP noise, is equal to the ridge regression regularization hyperparameter λ\lambda.

Analysis

Kernel ridge and GP regression are quite similar, but the major difference is that a GP regressor is a generative model of the response. This has a few major consequences. While kernel ridge regression can only predict yy, a GP can quantify its uncertainty about yy. Furthermore, a GP can generate posterior samples through the generative process. Finally, a GP has a marginal likelihood, and this can be used to find optimal hyperparameters via optimization (see (Rasmussen & Williams, 2006), chapter 55); in kernel ridge regression, one must perform grid search over λ\lambda.

   

Appendix

A1. Kernel ridge from the Woodbury identity

If A\mathbf{A} is a P×PP \times P full rank matrix that is rank corrected by UCV\mathbf{UCV} where URP×K\mathbf{U} \in \mathbb{R}^{P \times K}, CRK×K\mathbf{C} \in \mathbb{R}^{K \times K}, and VRK×P\mathbf{V} \in \mathbb{R}^{K \times P}, then the Woodbury identity is

(A+UCV)1=A1A1U(C1+VA1U)1VA1. (\mathbf{A} + \mathbf{UCV})^{-1} = \mathbf{A}^{-1} - \mathbf{A}^{-1} \mathbf{U} (\mathbf{C}^{-1} + \mathbf{V} \mathbf{A}^{-1} \mathbf{U})^{-1} \mathbf{V} \mathbf{A}^{-1}.

Apply the Woodbury matrix identity with A=C=I\mathbf{A} = \mathbf{C} = \mathbf{I}. Then

(I+UV)1=IU(I+VU)1V(I+UV)1U=[IU(I+VU)1V]U=UU(I+VU)1VU=U(I+VU)1[(I+VU)VU]=U(I+VU)1. \begin{aligned} (\mathbf{I} + \mathbf{U}\mathbf{V})^{-1} &= \mathbf{I} - \mathbf{U}(\mathbf{I} + \mathbf{V}\mathbf{U})^{-1} \mathbf{V} \\ (\mathbf{I} + \mathbf{U}\mathbf{V})^{-1} \mathbf{U} &= [\mathbf{I} - \mathbf{U}(\mathbf{I} + \mathbf{V}\mathbf{U})^{-1} \mathbf{V}] \mathbf{U} \\ &= \mathbf{U} - \mathbf{U}(\mathbf{I} + \mathbf{V}\mathbf{U})^{-1} \mathbf{V}\mathbf{U} \\ &= \mathbf{U}(\mathbf{I} + \mathbf{V}\mathbf{U})^{-1} [(\mathbf{I} + \mathbf{V}\mathbf{U}) - \mathbf{V}\mathbf{U}] \\ &= \mathbf{U}(\mathbf{I} + \mathbf{V}\mathbf{U})^{-1}. \end{aligned}

This is the push-through identity. Now multiply both sides by a scalar, 1/λ1/\lambda, and recall that (kA)1=k1A1(k \mathbf{A})^{-1} = k^{-1} \mathbf{A}^{-1} in general:

(I+UV)1U=U(I+VU)1λλ[(I+UV)1U]=λλ[U(I+VU)1](λI+λUV)1λU=λU(λI+VλU)1(λI+WV)1W=W(λI+VW)1. \begin{aligned} (\mathbf{I} + \mathbf{U}\mathbf{V})^{-1} \mathbf{U} &= \mathbf{U}(\mathbf{I} + \mathbf{V}\mathbf{U})^{-1} \\ \frac{\lambda}{\lambda}[(\mathbf{I} + \mathbf{U}\mathbf{V})^{-1} \mathbf{U}] &= \frac{\lambda}{\lambda}[\mathbf{U}(\mathbf{I} + \mathbf{V}\mathbf{U})^{-1}] \\ (\lambda \mathbf{I} + \lambda \mathbf{U}\mathbf{V})^{-1} \lambda\mathbf{U} &= \lambda\mathbf{U}(\lambda \mathbf{I} + \mathbf{V} \lambda \mathbf{U})^{-1} \\ (\lambda \mathbf{I} + \mathbf{W} \mathbf{V})^{-1} \mathbf{W} &= \mathbf{W}(\lambda \mathbf{I} + \mathbf{V} \mathbf{W})^{-1}. \end{aligned}

In the last step, we set W=λU\mathbf{W} = \lambda \mathbf{U}. For kernel ridge regression, we apply the last line with W=Φ\mathbf{W} = \boldsymbol{\Phi}^{\top} and V=Φ\mathbf{V} = \boldsymbol{\Phi}.

  1. Welling, M. (2013). Kernel ridge regression. Max Welling’s Classnotes in Machine Learning, 1–3.
  2. Bishop, C. M. (2006). Pattern Recognition and Machine Learning.
  3. Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning (Vol. 2, Number 3). MIT Press Cambridge, MA.