Random Fourier Features

Rahimi and Recht's 2007 paper, "Random Features for Large-Scale Kernel Machines", introduces a framework for randomized, low-dimensional approximations of kernel functions. I discuss this paper in detail with a focus on random Fourier features.

Kernel machines

Consider a learning problem with data and targets {(xn,yn)}n=1N\{(\mathbf{x}_n, y_n)\}_{n=1}^{N} where xnX\mathbf{x}_n \in \mathcal{X} and ynYy_n \in \mathcal{Y}. Ignoring the bias, a linear model finds a hyperplane w\mathbf{w} such that the decision function

f(x)=wx(1) f^{*}(\mathbf{x}) = \mathbf{w}^{\top} \mathbf{x} \tag{1}

is optimal for some loss function. For example, in logistic regression, we compute the logistic function of f(x)f(\mathbf{x}), and then threshold the output probability to produce a binary classifier with Y={0,1}\mathcal{Y} = \{0, 1\}. Obviously, linear models break down when our data are not linearly separable for classification (Figure 11, left) or do not have a linear relationship between the features and targets for regression.

In a kernel machine or a kernel method, the input domain X\mathcal{X} is mapped into another space V\mathcal{V} in which the targets may be a linear function of the data. The dimension of V\mathcal{V} may be high or even infinite, but kernel methods avoid operating explicitly in this space using the kernel trick: if k:X×XRk: \mathcal{X} \times \mathcal{X} \mapsto \mathbb{R} is a positive definite kernel, then by Mercer’s theorem there exists a basis function or feature map φ:XV\varphi: \mathcal{X} \mapsto \mathcal{V} such that

k(x,y)=φ(x),φ(y)V.(2) k(\mathbf{x}, \mathbf{y}) = \langle \varphi(\mathbf{x}), \varphi(\mathbf{y}) \rangle_{\mathcal{V}}. \tag{2}

,V\langle \cdot, \cdot \rangle_{\mathcal{V}} is an inner product in V\mathcal{V}. If the kernel trick is new to you, please see my previous post. Using the kernel trick and a representer theorem (Kimeldorf & Wahba, 1971), kernel methods construct nonlinear models of X\mathcal{X} that are linear in k(,)k(\cdot, \cdot),

f(x)=n=1Nαnk(x,xn)=w,φ(x)V.(3) f^{*}(\mathbf{x}) = \sum_{n=1}^{N} \alpha_n k(\mathbf{x}, \mathbf{x}_n) = \langle \mathbf{w}, \varphi(\mathbf{x}) \rangle_{\mathcal{V}}. \tag{3}

In (3)(3), f()f^{*}(\cdot) denotes the optimal f()f(\cdot). Taken together, (2)(2) and (3)(3) say that provided we have a positive definite kernel function k(,)k(\cdot, \cdot), we can avoid operating in the possibly infinite-dimensional space V\mathcal{V} and instead only compute over NN data points. This works because the optimal decision rule can be expressed as an expansion in terms of the training samples. See (Schölkopf et al., 2001) for a detailed treatment on this topic.

If the representer theorem is new to you, note that you’ve already seen it in action. For example, the optimal coefficients in linear regression are

w^=(XX)1Xy,(4) \hat{\mathbf{w}} = (\mathbf{X}^{\top} \mathbf{X})^{-1} \mathbf{X}^{\top} \mathbf{y}, \tag{4}

where X\mathbf{X} is an N×DN \times D matrix of training data. We make predictions on held out data X\mathbf{X}_{*} as y=Xw^\mathbf{y}_{*} = \mathbf{X}_{*} \hat{\mathbf{w}}. In other words, the optimal prediction for a linear model can be viewed as an expansion in terms of the training samples. For a second example, recall that the posterior mean of a Gaussian process (GP) regressor is

E[f]=KX,X[σ2I+KX,X]1yβ,(5) \mathbb{E}[\mathbf{f}_{*}] = \mathbf{K}_{X_*, X} \overbrace{[\sigma^2 \mathbf{I} + \mathbf{K}_{X,X}]^{-1} \mathbf{y}}^{\boldsymbol{\beta}}, \tag{5}

where the notation KX,X\mathbf{K}_{X_{*}, X} denotes the kernel function evaluated at all pairs of samples in X\mathbf{X}_{*} and X\mathbf{X}. See my previous post on GP regression if needed. For each component in the MM-vector E[f]\mathbb{E}[\mathbf{f}_{*}] (each prediction ymy_m), the GP prediction is a linear-in-β\boldsymbol{\beta} model of the kernel evaluated at the test data xm\mathbf{x}_m against all the training points.

Perhaps the most famous kernel machine is the nonlinear support vector machine (SVM) (Figure 11, right). However, any algorithm that can be represented as a dot product between pairs of samples can be converted into a kernel method using (2)(2). Other methods that can be considered kernel methods are GPs, kernel PCA, and kernel regression.

Figure 1. Logistic regression (left) with decision boundary denoted with a solid line and SVM with radial basis function (RBF) kernel (right) on the Scikit-learn half-circles dataset. Support vectors are denoted with circles, and the margins are denoted with dashed lines.

While the kernel trick is a beautiful idea and the conceptual backbone of kernel machines, the problem is that for large datasets (for huge NN), the machine must operate on a covariance matrix KX\mathbf{K}_X, induced by a particular kernel function, that is N×NN \times N,

KX=[k(x1,x1)k(x1,x2)k(x1,xN)k(x2,x1)k(x2,x2)k(x2,xN)k(xN,x1)k(xN,x2)k(xN,xN)].(6) \mathbf{K}_X = \begin{bmatrix} k(\mathbf{x}_1, \mathbf{x}_1) & k(\mathbf{x}_1, \mathbf{x}_2) & \dots & k(\mathbf{x}_1, \mathbf{x}_N) \\ k(\mathbf{x}_2, \mathbf{x}_1) & k(\mathbf{x}_2, \mathbf{x}_2) & \dots & k(\mathbf{x}_2, \mathbf{x}_N) \\ \vdots & \vdots & \ddots & \vdots \\ k(\mathbf{x}_N, \mathbf{x}_1) & k(\mathbf{x}_N, \mathbf{x}_2) & \dots & k(\mathbf{x}_N, \mathbf{x}_N) \end{bmatrix}. \tag{6}

For example, in the Gaussian process regressor in (5)(5), the right-most covariance kernel KX,X\mathbf{K}_{X, X} is an N×NN \times N matrix. To compute β\boldsymbol{\beta}, we must invert this. Furthermore, to evaluate a test data point, the model must evaluate the sum in (3)(3). While we’re avoiding computing in V\mathcal{V}, we are stuck operating in RN\mathbb{R}^N. In the area of big data, kernel methods do not necessarily scale.

Random features

In their 2007 paper, Random Features for Large-Scale Kernel Machines (Rahimi & Recht, 2007), Ali Rahimi and Ben Recht propose a different tack: approximate the above inner product in (2)(2) with a randomized map z:RDRR\mathbf{z}: \mathbb{R}^{D} \mapsto \mathbb{R}^{R} where ideally RNR \ll N,

k(x,y)=φ(x),φ(y)Vz(x)z(y).(7) k(\mathbf{x}, \mathbf{y}) = \langle \varphi(\mathbf{x}), \varphi(\mathbf{y}) \rangle_{\mathcal{V}} \approx \mathbf{z}(\mathbf{x})^{\top} \mathbf{z}(\mathbf{y}). \tag{7}

See this talk by Rahimi for details on the theoretical guarantees of this approximation. Why does this work and why is it it a good idea? The representer theorem tells us that the optimal solution is a weighted sum of the kernel evaluated at our observations. If we have a good approximation of φ()\varphi(\cdot), then

f(x)=n=1Nαnk(xn,x)=n=1Nαnφ(xn),φ(x)Vn=1Nαnz(xn)z(x)=βz(x).(8) \begin{aligned} f^{*}(\mathbf{x}) &= \sum_{n=1}^{N} \alpha_n k(\mathbf{x}_n, \mathbf{x}) \\ &= \sum_{n=1}^{N} \alpha_n \langle \varphi(\mathbf{x}_n), \varphi(\mathbf{x}) \rangle_{\mathcal{V}} \\ &\approx \sum_{n=1}^{N} \alpha_n \mathbf{z}(\mathbf{x}_n)^{\top} \mathbf{z}(\mathbf{x}) \\ &= \boldsymbol{\beta}^{\top} \mathbf{z}(\mathbf{x}). \end{aligned} \tag{8}

In other words, provided z()\mathbf{z}(\cdot) is a good approximation of φ()\varphi(\cdot), then we can simply project our data using z()\mathbf{z}(\cdot) and then use fast linear models in RR\mathbb{R}^{R} rather than RN\mathbb{R}^N because both β\boldsymbol{\beta} and z()\mathbf{z}(\cdot) are RR-vectors. So the task at hand is to find a random projection z()\mathbf{z}(\cdot) such that it well-approximates the corresponding nonlinear kernel machine.

According to this blog post by Rahimi, this idea was inspired by the following observation. Let ω\boldsymbol{\omega} be a random DD-dimensional vector such that

ωND(0,I).(9) \boldsymbol{\omega} \sim \mathcal{N}_D(\mathbf{0}, \mathbf{I}). \tag{9}

Now define hh as

h:xexp(iωx).(10) h: \mathbf{x} \mapsto \exp(i \boldsymbol{\omega}^{\top} \mathbf{x}). \tag{10}

Above, ii is the imaginary unit. Let the superscript * denote the complex conjugate. Importantly, recall that the complex conjugate of eixe^{ix} is eixe^{-ix}. Then note

Eω[h(x)h(y)]=Eω[exp(iω(xy))]=RDp(ω)exp(iω(xy))dω=exp ⁣(12(xy)(xy)).(11) \begin{aligned} \mathbb{E}_{\boldsymbol{\omega}}[h(\mathbf{x}) h(\mathbf{y})^{*}] &= \mathbb{E}_{\boldsymbol{\omega}}[\exp(i \boldsymbol{\omega}^{\top} (\mathbf{x} - \mathbf{y}))] \\ &= \int_{\mathbb{R}^D} p(\boldsymbol{\omega}) \exp(i \boldsymbol{\omega}^{\top} (\mathbf{x} - \mathbf{y})) \text{d}\boldsymbol{\omega} \\ &= \exp\!\big(-\frac{1}{2} (\mathbf{x} - \mathbf{y})^{\top} (\mathbf{x} - \mathbf{y}) \big). \end{aligned} \tag{11}

In other words, the expected value of h(x)h(y)h(\mathbf{x}) h(\mathbf{y})^{*} is the Gaussian kernel. See A1 for a complete derivation.

This is quite cool, and it is actually a specific instance of a more general result, Bochner’s theorem (Rudin, 1962). Quoting Rahimi and Recht’s version with small modifications for consistent notation, the theorem is:

Bochner’s theorem: A continuous kernel k(x,y)=k(xy)k(\mathbf{x}, \mathbf{y}) = k(\mathbf{x} - \mathbf{y}) on RD\mathbb{R}^D is positive definite if and only if k(Δ)k(\Delta) is the Fourier transform of a non-negative measure.

The Fourier transform of a non-negative measure, call it p(ω)p(\omega), is

k(Δ)=p(ω)exp(iωΔ)dω.(12) k(\boldsymbol{\Delta}) = \int p(\boldsymbol{\omega}) \exp(i \boldsymbol{\omega} \boldsymbol{\Delta}) \text{d}\boldsymbol{\omega}. \tag{12}

Rahimi and Recht observe that many popular kernels such as the Gaussian (or radial basis function), Laplace, and Cauchy kernels are shift-invariant. The upshot is that the transformation h()h(\cdot) is an unbiased estimate of k(xy)k(\mathbf{x} - \mathbf{y}). Which kernel is k(xy)k(\mathbf{x} - \mathbf{y})? It depends on the non-negative measure p(ω)p(\omega).

This gives us a general framework to approximate any shift-invariant kernel by re-defining h()h(\cdot) in (10)(10) to depend on ω\boldsymbol{\omega} from any non-negative measure p(ω)p(\boldsymbol{\omega}), not just the spherical Gaussian in (9)(9). Furthermore, if we sample RR i.i.d. realizations {ωr}r=1R\{\boldsymbol{\omega}_r \}_{r=1}^{R}, we can lower the variance of this approximation:

k(x,y)=k(xy)=p(ω)exp(iω(xy))dω=Eω[exp(iω(xy))]11Rr=1Rexp(iωr(xy))=[1Rexp(iω1x)1Rexp(iω2x)1Rexp(iωRx)][1Rexp(iω1y)1Rexp(iω2y)1Rexp(iωRy)]2h(x)h(y).(13) \begin{aligned} k(\mathbf{x}, \mathbf{y}) &= k(\mathbf{x} - \mathbf{y}) \\ &= \int p(\boldsymbol{\omega}) \exp(i \boldsymbol{\omega}^{\top}(\mathbf{x} - \mathbf{y})) \text{d}\boldsymbol{\omega} \\ &= \mathbb{E}_{\boldsymbol{\omega}}\big[\exp(i \boldsymbol{\omega}^{\top}(\mathbf{x} - \mathbf{y}))\big] \\ &\stackrel{1}{\approx} \frac{1}{R} \sum_{r=1}^{R} \exp(i \boldsymbol{\omega}_r^{\top}(\mathbf{x} - \mathbf{y})) \\ &= \begin{bmatrix} \frac{1}{\sqrt{R}} \exp(i \boldsymbol{\omega}_1^{\top} \mathbf{x}) \\ \frac{1}{\sqrt{R}} \exp(i \boldsymbol{\omega}_2^{\top} \mathbf{x}) \\ \vdots \\ \frac{1}{\sqrt{R}} \exp(i \boldsymbol{\omega}_R^{\top} \mathbf{x}) \end{bmatrix}^{\top} \begin{bmatrix} \frac{1}{\sqrt{R}} \exp(-i \boldsymbol{\omega}_1^{\top} \mathbf{y}) \\ \frac{1}{\sqrt{R}} \exp(-i \boldsymbol{\omega}_2^{\top} \mathbf{y}) \\ \vdots \\ \frac{1}{\sqrt{R}} \exp(-i \boldsymbol{\omega}_R^{\top} \mathbf{y}) \end{bmatrix} \\ &\stackrel{2}{\equiv} \mathbf{h}(\mathbf{x}) \mathbf{h}(\mathbf{y})^{*}. \end{aligned} \tag{13}

Step 11 is a Monte Carlo approximation of the expectation. Step 22 is the definition of a random map h:RDRR\mathbf{h}: \mathbb{R}^{D} \mapsto \mathbb{R}^R, so an RR-vector of normalized h()h(\cdot) transformations.

Note that we’ve talked about the dot product z(x)z(y)\mathbf{z}(\mathbf{x})^{\top} \mathbf{z}(\mathbf{y}), but above we have h(x)h(y)\mathbf{h}(\mathbf{x}) \mathbf{h}(\mathbf{y})^{*}. As we will see in the next section, the imaginary part of our random map will disappear, and the new transform is what Rahimi and Recht call z\mathbf{z}.

Fine tuning

Now that we understand the big idea of a low-dimensional, randomized map and why it might work, let’s get into the weeds. First, note that since both our distribution ND(0,I)\mathcal{N}_D(\mathbf{0}, \mathbf{I}) and the kernel k(Δ)k(\Delta) are real-valued, we can write

exp(iω(xy))=cos(ω(xy))isin(ω(xy))=cos(ω(xy)).(14) \begin{aligned} \exp(i \boldsymbol{\omega}^{\top} (\mathbf{x} - \mathbf{y})) &\stackrel{\dagger}{=} \cos(\boldsymbol{\omega}^{\top} (\mathbf{x} - \mathbf{y})) - \cancel{i \sin(\boldsymbol{\omega}^{\top} (\mathbf{x} - \mathbf{y}))} \\ &= \cos(\boldsymbol{\omega}^{\top} (\mathbf{x} - \mathbf{y})). \end{aligned} \tag{14}

Step \dagger is Euler’s formula. We can then define zω(x)z_{\boldsymbol{\omega}}(\mathbf{x})—note that this is still not yet the bolded z\mathbf{z}—without the imaginary unit as

ωp(ω)bUniform(0,2π)zω(x)=2cos(ωx+b).(15) \begin{aligned} \boldsymbol{\omega} &\sim p(\boldsymbol{\omega}) \\ b &\sim \text{Uniform}(0, 2\pi) \\ z_{\boldsymbol{\omega}}(\mathbf{x}) &= \sqrt{2} \cos(\boldsymbol{\omega}^{\top}\mathbf{x} + b). \end{aligned} \tag{15}

This works because

Eω[zω(x)zω(y)]=Eω[2cos(ωx+b)2cos(ωy+b)]=Eω[cos(ω(x+y)+2b)]+Eω[cos(ω(xy))]=Eω[cos(ω(xy))].(16) \begin{aligned} \mathbb{E}_{\boldsymbol{\omega}}[z_{\boldsymbol{\omega}}(\mathbf{x}) z_{\boldsymbol{\omega}}(\mathbf{y})] &= \mathbb{E}_{\boldsymbol{\omega}}[\sqrt{2} \cos(\boldsymbol{\omega}^{\top}\mathbf{x} + b) \sqrt{2} \cos(\boldsymbol{\omega}^{\top}\mathbf{y} + b)] \\ &\stackrel{\star}{=} \mathbb{E}_{\boldsymbol{\omega}}[ \cos(\boldsymbol{\omega}^{\top}(\mathbf{x} + \mathbf{y}) + 2b)] + \mathbb{E}_{\boldsymbol{\omega}}[\cos(\boldsymbol{\omega}^{\top}(\mathbf{x} - \mathbf{y}))] \\ &\stackrel{\dagger}{=} \mathbb{E}_{\boldsymbol{\omega}}[\cos(\boldsymbol{\omega}^{\top}(\mathbf{x} - \mathbf{y}))]. \end{aligned} \tag{16}

Step \star is just trigonometry. See A2 for a derivation. Step \dagger uses the fact that since bUniform(0,2π)b \sim \text{Uniform}(0, 2\pi), the expectation with respect to bb is zero:

Eω[cos(ω(x+y)+2b)]=Eω[Eb[cos(ω(x+y)+2b)ω]]=0(17) \mathbb{E}_{\boldsymbol{\omega}}[ \cos(\boldsymbol{\omega}^{\top}(\mathbf{x} + \mathbf{y}) + 2b)] = \mathbb{E}_{\boldsymbol{\omega}} [\mathbb{E}_b[ \cos(\boldsymbol{\omega}^{\top}(\mathbf{x} + \mathbf{y}) + 2b) \mid \boldsymbol{\omega}]] = 0 \tag{17}

If you are unconvinced, see A3. We are now ready to define the random map z:RDRR\mathbf{z}: \mathbb{R}^D \mapsto \mathbb{R}^R such that (7)(7) holds. Let

z(x)=[1Rzω1(x)1Rzω2(x)1RzωR(x)].(18) \mathbf{z}(\mathbf{x}) = \begin{bmatrix} \frac{1}{\sqrt{R}} z_{\boldsymbol{\omega}_1}(\mathbf{x}) \\ \frac{1}{\sqrt{R}} z_{\boldsymbol{\omega}_2}(\mathbf{x}) \\ \vdots \\ \frac{1}{\sqrt{R}} z_{\boldsymbol{\omega}_R}(\mathbf{x}) \end{bmatrix}.\tag{18}

and therefore

z(x)z(y)=1Rr=1Rzωr(x)zωr(y)=1Rr=1R2cos(ωrx+br)cos(ωry+br)=1Rr=1Rcos(ωr(xy))Eω[cos(ω(xy))]=k(x,y).(19) \begin{aligned} \mathbf{z}(\mathbf{x})^{\top} \mathbf{z}(\mathbf{y}) &= \frac{1}{R} \sum_{r=1}^{R} z_{\boldsymbol{\omega}_r}(\mathbf{x}) z_{\boldsymbol{\omega}_r}(\mathbf{y}) \\ &= \frac{1}{R} \sum_{r=1}^{R} 2 \cos(\boldsymbol{\omega}_r^{\top} \mathbf{x} + b_r) \cos(\boldsymbol{\omega}_r^{\top} \mathbf{y} + b_r) \\ &= \frac{1}{R} \sum_{r=1}^{R} \cos(\boldsymbol{\omega}_r^{\top} (\mathbf{x} - \mathbf{y})) \\ &\approx \mathbb{E}_{\boldsymbol{\omega}}[\cos(\boldsymbol{\omega}^{\top} (\mathbf{x} - \mathbf{y}))] \\ &= k(\mathbf{x}, \mathbf{y}). \end{aligned} \tag{19}

We now have a simple algorithm to estimate a shift invariant, positive definite kernel. Draw RR samples of ωp(ω)\boldsymbol{\omega} \sim p(\boldsymbol{\omega}) and bUniform(0,2π)b \sim \text{Uniform}(0, 2\pi) and then compute z(x)z(y)\mathbf{z}(\mathbf{x})^{\top} \mathbf{z}(\mathbf{y}).

An alternative version of random Fourier features that you might see is

zωr(x)=[cos(ωrx)sin(ωrx)].(20) z_{\boldsymbol{\omega}_r}(\mathbf{x}) = \begin{bmatrix} \cos(\boldsymbol{\omega}_r^{\top} \mathbf{x}) \\ \sin(\boldsymbol{\omega}_r^{\top} \mathbf{x}) \end{bmatrix}. \tag{20}

See A4 to see why this works and (Sutherland & Schneider, 2015) for a comparative analysis of each approach.

Examples

Gaussian kernel approximation

Before fitting a more complex model, let’s first approximate a Gaussian kernel using random Fourier features. Sample RR i.i.d. ω\boldsymbol{\omega} variables from a spherical Gaussian and then compute

z(x)z(y)=1Rr=1Rzωr(x)zωr(y)=1Rr=1Rcos(ωr(xy)).(21) \mathbf{z}(\mathbf{x})^{\top} \mathbf{z}(\mathbf{y}) = \frac{1}{R} \sum_{r=1}^{R} z_{\boldsymbol{\omega}_r}(\mathbf{x})^{\top} z_{\boldsymbol{\omega}_r}(\mathbf{y}) = \frac{1}{R} \sum_{r=1}^{R} \cos(\boldsymbol{\omega}_r^{\top}(\mathbf{x} - \mathbf{y})). \tag{21}

for each (x,y)(\mathbf{x}, \mathbf{y}) pair in the data. The resultant N×NN \times N object is the approximate covariance matrix induced by the Gaussian kernel function. Concretely, let ZX\mathbf{Z}_X denote z()\mathbf{z}(\cdot) applied to all NN samples xn\mathbf{x}_n. Thus, ZX\mathbf{Z}_X is N×RN \times R and therefore

KX[z(x1)z(xN)][z(x1)z(xN)]=ZXZX(22) \mathbf{K}_X \approx \begin{bmatrix} \mathbf{z}(\mathbf{x}_1) \\ \vdots \\ \mathbf{z}(\mathbf{x}_N) \end{bmatrix} \begin{bmatrix} \mathbf{z}(\mathbf{x}_1) & \dots & \mathbf{z}(\mathbf{x}_N) \end{bmatrix} = \mathbf{Z}_X \mathbf{Z}_X^{\top} \tag{22}

because

[k(x1,x1)k(x1,xN)k(xN,x1)k(xN,xN)][z(x1)z(x1)z(x1)z(xN)z(xN)z(x1)z(xN)z(xN)].(23) \begin{bmatrix} k(\mathbf{x}_1, \mathbf{x}_1) & \dots & k(\mathbf{x}_1, \mathbf{x}_N) \\ \vdots & \ddots & \vdots \\ k(\mathbf{x}_N, \mathbf{x}_1) & \dots & k(\mathbf{x}_N, \mathbf{x}_N) \end{bmatrix} \approx \begin{bmatrix} \mathbf{z}(\mathbf{x}_1)^{\top} \mathbf{z}(\mathbf{x}_1) & \dots & \mathbf{z}(\mathbf{x}_1)^{\top} \mathbf{z}(\mathbf{x}_N) \\ \vdots & \ddots & \vdots \\ \mathbf{z}(\mathbf{x}_N)^{\top} \mathbf{z}(\mathbf{x}_1) & \dots & \mathbf{z}(\mathbf{x}_N)^{\top} \mathbf{z}(\mathbf{x}_N) \end{bmatrix}. \tag{23}

We see in Figure 22 that as RR increases, the covariance matrix approximation improves because each cell value uses more Monte Carlo samples to estimate the basis function ϕ()\phi(\cdot) associated with k(,)k(\cdot, \cdot) for the pair of samples associated with that cell. See my GitHub for the code to generate this figure.

Figure 2. A comparison of a Gaussian kernel on the Scikit-learn S-curve with random Fourier features (left) and an approximated covariance matrix with an increasing number of Monte Carlo samples (right four frames).

Kernel ridge regression

As a more complex example and to see concretely why random Fourier features are efficient, let’s look at kernel ridge regression. If needed, see (Welling, 2013) for an introduction to the model. Also, the Scikit-learn docs have a nice tutorial comparing kernel ridge and Gaussian process regression. (8)(8) tells us that f(x)f^{*}(\mathbf{x}) is linear in z(x)\mathbf{z}(\mathbf{x}). Therefore, we just need to convert our input x\mathbf{x} into random features and apply linear methods. Concretely, we just want to solve for the coefficients β\boldsymbol{\beta} in

β^=(ZXZX+λIRA)1ZXy.(24) \hat{\boldsymbol{\beta}} = (\underbrace{\mathbf{Z}_X^{\top} \mathbf{Z}_X + \lambda \mathbf{I}_R}_{\mathbf{A}})^{-1} \mathbf{Z}_X^{\top} \mathbf{y}. \tag{24}

Above, λ\lambda is the ridge regression regularization parameter. See Figure 33 for the results of comparing Gaussian kernel regression with random Fourier feature regression.

Figure 3. Comparison of Gaussian kernel ridge regression (top left) with ridge regression using random Fourier features (RFF regression) on N=100N = 100 data points with R{1,5,100}R \in \{1, 5, 100\}.

With (24)(24) in mind, it is clear why random Fourier features are efficient: inverting A\mathbf{A} has time complexity O(R3)\mathcal{O}(R^3) rather than O(N3)\mathcal{O}(N^3). If RNR \ll N, then we can have big savings. What is not shown is that even on this small data set, random Fourier feature regression is over an order of magnitude faster than kernel regression with a Gaussian kernel. Since kernel machines scale poorly in NN, it is easy to make this multiplier larger by increasing NN while keeping RR fixed.

Again, see my GitHub for a complete implementation. Note that we need to cache the random variables, ω1,ω2,,ωR\boldsymbol{\omega}_1, \boldsymbol{\omega}_2, \dots, \boldsymbol{\omega}_R and b1,b2,bRb_1, b_2, \dots b_R, for generating ZX\mathbf{Z}_X so that we have the same transformation when we predict on test data.

Conclusion

Random features for kernel methods is a beautiful, simple, and practical idea; and I see why Rahimi and Recht’s paper won the Test of Time Award at NeurIPS 2017. Many theoretical results are impractical or complicated to implement. Many empirical results are not well-understood or brittle. Random features is neither: relatively speaking, it is a simple idea using established mathematics, yet it comes equipped with good theoretical guarantees and good results with practical, easy-to-implement models.

   

Acknowledgements

I thank Era Choshen for pointing out an error in appendix A3.

   

Appendix

A1. Gaussian kernel derivation

Let δ=xy\boldsymbol{\delta} = \mathbf{x} - \mathbf{y}:

Eω[h(x)h(y)]=Eω[exp(iωx)exp(iωy))]=Eω[exp(iωδ)]=RDp(ω)exp(iωδ)dω=(2π)D/2RDexp ⁣(12ωω)exp(iωδ)dω=(2π)D/2RDexp ⁣(12ωω+iωδ)dω=(2π)D/2RDexp ⁣(12(ωω2iωδδδ)12δδ)dω=(2π)D/2exp ⁣(12δδ)RDexp ⁣(12(ωiδ)(ωiδ))dω(2π)D/2=exp ⁣(12δδ)=k(δ).(A1.1) \begin{aligned} &\mathbb{E}_{\boldsymbol{\omega}}[h(\mathbf{x}) h(\mathbf{y})^{*}] \\ &= \mathbb{E}_{\boldsymbol{\omega}}[\exp(i \boldsymbol{\omega}^{\top} \mathbf{x}) \exp(- i \boldsymbol{\omega}^{\top} \mathbf{y}))] \\ &= \mathbb{E}_{\boldsymbol{\omega}}[\exp(i \boldsymbol{\omega}^{\top} \boldsymbol{\delta})] \\ &= \int_{\mathbb{R}^D} p(\boldsymbol{\omega}) \exp(i \boldsymbol{\omega}^{\top} \boldsymbol{\delta}) \text{d}\boldsymbol{\omega} \\ &= (2 \pi)^{-D/2} \int_{\mathbb{R}^D} \exp\!\big(-\frac{1}{2} \boldsymbol{\omega}^{\top} \boldsymbol{\omega} \big) \exp(i \boldsymbol{\omega}^{\top} \boldsymbol{\delta}) \text{d}\boldsymbol{\omega} \\ &= (2 \pi)^{-D/2} \int_{\mathbb{R}^D} \exp\!\big(-\frac{1}{2} \boldsymbol{\omega}^{\top} \boldsymbol{\omega} + i \boldsymbol{\omega}^{\top} \boldsymbol{\delta} \big) \text{d}\boldsymbol{\omega} \\ &= (2 \pi)^{-D/2} \int_{\mathbb{R}^D} \exp\!\big(-\frac{1}{2} \big( \boldsymbol{\omega}^{\top} \boldsymbol{\omega} - 2i \boldsymbol{\omega}^{\top} \boldsymbol{\delta} - \boldsymbol{\delta}^{\top} \boldsymbol{\delta}\big) - \frac{1}{2} \boldsymbol{\delta}^{\top} \boldsymbol{\delta} \big) \text{d}\boldsymbol{\omega} \\ &= (2 \pi)^{-D/2} \exp\!\big(-\frac{1}{2} \boldsymbol{\delta}^{\top} \boldsymbol{\delta} \big) \underbrace{\int_{\mathbb{R}^D} \exp\!\big(-\frac{1}{2} \big( \boldsymbol{\omega} - i \boldsymbol{\delta}\big)^{\top} \big( \boldsymbol{\omega} - i \boldsymbol{\delta}\big) \big) \text{d}\boldsymbol{\omega}}_{(2\pi)^{D/2}} \\ &= \exp\!\big(-\frac{1}{2} \boldsymbol{\delta}^{\top} \boldsymbol{\delta} \big) \\ &= k(\boldsymbol{\delta}). \end{aligned} \tag{A1.1}

In other words, k()k(\cdot) is the Gaussian kernel with p(ω)p(\boldsymbol{\omega}) is a spherical Gaussian.

A2. Trigonometric identity

Recall from trigonometry that

cos(x+y)=cos(x)cos(y)sin(x)sin(y).(A2.1) \cos(x + y) = \cos(x) \cos(y) - \sin(x) \sin(y). \tag{A2.1}

See this Khan Academy video for a proof. Furthermore, note that cos(x)=cos(x)\cos(-x) = \cos(x) since the cosine function is symmetric about x=0x = 0. This is not true of the sine function. Instead, it has odd symmetry: sin(x)=sin(x)\sin(-x) = -\sin(x). Thus, with a little clever manipulation, we can write

cos(x+y)+cos(xy)=cos(x+y)+cos(x+(y))=[cos(x)cos(y)sin(x)sin(y)]+[cos(x)cos(y)sin(x)sin(y)]=[cos(x)cos(y)sin(x)sin(y)]+[cos(x)cos(y)+sin(x)sin(y)]=2cos(x)cos(y).(A2.2) \begin{aligned} &\cos(x + y) + \cos(x - y) \\ &= \cos(x + y) + \cos(x + (-y)) \\ &= [\cos(x) \cos(y) - \sin(x) \sin(y)] + [\cos(x) \cos(-y) - \sin(x) \sin(-y)] \\ &= [\cos(x) \cos(y) - \cancel{\sin(x) \sin(y)}] + [\cos(x) \cos(y) + \cancel{\sin(x) \sin(y)}] \\ &= 2 \cos(x) \cos(y). \end{aligned} \tag{A2.2}

A3. Expectation of cos(t+b)\cos(\mathbf{t} + b) is zero.

Note that

Eω[cos(ω(x+y)+2b)]=Eω[Eb[cos(ω(x+y)+2b)ω]](A3.1) \mathbb{E}_{\boldsymbol{\omega}} [\cos(\boldsymbol{\omega}^{\top}(\mathbf{x} + \mathbf{y}) + 2b)] = \mathbb{E}_{\boldsymbol{\omega}} [\mathbb{E}_b[ \cos(\boldsymbol{\omega}^{\top}(\mathbf{x} + \mathbf{y}) + 2b) \mid \boldsymbol{\omega}]] \tag{A3.1}

holds by the law of total expectation. We claim the inner conditional expectation is zero. To ease notation, let t=ω(xy)\mathbf{t} = \boldsymbol{\omega}^{\top}(\mathbf{x} - \mathbf{y}). Then

Eb[cos(t+2b)ω]=02πcos(t+2b)2πdb=12π02πcos(t+2b)db=12π[sin(t+2b)02π]=12π[sin(t)sin(t+4π)]=0(A3.2) \begin{aligned} \mathbb{E}_{b}[\cos(\mathbf{t} + 2b) \mid \boldsymbol{\omega}] &= \int_{0}^{2\pi} \frac{\cos(\mathbf{t} +2b)}{2\pi} \text{d}b \\ &= \frac{1}{2\pi} \int_{0}^{2\pi} \cos(\mathbf{t} + 2b) \text{d}b \\ &= \frac{1}{2\pi} \Big[ \sin(\mathbf{t} + 2b) \Big|_{0}^{2\pi} \Big] \\ &= \frac{1}{2\pi} \Big[ \sin(\mathbf{t}) - \sin(\mathbf{t} + 4\pi) \Big] \\ &= 0 \end{aligned} \tag{A3.2}

The last step holds because sin(t)=sin(t±2πk)\sin(\mathbf{t}) = \sin(\mathbf{t} \pm 2 \pi k) for any integer kk.

A4. Alternative random Fourier features

Consider this alternative definition of the random map:

zωr(x)=[cos(ωrx)sin(ωrx)].(A4.1) z_{\boldsymbol{\omega}_r}(\mathbf{x}) = \begin{bmatrix} \cos(\boldsymbol{\omega}_r^{\top} \mathbf{x}) \\ \sin(\boldsymbol{\omega}_r^{\top} \mathbf{x}) \end{bmatrix}. \tag{A4.1}

Draw R=R/2R^{\prime} = R / 2 samples

ωrp(ω).(A4.2) \boldsymbol{\omega}_r \sim p(\boldsymbol{\omega}). \tag{A4.2}

Then

1Rr=1Rzωr(x)zωr(y)2Rr=1R/2([cos(ωrx)sin(ωrx)][cos(ωry)sin(ωry)])=2Rr=1R/2cos(ωrx)cos(ωry)+sin(ωrx)sin(ωry)=2Rr=1R/2cos(ωrxωry)Eω[cos(ω(xy))]=k(x,y).(A4.3) \begin{aligned} \frac{1}{R^{\prime}} \sum_{r=1}^{R^{\prime}} z_{\boldsymbol{\omega_r}}(\mathbf{x})^{\top} z_{\boldsymbol{\omega_r}}(\mathbf{y}) &\equiv \frac{2}{R} \sum_{r=1}^{R/2} \Bigg( \begin{bmatrix} \cos(\boldsymbol{\omega}_r^{\top} \mathbf{x}) \\ \sin(\boldsymbol{\omega}_r^{\top} \mathbf{x}) \end{bmatrix}^{\top} \begin{bmatrix} \cos(\boldsymbol{\omega}_r^{\top} \mathbf{y}) \\ \sin(\boldsymbol{\omega}_r^{\top} \mathbf{y}) \end{bmatrix}\Bigg) \\ &= \frac{2}{R} \sum_{r=1}^{R/2} \cos(\boldsymbol{\omega}_r^{\top} \mathbf{x}) \cos(\boldsymbol{\omega}_r^{\top} \mathbf{y}) + \sin(\boldsymbol{\omega}_r^{\top} \mathbf{x}) \sin(\boldsymbol{\omega}_r^{\top} \mathbf{y}) \\ &\stackrel{\star}{=} \frac{2}{R} \sum_{r=1}^{R/2} \cos(\boldsymbol{\omega}_r^{\top} \mathbf{x} - \boldsymbol{\omega}_r^{\top} \mathbf{y}) \\ &\approx \mathbb{E}_{\boldsymbol{\omega}}[\cos(\boldsymbol{\omega}^{\top} ( \mathbf{x} - \mathbf{y}))] \\ &= k(\mathbf{x}, \mathbf{y}). \end{aligned} \tag{A4.3}

Step \star just applies the product identities from trigonometry:

2sin(x)sin(y)=cos(xy)cos(x+y)2cos(x)cos(y)=cos(xy)+cos(x+y).(A4.4) \begin{aligned} 2 \sin(x) \sin(y) = \cos(x - y) - \cancel{\cos(x + y)} \\ 2 \cos(x) \cos(y) = \cos(x - y) + \cancel{\cos(x + y)}. \end{aligned} \tag{A4.4}

The right-most terms above cancel in (A4.3)(\text{A}4.3), and we get 2cos(xy)2 \cos(x - y).

  1. Kimeldorf, G., & Wahba, G. (1971). Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications, 33(1), 82–95.
  2. Schölkopf, B., Herbrich, R., & Smola, A. J. (2001). A generalized representer theorem. International Conference on Computational Learning Theory, 416–426.
  3. Rahimi, A., & Recht, B. (2007). Random features for large-scale kernel machines. Advances in Neural Information Processing Systems, 1177–1184.
  4. Rudin, W. (1962). Fourier analysis on groups (Vol. 121967). Wiley Online Library.
  5. Sutherland, D. J., & Schneider, J. (2015). On the error of random fourier features. ArXiv Preprint ArXiv:1506.02785.
  6. Welling, M. (2013). Kernel ridge regression. Max Welling’s Classnotes in Machine Learning, 1–3.