From Probabilistic PCA to the GPLVM

A Gaussian process latent variable model (GPLVM) can be viewed as a generalization of probabilistic principal component analysis (PCA) in which the latent maps are Gaussian-process distributed. I discuss this relationship.

Probabilistic PCA

In probabilistic principal component analysis (PCA), we assume our DD-dimensional observations Y=[y1yN]\mathbf{Y} = [\mathbf{y}_1 \dots \mathbf{y}_N]^{\top} are a linear function of KK-dimensional latent variables X=[x1xN]\mathbf{X} = [\mathbf{x}_1 \dots \mathbf{x}_N]^{\top},

yn=Wxn+εn,(1) \mathbf{y}_n = \mathbf{W} \mathbf{x}_n + \varepsilon_n, \tag{1}

where KDK \ll D and εnN(0,σ2)\varepsilon_n \sim \mathcal{N}(0, \sigma^2) is random noise. The linear map W\mathbf{W} is a D×KD \times K matrix relating the two sets of variables. We further assume

xnNK(0,I),(2) \mathbf{x}_n \sim \mathcal{N}_K(\mathbf{0}, \mathbf{I}), \tag{2}

where NK(,)\mathcal{N}_K(\cdot, \cdot) denotes a KK-variate Gaussian. In words, we assume no structure on xn\mathbf{x}_n. This model is similar to factor analysis but with isotropic rather than non-isotropic Gaussian noise. In the language of factor analysis, xn\mathbf{x}_n are latent variables, the columns of X\mathbf{X} are factors (latent features), and W\mathbf{W} are loadings.

If we assume that each observation is i.i.d., then the marginal distribution of yn\mathbf{y}_n is also Gaussian,

p(ynW)=ND(0,WW+σ2IC).(3) p(\mathbf{y}_n \mid \mathbf{W}) = \mathcal{N}_D(\mathbf{0}, \underbrace{\mathbf{W}\mathbf{W}^{\top} + \sigma^2 \mathbf{I}}_{\mathbf{C}}). \tag{3}

The proof of this claim just relies on the fact that since

p(yn,xnW)=p(ynxn,W)p(xnW)(4) p(\mathbf{y}_n, \mathbf{x}_n \mid \mathbf{W}) = p(\mathbf{y}_n \mid \mathbf{x}_n, \mathbf{W}) p(\mathbf{x}_n \mid \mathbf{W}) \tag{4}

is jointly Gaussian, the marginal p(ynW)p(\mathbf{y}_n \mid \mathbf{W}) is also Gaussian. I really should write a blog post proving this type of result, but see Sec. 2.3.32.3.3 in (Bishop, 2006) for a discussion. In effect, we have integrated out the latent variables X\mathbf{X}. The log likelihood L(W)\mathcal{L}(\mathbf{W}) of Eq. 33 is:

L(W)=logp(YW)=n=1Nlogp(ynW)=n=1N[log(1(2π)DC)12ynC1yn]=n=1N[D2log2π12logC12ynC1yn]=ND2log2πN2logC12n=1NynC1yn.(5) \begin{aligned} \mathcal{L}(\mathbf{W}) &= \log p(\mathbf{Y} \mid \mathbf{W}) \\ &= \sum_{n=1}^{N} \log p(\mathbf{y}_n \mid \mathbf{W}) \\ &= \sum_{n=1}^{N} \Big[ \log \Big( \frac{1}{\sqrt{(2\pi)^D |\mathbf{C}|}} \Big) - \frac{1}{2} \mathbf{y}_n^{\top} \mathbf{C}^{-1} \mathbf{y}_n \Big] \\ &= \sum_{n=1}^{N} \Big[ - \frac{D}{2} \log 2\pi - \frac{1}{2} \log |\mathbf{C}| - \frac{1}{2} \mathbf{y}_n^{\top} \mathbf{C}^{-1} \mathbf{y}_n \Big] \\ &= -\frac{ND}{2} \log 2\pi - \frac{N}{2} \log |\mathbf{C}| - \frac{1}{2} \sum_{n=1}^{N} \mathbf{y}_n^{\top} \mathbf{C}^{-1} \mathbf{y}_n. \end{aligned} \tag{5}

Often, the term n=1NynC1yn\sum_{n=1}^{N} \mathbf{y}_n^{\top} \mathbf{C}^{-1} \mathbf{y}_n is written as,

n=1NynC1yn=n=Ntr(C1ynyn)=tr(C1n=Nynyn)=tr(C1YY).(6) \begin{aligned} \sum_{n=1}^{N} \mathbf{y}_n^{\top} \mathbf{C}^{-1} \mathbf{y}_n &= \sum_{n=}^{N} \text{tr}\big( \mathbf{C}^{-1} \mathbf{y}_n \mathbf{y}_n^{\top} \big) \\ &= \text{tr}\big( \mathbf{C}^{-1} \sum_{n=}^{N} \mathbf{y}_n \mathbf{y}_n^{\top} \big) \\ &= \text{tr}(\mathbf{C}^{-1} \mathbf{Y}^{\top} \mathbf{Y}). \end{aligned} \tag{6}

This just uses a trace trick,

aM1a=tr(M1aa),(7) \mathbf{a}^{\top} \mathbf{M}^{-1} \mathbf{a} = \text{tr}(\mathbf{M}^{-1} \mathbf{a} \mathbf{a}^{\top}), \tag{7}

and the linearity of the trace operation. See here or here for a discussion. To fit probabilistic PCA, we optimize the nonlinear map W\mathbf{W} with respect to this marginal log likelihood. (Tipping & Bishop, 1999) proved that the maximum likelihood estimate of W\mathbf{W} is equivalent, up to scale and rotation, to the eigenvalue-based solution of standard PCA.

GPLVM

This marginalize-then-optimize process is reversed for a Gaussian process latent variable model (GPLVM) (Lawrence, 2004). Rather than integrating out the latent variable X\mathbf{X} and then optimizing W\mathbf{W}, we integrate out W\mathbf{W} and optimize X\mathbf{X}. This requires that we rewrite Eq. 11 as

yd=Xwd+εd.(8) \mathbf{y}_d = \mathbf{X} \mathbf{w}_d + \varepsilon_d. \tag{8}

Recall that W\mathbf{W} is a D×KD \times K matrix. If we assume that its rows are i.i.d., meaning

p(W)=d=1DNK(0,α1I),(9) p(\mathbf{W}) = \prod_{d=1}^{D} \mathcal{N}_K(\mathbf{0}, \alpha^{-1} \mathbf{I}), \tag{9}

then we can copy the logic from the previous section exactly, just switching a couple variables. Now the marginal distribution of yd\mathbf{y}_d is

p(ydX)=NN(0,α1XX+σ2IK).(10) p(\mathbf{y}_d \mid \mathbf{X}) = \mathcal{N}_N(\mathbf{0}, \underbrace{\alpha^{-1} \mathbf{X}\mathbf{X}^{\top} + \sigma^2 \mathbf{I}}_{\mathbf{K}}). \tag{10}

And the log likelihood is

L(X)=logp(YX)=d=1Dlogp(ydX)=d=1D[log(1(2π)NK)12ydK1yd]=d=1D[N2log2π12logK12ydK1yd]=DN2log2πD2logK12d=1DydK1yd.(11) \begin{aligned} \mathcal{L}(\mathbf{X}) &= \log p(\mathbf{Y} \mid \mathbf{X}) \\ &= \sum_{d=1}^{D} \log p(\mathbf{y}_d \mid \mathbf{X}) \\ &= \sum_{d=1}^{D} \Big[ \log \Big( \frac{1}{\sqrt{(2\pi)^N |\mathbf{K}|}} \Big) - \frac{1}{2} \mathbf{y}_d^{\top} \mathbf{K}^{-1} \mathbf{y}_d \Big] \\ &= \sum_{d=1}^{D} \Big[ - \frac{N}{2} \log 2\pi - \frac{1}{2} \log |\mathbf{K}| - \frac{1}{2} \mathbf{y}_d^{\top} \mathbf{K}^{-1} \mathbf{y}_d \Big] \\ &= -\frac{DN}{2} \log 2\pi - \frac{D}{2} \log |\mathbf{K}| - \frac{1}{2} \sum_{d=1}^{D} \mathbf{y}_d^{\top} \mathbf{K}^{-1} \mathbf{y}_d. \end{aligned} \tag{11}

If we rewrite the final sum in Eq. 1111 using the trace trick in Eq. 77, we get the same equation as Lawrence’s Eq. 22:

L(X)=DN2log2πD2logK12tr(K1YY).(12) \mathcal{L}(\mathbf{X}) = -\frac{DN}{2} \log 2\pi - \frac{D}{2} \log |\mathbf{K}| - \frac{1}{2} \text{tr}(\mathbf{K}^{-1} \mathbf{Y}\mathbf{Y}^{\top}). \tag{12}

That’s it. In a GPLVM, we optimize Eq. 1212 with respect to X\mathbf{X} rather than optimizing Eq. 55 with respect to W\mathbf{W} as in probabilistic PCA. One can prove that the solution solved by optimizing Eq. 1212 is equivalent the solution in PCA (Tipping, 2001).

Nonlinear kernel functions

Optimizing Eq. 1212 can be viewed as the simplest form of a GPLVM. This might be surprising because there is no kernel function. Where is the Gaussian process (GP)? However, K=α1XX+σ2I\mathbf{K} = \alpha^{-1} \mathbf{X}\mathbf{X}^{\top} + \sigma^2 \mathbf{I} can be viewed as the covariance matrix induced by a linear kernel,

klinear(x,x)=xx.(13) k_{\texttt{linear}}(\mathbf{x}, \mathbf{x}^{\prime}) = \mathbf{x}^{\top} \mathbf{x}^{\prime}. \tag{13}

A natural extension to this line of thinking is: what if we used a nonlinear kernel? Recall that a GP is an infinite collection of random variables, defined by a mean function m()m(\cdot) and kernel function k(,)k(\cdot, \cdot), such that any finite collection of points is Gaussian distributed:

f(x)GP(m(x),k(x,x))    fNN(mX,KX).(14) f(\mathbf{x}) \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}^{\prime})) \implies \mathbf{f} \sim \mathcal{N}_N(\mathbf{m}_X, \mathbf{K}_X). \tag{14}

mX\mathbf{m}_X denotes an NN-vector after evaluating the mean function m()m(\cdot) on all data points X\mathbf{X}, and KX\mathbf{K}_X denotes an N×NN \times N matrix after evaluating the kernel function k(,)k(\cdot, \cdot) on all pairs of data points in X\mathbf{X}. The NN-vector f\mathbf{f} is thus Gaussian-distributed, but we can view it as an evaluation of a function fdf_d, a realization from a GP prior, on all our observations. See my previous post on GP regression or (Rasmussen & Williams, 2006) if this doesn’t make sense. This is why GPLVMs are often written in the following generative form:

ydNN(fd,σ2I),fdNN(0,KX),xnND(0,I).(15) \mathbf{y}_d \sim \mathcal{N}_N(\mathbf{f}_d, \sigma^2 \mathbf{I}), \quad \mathbf{f}_d \sim \mathcal{N}_N(\mathbf{0}, \mathbf{K}_X), \quad \mathbf{x}_n \sim \mathcal{N}_D(\mathbf{0}, \mathbf{I}). \tag{15}

In words, by choosing a specific nonlinear kernel function, we induce a nonlinear relationship between the latent variable X\mathbf{X} and our observations Y\mathbf{Y}. This nonlinear relationship can be viewed as placing a GP prior on each column of the N×DN \times D matrix Y\mathbf{Y}.

Conclusion

The GPLVM is a generalization of PCA. In PCA, we integrate out the latent variables and optimize the linear map. However, we are constrained to linear maps. In a GPLVM, we integrate out the maps and optimize the latent variable. This allows us to induce nonlinearity by assuming a nonlinear kernel function. If the kernel function is the linear kernel, the GPLVM reduces to PCA.

  1. Bishop, C. M. (2006). Pattern Recognition and Machine Learning.
  2. Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3), 611–622.
  3. Lawrence, N. D. (2004). Gaussian process latent variable models for visualisation of high dimensional data. Advances in Neural Information Processing Systems, 329–336.
  4. Tipping, M. E. (2001). Sparse kernel principal component analysis. Advances in Neural Information Processing Systems, 633–639.
  5. Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning (Vol. 2, Number 3). MIT Press Cambridge, MA.