From Probabilistic PCA to the GPLVM

A Gaussian process latent variable model (GPLVM) can be viewed as a generalization of probabilistic principal component analysis (PCA) in which the latent maps are Gaussian-process distributed. I discuss this relationship.

Published

14 July 2020

Probabilistic PCA

In probabilistic principal component analysis (PCA), we assume our $D$ -dimensional observations $\mathbf{Y} = [\mathbf{y}_1 \dots \mathbf{y}_N]^{\top}$ are a linear function of $K$ -dimensional latent variables $\mathbf{X} = [\mathbf{x}_1 \dots \mathbf{x}_N]^{\top}$ ,

$\mathbf{y}_n = \mathbf{W} \mathbf{x}_n + \varepsilon_n, \tag{1}$

where $K \ll D$ and $\varepsilon_n \sim \mathcal{N}(0, \sigma^2)$ is random noise. The linear map $\mathbf{W}$ is a $D \times K$ matrix relating the two sets of variables. We further assume

$\mathbf{x}_n \sim \mathcal{N}_K(\mathbf{0}, \mathbf{I}), \tag{2}$

where $\mathcal{N}_K(\cdot, \cdot)$ denotes a $K$ -variate Gaussian. In words, we assume no structure on $\mathbf{x}_n$ . This model is similar to factor analysis but with isotropic rather than non-isotropic Gaussian noise. In the language of factor analysis, $\mathbf{x}_n$ are latent variables, the columns of $\mathbf{X}$ are factors (latent features), and $\mathbf{W}$ are loadings.

If we assume that each observation is i.i.d., then the marginal distribution of $\mathbf{y}_n$ is also Gaussian,

$p(\mathbf{y}_n \mid \mathbf{W}) = \mathcal{N}_D(\mathbf{0}, \underbrace{\mathbf{W}\mathbf{W}^{\top} + \sigma^2 \mathbf{I}}_{\mathbf{C}}). \tag{3}$

The proof of this claim just relies on the fact that since

$p(\mathbf{y}_n, \mathbf{x}_n \mid \mathbf{W}) = p(\mathbf{y}_n \mid \mathbf{x}_n, \mathbf{W}) p(\mathbf{x}_n \mid \mathbf{W}) \tag{4}$

is jointly Gaussian, the marginal $p(\mathbf{y}_n \mid \mathbf{W})$ is also Gaussian. I really should write a blog post proving this type of result, but see Sec. $2.3.3$ in (Bishop, 2006) for a discussion. In effect, we have integrated out the latent variables $\mathbf{X}$ . The log likelihood $\mathcal{L}(\mathbf{W})$ of Eq. $3$ is:

$\begin{aligned} \mathcal{L}(\mathbf{W}) &= \log p(\mathbf{Y} \mid \mathbf{W}) \\ &= \sum_{n=1}^{N} \log p(\mathbf{y}_n \mid \mathbf{W}) \\ &= \sum_{n=1}^{N} \Big[ \log \Big( \frac{1}{\sqrt{(2\pi)^D |\mathbf{C}|}} \Big) - \frac{1}{2} \mathbf{y}_n^{\top} \mathbf{C}^{-1} \mathbf{y}_n \Big] \\ &= \sum_{n=1}^{N} \Big[ - \frac{D}{2} \log 2\pi - \frac{1}{2} \log |\mathbf{C}| - \frac{1}{2} \mathbf{y}_n^{\top} \mathbf{C}^{-1} \mathbf{y}_n \Big] \\ &= -\frac{ND}{2} \log 2\pi - \frac{N}{2} \log |\mathbf{C}| - \frac{1}{2} \sum_{n=1}^{N} \mathbf{y}_n^{\top} \mathbf{C}^{-1} \mathbf{y}_n. \end{aligned} \tag{5}$

Often, the term $\sum_{n=1}^{N} \mathbf{y}_n^{\top} \mathbf{C}^{-1} \mathbf{y}_n$ is written as,

$\begin{aligned} \sum_{n=1}^{N} \mathbf{y}_n^{\top} \mathbf{C}^{-1} \mathbf{y}_n &= \sum_{n=}^{N} \text{tr}\big( \mathbf{C}^{-1} \mathbf{y}_n \mathbf{y}_n^{\top} \big) \\ &= \text{tr}\big( \mathbf{C}^{-1} \sum_{n=}^{N} \mathbf{y}_n \mathbf{y}_n^{\top} \big) \\ &= \text{tr}(\mathbf{C}^{-1} \mathbf{Y}^{\top} \mathbf{Y}). \end{aligned} \tag{6}$

This just uses a trace trick,

$\mathbf{a}^{\top} \mathbf{M}^{-1} \mathbf{a} = \text{tr}(\mathbf{M}^{-1} \mathbf{a} \mathbf{a}^{\top}), \tag{7}$

and the linearity of the trace operation. See here or here for a discussion. To fit probabilistic PCA, we optimize the nonlinear map $\mathbf{W}$ with respect to this marginal log likelihood. (Tipping & Bishop, 1999) proved that the maximum likelihood estimate of $\mathbf{W}$ is equivalent, up to scale and rotation, to the eigenvalue-based solution of standard PCA.

GPLVM

This marginalize-then-optimize process is reversed for a Gaussian process latent variable model (GPLVM) (Lawrence, 2004). Rather than integrating out the latent variable $\mathbf{X}$ and then optimizing $\mathbf{W}$ , we integrate out $\mathbf{W}$ and optimize $\mathbf{X}$ . This requires that we rewrite Eq. $1$ as

$\mathbf{y}_d = \mathbf{X} \mathbf{w}_d + \varepsilon_d. \tag{8}$

Recall that $\mathbf{W}$ is a $D \times K$ matrix. If we assume that its rows are i.i.d., meaning

$p(\mathbf{W}) = \prod_{d=1}^{D} \mathcal{N}_K(\mathbf{0}, \alpha^{-1} \mathbf{I}), \tag{9}$

then we can copy the logic from the previous section exactly, just switching a couple variables. Now the marginal distribution of $\mathbf{y}_d$ is

$p(\mathbf{y}_d \mid \mathbf{X}) = \mathcal{N}_N(\mathbf{0}, \underbrace{\alpha^{-1} \mathbf{X}\mathbf{X}^{\top} + \sigma^2 \mathbf{I}}_{\mathbf{K}}). \tag{10}$

And the log likelihood is

$\begin{aligned} \mathcal{L}(\mathbf{X}) &= \log p(\mathbf{Y} \mid \mathbf{X}) \\ &= \sum_{d=1}^{D} \log p(\mathbf{y}_d \mid \mathbf{X}) \\ &= \sum_{d=1}^{D} \Big[ \log \Big( \frac{1}{\sqrt{(2\pi)^N |\mathbf{K}|}} \Big) - \frac{1}{2} \mathbf{y}_d^{\top} \mathbf{K}^{-1} \mathbf{y}_d \Big] \\ &= \sum_{d=1}^{D} \Big[ - \frac{N}{2} \log 2\pi - \frac{1}{2} \log |\mathbf{K}| - \frac{1}{2} \mathbf{y}_d^{\top} \mathbf{K}^{-1} \mathbf{y}_d \Big] \\ &= -\frac{DN}{2} \log 2\pi - \frac{D}{2} \log |\mathbf{K}| - \frac{1}{2} \sum_{d=1}^{D} \mathbf{y}_d^{\top} \mathbf{K}^{-1} \mathbf{y}_d. \end{aligned} \tag{11}$

If we rewrite the final sum in Eq. $11$ using the trace trick in Eq. $7$ , we get the same equation as Lawrence’s Eq. $2$ :

$\mathcal{L}(\mathbf{X}) = -\frac{DN}{2} \log 2\pi - \frac{D}{2} \log |\mathbf{K}| - \frac{1}{2} \text{tr}(\mathbf{K}^{-1} \mathbf{Y}\mathbf{Y}^{\top}). \tag{12}$

That’s it. In a GPLVM, we optimize Eq. $12$ with respect to $\mathbf{X}$ rather than optimizing Eq. $5$ with respect to $\mathbf{W}$ as in probabilistic PCA. One can prove that the solution solved by optimizing Eq. $12$ is equivalent the solution in PCA (Tipping, 2001).

Nonlinear kernel functions

Optimizing Eq. $12$ can be viewed as the simplest form of a GPLVM. This might be surprising because there is no kernel function. Where is the Gaussian process (GP)? However, $\mathbf{K} = \alpha^{-1} \mathbf{X}\mathbf{X}^{\top} + \sigma^2 \mathbf{I}$ can be viewed as the covariance matrix induced by a linear kernel,

$k_{\texttt{linear}}(\mathbf{x}, \mathbf{x}^{\prime}) = \mathbf{x}^{\top} \mathbf{x}^{\prime}. \tag{13}$

A natural extension to this line of thinking is: what if we used a nonlinear kernel? Recall that a GP is an infinite collection of random variables, defined by a mean function $m(\cdot)$ and kernel function $k(\cdot, \cdot)$ , such that any finite collection of points is Gaussian distributed:

$f(\mathbf{x}) \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}^{\prime})) \implies \mathbf{f} \sim \mathcal{N}_N(\mathbf{m}_X, \mathbf{K}_X). \tag{14}$

$\mathbf{m}_X$ denotes an $N$ -vector after evaluating the mean function $m(\cdot)$ on all data points $\mathbf{X}$ , and $\mathbf{K}_X$ denotes an $N \times N$ matrix after evaluating the kernel function $k(\cdot, \cdot)$ on all pairs of data points in $\mathbf{X}$ . The $N$ -vector $\mathbf{f}$ is thus Gaussian-distributed, but we can view it as an evaluation of a function $f_d$ , a realization from a GP prior, on all our observations. See my previous post on GP regression or (Rasmussen & Williams, 2006) if this doesn’t make sense. This is why GPLVMs are often written in the following generative form:

$\mathbf{y}_d \sim \mathcal{N}_N(\mathbf{f}_d, \sigma^2 \mathbf{I}), \quad \mathbf{f}_d \sim \mathcal{N}_N(\mathbf{0}, \mathbf{K}_X), \quad \mathbf{x}_n \sim \mathcal{N}_D(\mathbf{0}, \mathbf{I}). \tag{15}$

In words, by choosing a specific nonlinear kernel function, we induce a nonlinear relationship between the latent variable $\mathbf{X}$ and our observations $\mathbf{Y}$ . This nonlinear relationship can be viewed as placing a GP prior on each column of the $N \times D$ matrix $\mathbf{Y}$ .

Conclusion

The GPLVM is a generalization of PCA. In PCA, we integrate out the latent variables and optimize the linear map. However, we are constrained to linear maps. In a GPLVM, we integrate out the maps and optimize the latent variable. This allows us to induce nonlinearity by assuming a nonlinear kernel function. If the kernel function is the linear kernel, the GPLVM reduces to PCA.

Bishop, C. M. (2006). Pattern Recognition and Machine Learning.
Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3), 611–622.
Lawrence, N. D. (2004). Gaussian process latent variable models for visualisation of high dimensional data. Advances in Neural Information Processing Systems, 329–336.
Tipping, M. E. (2001). Sparse kernel principal component analysis. Advances in Neural Information Processing Systems, 633–639.
Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning (Vol. 2, Number 3). MIT Press Cambridge, MA.