Gaussian Process Dynamical Models

Wang and Fleet's 2008 paper, "Gaussian Process Dynamical Models for Human Motion", introduces a Gaussian process latent variable model with Gaussian process latent dynamics. I discuss this paper in detail.

A Gaussian process dynamical model (GPDM) (Wang et al., 2006) can be viewed as a Gaussian process latent variable model (GPLVM) with the latent variable evolving according to its own Gaussian process. Put more simply, imagine a latent variable evolves according to some smooth, nonlinear dynamics, and that the mapping from latent- to observation-space is also a smooth, nonlinear map. For example, imagine a mouse is moving in a maze, but we only record the firing rates of its hippocampal place cells. The latent variable is the mouse position, which we know is a relatively smooth trajectory in 3D space, and the observations, neuron firing rates, are a nonlinear function of this latent variable. Can we infer the position of the mouse given just its firing rates? This is the type of question the GPDM attempts to answer. The goal of this post is to work through this model in detail.

Gaussian process dynamics

Let Y=[y1yT]\mathbf{Y} = [\mathbf{y}_1 \dots \mathbf{y}_T]^{\top} be TT observations, each JJ-dimensional, and indexed by discrete-time index tt, and let X=[x1xT]\mathbf{X} = [\mathbf{x}_1 \dots \mathbf{x}_T]^{\top} be DD-dimensional latent variables. Now consider the following Markov dynamics:

xt=f(xt1;A)+nx,t,yt=g(xt;B)+ny,t.(1) \begin{aligned} \mathbf{x}_t &= f(\mathbf{x}_{t-1}; \mathbf{A}) + \mathbf{n}_{x,t}, \\ \mathbf{y}_t &= g(\mathbf{x}_{t}; \mathbf{B}) + \mathbf{n}_{y,t}. \end{aligned} \tag{1}

In Eq. 11, ff and gg are mappings with parameters A\mathbf{A} and B\mathbf{B} respectively, and nx,t\mathbf{n}_{x,t} and ny,t\mathbf{n}_{y,t} are zero-mean Gaussian noise. The latent dynamics are Markov because the latent variable at time tt, xt\mathbf{x}_t, only depends on the previous latent variable xt1\mathbf{x}_{t-1} and the dynamics defined by ff. Each observation yt\mathbf{y}_t depends on its latent variable xt\mathbf{x}_t through the function gg. For example, in a linear dynamical system, ff and gg are linear functions:

xt=Axt1+nx,t,yt=Bxt+ny,t.(2) \begin{aligned} \mathbf{x}_t &= \mathbf{A} \mathbf{x}_{t-1} + \mathbf{n}_{x,t}, \\ \mathbf{y}_t &= \mathbf{B} \mathbf{x}_{t} + \mathbf{n}_{y,t}. \end{aligned} \tag{2}

In the original GPDM paper, the authors propose a particular nonlinear case in which ff and gg are linear combinations of basis functions:

f(x;A)=k=1Kakϕk(x),g(x;B)=m=1Mbmψm(x),(3) \begin{aligned} f(\mathbf{x}; \mathbf{A}) &= \sum_{k=1}^{K} \mathbf{a}_k \phi_k(\mathbf{x}), \\ g(\mathbf{x}; \mathbf{B}) &= \sum_{m=1}^{M} \mathbf{b}_m \psi_m(\mathbf{x}), \end{aligned} \tag{3}

for weights A=[a1aK]\mathbf{A} = [\mathbf{a}_1 \dots \mathbf{a}_K]^{\top} and B=[b1bM]\mathbf{B} = [\mathbf{b}_1 \dots \mathbf{b}_M]^{\top}. To be clear, this is just a matrix-vector multiplication. We can rewrite Eq. 33 as

f(x;A)=[a1aK]A[ϕ1(x)ϕK(x)]ϕ(x),g(x;B)=[b1bM]B[ψ1(x)ψM(x)]ψ(x),(4) \begin{aligned} f(\mathbf{x}; \mathbf{A}) &= \underbrace{ \left[\begin{array}{c|c|c} \mathbf{a}_1 & \dots & \mathbf{a}_K \end{array}\right] }_{\mathbf{A}} \underbrace{ \begin{bmatrix} \phi_1(\mathbf{x}) \\ \vdots \\ \phi_K(\mathbf{x}) \end{bmatrix} }_{\boldsymbol{\phi}(\mathbf{x})}, \\ g(\mathbf{x}; \mathbf{B}) &= \underbrace{ \left[\begin{array}{c|c|c} \mathbf{b}_1 & \dots & \mathbf{b}_M \end{array}\right] }_{\mathbf{B}} \underbrace{ \begin{bmatrix} \psi_1(\mathbf{x}) \\ \vdots \\ \psi_M(\mathbf{x}) \end{bmatrix} }_{\boldsymbol{\psi}(\mathbf{x})}, \end{aligned} \tag{4}

where A\mathbf{A} and B\mathbf{B} are D×KD \times K and J×MJ \times M matrices respectively and where ϕ(x)=[ϕ1(x)ϕK(x)]\boldsymbol{\phi}(\mathbf{x}) = [\phi_1(\mathbf{x}) \dots \phi_K(\mathbf{x})]^{\top} and ψ(x)=[ψ1(x)ψM(x)]\boldsymbol{\psi}(\mathbf{x}) = [\psi_1(\mathbf{x}) \dots \psi_M(\mathbf{x})]^{\top}, so KK- and MM-vectors respectively.

It may not be obvious why this is a clever idea, but if we place Gaussian priors on the rows of A\mathbf{A} and B\mathbf{B}, we can integrate them out. In both cases, the derivations are the same, minus a few details, as the derivation for a GPLVM. We will be left with multivariate Gaussian distributions whose covariance matrices are Gram matrices where each cell value is a dot product between the basis functions, e.g.

Kij=ϕi(x)ϕj(x).(5) \mathbf{K}_{ij} = \phi_i(\mathbf{x})^{\top} \phi_j(\mathbf{x}). \tag{5}

Thus, an appropriate choice of basis function implies Gaussian-process dynamics for this model. This is a common theme in Bayesian inference: placing a prior on weights induces a prior on functions.

Now let’s walk through this integration process in detail.

Integrating out the weights

First, let’s integrate out B\mathbf{B}. Wang’s modeling assumption is that each row of B\mathbf{B}, each bj\mathbf{b}_j, has a isotropic Gaussian prior, meaning

p(B)=j=1JNM(bj0,wj2I),(6) p(\mathbf{B}) = \prod_{j=1}^{J} \mathcal{N}_M( \mathbf{b}_j \mid \mathbf{0}, w_j^{-2} \mathbf{I}), \tag{6}

for some variance wj2w_j^{-2}. Now let’s rewrite the model in terms in terms of the features (columns) of Y\mathbf{Y}:

yj=Ψbj+ny,j,(7) \mathbf{y}_j = \boldsymbol{\Psi} \mathbf{b}_j + \mathbf{n}_{y,j}, \tag{7}

where Ψ=[ψ(x1)ψ(xT)]\boldsymbol{\Psi} = [\boldsymbol{\psi}(\mathbf{x}_1) \dots \boldsymbol{\psi}(\mathbf{x}_T)]^{\top}, or a T×MT \times M matrix. This implies that the distribution on yj\mathbf{y}_j, conditioned on bj\mathbf{b}_j, is a TT-variate normal,

yjX,bjNT(yjΨbj,wj2σY2I).(8) \mathbf{y}_j \mid \mathbf{X}, \mathbf{b}_j \sim \mathcal{N}_T(\mathbf{y}_j \mid \boldsymbol{\Psi} \mathbf{b}_j, w_j^{-2} \sigma_Y^2 \mathbf{I}). \tag{8}

Note that the variance of ny,j\mathbf{n}_{y,j} is wj2σY2w_j^{-2} \sigma_Y^2. Wang never writes this out explicitly, but he doesn’t write out many terms explicitly, and this assumption makes the derivation work. Since both p(bj)p(\mathbf{b}_j) and p(yjbj)p(\mathbf{y}_j \mid \mathbf{b}_j) are independently Gaussian, we can marginalize out bj\mathbf{b}_j to get:

p(yjX)=NT(yj0,Ψwj2Ψ+wj2σY2I).(9) p(\mathbf{y}_j \mid \mathbf{X}) = \mathcal{N}_T(\mathbf{y}_j \mid \mathbf{0}, \boldsymbol{\Psi} w_j^{-2} \boldsymbol{\Psi}^{\top} + w_j^{-2} \sigma_Y^2 \mathbf{I}). \tag{9}

See Sec. 2.3.32.3.3 in (Bishop, 2006) if this marginalization does not make sense. We can rewrite the covariance matrix in Eq. 99 as

Ψwj2Ψ+wj2σY2I=wj2(ΨΨ+σY2I)=wj2KY,(10) \boldsymbol{\Psi} w_j^{-2} \boldsymbol{\Psi}^{\top} + w_j^{-2} \sigma_Y^2 \mathbf{I} = w_j^2 (\boldsymbol{\Psi} \boldsymbol{\Psi}^{\top} + \sigma_Y^2 \mathbf{I}) = w_j^2 \mathbf{K}_Y, \tag{10}

where we’ve defined KY=ΨΨ+σY2I\mathbf{K}_Y = \boldsymbol{\Psi} \boldsymbol{\Psi}^{\top} + \sigma_Y^2 \mathbf{I}. Therefore, the joint density over all JJ features is

p(YX)=j=1Jp(yjX)=j=1J1(2π)Twj2KYexp{12yjwj2KY1yj}=(wj)TJ(2π)TJKYJexp{12j=1Jyjwj2KY1yj}=(wj)TJ(2π)TJKYJexp{12tr(j=1JKY1yjyjwj2)}=WT(2π)TJKYJexp{12tr(KY1YW2Y)},(11) \begin{aligned} p(\mathbf{Y} \mid \mathbf{X}) &= \prod_{j=1}^{J} p(\mathbf{y}_j \mid \mathbf{X}) \\ &= \prod_{j=1}^{J} \frac{1}{\sqrt{(2\pi)^T |w_j^{-2} \mathbf{K}_Y|}} \exp\Big\{ -\frac{1}{2} \mathbf{y}_j^{\top} w_j^{-2} \mathbf{K}_Y^{-1} \mathbf{y}_j \Big\} \\ &= \frac{(w_j)^{TJ}}{\sqrt{(2\pi)^{TJ} |\mathbf{K}_Y|^J}} \exp\Big\{-\frac{1}{2} \sum_{j=1}^{J} \mathbf{y}_j^{\top} w_j^{-2} \mathbf{K}_Y^{-1} \mathbf{y}_j \Big\} \\ &= \frac{(w_j)^{TJ}}{\sqrt{(2\pi)^{TJ} |\mathbf{K}_Y|^J}} \exp\Big\{-\frac{1}{2} \text{tr} \Big( \sum_{j=1}^{J} \mathbf{K}_Y^{-1} \mathbf{y}_j \mathbf{y}_j^{\top} w_j^{-2} \Big) \Big\} \\ &= \frac{|\mathbf{W}|^{T}}{\sqrt{(2\pi)^{TJ} |\mathbf{K}_Y|^J}} \exp\Big\{-\frac{1}{2} \text{tr} \Big( \mathbf{K}_Y^{-1} \mathbf{Y} \mathbf{W}^2 \mathbf{Y}^{\top} \Big) \Big\}, \end{aligned} \tag{11}

where W=diag([w1wJ])\mathbf{W} = \text{diag}([w_1 \dots w_J]). We have used a trace trick, the fact that the sum of outer products is a matrix multiplication, and some properties of the determinant. Thus, we have rederived Wang’s Eq. 55.

Now let’s integrate out A\mathbf{A}. The assumption is that rows of A\mathbf{A} are i.i.d. Gaussian,

p(A)=d=1DNK(ad0,I).(12) p(\mathbf{A}) = \prod_{d=1}^D \mathcal{N}_K(\mathbf{a}_d \mid \mathbf{0}, \mathbf{I}). \tag{12}

We can rewrite Eq. 44 in terms of the columns of X\mathbf{X} and the rows of A\mathbf{A}:

xd(2:T)=Φ1:(T1)ad+nx,d,xd(2:T)X,adNT1(xd(2:T)Φ1:(T1)ad,σX2I),(13) \begin{aligned} \mathbf{x}_d^{(2:T)} &= \boldsymbol{\Phi}^{1:(T-1)} \mathbf{a}_d + \mathbf{n}_{x,d}, \\ &\Downarrow \\ \mathbf{x}_d^{(2:T)} \mid \mathbf{X}, \mathbf{a}_d &\sim \mathcal{N}_{T-1}(\mathbf{x}_d^{(2:T)} \mid \boldsymbol{\Phi}^{1:(T-1)} \mathbf{a}_d, \sigma^2_X \mathbf{I}), \end{aligned} \tag{13}

where xd(2:T)\mathbf{x}_d^{(2:T)} denotes the dd-th column of X\mathbf{X} after removing the first row of X\mathbf{X}, i.e. xd(2:T)=[xd(2)xd(T)]\mathbf{x}_d^{(2:T)} = [x_d^{(2)} \dots x_d^{(T)}]^{\top}, and where Φ1:(T1)=[ϕ(x1)ϕ(xT1)]\boldsymbol{\Phi}^{1:(T-1)} = [\boldsymbol{\phi}(\mathbf{x}_1) \dots \boldsymbol{\phi}(\mathbf{x}_{T-1})]^{\top}. To be clear, the superscripts denote components in a vector. σX2\sigma_X^2 is the variance of nx,d\mathbf{n}_{x,d}. This funky indexing is just the result of the Markov assumption, that xt\mathbf{x}_{t} depends on xt1\mathbf{x}_{t-1}. Since this superscript notation is quite unwieldly, we’ll drop it in favor of just specifying matrix shapes as needed. Again, we can marginalize out A\mathbf{A} since both p(xdad)p(\mathbf{x}_d \mid \mathbf{a}_d) and p(ad)p(\mathbf{a}_d) are Gaussian:

p(xd)=NT1(xd0,ΦΦ+σX2IKX).(14) p(\mathbf{x}_d) = \mathcal{N}_{T-1}(\mathbf{x}_d \mid \mathbf{0}, \underbrace{\boldsymbol{\Phi}\boldsymbol{\Phi}^{\top} + \sigma_X^2 \mathbf{I}}_{\mathbf{K}_X}). \tag{14}

Again, see Sec. 2.3.32.3.3 in (Bishop, 2006) if this marginalization does not make sense. Thus, we can marginalize out A\mathbf{A} from the model entirely:

p(X)=p(XA)p(A)dA=p(x1)d=1Dp(xdA)p(A)dA=p(x1)d=1D1(2π)(T1)KXexp{12xdKX1xd}=p(x1)1(2π)D(T1)KXDexp{12d=1DxdKX1xd}=p(x1)1(2π)D(T1)KXDexp{12tr(KX1XˉXˉ)},(15) \begin{aligned} p(\mathbf{X}) &= \int p(\mathbf{X} \mid \mathbf{A}) p(\mathbf{A}) \text{d}\mathbf{A} \\ &= \int p(\mathbf{x}_1) \prod_{d=1}^D p(\mathbf{x}_d \mid \mathbf{A}) p(\mathbf{A}) \text{d}\mathbf{A} \\ &= p(\mathbf{x}_1) \prod_{d=1}^D \frac{1}{\sqrt{(2\pi)^{(T-1)}|\mathbf{K}_X|}} \exp\Big\{-\frac{1}{2} \mathbf{x}_d^{\top} \mathbf{K}_X^{-1} \mathbf{x}_d \Big\} \\ &= p(\mathbf{x}_1) \frac{1}{\sqrt{(2\pi)^{D(T-1)}|\mathbf{K}_X|^D}} \exp\Big\{-\frac{1}{2} \sum_{d=1}^{D} \mathbf{x}_d^{\top} \mathbf{K}_X^{-1} \mathbf{x}_d \Big\} \\ &= p(\mathbf{x}_1) \frac{1}{\sqrt{(2\pi)^{D(T-1)}|\mathbf{K}_X|^D}} \exp\Big\{-\frac{1}{2} \text{tr} \big( \mathbf{K}_X^{-1} \bar{\mathbf{X}} \bar{\mathbf{X}}^{\top} \big) \Big\}, \end{aligned} \tag{15}

where Xˉ=X(2:T)\bar{\mathbf{X}} = \mathbf{X}^{(2:T)}.

That’s it. We have derived the GPDM. As with the derivation for a GPLVM, it may not be immediately obvious where the Gaussian processes are, but notice that both p(YX)p(\mathbf{Y} \mid \mathbf{X}) and p(X)p(\mathbf{X}) are both multivariate Gaussian with covariance matrices that are functions of Gram matrices, KY\mathbf{K}_Y and KX\mathbf{K}_X. The other matrices in the last lines of Eq. 1111 and 1515 are the column-specific variances, W\mathbf{W}, or the result of the i.i.d. assumption for the observations and latent variables, Y\mathbf{Y} and Xˉ\bar{\mathbf{X}}. Thus, this model implies a Gaussian process prior on both the dynamics and latent-to-observation maps if we define KY\mathbf{K}_Y and KX\mathbf{K}_X using positive definite kernel functions.

Inference

We want to infer the posterior p(XY)p(\mathbf{X} \mid \mathbf{Y}) where

p(XY)p(YX)p(X).(16) p(\mathbf{X} \mid \mathbf{Y}) \propto p(\mathbf{Y} \mid \mathbf{X}) p(\mathbf{X}). \tag{16}

Therefore, the log posterior in Eq. 1616 decomposes into the sum of the logs of Eq. 1111 and 1515:

L(X)=logp(YX)+logp(X)=TlogWTJ2log(2π)J2logKY12tr(KY1YWY)+D(T1)2log(2π)D2logKX12tr(KX1XˉXˉ)+logp(x1)=TlogWJ2logKY12tr(KY1YWY)D2logKX12tr(KX1XˉXˉ).(17) \begin{aligned} &\mathcal{L}(\mathbf{X}) \\ &= \log p(\mathbf{Y} \mid \mathbf{X}) + \log p(\mathbf{X}) \\ &= T \log |\mathbf{W}| - \cancel{\frac{TJ}{2} \log(2\pi)} - \frac{J}{2} \log |\mathbf{K}_Y| - \frac{1}{2} \text{tr}(\mathbf{K}_Y^{-1} \mathbf{Y} \mathbf{W} \mathbf{Y}^{\top}) \\&+ \cancel{\frac{D(T-1)}{2} \log(2\pi)} - \frac{D}{2} \log |\mathbf{K}_X| - \frac{1}{2} \text{tr}(\mathbf{K}_X^{-1} \bar{\mathbf{X}} \bar{\mathbf{X}}^{\top}) + \cancel{\log p(\mathbf{x}_1)} \\ &= T \log |\mathbf{W}| - \frac{J}{2} \log |\mathbf{K}_Y| - \frac{1}{2} \text{tr}(\mathbf{K}_Y^{-1} \mathbf{Y} \mathbf{W} \mathbf{Y}^{\top}) - \frac{D}{2} \log |\mathbf{K}_X| - \frac{1}{2} \text{tr}(\mathbf{K}_X^{-1} \bar{\mathbf{X}} \bar{\mathbf{X}}^{\top}). \end{aligned} \tag{17}

We can cancel constant terms. If we optimize the latent variable X\mathbf{X} with respect to this log posterior, Eq. 1717, then we will infer a latent variable that adheres to our modeling assumptions in Eq. 33. However, Eq. 1717 is slightly different from Wang’s Eq. 1616. That’s because Wang also proposes optimizing the kernel functions’ hyperparameters with respect to the log posterior. Wang assumes a radial basis function (RBF) kernel and a linear-plus-RBF kernel for KY\mathbf{K}_Y and KX\mathbf{K}_X respectively,

kY(x,x;β)=β1exp(β22xx22)+β31δx,x,kX(x,x;α)=α1exp(α22xx22)+α3xx+α41δx,x,(18) \begin{aligned} k_Y(\mathbf{x}, \mathbf{x}^{\prime}; \boldsymbol{\beta}) &= \beta_1 \exp\Big(-\frac{\beta_2}{2} \lVert \mathbf{x} - \mathbf{x}^{\prime} \rVert_2^2 \Big) + \beta_3^{-1} \delta_{\mathbf{x}, \mathbf{x}^{\prime}}, \\ k_X(\mathbf{x}, \mathbf{x}^{\prime}; \boldsymbol{\alpha}) &= \alpha_1 \exp\Big(-\frac{\alpha_2}{2}\lVert \mathbf{x} - \mathbf{x}^{\prime}\rVert_2^2\Big) + \alpha_3 \mathbf{x}^{\top} \mathbf{x}^{\prime} + \alpha_4^{-1} \delta_{\mathbf{x}, \mathbf{x}^{\prime}}, \end{aligned} \tag{18}

where δ\delta is the Dirac delta function. The Dirac delta functions model the diagonal elements. Thus, β={β1,β2,β3}\boldsymbol{\beta} = \{\beta_1, \beta_2, \beta_3 \} and α={α1,α2,α3,α4}\boldsymbol{\alpha} = \{\alpha_1, \alpha_2, \alpha_3, \alpha_4 \} where β31=σY2\beta_3^{-1} = \sigma_Y^2 and α41=σX2\alpha_4^{-1} = \sigma_X^2. Regardless of the kernel, meaning regardless of the set of hyperparameters β\boldsymbol{\beta} and α\boldsymbol{\alpha}, we can place a prior on both kernels’ hyperparameters,

p(β)=ip(βi),p(α)=ip(αi),(19) \begin{aligned} p(\boldsymbol{\beta}) &= \prod_i p(\beta_i), \\ p(\boldsymbol{\alpha}) &= \prod_i p(\alpha_i), \end{aligned} \tag{19}

and then optimize the posterior

p(X,β,αY)p(YX,β)p(Xα)p(β)p(α).(20) p(\mathbf{X}, \boldsymbol{\beta}, \boldsymbol{\alpha} \mid \mathbf{Y}) \propto p(\mathbf{Y} \mid \mathbf{X}, \boldsymbol{\beta}) p(\mathbf{X} \mid \boldsymbol{\alpha}) p(\boldsymbol{\beta}) p(\boldsymbol{\alpha}). \tag{20}

This reduces to adding two additional terms to Eq. 1717:

L(X,β,α)=TlogWJ2logKY12tr(KY1YWY)D2logKX12tr(KX1XˉXˉ)+ilogαi+ilogβi.(21) \begin{aligned} \mathcal{L}(\mathbf{X}, \boldsymbol{\beta}, \boldsymbol{\alpha}) &= T \log |\mathbf{W}| - \frac{J}{2} \log |\mathbf{K}_Y| - \frac{1}{2} \text{tr}(\mathbf{K}_Y^{-1} \mathbf{Y} \mathbf{W} \mathbf{Y}^{\top}) \\ &- \frac{D}{2} \log |\mathbf{K}_X| - \frac{1}{2} \text{tr}(\mathbf{K}_X^{-1} \bar{\mathbf{X}} \bar{\mathbf{X}}^{\top}) \\ &+ \sum_i \log \alpha_i + \sum_i \log \beta_i. \end{aligned} \tag{21}

In conclusion, inference for a GPDM amounts to minimizing α\boldsymbol{\alpha}, β\boldsymbol{\beta}, and X\mathbf{X} with respect to the negative log posterior, the negative of Eq. 2020.

Example

Consider the task of inferring a smooth S-shaped latent variable from high-dimensional observations (Figure 11, top left). See A1 for Python code to generate this S-shaped variable. Next, let’s fit three dimension reduction techniques to these data: principal component analysis (PCA), GPLVM, and GPDM (Figure 11, top right three subplots). We find that PCA can uncover the structure at a high-level, but has difficulty disambiguating points in the middle of the S-curve. This makes sense since PCA is a linear model. Both the GPLVM and GPDM do better at inferring the actual S-shape.

Figure 1. (Top row, left) True S-shaped latent variable generated from Scikit-learn's `make_s_curve` function. (Top row, right three plots) The inferred latent variable from three dimension reduction techniques: PCA, GPLVM, and GPDM. (Bottom row) Each subplot is the same as the one above it, but using a line plot rather than a scatter plot.

However, consider the bottom row of plots in Figure 11. Here, I’ve plotted the latent variables, each xn\mathbf{x}_n, in order using a line plot. This visualization makes it clear that GPDM infers a smoother latent space than the GPLVM. See A2 for Python code to fit a GPDM by optimizing the log posterior.

Appendix

A1. Generating the data

import numpy as np
from   sklearn.datasets import make_s_curve


def gen_data():
    T    = 200
    J    = 40
    X, t = make_s_curve(T)
    X    = np.delete(X, obj=1, axis=1)
    X    = X / np.std(X, axis=0)
    D    = X.shape[1]
    inds = t.argsort()
    X    = X[inds]
    t    = t[inds]
    K    = rbf_kernel(X, 1, 1, 0)
    F    = np.random.multivariate_normal(np.zeros(T), K, size=J).T
    Y    = F + np.random.normal(0, scale=1, size=F.shape)
    return X, Y, t


def rbf_kernel(X, var, length_scale, diag):
    T = len(X)
    diffs = np.expand_dims(X / length_scale, 1) \
          - np.expand_dims(X / length_scale, 0)
    return var * np.exp(-0.5 * np.sum(diffs ** 2, axis=2)) + diag * np.eye(T)

A2. Fitting a GPDM

import autograd.numpy as np
from   autograd import grad
from   scipy.optimize import fmin_l_bfgs_b
from   sklearn.decomposition import PCA


def log_posterior(Y, X, beta, alpha):
    _, J = Y.shape

    K_Y      = rbf_kernel(X, *beta)
    det_term = -J/2 * np.prod(np.linalg.slogdet(K_Y))
    tr_term  = -1/2 * np.trace(np.linalg.inv(K_Y) @ Y @ Y.T)
    LL       = det_term + tr_term

    K_X      = rbf_linear_kernel(X[:-1], *alpha)
    X_bar    = X[1:]
    det_term = -D/2 * np.prod(np.linalg.slogdet(K_X))
    tr_term  = -1/2 * np.trace(np.linalg.inv(K_X) @ X_bar @ X_bar.T)
    LP       = det_term + tr_term

    return LL + LP


def rbf_linear_kernel(X, var, length_scale, diag1, diag2):
    rbf = rbf_kernel(X, length_scale, var, diag1)
    linear = diag2 * X @ X.T
    return rbf + linear


def optimize_gpdm(Y, X0):
    T, D = X0.shape

    beta0 = np.array([1, 1, 1e-6])
    alpha0 = np.array([1, 1, 1e-6, 1e-6])

    def _neg_f(params):
        X = params[:T*D].reshape(X0.shape)
        beta = params[T*D:T*D+3]
        alpha = params[T*D+3:]
        return -1 * log_posterior(Y, X, beta, alpha)

    _neg_fp = grad(_neg_f)
    
    def f_fp(params):
        return _neg_f(params), _neg_fp(params)

    x0 = np.concatenate([X0.flatten(), beta0, alpha0])
    res = fmin_l_bfgs_b(f_fp, x0)
    X_map = res[0][:T*D].reshape(X0.shape)

    return X_map
  1. Wang, J., Hertzmann, A., & Fleet, D. J. (2006). Gaussian process dynamical models. Advances in Neural Information Processing Systems, 1441–1448.
  2. Bishop, C. M. (2006). Pattern Recognition and Machine Learning.