Gaussian Processes with Multinomial Observations

Linderman, Johnson, and Adam's 2015 paper, "Dependent multinomial models made easy: Stick-breaking with the Pólya-gamma augmentation", introduces a Gibbs sampler for Gaussian processes with multinomial observations. I discuss this model in detail.

In Dependent multinomial models made easy: Stick-breaking with the polya-gamma augmentation, (Linderman et al., 2015), Linderman et al leverage a representation of the KK-dimensional multinomial distribution as the product of K1K-1 binomial distributions to construct a model that can leverage Pólya-gamma augmentation. I discussed this representation and its correctness in a previous post. In this post, I want to construct the Gibbs sampler for Gaussian process regression with multinomial observations discussed in the paper.

Pólya-gamma augmentation

Let’s review Pólya-gamma augmentation to set up the notation used by Linderman et al. Conceptually, Pólya-gamma augmentation is fairly straightforward, but the mathematical details are a bit tedious. See my previous post if you want a more detailed presentation.

In (Polson et al., 2013), Polson proved two useful properties of Pólya-gamma random variables. First, he proved the following identity:

(eψ)a(1+eψ)b=2beκψ0eωψ2/2p(ωb,0)dω=2beκψEp(ωb,0)[eωψ2/2],(1) \begin{aligned} \frac{(e^{\psi})^a}{(1 + e^{\psi})^b} &= 2^{-b} e^{\kappa \psi} \int_{0}^{\infty} e^{- \omega \psi^2 / 2} p(\omega \mid b, 0) \text{d}\omega \\ &= 2^{-b} e^{\kappa \psi} \mathbb{E}_{p(\omega \mid b, 0)}[e^{- \omega \psi^2 / 2}], \end{aligned} \tag{1}

where κ=ab/2\kappa = a - b/2 and p(ωb,0)=PG(ωb,0)p(\omega \mid b, 0) = \text{PG}(\omega \mid b, 0) where PG(α,β)\text{PG}(\alpha, \beta) denotes the Pólya-gamma distribution with parameters α\alpha and β\beta. Next, consider probabilistic models whose data likelihood has a functional form containing the logistic function:

p(xψ)=c(x)(eψ)a(x)(1+eψ)b(x),(2) p(x \mid \psi) = c(x) \frac{(e^{\psi})^{a(x)}}{(1 + e^{\psi})^{b(x)}}, \tag{2}

for some functions of the data a()a(\cdot), b()b(\cdot), and c()c(\cdot). For example, in Bernoulli regression, a(x)=b(x)=c(x)=1a(x) = b(x) = c(x) = 1. If we condition on the Pólya-gamma random variables ω\omega, the expectation in Eq. 11 becomes a constant, and the identity allows us to write the conditional distribution p(ψx,ω)p(\psi \mid x, \omega) as conditionally Gaussian,

p(ψx,ω)p(xψ,ω)p(ψ)=eκψeωψ2/2p(ψ),(3) p(\psi \mid x, \omega) \propto p(x \mid \psi, \omega) p(\psi) = e^{\kappa \psi} e^{- \omega \psi^2 / 2} p(\psi), \tag{3}

provided p(ψ)p(\psi) is Gaussian. Furthermore, Polson proved that, conditioned on ψ\psi, the Pólya-gamma random variables have distribution

p(ωψ)PG(b,ψ).(4) p(\omega \mid \psi) \sim \text{PG}(b, \psi). \tag{4}

This means we can construct a Gibbs sampler that iteratively samples from Eq. 33 and then Eq. 44, making Bayesian inference for logistic models tractable.

Pólya-gamma augmentation of multinomial distributions

Linderman et al extend Polson’s idea to multinomial distributions by re-writing the multinomial density as a product of binomial densities:

mult(xN,π)=k=1K1binom(xkNk,π~k)Nk=Nj<kxj,π~k=πk1j<kπk,k=2,3,,K,N1=N,π~1=π1.(5) \begin{aligned} \text{mult}(\mathbf{x} \mid N, \boldsymbol{\pi}) &= \prod_{k=1}^{K-1} \text{binom}(x_k \mid N_k, \tilde{\pi}_k) \\ N_k &= N - \sum_{j < k} x_j, \quad \tilde{\pi}_k = \frac{\pi_k}{1 - \sum_{j < k} \pi_k}, \quad k = 2, 3, \dots, K, \\ N_1 &= N, \quad \tilde{\pi}_1 = \pi_1. \end{aligned} \tag{5}

See my previous post for a proof of this identity. Now define π~kσ(ψk)\tilde{\pi}_k \triangleq \sigma(\psi_k) where σ()\sigma(\cdot) is the logistic function. Then we can use this stick-breaking representation to write our KK-dimensional multinomial as

mult(xN,π)=k=1K1(Nkxk)σ(ψk)xk(1σ(ψk))Nkxk=k=1K1(Nkxk)(eψk)xk(1+eψk)Nk.(6) \begin{aligned} \text{mult}(\mathbf{x} \mid N, \boldsymbol{\pi}) &= \prod_{k=1}^{K-1} {N_k \choose x_k} \sigma(\psi_k)^{x_k} (1 - \sigma(\psi_k))^{N_k - x_k} \\ &= \prod_{k=1}^{K-1} {N_k \choose x_k} \frac{(e^{\psi_k})^{x_k}}{(1 + e^{\psi_k})^{N_k}}. \end{aligned} \tag{6}

Note that each term inside the product in Eq. 66 can be re-written using Pólya-gamma augmentation. Let’s introduce a Pólya-gamma random variable ωk\omega_k for each of the K1K-1 components in the stick-breaking representation of x\mathbf{x}. Thus, let ω=[ω1ωK1]\boldsymbol{\omega} = [\omega_1 \dots \omega_{K-1}]^{\top} and ψ=[ψ1ψK1]\boldsymbol{\psi} = [\psi_1 \dots \psi_{K-1}]^{\top}. Since the term κ\kappa in the previous section depends on a()a(\cdot) and b()b(\cdot), which are now indexed by kk, let κ=[κ1κK1]\boldsymbol{\kappa} = [\kappa_1 \dots \kappa_{K-1}]^{\top}, where κk=xkNk/2\kappa_k = x_k - N_{k}/2. Now we can write the likelihood in Eq. 66, conditioned on ω\boldsymbol{\omega}, as:

mult(xN,π,ω)=k=1K1(Nkxk)(eψk)xk(1+eψk)Nk=k=1K1(Nkxk)2Nkexp{κkψk}exp{ωkψk22}k=1K1exp{κkψkωkψk22}.(7) \begin{aligned} \text{mult}(\mathbf{x} \mid N, \boldsymbol{\pi}, \boldsymbol{\omega}) &= \prod_{k=1}^{K-1} {N_k \choose x_k} \frac{(e^{\psi_k})^{x_k}}{(1 + e^{\psi_k})^{N_k}} \\ &= \prod_{k=1}^{K-1} {N_k \choose x_k} 2^{-N_k} \exp\Big\{\kappa_k \psi_k\Big\} \exp\Big\{-\frac{\omega_k \psi_k^2}{2}\Big\} \\ &\propto \prod_{k=1}^{K-1} \exp\Big\{ \kappa_k \psi_k - \frac{\omega_k \psi_k^2}{2} \Big\}. \end{aligned} \tag{7}

Now let’s complete the square for exponent term, dropping sub-terms that do not depend on ψk\psi_k:

exp{κkψkωkψk22}=exp{ωk2[ψk22κkωkψk]}=exp{ωk2[ψk22κkωkψk+(κkωk)2(κkωk)2]}exp{ωk2[ψk22κkωkψk+(κkωk)2]}exp{ωk2(ψkκkωk)2}.(8) \begin{aligned} \exp\Big\{ \kappa_k \psi_k - \frac{\omega_k \psi_k^2}{2} \Big\} &= \exp \Big\{- \frac{\omega_k}{2} \Big[ \psi_k^2 - 2 \frac{\kappa_k}{\omega_k} \psi_k \Big] \Big\} \\ &= \exp \Big\{- \frac{\omega_k}{2} \Big[ \psi_k^2 - 2 \frac{\kappa_k}{\omega_k} \psi_k + \Big(\frac{\kappa_k}{\omega_k}\Big)^2 - \Big(\frac{\kappa_k}{\omega_k}\Big)^2 \Big] \Big\} \\ &\propto \exp \Big\{- \frac{\omega_k}{2} \Big[ \psi_k^2 - 2 \frac{\kappa_k}{\omega_k} \psi_k + \Big(\frac{\kappa_k}{\omega_k}\Big)^2 \Big] \Big\} \\ &\propto \exp \Big\{- \frac{\omega_k}{2} \Big( \psi_k - \frac{\kappa_k}{\omega_k}\Big)^2 \Big\}. \end{aligned} \tag{8}

Note that this term, which is quadratic in ψk\psi_k, is equal to a Gaussian kernel. If we assume each of the K1K-1 components is independent, we can write the last line of Eq. 77 as a multivariate Gaussian:

mult(xN,π,ω)k=1K1exp{κkψkωkψk22}N(ψΩ1κ,Ω1),(9) \text{mult}(\mathbf{x} \mid N, \boldsymbol{\pi}, \boldsymbol{\omega}) \propto \prod_{k=1}^{K-1} \exp\Big\{ \kappa_k \psi_k - \frac{\omega_k \psi_k^2}{2} \Big\} \propto \mathcal{N}\Big( \boldsymbol{\psi} \mid \boldsymbol{\Omega}^{-1} \boldsymbol{\kappa}, \boldsymbol{\Omega}^{-1} \Big), \tag{9}

where Ω=diag(ω)\boldsymbol{\Omega} = \text{diag}(\boldsymbol{\omega}), a (K1)×(K1)(K-1) \times (K-1) matrix.

In words, this means that we can construct a Gibbs sampler similar to Polson’s because we can sample from this conditionally Gaussian distribution, which is analogous to Eq. 33. However, in this case, each observation xm\mathbf{x}_m, where mm indexes MM observations, conditions not on a single Pólya-gamma random variable but a (K1)(K-1)-vector of them. Thus, the analog to Eq. 44 is

ωm,kxn,ψm,kPG(Nm,k,ωm,k),(10) \omega_{m,k} \mid \mathbf{x}_n, \psi_{m,k} \sim \text{PG}(N_{m,k}, \omega_{m, k}), \tag{10}

where mm indexes MM observations, X=[x1,,xM]\mathbf{X} = [\mathbf{x}_1, \dots, \mathbf{x}_M]^{\top}.

Finally, we need a way to map between the normalized probability vector π\boldsymbol{\pi} and the (K1)(K-1)-vector ψ\boldsymbol{\psi}. Let πSB()\pi_{\texttt{SB}}(\cdot) be a function, RK1[0,1]K\mathbb{R}^{K-1} \mapsto [0, 1]^K, transforming the stick-breaking representation ψ\boldsymbol{\psi} to a KK-vector of normalized probabilities π\boldsymbol{\pi}.

Gaussian processes with multinomial observations

Let X\mathbf{X} be an M×KM \times K matrix of count data, and assume that each observation (a row vector) is multinomially distributed. Let Z\mathbf{Z} be an M×DM \times D matrix of GP inputs or covariates. Furthermore, assume that each column of the matrix evolves according to a Gaussian process (GP). Let Ψ\boldsymbol{\Psi} be an M×(K1)M \times (K-1) matrix of values such that the mm-th row and kk-th column value is ψm,k\psi_{m,k}. Then the GP assumption used by Linderman et al is

Ψ:,kGP(μk,C),xnmult(Nm,πSB(ψm,:)),(11) \begin{aligned} \boldsymbol{\Psi}_{:,k} &\sim \mathcal{GP}(\boldsymbol{\mu}_k, \mathbf{C}), \\ \mathbf{x}_n &\sim \text{mult}(N_m, \pi_{\texttt{SB}}(\boldsymbol{\psi}_{m,:})), \end{aligned} \tag{11}

where C\mathbf{C} is a covariance matrix, linking the inputs. So entry Ci,jC_{i,j} is the covariance between zi\mathbf{z}_i and zj\mathbf{z}_j. In words, each row of our data is a multinomial random variable, but the columns (features) share structure through the GP prior.

Since a GP prior induces a multivariate normal distribution on any finite collection of data points, we can sum the quadratics:

ψ:,kX,Z,ωk,μk,CN(ψ:,kΩk1κk,Ωk1)N(ψ:,kμk,C)N(ψ:,km,V),(12) \begin{aligned} \boldsymbol{\psi}_{:,k} \mid \mathbf{X}, \mathbf{Z}, \boldsymbol{\omega}_k, \boldsymbol{\mu}_k, \mathbf{C} &\propto \mathcal{N}\Big( \boldsymbol{\psi}_{:,k} \mid \boldsymbol{\Omega}_k^{-1} \boldsymbol{\kappa}_k, \boldsymbol{\Omega}_k^{-1} \Big) \mathcal{N}\Big( \boldsymbol{\psi}_{:,k} \mid \boldsymbol{\mu}_k, \mathbf{C} \Big) \\ &\propto \mathcal{N}\Big( \boldsymbol{\psi}_{:,k} \mid \mathbf{m}, \mathbf{V} \Big), \end{aligned} \tag{12}

where

V=(C1+Ωk)1,m=V1(C1μk+ΩkΩk1κk)=V1(C1μk+κk).(13) \begin{aligned} \mathbf{V} &= (\mathbf{C}^{-1} + \boldsymbol{\Omega}_k)^{-1}, \\ \mathbf{m} &= \mathbf{V}^{-1}(\mathbf{C}^{-1} \boldsymbol{\mu}_k + \boldsymbol{\Omega}_k \boldsymbol{\Omega}_k^{-1} \boldsymbol{\kappa}_k) \\ &= \mathbf{V}^{-1}(\mathbf{C}^{-1} \boldsymbol{\mu}_k + \boldsymbol{\kappa}_k). \end{aligned} \tag{13}

That’s it. If you understand Pólya-gamma augmentation and the stick-breaking representation in Eq. 55, then this kind of model (Eq. 1212 and 1313) is a natural extension.

Implementation

Often, I like to see a blog post through to implementation. However, Linderman et al already released very readable Python code for this model. For example, here is the sampling function for the Gibbs-sampling version of a GP with multinomial observations. As you can see, they iteratively sample Ψ\boldsymbol{\Psi} and Ω\boldsymbol{\Omega} using the conditional distributions we derived here (Eq. 1010 and 1212).

  1. Linderman, S., Johnson, M. J., & Adams, R. P. (2015). Dependent multinomial models made easy: Stick-breaking with the polya-gamma augmentation. Advances in Neural Information Processing Systems, 3456–3464.
  2. Polson, N. G., Scott, J. G., & Windle, J. (2013). Bayesian inference for logistic models using Pólya–Gamma latent variables. Journal of the American Statistical Association, 108(504), 1339–1349.