Gaussian Processes with Multinomial Observations

Linderman, Johnson, and Adam's 2015 paper, "Dependent multinomial models made easy: Stick-breaking with the Pólya-gamma augmentation", introduces a Gibbs sampler for Gaussian processes with multinomial observations. I discuss this model in detail.

Published

03 July 2020

In Dependent multinomial models made easy: Stick-breaking with the polya-gamma augmentation, (Linderman et al., 2015), Linderman et al leverage a representation of the $K$ -dimensional multinomial distribution as the product of $K-1$ binomial distributions to construct a model that can leverage Pólya-gamma augmentation. I discussed this representation and its correctness in a previous post. In this post, I want to construct the Gibbs sampler for Gaussian process regression with multinomial observations discussed in the paper.

Pólya-gamma augmentation

Let’s review Pólya-gamma augmentation to set up the notation used by Linderman et al. Conceptually, Pólya-gamma augmentation is fairly straightforward, but the mathematical details are a bit tedious. See my previous post if you want a more detailed presentation.

In (Polson et al., 2013), Polson proved two useful properties of Pólya-gamma random variables. First, he proved the following identity:

$\begin{aligned} \frac{(e^{\psi})^a}{(1 + e^{\psi})^b} &= 2^{-b} e^{\kappa \psi} \int_{0}^{\infty} e^{- \omega \psi^2 / 2} p(\omega \mid b, 0) \text{d}\omega \\ &= 2^{-b} e^{\kappa \psi} \mathbb{E}_{p(\omega \mid b, 0)}[e^{- \omega \psi^2 / 2}], \end{aligned} \tag{1}$

where $\kappa = a - b/2$ and $p(\omega \mid b, 0) = \text{PG}(\omega \mid b, 0)$ where $\text{PG}(\alpha, \beta)$ denotes the Pólya-gamma distribution with parameters $\alpha$ and $\beta$ . Next, consider probabilistic models whose data likelihood has a functional form containing the logistic function:

$p(x \mid \psi) = c(x) \frac{(e^{\psi})^{a(x)}}{(1 + e^{\psi})^{b(x)}}, \tag{2}$

for some functions of the data $a(\cdot)$ , $b(\cdot)$ , and $c(\cdot)$ . For example, in Bernoulli regression, $a(x) = b(x) = c(x) = 1$ . If we condition on the Pólya-gamma random variables $\omega$ , the expectation in Eq. $1$ becomes a constant, and the identity allows us to write the conditional distribution $p(\psi \mid x, \omega)$ as conditionally Gaussian,

$p(\psi \mid x, \omega) \propto p(x \mid \psi, \omega) p(\psi) = e^{\kappa \psi} e^{- \omega \psi^2 / 2} p(\psi), \tag{3}$

provided $p(\psi)$ is Gaussian. Furthermore, Polson proved that, conditioned on $\psi$ , the Pólya-gamma random variables have distribution

$p(\omega \mid \psi) \sim \text{PG}(b, \psi). \tag{4}$

This means we can construct a Gibbs sampler that iteratively samples from Eq. $3$ and then Eq. $4$ , making Bayesian inference for logistic models tractable.

Pólya-gamma augmentation of multinomial distributions

Linderman et al extend Polson’s idea to multinomial distributions by re-writing the multinomial density as a product of binomial densities:

$\begin{aligned} \text{mult}(\mathbf{x} \mid N, \boldsymbol{\pi}) &= \prod_{k=1}^{K-1} \text{binom}(x_k \mid N_k, \tilde{\pi}_k) \\ N_k &= N - \sum_{j < k} x_j, \quad \tilde{\pi}_k = \frac{\pi_k}{1 - \sum_{j < k} \pi_k}, \quad k = 2, 3, \dots, K, \\ N_1 &= N, \quad \tilde{\pi}_1 = \pi_1. \end{aligned} \tag{5}$

See my previous post for a proof of this identity. Now define $\tilde{\pi}_k \triangleq \sigma(\psi_k)$ where $\sigma(\cdot)$ is the logistic function. Then we can use this stick-breaking representation to write our $K$ -dimensional multinomial as

$\begin{aligned} \text{mult}(\mathbf{x} \mid N, \boldsymbol{\pi}) &= \prod_{k=1}^{K-1} {N_k \choose x_k} \sigma(\psi_k)^{x_k} (1 - \sigma(\psi_k))^{N_k - x_k} \\ &= \prod_{k=1}^{K-1} {N_k \choose x_k} \frac{(e^{\psi_k})^{x_k}}{(1 + e^{\psi_k})^{N_k}}. \end{aligned} \tag{6}$

Note that each term inside the product in Eq. $6$ can be re-written using Pólya-gamma augmentation. Let’s introduce a Pólya-gamma random variable $\omega_k$ for each of the $K-1$ components in the stick-breaking representation of $\mathbf{x}$ . Thus, let $\boldsymbol{\omega} = [\omega_1 \dots \omega_{K-1}]^{\top}$ and $\boldsymbol{\psi} = [\psi_1 \dots \psi_{K-1}]^{\top}$ . Since the term $\kappa$ in the previous section depends on $a(\cdot)$ and $b(\cdot)$ , which are now indexed by $k$ , let $\boldsymbol{\kappa} = [\kappa_1 \dots \kappa_{K-1}]^{\top}$ , where $\kappa_k = x_k - N_{k}/2$ . Now we can write the likelihood in Eq. $6$ , conditioned on $\boldsymbol{\omega}$ , as:

$\begin{aligned} \text{mult}(\mathbf{x} \mid N, \boldsymbol{\pi}, \boldsymbol{\omega}) &= \prod_{k=1}^{K-1} {N_k \choose x_k} \frac{(e^{\psi_k})^{x_k}}{(1 + e^{\psi_k})^{N_k}} \\ &= \prod_{k=1}^{K-1} {N_k \choose x_k} 2^{-N_k} \exp\Big\{\kappa_k \psi_k\Big\} \exp\Big\{-\frac{\omega_k \psi_k^2}{2}\Big\} \\ &\propto \prod_{k=1}^{K-1} \exp\Big\{ \kappa_k \psi_k - \frac{\omega_k \psi_k^2}{2} \Big\}. \end{aligned} \tag{7}$

Now let’s complete the square for exponent term, dropping sub-terms that do not depend on $\psi_k$ :

$\begin{aligned} \exp\Big\{ \kappa_k \psi_k - \frac{\omega_k \psi_k^2}{2} \Big\} &= \exp \Big\{- \frac{\omega_k}{2} \Big[ \psi_k^2 - 2 \frac{\kappa_k}{\omega_k} \psi_k \Big] \Big\} \\ &= \exp \Big\{- \frac{\omega_k}{2} \Big[ \psi_k^2 - 2 \frac{\kappa_k}{\omega_k} \psi_k + \Big(\frac{\kappa_k}{\omega_k}\Big)^2 - \Big(\frac{\kappa_k}{\omega_k}\Big)^2 \Big] \Big\} \\ &\propto \exp \Big\{- \frac{\omega_k}{2} \Big[ \psi_k^2 - 2 \frac{\kappa_k}{\omega_k} \psi_k + \Big(\frac{\kappa_k}{\omega_k}\Big)^2 \Big] \Big\} \\ &\propto \exp \Big\{- \frac{\omega_k}{2} \Big( \psi_k - \frac{\kappa_k}{\omega_k}\Big)^2 \Big\}. \end{aligned} \tag{8}$

Note that this term, which is quadratic in $\psi_k$ , is equal to a Gaussian kernel. If we assume each of the $K-1$ components is independent, we can write the last line of Eq. $7$ as a multivariate Gaussian:

$\text{mult}(\mathbf{x} \mid N, \boldsymbol{\pi}, \boldsymbol{\omega}) \propto \prod_{k=1}^{K-1} \exp\Big\{ \kappa_k \psi_k - \frac{\omega_k \psi_k^2}{2} \Big\} \propto \mathcal{N}\Big( \boldsymbol{\psi} \mid \boldsymbol{\Omega}^{-1} \boldsymbol{\kappa}, \boldsymbol{\Omega}^{-1} \Big), \tag{9}$

where $\boldsymbol{\Omega} = \text{diag}(\boldsymbol{\omega})$ , a $(K-1) \times (K-1)$ matrix.

In words, this means that we can construct a Gibbs sampler similar to Polson’s because we can sample from this conditionally Gaussian distribution, which is analogous to Eq. $3$ . However, in this case, each observation $\mathbf{x}_m$ , where $m$ indexes $M$ observations, conditions not on a single Pólya-gamma random variable but a $(K-1)$ -vector of them. Thus, the analog to Eq. $4$ is

$\omega_{m,k} \mid \mathbf{x}_n, \psi_{m,k} \sim \text{PG}(N_{m,k}, \omega_{m, k}), \tag{10}$

where $m$ indexes $M$ observations, $\mathbf{X} = [\mathbf{x}_1, \dots, \mathbf{x}_M]^{\top}$ .

Finally, we need a way to map between the normalized probability vector $\boldsymbol{\pi}$ and the $(K-1)$ -vector $\boldsymbol{\psi}$ . Let $\pi_{\texttt{SB}}(\cdot)$ be a function, $\mathbb{R}^{K-1} \mapsto [0, 1]^K$ , transforming the stick-breaking representation $\boldsymbol{\psi}$ to a $K$ -vector of normalized probabilities $\boldsymbol{\pi}$ .

Gaussian processes with multinomial observations

Let $\mathbf{X}$ be an $M \times K$ matrix of count data, and assume that each observation (a row vector) is multinomially distributed. Let $\mathbf{Z}$ be an $M \times D$ matrix of GP inputs or covariates. Furthermore, assume that each column of the matrix evolves according to a Gaussian process (GP). Let $\boldsymbol{\Psi}$ be an $M \times (K-1)$ matrix of values such that the $m$ -th row and $k$ -th column value is $\psi_{m,k}$ . Then the GP assumption used by Linderman et al is

$\begin{aligned} \boldsymbol{\Psi}_{:,k} &\sim \mathcal{GP}(\boldsymbol{\mu}_k, \mathbf{C}), \\ \mathbf{x}_n &\sim \text{mult}(N_m, \pi_{\texttt{SB}}(\boldsymbol{\psi}_{m,:})), \end{aligned} \tag{11}$

where $\mathbf{C}$ is a covariance matrix, linking the inputs. So entry $C_{i,j}$ is the covariance between $\mathbf{z}_i$ and $\mathbf{z}_j$ . In words, each row of our data is a multinomial random variable, but the columns (features) share structure through the GP prior.

Since a GP prior induces a multivariate normal distribution on any finite collection of data points, we can sum the quadratics:

$\begin{aligned} \boldsymbol{\psi}_{:,k} \mid \mathbf{X}, \mathbf{Z}, \boldsymbol{\omega}_k, \boldsymbol{\mu}_k, \mathbf{C} &\propto \mathcal{N}\Big( \boldsymbol{\psi}_{:,k} \mid \boldsymbol{\Omega}_k^{-1} \boldsymbol{\kappa}_k, \boldsymbol{\Omega}_k^{-1} \Big) \mathcal{N}\Big( \boldsymbol{\psi}_{:,k} \mid \boldsymbol{\mu}_k, \mathbf{C} \Big) \\ &\propto \mathcal{N}\Big( \boldsymbol{\psi}_{:,k} \mid \mathbf{m}, \mathbf{V} \Big), \end{aligned} \tag{12}$

where

$\begin{aligned} \mathbf{V} &= (\mathbf{C}^{-1} + \boldsymbol{\Omega}_k)^{-1}, \\ \mathbf{m} &= \mathbf{V}^{-1}(\mathbf{C}^{-1} \boldsymbol{\mu}_k + \boldsymbol{\Omega}_k \boldsymbol{\Omega}_k^{-1} \boldsymbol{\kappa}_k) \\ &= \mathbf{V}^{-1}(\mathbf{C}^{-1} \boldsymbol{\mu}_k + \boldsymbol{\kappa}_k). \end{aligned} \tag{13}$

That’s it. If you understand Pólya-gamma augmentation and the stick-breaking representation in Eq. $5$ , then this kind of model (Eq. $12$ and $13$ ) is a natural extension.

Implementation

Often, I like to see a blog post through to implementation. However, Linderman et al already released very readable Python code for this model. For example, here is the sampling function for the Gibbs-sampling version of a GP with multinomial observations. As you can see, they iteratively sample $\boldsymbol{\Psi}$ and $\boldsymbol{\Omega}$ using the conditional distributions we derived here (Eq. $10$ and $12$ ).

Linderman, S., Johnson, M. J., & Adams, R. P. (2015). Dependent multinomial models made easy: Stick-breaking with the polya-gamma augmentation. Advances in Neural Information Processing Systems, 3456–3464.
Polson, N. G., Scott, J. G., & Windle, J. (2013). Bayesian inference for logistic models using Pólya–Gamma latent variables. Journal of the American Statistical Association, 108(504), 1339–1349.