A Stick-Breaking Representation of the Multinomial Distribution

Following Linderman, Johnson, and Adam's 2015 paper, "Dependent multinomial models made easy: Stick-breaking with the Pólya-gamma augmentation", I show that a multinomial density can be represented as a product of binomial densities.

Published

01 July 2020

In (Linderman et al., 2015), Linderman et al use a stick-breaking representation of the multinomial distribution to decompose it into a product of binomials. This is nice because they can then introduce Pólya-gamma augmentation to make Bayesian inference for multinomial models tractable. The key step is to write a $K$ -dimensional multinomial density as the product of $K-1$ binomial densities:

$\begin{aligned} \text{mult}(\mathbf{x} \mid N, \boldsymbol{\pi}) &= \prod_{k=1}^{K-1} \text{binom}(x_k \mid N_k, \tilde{\pi}_k) \\ N_k &= N - \sum_{j < k} x_j, \quad \tilde{\pi}_k = \frac{\pi_k}{1 - \sum_{j < k} \pi_k}, \quad k = 2, 3, \dots, K, \\ N_1 &= N, \quad \tilde{\pi}_1 = \pi_1. \end{aligned} \tag{1}$

This is called a “stick-breaking” representation because you can imagine the multinomial’s probability vector $\boldsymbol{\pi} = [\pi_1 \dots \pi_K]^{\top}$ is a stick that is recursively broken in two pieces to generate binomial random variables.

While this is the key identity used in the paper, Linderman et al do not provide a derivation, and it wasn’t obvious to me that this holds. Why is it the product of $K-1$ binomials and not $K$ binomials? Why does $N_k$ decrease in quantity? It feels right, but I wanted proof. So here it is.

The densities for the multinomial and binomial distributions are:

$\begin{aligned} \text{mult}(\mathbf{x} \mid N, \boldsymbol{\pi}) &= \frac{N!}{x_1! x_2! \dots x_K!} \pi_1^{x_1} \pi_2^{x_2} \dots \pi_{K}^{x_K}, \\\\ \text{binom}(x_k \mid N_k, \tilde{\pi}_k) &= {N_k \choose x_k} \tilde{\pi}_k^{x_k} (1 - \tilde{\pi}_k)^{N_k - x_k}. \end{aligned} \tag{2}$

First, let’s only compare the normalizers between the two representations. We’ll see that they are equivalent. The normalizers for the stick-breaking representation are

$\prod_{k=1}^{K-1} {N_k \choose x_k} = {N \choose x_1} {N - x_1 \choose x_2} {N - (x_1 + x_2) \choose x_3} \dots {N - (x_1 + \dots + x_{k-1}) \choose x_{k-1}}. \tag{3}$

Let’s write a few terms out to see that most sub-terms cancel:

$\begin{aligned} {N \choose x_1} &= \frac{N!}{x_1! \color{#11accd}{(N - x_1)!}}, \\ {N - x_1 \choose x_2} &= \frac{\color{#11accd}{(N - x_1)!}}{x_2! \color{#bc2612}{(N - (x_1 + x_2))!}}, \\ {N - (x_1 + x_2) \choose x_3} &= \frac{\color{#bc2612}{(N - (x_1 + x_2))!}}{x_3! \color{#807504}{(N - (x_1 + x_2 + x_3))!}}, \\ &\;\;\vdots \\ {N - (x_1 + \dots + x_{k-2}) \choose x_{k-1}} &= \frac{\color{#8D55DB}{(N - (x_1 + \dots + x_{k-2}))!}}{x_{k-1}! (N - (x_1 + \dots + x_{k-1}))!}. \end{aligned} \tag{4}$

We can see that the $(b-a)!$ terms all cancel in ${a \choose b}$ , which I have colored, except for the last term. The last term is,

$(N - (x_1 + \dots + x_{k-1}))! = x_K!, \tag{5}$

by the definition in Eq. $1$ . This immediately gives the multinomial normalizer in Eq. $2$ :

$\prod_{k=1}^{K-1} {N_k \choose x_k} = \frac{N!}{x_1! x_2! \dots x_K!}. \tag{6}$

The equivalence of the remaining terms, those not associated with the normalizer, follows the same logic. Let’s first write out the $k$ -th term explicitly in terms of $\pi_1, \dots, \pi_k$ rather $\tilde{\pi}_k$ :

$\begin{aligned} \tilde{\pi}_k^{x_k} (1 - \tilde{\pi}_k)^{N_k - x_k} &= \Big( \frac{\pi_k}{1 - (\pi_1 + \dots + \pi_{k-1})} \Big)^{x_k} \Big( 1 - \frac{\pi_k}{1 - (\pi_1 + \dots + \pi_{k-1})} \Big)^{N_k - x_k} \\ &= \Big( \frac{\pi_k}{1 - (\pi_1 + \dots + \pi_{k-1})} \Big)^{x_k} \Big( \frac{1 - (\pi_1 + \dots + \pi_k)}{1 - (\pi_1 + \dots + \pi_{k-1})} \Big)^{N_k - x_k} \\ &= \pi_k^{x_k} \Big( \frac{1}{1 - (\pi_1 + \dots + \pi_{k-1})} \Big)^{N_k} (1 - (\pi_1 + \dots + \pi_k))^{N_k - x_k} \end{aligned} \tag{7}$

Using this simplification, we can write out a few terms to see again that most sub-terms cancel:

$\begin{aligned} \tilde{\pi}_1^{x_1} (1 - \tilde{\pi}_1)^{N_1 - x_1} &= \pi_1^{x_1} \color{#11accd}{(1 - \pi_1)^{N_1 - x_1}}, \\ \tilde{\pi}_2^{x_2} (1 - \tilde{\pi}_2)^{N_2 - x_2} &= \pi_2^{x_2} \color{#11accd}{\Big( \frac{1}{1 - \pi_1} \Big)^{N_2}} \color{#bc2612}{(1 - (\pi_1 + \pi_2))^{N_2 - x_2}}, \\ \tilde{\pi}_3^{x_3} (1 - \tilde{\pi}_3)^{N_3 - x_3} &= \pi_3^{x_3} \color{#bc2612}{\Big( \frac{1}{1 - (\pi_1 + \pi_2)} \Big)^{N_3}} \color{#807504}{(1 - (\pi_1 + \pi_2 + \pi_3))^{N_3 - x_3}}. \\ &\;\;\vdots \\ \tilde{\pi}_{K-1}^{x_{K-1}} (1 - \tilde{\pi}_{K-1})^{N_{K-1} - x_{K-1}} &= \pi_{K-1}^{x_{K-1}} \color{#8D55DB}{\Big( \frac{1}{1 - (\pi_1 + \dots + \pi_{K-2})} \Big)^{N_{K-1}}} \\ &\;\;\;\;(1 - (\pi_1 + \dots + \pi_{K-1}))^{N_{K-1} - x_{K-1}}. \end{aligned} \tag{8}$

Again, I’ve used colors to denote pairs of terms that cancel. If it’s not immediately obvious why this works, note that $N_k = N_{k-1} - x_{k-1}$ . Finally, note that

$(1 - (\pi_1 + \dots + \pi_{K-1}))^{N_{K-1} - x_{K-1}} = \pi_K^{x_K}, \tag{9}$

because the $\pi_k$ terms sum to one. So we have shown,

$\prod_{k=1}^{K-1} \tilde{\pi}_k^{x_k} (1 - \tilde{\pi}_k)^{N_k - x_k} = \pi_1^{x_1} \pi_2^{x_2} \dots \pi_K^{x_K}, \tag{10}$

and we’re done. As I mentioned, this is cool because now we can use Pólya-gamma augmentation. Just define $\tilde{\pi}_k \triangleq \sigma(\psi_k)$ where $\sigma(\cdot)$ is the logistic function. Then we can use this stick-breaking representation to write our $K$ -dimensional multinomial as

$\begin{aligned} \text{mult}(\mathbf{x} \mid N, \boldsymbol{\pi}) &= \prod_{k=1}^{K-1} {N_k \choose x_k} \sigma(\psi_k)^{x_k} (1 - \sigma(\psi_k))^{N_k - x_k} \\ &= \prod_{k=1}^{K-1} {N_k \choose x_k} \frac{(e^{\psi_k})^{x_k}}{(1 + e^{\psi_k})^{N_k}}. \end{aligned} \tag{11}$

This product of logistic functions is amendable to the identity in Theorem $1$ in (Polson et al., 2013).

Linderman, S., Johnson, M. J., & Adams, R. P. (2015). Dependent multinomial models made easy: Stick-breaking with the polya-gamma augmentation. Advances in Neural Information Processing Systems, 3456–3464.
Polson, N. G., Scott, J. G., & Windle, J. (2013). Bayesian inference for logistic models using Pólya–Gamma latent variables. Journal of the American Statistical Association, 108(504), 1339–1349.