A Stick-Breaking Representation of the Multinomial Distribution

Following Linderman, Johnson, and Adam's 2015 paper, "Dependent multinomial models made easy: Stick-breaking with the Pólya-gamma augmentation", I show that a multinomial density can be represented as a product of binomial densities.

In (Linderman et al., 2015), Linderman et al use a stick-breaking representation of the multinomial distribution to decompose it into a product of binomials. This is nice because they can then introduce Pólya-gamma augmentation to make Bayesian inference for multinomial models tractable. The key step is to write a KK-dimensional multinomial density as the product of K1K-1 binomial densities:

mult(xN,π)=k=1K1binom(xkNk,π~k)Nk=Nj<kxj,π~k=πk1j<kπk,k=2,3,,K,N1=N,π~1=π1.(1) \begin{aligned} \text{mult}(\mathbf{x} \mid N, \boldsymbol{\pi}) &= \prod_{k=1}^{K-1} \text{binom}(x_k \mid N_k, \tilde{\pi}_k) \\ N_k &= N - \sum_{j < k} x_j, \quad \tilde{\pi}_k = \frac{\pi_k}{1 - \sum_{j < k} \pi_k}, \quad k = 2, 3, \dots, K, \\ N_1 &= N, \quad \tilde{\pi}_1 = \pi_1. \end{aligned} \tag{1}

This is called a “stick-breaking” representation because you can imagine the multinomial’s probability vector π=[π1πK]\boldsymbol{\pi} = [\pi_1 \dots \pi_K]^{\top} is a stick that is recursively broken in two pieces to generate binomial random variables.

While this is the key identity used in the paper, Linderman et al do not provide a derivation, and it wasn’t obvious to me that this holds. Why is it the product of K1K-1 binomials and not KK binomials? Why does NkN_k decrease in quantity? It feels right, but I wanted proof. So here it is.

The densities for the multinomial and binomial distributions are:

mult(xN,π)=N!x1!x2!xK!π1x1π2x2πKxK,binom(xkNk,π~k)=(Nkxk)π~kxk(1π~k)Nkxk.(2) \begin{aligned} \text{mult}(\mathbf{x} \mid N, \boldsymbol{\pi}) &= \frac{N!}{x_1! x_2! \dots x_K!} \pi_1^{x_1} \pi_2^{x_2} \dots \pi_{K}^{x_K}, \\\\ \text{binom}(x_k \mid N_k, \tilde{\pi}_k) &= {N_k \choose x_k} \tilde{\pi}_k^{x_k} (1 - \tilde{\pi}_k)^{N_k - x_k}. \end{aligned} \tag{2}

First, let’s only compare the normalizers between the two representations. We’ll see that they are equivalent. The normalizers for the stick-breaking representation are

k=1K1(Nkxk)=(Nx1)(Nx1x2)(N(x1+x2)x3)(N(x1++xk1)xk1).(3) \prod_{k=1}^{K-1} {N_k \choose x_k} = {N \choose x_1} {N - x_1 \choose x_2} {N - (x_1 + x_2) \choose x_3} \dots {N - (x_1 + \dots + x_{k-1}) \choose x_{k-1}}. \tag{3}

Let’s write a few terms out to see that most sub-terms cancel:

(Nx1)=N!x1!(Nx1)!,(Nx1x2)=(Nx1)!x2!(N(x1+x2))!,(N(x1+x2)x3)=(N(x1+x2))!x3!(N(x1+x2+x3))!,    (N(x1++xk2)xk1)=(N(x1++xk2))!xk1!(N(x1++xk1))!.(4) \begin{aligned} {N \choose x_1} &= \frac{N!}{x_1! \color{#11accd}{(N - x_1)!}}, \\ {N - x_1 \choose x_2} &= \frac{\color{#11accd}{(N - x_1)!}}{x_2! \color{#bc2612}{(N - (x_1 + x_2))!}}, \\ {N - (x_1 + x_2) \choose x_3} &= \frac{\color{#bc2612}{(N - (x_1 + x_2))!}}{x_3! \color{#807504}{(N - (x_1 + x_2 + x_3))!}}, \\ &\;\;\vdots \\ {N - (x_1 + \dots + x_{k-2}) \choose x_{k-1}} &= \frac{\color{#8D55DB}{(N - (x_1 + \dots + x_{k-2}))!}}{x_{k-1}! (N - (x_1 + \dots + x_{k-1}))!}. \end{aligned} \tag{4}

We can see that the (ba)!(b-a)! terms all cancel in (ab){a \choose b}, which I have colored, except for the last term. The last term is,

(N(x1++xk1))!=xK!,(5) (N - (x_1 + \dots + x_{k-1}))! = x_K!, \tag{5}

by the definition in Eq. 11. This immediately gives the multinomial normalizer in Eq. 22:

k=1K1(Nkxk)=N!x1!x2!xK!.(6) \prod_{k=1}^{K-1} {N_k \choose x_k} = \frac{N!}{x_1! x_2! \dots x_K!}. \tag{6}

The equivalence of the remaining terms, those not associated with the normalizer, follows the same logic. Let’s first write out the kk-th term explicitly in terms of π1,,πk\pi_1, \dots, \pi_k rather π~k\tilde{\pi}_k:

π~kxk(1π~k)Nkxk=(πk1(π1++πk1))xk(1πk1(π1++πk1))Nkxk=(πk1(π1++πk1))xk(1(π1++πk)1(π1++πk1))Nkxk=πkxk(11(π1++πk1))Nk(1(π1++πk))Nkxk(7) \begin{aligned} \tilde{\pi}_k^{x_k} (1 - \tilde{\pi}_k)^{N_k - x_k} &= \Big( \frac{\pi_k}{1 - (\pi_1 + \dots + \pi_{k-1})} \Big)^{x_k} \Big( 1 - \frac{\pi_k}{1 - (\pi_1 + \dots + \pi_{k-1})} \Big)^{N_k - x_k} \\ &= \Big( \frac{\pi_k}{1 - (\pi_1 + \dots + \pi_{k-1})} \Big)^{x_k} \Big( \frac{1 - (\pi_1 + \dots + \pi_k)}{1 - (\pi_1 + \dots + \pi_{k-1})} \Big)^{N_k - x_k} \\ &= \pi_k^{x_k} \Big( \frac{1}{1 - (\pi_1 + \dots + \pi_{k-1})} \Big)^{N_k} (1 - (\pi_1 + \dots + \pi_k))^{N_k - x_k} \end{aligned} \tag{7}

Using this simplification, we can write out a few terms to see again that most sub-terms cancel:

π~1x1(1π~1)N1x1=π1x1(1π1)N1x1,π~2x2(1π~2)N2x2=π2x2(11π1)N2(1(π1+π2))N2x2,π~3x3(1π~3)N3x3=π3x3(11(π1+π2))N3(1(π1+π2+π3))N3x3.    π~K1xK1(1π~K1)NK1xK1=πK1xK1(11(π1++πK2))NK1        (1(π1++πK1))NK1xK1.(8) \begin{aligned} \tilde{\pi}_1^{x_1} (1 - \tilde{\pi}_1)^{N_1 - x_1} &= \pi_1^{x_1} \color{#11accd}{(1 - \pi_1)^{N_1 - x_1}}, \\ \tilde{\pi}_2^{x_2} (1 - \tilde{\pi}_2)^{N_2 - x_2} &= \pi_2^{x_2} \color{#11accd}{\Big( \frac{1}{1 - \pi_1} \Big)^{N_2}} \color{#bc2612}{(1 - (\pi_1 + \pi_2))^{N_2 - x_2}}, \\ \tilde{\pi}_3^{x_3} (1 - \tilde{\pi}_3)^{N_3 - x_3} &= \pi_3^{x_3} \color{#bc2612}{\Big( \frac{1}{1 - (\pi_1 + \pi_2)} \Big)^{N_3}} \color{#807504}{(1 - (\pi_1 + \pi_2 + \pi_3))^{N_3 - x_3}}. \\ &\;\;\vdots \\ \tilde{\pi}_{K-1}^{x_{K-1}} (1 - \tilde{\pi}_{K-1})^{N_{K-1} - x_{K-1}} &= \pi_{K-1}^{x_{K-1}} \color{#8D55DB}{\Big( \frac{1}{1 - (\pi_1 + \dots + \pi_{K-2})} \Big)^{N_{K-1}}} \\ &\;\;\;\;(1 - (\pi_1 + \dots + \pi_{K-1}))^{N_{K-1} - x_{K-1}}. \end{aligned} \tag{8}

Again, I’ve used colors to denote pairs of terms that cancel. If it’s not immediately obvious why this works, note that Nk=Nk1xk1N_k = N_{k-1} - x_{k-1}. Finally, note that

(1(π1++πK1))NK1xK1=πKxK,(9) (1 - (\pi_1 + \dots + \pi_{K-1}))^{N_{K-1} - x_{K-1}} = \pi_K^{x_K}, \tag{9}

because the πk\pi_k terms sum to one. So we have shown,

k=1K1π~kxk(1π~k)Nkxk=π1x1π2x2πKxK,(10) \prod_{k=1}^{K-1} \tilde{\pi}_k^{x_k} (1 - \tilde{\pi}_k)^{N_k - x_k} = \pi_1^{x_1} \pi_2^{x_2} \dots \pi_K^{x_K}, \tag{10}

and we’re done. As I mentioned, this is cool because now we can use Pólya-gamma augmentation. Just define π~kσ(ψk)\tilde{\pi}_k \triangleq \sigma(\psi_k) where σ()\sigma(\cdot) is the logistic function. Then we can use this stick-breaking representation to write our KK-dimensional multinomial as

mult(xN,π)=k=1K1(Nkxk)σ(ψk)xk(1σ(ψk))Nkxk=k=1K1(Nkxk)(eψk)xk(1+eψk)Nk.(11) \begin{aligned} \text{mult}(\mathbf{x} \mid N, \boldsymbol{\pi}) &= \prod_{k=1}^{K-1} {N_k \choose x_k} \sigma(\psi_k)^{x_k} (1 - \sigma(\psi_k))^{N_k - x_k} \\ &= \prod_{k=1}^{K-1} {N_k \choose x_k} \frac{(e^{\psi_k})^{x_k}}{(1 + e^{\psi_k})^{N_k}}. \end{aligned} \tag{11}

This product of logistic functions is amendable to the identity in Theorem 11 in (Polson et al., 2013).

  1. Linderman, S., Johnson, M. J., & Adams, R. P. (2015). Dependent multinomial models made easy: Stick-breaking with the polya-gamma augmentation. Advances in Neural Information Processing Systems, 3456–3464.
  2. Polson, N. G., Scott, J. G., & Windle, J. (2013). Bayesian inference for logistic models using Pólya–Gamma latent variables. Journal of the American Statistical Association, 108(504), 1339–1349.