In (Linderman et al., 2015), Linderman et al use a stick-breaking representation of the multinomial distribution to decompose it into a product of binomials. This is nice because they can then introduce Pólya-gamma augmentation to make Bayesian inference for multinomial models tractable. The key step is to write a K-dimensional multinomial density as the product of K−1 binomial densities:
mult(x∣N,π)NkN1=k=1∏K−1binom(xk∣Nk,π~k)=N−j<k∑xj,π~k=1−∑j<kπkπk,k=2,3,…,K,=N,π~1=π1.(1)
This is called a “stick-breaking” representation because you can imagine the multinomial’s probability vector π=[π1…πK]⊤ is a stick that is recursively broken in two pieces to generate binomial random variables.
While this is the key identity used in the paper, Linderman et al do not provide a derivation, and it wasn’t obvious to me that this holds. Why is it the product of K−1 binomials and not K binomials? Why does Nk decrease in quantity? It feels right, but I wanted proof. So here it is.
The densities for the multinomial and binomial distributions are:
mult(x∣N,π)binom(xk∣Nk,π~k)=x1!x2!…xK!N!π1x1π2x2…πKxK,=(xkNk)π~kxk(1−π~k)Nk−xk.(2)
First, let’s only compare the normalizers between the two representations. We’ll see that they are equivalent. The normalizers for the stick-breaking representation are
k=1∏K−1(xkNk)=(x1N)(x2N−x1)(x3N−(x1+x2))…(xk−1N−(x1+⋯+xk−1)).(3)
Let’s write a few terms out to see that most sub-terms cancel:
(x1N)(x2N−x1)(x3N−(x1+x2))(xk−1N−(x1+⋯+xk−2))=x1!(N−x1)!N!,=x2!(N−(x1+x2))!(N−x1)!,=x3!(N−(x1+x2+x3))!(N−(x1+x2))!,⋮=xk−1!(N−(x1+⋯+xk−1))!(N−(x1+⋯+xk−2))!.(4)
We can see that the (b−a)! terms all cancel in (ba), which I have colored, except for the last term. The last term is,
(N−(x1+⋯+xk−1))!=xK!,(5)
by the definition in Eq. 1. This immediately gives the multinomial normalizer in Eq. 2:
k=1∏K−1(xkNk)=x1!x2!…xK!N!.(6)
The equivalence of the remaining terms, those not associated with the normalizer, follows the same logic. Let’s first write out the k-th term explicitly in terms of π1,…,πk rather π~k:
π~kxk(1−π~k)Nk−xk=(1−(π1+⋯+πk−1)πk)xk(1−1−(π1+⋯+πk−1)πk)Nk−xk=(1−(π1+⋯+πk−1)πk)xk(1−(π1+⋯+πk−1)1−(π1+⋯+πk))Nk−xk=πkxk(1−(π1+⋯+πk−1)1)Nk(1−(π1+⋯+πk))Nk−xk(7)
Using this simplification, we can write out a few terms to see again that most sub-terms cancel:
π~1x1(1−π~1)N1−x1π~2x2(1−π~2)N2−x2π~3x3(1−π~3)N3−x3π~K−1xK−1(1−π~K−1)NK−1−xK−1=π1x1(1−π1)N1−x1,=π2x2(1−π11)N2(1−(π1+π2))N2−x2,=π3x3(1−(π1+π2)1)N3(1−(π1+π2+π3))N3−x3.⋮=πK−1xK−1(1−(π1+⋯+πK−2)1)NK−1(1−(π1+⋯+πK−1))NK−1−xK−1.(8)
Again, I’ve used colors to denote pairs of terms that cancel. If it’s not immediately obvious why this works, note that Nk=Nk−1−xk−1. Finally, note that
(1−(π1+⋯+πK−1))NK−1−xK−1=πKxK,(9)
because the πk terms sum to one. So we have shown,
k=1∏K−1π~kxk(1−π~k)Nk−xk=π1x1π2x2…πKxK,(10)
and we’re done. As I mentioned, this is cool because now we can use Pólya-gamma augmentation. Just define π~k≜σ(ψk) where σ(⋅) is the logistic function. Then we can use this stick-breaking representation to write our K-dimensional multinomial as
mult(x∣N,π)=k=1∏K−1(xkNk)σ(ψk)xk(1−σ(ψk))Nk−xk=k=1∏K−1(xkNk)(1+eψk)Nk(eψk)xk.(11)
This product of logistic functions is amendable to the identity in Theorem 1 in (Polson et al., 2013).