In Dependent multinomial models made easy: Stick-breaking with the polya-gamma augmentation, (Linderman et al., 2015), Linderman et al leverage a representation of the K-dimensional multinomial distribution as the product of K−1 binomial distributions to construct a model that can leverage Pólya-gamma augmentation. I discussed this representation and its correctness in a previous post. In this post, I want to construct the Gibbs sampler for Gaussian process regression with multinomial observations discussed in the paper.
Pólya-gamma augmentation
Let’s review Pólya-gamma augmentation to set up the notation used by Linderman et al. Conceptually, Pólya-gamma augmentation is fairly straightforward, but the mathematical details are a bit tedious. See my previous post if you want a more detailed presentation.
In (Polson et al., 2013), Polson proved two useful properties of Pólya-gamma random variables. First, he proved the following identity:
(1+eψ)b(eψ)a=2−beκψ∫0∞e−ωψ2/2p(ω∣b,0)dω=2−beκψEp(ω∣b,0)[e−ωψ2/2],(1)
where κ=a−b/2 and p(ω∣b,0)=PG(ω∣b,0) where PG(α,β) denotes the Pólya-gamma distribution with parameters α and β. Next, consider probabilistic models whose data likelihood has a functional form containing the logistic function:
p(x∣ψ)=c(x)(1+eψ)b(x)(eψ)a(x),(2)
for some functions of the data a(⋅), b(⋅), and c(⋅). For example, in Bernoulli regression, a(x)=b(x)=c(x)=1. If we condition on the Pólya-gamma random variables ω, the expectation in Eq. 1 becomes a constant, and the identity allows us to write the conditional distribution p(ψ∣x,ω) as conditionally Gaussian,
p(ψ∣x,ω)∝p(x∣ψ,ω)p(ψ)=eκψe−ωψ2/2p(ψ),(3)
provided p(ψ) is Gaussian. Furthermore, Polson proved that, conditioned on ψ, the Pólya-gamma random variables have distribution
p(ω∣ψ)∼PG(b,ψ).(4)
This means we can construct a Gibbs sampler that iteratively samples from Eq. 3 and then Eq. 4, making Bayesian inference for logistic models tractable.
Pólya-gamma augmentation of multinomial distributions
Linderman et al extend Polson’s idea to multinomial distributions by re-writing the multinomial density as a product of binomial densities:
mult(x∣N,π)NkN1=k=1∏K−1binom(xk∣Nk,π~k)=N−j<k∑xj,π~k=1−∑j<kπkπk,k=2,3,…,K,=N,π~1=π1.(5)
See my previous post for a proof of this identity. Now define π~k≜σ(ψk) where σ(⋅) is the logistic function. Then we can use this stick-breaking representation to write our K-dimensional multinomial as
mult(x∣N,π)=k=1∏K−1(xkNk)σ(ψk)xk(1−σ(ψk))Nk−xk=k=1∏K−1(xkNk)(1+eψk)Nk(eψk)xk.(6)
Note that each term inside the product in Eq. 6 can be re-written using Pólya-gamma augmentation. Let’s introduce a Pólya-gamma random variable ωk for each of the K−1 components in the stick-breaking representation of x. Thus, let ω=[ω1…ωK−1]⊤ and ψ=[ψ1…ψK−1]⊤. Since the term κ in the previous section depends on a(⋅) and b(⋅), which are now indexed by k, let κ=[κ1…κK−1]⊤, where κk=xk−Nk/2. Now we can write the likelihood in Eq. 6, conditioned on ω, as:
mult(x∣N,π,ω)=k=1∏K−1(xkNk)(1+eψk)Nk(eψk)xk=k=1∏K−1(xkNk)2−Nkexp{κkψk}exp{−2ωkψk2}∝k=1∏K−1exp{κkψk−2ωkψk2}.(7)
Now let’s complete the square for exponent term, dropping sub-terms that do not depend on ψk:
exp{κkψk−2ωkψk2}=exp{−2ωk[ψk2−2ωkκkψk]}=exp{−2ωk[ψk2−2ωkκkψk+(ωkκk)2−(ωkκk)2]}∝exp{−2ωk[ψk2−2ωkκkψk+(ωkκk)2]}∝exp{−2ωk(ψk−ωkκk)2}.(8)
Note that this term, which is quadratic in ψk, is equal to a Gaussian kernel. If we assume each of the K−1 components is independent, we can write the last line of Eq. 7 as a multivariate Gaussian:
mult(x∣N,π,ω)∝k=1∏K−1exp{κkψk−2ωkψk2}∝N(ψ∣Ω−1κ,Ω−1),(9)
where Ω=diag(ω), a (K−1)×(K−1) matrix.
In words, this means that we can construct a Gibbs sampler similar to Polson’s because we can sample from this conditionally Gaussian distribution, which is analogous to Eq. 3. However, in this case, each observation xm, where m indexes M observations, conditions not on a single Pólya-gamma random variable but a (K−1)-vector of them. Thus, the analog to Eq. 4 is
ωm,k∣xn,ψm,k∼PG(Nm,k,ωm,k),(10)
where m indexes M observations, X=[x1,…,xM]⊤.
Finally, we need a way to map between the normalized probability vector π and the (K−1)-vector ψ. Let πSB(⋅) be a function, RK−1↦[0,1]K, transforming the stick-breaking representation ψ to a K-vector of normalized probabilities π.
Gaussian processes with multinomial observations
Let X be an M×K matrix of count data, and assume that each observation (a row vector) is multinomially distributed. Let Z be an M×D matrix of GP inputs or covariates. Furthermore, assume that each column of the matrix evolves according to a Gaussian process (GP). Let Ψ be an M×(K−1) matrix of values such that the m-th row and k-th column value is ψm,k. Then the GP assumption used by Linderman et al is
Ψ:,kxn∼GP(μk,C),∼mult(Nm,πSB(ψm,:)),(11)
where C is a covariance matrix, linking the inputs. So entry Ci,j is the covariance between zi and zj. In words, each row of our data is a multinomial random variable, but the columns (features) share structure through the GP prior.
Since a GP prior induces a multivariate normal distribution on any finite collection of data points, we can sum the quadratics:
ψ:,k∣X,Z,ωk,μk,C∝N(ψ:,k∣Ωk−1κk,Ωk−1)N(ψ:,k∣μk,C)∝N(ψ:,k∣m,V),(12)
where
Vm=(C−1+Ωk)−1,=V−1(C−1μk+ΩkΩk−1κk)=V−1(C−1μk+κk).(13)
That’s it. If you understand Pólya-gamma augmentation and the stick-breaking representation in Eq. 5, then this kind of model (Eq. 12 and 13) is a natural extension.
Implementation
Often, I like to see a blog post through to implementation. However, Linderman et al already released very readable Python code for this model. For example, here is the sampling function for the Gibbs-sampling version of a GP with multinomial observations. As you can see, they iteratively sample Ψ and Ω using the conditional distributions we derived here (Eq. 10 and 12).