Pólya-Gamma Augmentation

Bayesian inference for models with binomial likelihoods is hard, but in a 2013 paper, Nicholas Polson and his coauthors introduced a new method fast Bayesian inference using Gibbs sampling. I discuss their main results in detail.

Consider the task of Bayesian inference for models with binomial likelihoods parameterized by log-odds. Two well-known examples of such models are logistic regression and negative binomial regression. For example, in logistic regression, the dependent variables are assumed to be i.i.d. from a Bernoulli distribution with parameter pp, and therefore the likelihood function is

L(p)n=1Npyn(1p)1yn=pyn(1p)Nyn.(1) \mathcal{L}(p) \propto \prod_{n=1}^{N} p^{y_n} (1 - p)^{1 - y_n} = p^{\sum y_n} (1 - p)^{N - \sum y_n}. \tag{1}

The observations interact with the response through a linear relationship with the log-odds,

log(p1p)=β0+x1β1+x2β2++xDβD=βx.(2) \log\Big(\frac{p}{1-p}\Big) = \beta_0 + x_1 \beta_1 + x_2 \beta_2 + \dots + x_D \beta_D = \boldsymbol{\beta}^{\top} \mathbf{x}. \tag{2}

If we solve for pp in (2)(2), we get

p=exp(βxn)1+exp(βxn)(3) p = \frac{\exp(\boldsymbol{\beta}^{\top} \mathbf{x}_n)}{1 + \exp(\boldsymbol{\beta}^{\top} \mathbf{x}_n)} \tag{3}

and a likelihood of

L(β)[exp(βx)]yn[1+exp(βx)]N.(4) \mathcal{L}(\boldsymbol{\beta}) \propto \frac{ [\exp(\boldsymbol{\beta}^{\top} \mathbf{x})]^{\sum y_n} }{ [1 + \exp(\boldsymbol{\beta}^{\top} \mathbf{x})]^{N} }. \tag{4}

Due to this functional form, Bayesian inference for logistic regression is intractable (Bishop, 2006). This is because the evidence would require normalizing the product of a prior distribution (e.g. a Gaussian prior on β\boldsymbol{\beta}) times the likelihood function in (4)(4). A similar problem arises for other models with binomial likelihoods parameterized by log-odds. See A1 for a derivation of the likelihood function for negative binomial regression.

However, (Polson et al., 2013) introduced a new method called Pólya-gamma augmentation that allows for constructing simple Gibbs samplers for these models. The goal of this post is to discuss their main results in detail, understand the derivations, and implement this Gibbs sampler.

Pólya-gamma random variables

If ω\omega is a Pólya-gamma-distributed random variable with parameters b>0b > 0 and cRc \in \mathbb{R}, denoted ωPG(b,c)\omega \sim \text{PG}(b, c), then it is equal in distribution to an infinite weighted sum of gamma random variables:

ω=d12π2k=1gk(k1/2)2+c2/(4π2).(5) \omega \stackrel{d}{=} \frac{1}{2 \pi^2} \sum_{k=1}^{\infty} \frac{g_{k}}{(k-1/2)^2 + c^2/(4\pi^2)}. \tag{5}

Here, =d\stackrel{d}{=} denotes equality in distribution, and gkgamma(b,1)g_k \sim \text{gamma}(b, 1) are independent gamma random variables. Note that this is not the density function. Instead, equality in distribution means that the Pólya-gamma random variable on the left-hand side has the same cumulative distribution function as the random variable on the right-hand side. The density function itself is more complicated (see Polson et al’s first equation in section 2.3).

While a Pólya-gamma random variable’s density function is complicated, Polson et al show that all the finite moments of ω\omega can be written in closed form. For example, the expectation can be calculated immediately,

E[ω]=b2ctanh(c/2).(6) \mathbb{E}[\omega] = \frac{b}{2c} \tanh(c/2). \tag{6}

In particular, Polson et al proved two useful properties of Pólya-gamma variables. First,

(eψ)a(1+eψ)b=2beκψ0eωψ2/2p(ω)dω(7) \frac{(e^{\psi})^a}{(1 + e^{\psi})^b} = 2^{-b} e^{\kappa \psi} \int_{0}^{\infty} e^{- \omega \psi^2 / 2} p(\omega) \text{d}\omega \tag{7}

where κ=ab/2\kappa = a - b/2 and p(ω)=PG(ωb,0)p(\omega) = \text{PG}(\omega \mid b, 0). And second,

p(ωψ)PG(b,ψ).(8) p(\omega \mid \psi) \sim \text{PG}(b, \psi). \tag{8}

While the proof of (7)(7) is a few lines in the paper, it is dense. See A2 for the proof with details. See A3 for a derivation of (8)(8).

Logistic regression with PG augmentation

It may not be immediately obvious why the RHS of (7)(7) is useful. Its utility is that we can construct Gibbs samplers of logistic models or models with likelihoods of the form in (9)(9). To be concrete, consider Bayesian inference for logistic regression. Recall that the nn-th observation’s contribution to the likelihood (4)(4) is

Ln(β)=(exp(βxn))yn1+exp(βxn).(9) \mathcal{L}_n(\boldsymbol{\beta}) = \frac{\big( \exp(\boldsymbol{\beta}^{\top} \mathbf{x}_n) \big)^{y_n}}{1 + \exp(\boldsymbol{\beta}^{\top} \mathbf{x}_n)}. \tag{9}

Using (7)(7), we can express this likelihood contribution as

(exp(βxn))yn1+exp(βxn)=2exp{κnβxn}0exp{ωn(βxn)2/2}p(ωn1,0)dωn=2exp{κnβxn}Ep(ωn1,0)[exp(ωn(βxn)2/2)],(10) \begin{aligned} \frac{\big( \exp(\boldsymbol{\beta}^{\top} \mathbf{x}_n) \big)^{y_n}}{1 + \exp(\boldsymbol{\beta}^{\top} \mathbf{x}_n)} &= -2 \exp\big\{ \kappa_n \boldsymbol{\beta}^{\top} \mathbf{x}_n \big\} \int_{0}^{\infty} \exp\big\{-\omega_n (\boldsymbol{\beta}^{\top} \mathbf{x}_n)^2 / 2 \big\} p(\omega_n \mid 1, 0) \text{d}\omega_n \tag{10} \\ &= -2 \exp\big\{ \kappa_n \boldsymbol{\beta}^{\top} \mathbf{x}_n \big\} \mathbb{E}_{p(\omega_n \mid 1, 0)}[\exp(- \omega_n(\boldsymbol{\beta}^{\top} \mathbf{x}_n)^2 / 2)], \end{aligned}

where κn=yn1/2\kappa_n = y_n - 1/2. Note that if we condition on ωn\omega_n, the likelihood contribution in (10)(10) is Gaussian in β\boldsymbol{\beta}:

p(βΩ,y)=p(β)n=1NLn(βωn)p(β)n=1Nexp{κnβxn}exp{ωn(βxn)2/2}p(β)n=1Nexp{ωn2(βxnκnωn)2}=p(β)exp{12(zXβ)Ω(zXβ)}=p(β)exp{12(βX1z)XΩX(βX1z)}(11) \begin{aligned} p(\boldsymbol{\beta} \mid \Omega, y) &= p(\boldsymbol{\beta}) \prod_{n=1}^{N} \mathcal{L}_n(\boldsymbol{\beta} \mid \omega_n) \\ &\stackrel{\ddagger}{\propto} p(\boldsymbol{\beta}) \prod_{n=1}^{N} \exp\big\{ \kappa_n \boldsymbol{\beta}^{\top} \mathbf{x}_n \big\} \exp\big\{-\omega_n (\boldsymbol{\beta}^{\top} \mathbf{x}_n)^2 / 2 \big\} \\ &\stackrel{\star}{\propto} p(\boldsymbol{\beta}) \prod_{n=1}^{N} \exp\Big\{-\frac{\omega_n}{2} \Big( \boldsymbol{\beta}^{\top} \mathbf{x}_n - \frac{\kappa_n}{\omega_n} \Big)^2 \Big\} \\ &= p(\boldsymbol{\beta}) \exp \Big\{ -\frac{1}{2} \big(\mathbf{z} - X \beta \big)^{\top} \Omega \big(\mathbf{z} - X \beta \big) \Big\} \\ &\stackrel{\dagger}{=} p(\boldsymbol{\beta}) \exp \Big\{ -\frac{1}{2} \big( \boldsymbol{\beta} - X^{-1} \mathbf{z}\big)^{\top} X^{\top} \Omega X \big(\boldsymbol{\beta} - X^{-1} \mathbf{z} \big) \Big\} \end{aligned} \tag{11}

where z=κ1/ω1,,κN/ωN\mathbf{z} = \langle \kappa_1 / \omega_1, \dots, \kappa_N / \omega_N \rangle and Ω=diag(ω1,,ωN)\Omega = \text{diag}(\omega_1, \dots, \omega_N). Step \ddagger holds because the expectation in (10)(10) is constant if we condition on ωn\omega_n. Step \star works by completing the square (see A4), while step \dagger is just a little algebra (see A5).

In summary, if our prior on β\boldsymbol{\beta} is Gaussian (quadratic in β\boldsymbol{\beta}), then (11)(11) is tractable because the posterior can be written as Gaussian (β\boldsymbol{\beta}). This suggests that we can construct a Gibbs sampler, where we repeatedly sample Ω\Omega given β\boldsymbol{\beta} and then β\boldsymbol{\beta} given Ω\Omega.

PG augmented Gibbs sampler

To perform Gibbs sampling with two parameters, we repeatedly fix one parameter while conditionally sampling from the other. Concretely for us, we first initialize β\boldsymbol{\beta}. Then we for t=1,Tt = 1, \dots T, we sample

Ω(t+1)p(Ωβ(t))β(t+1)p(βΩ(t+1)).(12) \begin{aligned} \Omega^{(t+1)} &\sim p(\Omega \mid \boldsymbol{\beta}^{(t)}) \\ \boldsymbol{\beta}^{(t+1)} &\sim p(\boldsymbol{\beta} \mid \Omega^{(t+1)}). \end{aligned} \tag{12}

Provided we can compute each density above, we’re done. The first density comes from (8)(8). We know that

ωnβPG(1,βxn).(13) \omega_n \mid \boldsymbol{\beta} \sim \text{PG}(1, \boldsymbol{\beta}^{\top} \mathbf{x}_n). \tag{13}

In other words, we sample each element along the diagonal of Ω\Omega using (13)(13). The second equation is a bit trickier. If the prior on β\boldsymbol{\beta} is N(b,B)\mathcal{N}(b, B), then p(βΩ,y)p(\boldsymbol{\beta} \mid \Omega, \mathbf{y}) is

βΩ,yN(mω,Vω)(14) \boldsymbol{\beta} \mid \Omega, \mathbf{y} \sim \mathcal{N}(\mathbf{m}_{\omega}, V_{\omega}) \tag{14}

where

Vω=(XΩX+B1)1mω=Vω(Xκ+B1b)(15) \begin{aligned} V_{\omega} &= (X^{\top} \Omega X + B^{-1})^{-1} \\ \mathbf{m}_{\omega} &= V_{\omega} (X^{\top} \boldsymbol{\kappa} + B^{-1} b) \end{aligned} \tag{15}

where κ=κ1,,κN\boldsymbol{\kappa} = \langle \kappa_1, \dots, \kappa_N \rangle. The derivation just requires the matrix formula for completing the square and a bit of algebra (see A6). It is worth skimming this derivation to confirm that the reason it works is because β\boldsymbol{\beta} is Gaussian.

Thinking algorithmically, if we can sample ωn\omega_n, we can use this reparameterization to get a conditionally Gaussian likelihood centered at X1zX^{-1} \mathbf{z} with covariance XΩXX^{\top} \Omega X.

Demo

Section 4 of Polson et al discusses simulating PG random variables. The details of this are beyond the scope of this post, and thankfully Scott Linderman has already created a Cython port of Jesse Windle’s code for sampling PG random variables. Using this library, we can easily construct a Gibbs sampler for logistic regression using PG augmentation:

import matplotlib.pyplot as plt
import numpy as np
from   numpy.linalg import inv
import numpy.random as npr
from   pypolyagamma import PyPolyaGamma


def sigmoid(x):
    """Numerically stable sigmoid function.
    """
    return np.where(x >= 0, 
                    1 / (1 + np.exp(-x)),
                    np.exp(x) / (1 + np.exp(x)))

def multi_pgdraw(pg, B, C):
    """Utility function for calling `pgdraw` on every pair in vectors B, C.
    """
    return np.array([pg.pgdraw(b, c) for b, c in zip(B, C)])

def gen_bimodal_data(N, p):
    """Generate bimodal data for easy sanity checking.
    """
    y     = npr.random(N) < p
    X     = np.empty(N)
    X[y]  = npr.normal(0, 1, size=y.sum())
    X[~y] = npr.normal(4, 1.4, size=(~y).sum())
    return X, y.astype(int)


# Set priors and create data.
N_train = 1000
N_test  = 1000
b       = np.zeros(2)
B       = np.diag(np.ones(2))
X_train, y_train = gen_bimodal_data(N_train, p=0.3)
X_test, y_test   = gen_bimodal_data(N_test, p=0.3)
# Prepend 1 for the bias β_0.
X_train = np.vstack([np.ones(N_train), X_train])
X_test  = np.vstack([np.ones(N_test), X_test])

# Peform Gibb sampling for T iterations.
pg         = PyPolyaGamma()
T          = 100
Omega_diag = np.ones(N_train)
beta_hat   = npr.multivariate_normal(b, B)
k          = y_train - 1/2.

for _ in range(T):
    # ω ~ PG(1, x*β).
    Omega_diag = multi_pgdraw(pg, np.ones(N_train), X_train.T @ beta_hat)
    # β ~ N(m, V).
    V         = inv(X_train @ np.diag(Omega_diag) @ X_train.T + inv(B))
    m         = np.dot(V, X_train @ k + inv(B) @ b)
    beta_hat  = npr.multivariate_normal(m, V)

y_pred = npr.binomial(1, sigmoid(X_test.T @ beta_hat))
bins = np.linspace(X_test.min()-3., X_test.max()+3, 100)
plt.hist(X_test.T[y_pred == 0][:, 1],    color='r', bins=bins)
plt.hist(X_test.T[~(y_pred == 0)][:, 1], color='b', bins=bins)
plt.show()

We can see in Figure 11 that the method works nicely. The only data points that are misclassified are where the two Gaussian distributions overlap.

Figure 1. (Left) Test data from a bimodal distribution colored based on ground truth binary labels. (Right) Test data colored based on predictions from a Bayesianb logistic regression model using PG-augmented Gibb sampling.

   

Acknowledgements

Thanks to Michael Minyi Zhang for helping with a derivation and finding a bug in my code. Thanks to Yamada Kumpei and Nikos Gianniotis for correcting typos and mistakes. Finally, I asked for a detailed derivation of Polson’s proof on math.stackexchange.com, and Grada Gukovic provided a fantastic answer.

   

Appendix

A1. Negative binomial likelihood

p(YX,r)=n=1N(yn+r1yn)p(xn)r(1p(xn))ynn=1Np(xn)r(1p(xn))yn=n=1N[exp(βxn)1+exp(βxn)]r[1exp(βxn)1+exp(βxn)]yn=n=1N[exp(βxn)1+exp(βxn)]r[11+exp(βxn)]yn=n=1N[exp(βxn)]r[1+exp(βxn)]yn+r.(16) \begin{aligned} p(\mathbf{Y} \mid \mathbf{X}, r) &= \prod_{n=1}^{N} {y_{n} + r - 1 \choose y_{n}} p(\mathbf{x}_n)^{r} (1 - p(\mathbf{x}_n))^{y_{n}} \\ &\propto \prod_{n=1}^{N} p(\mathbf{x}_n)^{r} (1 - p(\mathbf{x}_n))^{y_{n}} \\ &= \prod_{n=1}^{N} \Bigg[ \frac{\exp\big(\boldsymbol{\beta}^{\top} \mathbf{x}_n\big)}{1 + \exp\big(\boldsymbol{\beta}^{\top} \mathbf{x}_n\big)} \Bigg]^{r} \Bigg[ 1 - \frac{\exp\big(\boldsymbol{\beta}^{\top} \mathbf{x}_n\big)}{1 + \exp\big(\boldsymbol{\beta}^{\top} \mathbf{x}_n\big)} \Bigg]^{y_{n}} \\ &= \prod_{n=1}^{N} \Bigg[ \frac{\exp\big(\boldsymbol{\beta}^{\top} \mathbf{x}_n\big)}{1 + \exp\big(\boldsymbol{\beta}^{\top} \mathbf{x}_n\big)} \Bigg]^{r} \Bigg[ \frac{1}{1 + \exp\big(\boldsymbol{\beta}^{\top} \mathbf{x}_n\big)} \Bigg]^{y_{n}} \\ &= \prod_{n=1}^{N} \frac{\big[\exp\big(\boldsymbol{\beta}^{\top} \mathbf{x}_n\big)\big]^{r}}{\big[1 + \exp\big(\boldsymbol{\beta}^{\top} \mathbf{x}_n\big)\big]^{y_{n} + r}}. \end{aligned} \tag{16}

A2. Proof of main result

We want to prove (7)(7) or

(eψ)a(1+eψ)b=2beκψ0eωψ2/2p(ω)dω \frac{(e^{\psi})^a}{(1 + e^{\psi})^b} = 2^{-b} e^{\kappa \psi} \int_{0}^{\infty} e^{- \omega \psi^2 / 2} p(\omega) \text{d}\omega

where ω\omega is a PG random variable with PDF (5)(5). First, plug a=κ+b/2a = \kappa + b/2 into the LHS of (7)(7),

(eψ)a(1+eψ)b=(eψ)κ+b/2(1+eψ)b=(eψ)κ(eψ)b/2(1+eψ)b=(eψ)κ(1+eψ)b(eψ/2)b=(eψ)κ(1+eψeψ/2)b(17) \begin{aligned} \frac{(e^{\psi})^a}{(1 + e^{\psi})^b} &= \frac{(e^{\psi})^{\kappa + b/2}}{(1 + e^{\psi})^b} \\ &= \frac{(e^{\psi})^{\kappa} (e^{\psi})^{b/2}}{(1 + e^{\psi})^b} \\ &= \frac{(e^{\psi})^{\kappa}}{(1 + e^{\psi})^b (e^{-\psi/2})^{b}} \\ &= \frac{(e^{\psi})^{\kappa}}{\Big(\frac{1 + e^{\psi}}{e^{\psi/2}}\Big)^{b}} \end{aligned} \tag{17}

We can introduce the hyperbolic cosine through the identity

cosh(x)=ex+ex2.(18) \cosh(x) = \frac{e^{x} + e^{-x}}{2}. \tag{18}

Plugging ψ/2\psi/2 in for xx and working backwards on the denominator in the last line of (17)(17), we have

(1+eψeψ/2)b=(eψeψ/2+1eψ/2)b=(eψ/2eψ/2)2=[2cosh(ψ/2)]b.(19) \begin{aligned} \Big(\frac{1 + e^{\psi}}{e^{\psi/2}}\Big)^b &= \Big(\frac{e^{\psi}}{e^{\psi/2}} + \frac{1}{e^{\psi/2}}\Big)^b \\ &= \big( e^{\psi/2} - e^{-\psi/2} \big)^2 \\ &= [2 \cosh(\psi/2)]^b. \end{aligned} \tag{19}

This recapitulates the first step in Polson’s proof, the line starting after the phrase “Appealing to (3)(3), we may write the lefthand side of (7)(7) as…”:

(eψ)a(1+eψ)b=(eψ)κ(1+eψeψ/2)b=(eψ)κ[2cosh(ψ/2)]b=2b(eψ)κcoshn(ψ/2).(20) \frac{(e^{\psi})^a}{(1 + e^{\psi})^b} = \frac{(e^{\psi})^{\kappa}}{\Big(\frac{1 + e^{\psi}}{e^{\psi/2}}\Big)^{b}} = \frac{(e^{\psi})^{\kappa}}{[2 \cosh(\psi/2)]^b} = \frac{2^{-b} (e^{\psi})^{\kappa}}{\cosh^n(\psi/2)}. \tag{20}

The next step uses (3)(3) from Polson (page 44):

E[exp(ωt)]=1coshb(t/2)(21) \mathbb{E}[\exp(-\omega t)] = \frac{1}{\cosh^b(\sqrt{t/2})} \tag{21}

and where we just set t=ψ2/2t = \psi^2 / 2:

E[exp(ωψ22)]=1coshb(ψ/2).(22) \mathbb{E}\Big[\exp\Big(-\omega \frac{\psi^2}{2} \Big)\Big] = \frac{1}{\cosh^b(\psi / 2)}. \tag{22}

Finally, we plug (22)(22) into (20)(20), and we’re done:

(eψ)a(1+eψ)b=2b(eψ)κcoshn(ψ/2)=2beψκE[exp(ωψ22)].(23) \frac{(e^{\psi})^a}{(1 + e^{\psi})^b} = \frac{2^{-b} (e^{\psi})^{\kappa}}{\cosh^n(\psi/2)} = 2^{-b} e^{\psi \kappa} \mathbb{E}\Big[\exp\Big(-\omega \frac{\psi^2}{2} \Big)\Big]. \tag{23}

The expectation is with respect to ωPG(b,0)\omega \sim \text{PG}(b, 0). Just apply the definition of expectation to (23)(23), and we’re done.

A3. Proof of secondary result

By the definitions of a PG random variable and expectation,

E[exp(ω)ψ2/2]=0exp(ωψ2/2)p(ω)dω.(24) \mathbb{E}[\exp(-\omega) \psi^2/2] = \int_{0}^{\infty} \exp(-\omega \psi^2/2) p(\omega) \text{d}\omega. \tag{24}

Plug this into Polson’s (5)(5) with c=ψc = \psi.

A4. Completing the square

This derivation relies on the univariate case of completing the square. If we drop the subscripts and change vectors to scalars to ease notation, we have

exp{κβx}exp{ω(βx)22}=exp{κβxω(βx)22}=exp{ω2((βx)22κω(βx))}=exp{ω2((βx)22κω(βx)+κ2ω2κ2ω2)}exp{ω2(βxκω)2}.(25) \begin{aligned} \exp\Big\{ \kappa \beta x \Big\} \exp\Big\{ -\omega \frac{(\beta x)^2}{2} \Big\} &= \exp\Big\{ \kappa \beta x - \omega \frac{(\beta x)^2}{2} \Big\} \\ &= \exp\Big\{ -\frac{\omega}{2} \Big( (\beta x)^2 - \frac{2\kappa}{\omega} (\beta x) \Big) \Big\} \\ &= \exp\Big\{ -\frac{\omega}{2} \Big( (\beta x)^2 - \frac{2\kappa}{\omega} (\beta x) + \frac{\kappa^2}{\omega^2} - \frac{\kappa^2}{\omega^2} \Big) \Big\} \\ &\propto \exp \Big\{ -\frac{\omega}{2} \Big( \beta x - \frac{\kappa}{\omega} \Big)^2 \Big\}. \end{aligned} \tag{25}

A5. Making the likelihood quadratic in β\boldsymbol{\beta}.

We can show

(zXβ)Ω(zXβ)=(βX1z)XΩX(βX1z)(26) \big(\mathbf{z} - X \beta \big)^{\top} \Omega \big(\mathbf{z} - X \beta \big) = \big( \boldsymbol{\beta} - X^{-1} \mathbf{z})^{\top} X^{\top} \Omega X (\boldsymbol{\beta} - X^{-1} \mathbf{z} \big) \tag{26}

with a little algebra:

(zXβ)Ω(zXβ)=[X(βX1z)]Ω[X(βX1z)]=(βX1z)XΩX(βX1z).(27) \begin{aligned} &(\mathbf{z} - X \boldsymbol{\beta})^{\top} \Omega (\mathbf{z} - X \boldsymbol{\beta}) \\ &= [X(\boldsymbol{\beta} - X^{-1} \mathbf{z})]^{\top} \Omega [X(\boldsymbol{\beta} - X^{-1} \mathbf{z})] \\ &= (\boldsymbol{\beta} - X^{-1} \mathbf{z})^{\top} X^{\top} \Omega X (\boldsymbol{\beta} - X^{-1} \mathbf{z}). \end{aligned} \tag{27}

A6. Sum of two quadratic forms in x\mathbf{x}

Note that the sum of two quadratic forms in x\mathbf{x} can be written as a single quadratic form plus a constant term that is independent of x\mathbf{x}. Consider the equation

(xμ)Σ1(xμ)+(xν)Ψ1(xν).(28) (\mathbf{x} - \boldsymbol{\mu})^{\top} \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu}) + (\mathbf{x} - \boldsymbol{\nu})^{\top} \Psi^{-1} (\mathbf{x} - \boldsymbol{\nu}). \tag{28}

First, expand each quadratic term out:

(xμ)Σ1(xμ)=xΣ1x2μΣ1x+μΣ1μ(xν)Ψ1(xν)=xΨ1x2νΨ1x+νΨ1ν.(29) \begin{aligned} (\mathbf{x} - \boldsymbol{\mu})^{\top} \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu}) &= \mathbf{x}^{\top} \Sigma^{-1} \mathbf{x} - 2 \boldsymbol{\mu}^{\top} \Sigma^{-1} \mathbf{x} + \boldsymbol{\mu}^{\top} \Sigma^{-1} \boldsymbol{\mu} \\ (\mathbf{x} - \mathbf{\nu})^{\top} \Psi^{-1} (\mathbf{x} - \mathbf{\nu}) &= \mathbf{x}^{\top} \Psi^{-1} \mathbf{x} - 2 \boldsymbol{\nu}^{\top} \Psi^{-1} \mathbf{x} + \boldsymbol{\nu}^{\top} \Psi^{-1} \boldsymbol{\nu}. \end{aligned} \tag{29}

If we combine “similar” terms and distribute, we get

x(Σ1+Ψ1)x2(μΣ1+νΨ1)x+(μΣ1μ+νΨ1ν)(30) \begin{aligned} \mathbf{x}^{\top}(\Sigma^{-1} + \Psi^{-1}) \mathbf{x} - 2(\boldsymbol{\mu}^{\top} \Sigma^{-1} + \boldsymbol{\nu}^{\top} \Psi^{-1}) \mathbf{x} + (\boldsymbol{\mu}^{\top} \Sigma^{-1} \boldsymbol{\mu} + \boldsymbol{\nu}^{\top} \Psi^{-1} \boldsymbol{\nu}) \end{aligned} \tag{30}

which is again quadratic in x\mathbf{x}. If we set

V=Σ1+Ψ1m=Σ1μ+Ψ1νR=μΣ1μ+νΨ1ν(31) \begin{aligned} V &= \Sigma^{-1} + \Psi^{-1} \\ \mathbf{m} &= \Sigma^{-1} \boldsymbol{\mu} + \Psi^{-1} \boldsymbol{\nu} \\ R &= \boldsymbol{\mu}^{\top} \Sigma^{-1} \boldsymbol{\mu} + \boldsymbol{\nu}^{\top} \Psi^{-1} \boldsymbol{\nu} \end{aligned} \tag{31}

and apply completing the square, then we can write the above as

xVx2mx+R=(xV1m)V(xV1m)mV1m+R.(32) \begin{aligned} &\mathbf{x}^{\top} V \mathbf{x} - 2 \mathbf{m}^{\top} \mathbf{x} + R \\ &= (\mathbf{x} - V^{-1} \mathbf{m}) V (\mathbf{x} - V^{-1} \mathbf{m}) - \mathbf{m}^{\top} \mathbf{V}^{-1} \mathbf{m} + R. \end{aligned} \tag{32}

This is proportional to a Gaussian kernel with mean V1mV^{-1} \mathbf{m} and covariance V1V^{-1}. We can ignore the remainder terms mV1m\mathbf{m}^{\top} \mathbf{V}^{-1} \mathbf{m} and RR since they does not depend on β\boldsymbol{\beta}.

This is the trick used in the paper. Using the notation in the paper, both Gaussians are quadratic in β\boldsymbol{\beta}; one has mean b\mathbf{b} and covariance BB, and the other has mean X1zX^{-1} \mathbf{z} and covariance XΩ1XX^{\top} \Omega^{-1} X. Doing a little pattern matching, we get

V=(B1+XΩX)1m=B1b+(XΩX)X1z=B1b+XΩz=B1b+Xκ.(33) \begin{aligned} V &= (B^{-1} + X^{\top} \Omega X)^{-1} \\ \mathbf{m} &= B^{-1} \mathbf{b} + (X^{\top} \Omega X) X^{-1} \mathbf{z} \\ &= B^{-1} \mathbf{b} + X^{\top} \Omega \mathbf{z} \\ &\stackrel{\star}{=} B^{-1} \mathbf{b} + X^{\top} \boldsymbol{\kappa}. \end{aligned} \tag{33}

Step \star holds because if we multiply each value in Ω\boldsymbol{\Omega}s by the definition of z\mathbf{z}, we get back κ\boldsymbol{\kappa}. Thus, we have shown that (βy,ω)(\boldsymbol{\beta} \mid y, \omega) is Gaussian with mean V1mV^{-1} \mathbf{m} and covariance V1V^{-1}.

  1. Bishop, C. M. (2006). Pattern Recognition and Machine Learning.
  2. Polson, N. G., Scott, J. G., & Windle, J. (2013). Bayesian inference for logistic models using Pólya–Gamma latent variables. Journal of the American Statistical Association, 108(504), 1339–1349.