Bayesian Inference for Beta–Bernoulli Models

I derive the posterior, marginal likelihood, and posterior predictive distributions for beta–Bernoulli models.

The goal of this post is to derive the posterior, marginal likelihood, and posterior predictive distributions for a Bernoulli model with a beta prior. This will be similar to my post on Bayesian inference for the Gaussian and requires the reader to understand conjugacy.

We assume our data X={x1,,xN}X = \{x_1, \dots, x_N\} are Bernoulli distributed with a beta prior:

xBern(θ),θbeta(α,β),θ[0,1].(1) x \sim \text{Bern}(\theta), \quad \theta \sim \text{beta}(\alpha, \beta), \quad \theta \in [0, 1]. \tag{1}

Posterior. Showing conjugacy by deriving the posterior is relatively easy. The likelihood times prior is

p(θX)n=1Np(xnθ)p(θ)=n=1Nθxn(1θ)1xn1B(α,β)θα1(1θ)β1θnxn+α1(1θ)Nnxn+β1.(2) \begin{aligned} p(\theta \mid X) &\propto \prod_{n=1}^{N} p(x_n \mid \theta) p(\theta) \\ &= \prod_{n=1}^{N} \theta^{x_n} (1 - \theta)^{1 - x_n} \frac{1}{\text{B}(\alpha, \beta)} \theta^{\alpha - 1} (1 - \theta)^{\beta - 1} \\ &\propto \theta^{\sum_n x_n + \alpha - 1} (1 - \theta)^{N - \sum_n x_n + \beta - 1}. \end{aligned} \tag{2}

In Eq. 22, the beta function B\text{B} is defined as

B(α,β)=Γ(α)Γ(β)Γ(α+β)=01μα1(1μ)β1dμ.(3) \text{B}(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)} = \int_{0}^{1} \mu^{\alpha - 1} (1 - \mu)^{\beta - 1} \text{d}\mu. \tag{3}

So we see the posterior is proportional to a beta distribution:

p(θX)=beta(αN,βN),αN=n=1Nxn+α,βN=Nn=1Nxn+β.(4) \begin{aligned} p(\theta \mid X) &= \text{beta}(\alpha_N, \beta_N), \\ \alpha_N &= \sum_{n=1}^N x_n + \alpha, \\ \beta_N &= N - \sum_{n=1}^N x_n + \beta. \end{aligned} \tag{4}

Marginal likelihood. To compute the marginal likelihood, we just leverage the integral definition of the beta function:

p(X)=01p(Xθ)p(θ)dθ=011B(α,β)θnxn+α1(1θ)Nnxn+β1dθ=1B(α,β)01θαN1(1θ)βN1dθ=B(αN,βN)B(α,β).(5) \begin{aligned} p(X) &= \int_{0}^{1} p(X \mid \theta) p(\theta) \text{d}\theta \\ &= \int_{0}^{1} \frac{1}{\text{B}(\alpha, \beta)} \theta^{\sum_n x_n + \alpha - 1} (1 - \theta)^{N - \sum_n x_n + \beta - 1} \text{d}\theta \\ &= \frac{1}{\text{B}(\alpha, \beta)} \int_{0}^{1} \theta^{\alpha_N - 1} (1 - \theta)^{\beta_N - 1} \text{d}\theta \\ &= \frac{\text{B}(\alpha_N, \beta_N)}{\text{B}(\alpha, \beta)}. \end{aligned} \tag{5}

Posterior predictive. To compute the posterior predictive over an unseen observation x^\hat{x}, we integrate out our uncertainty about the parameters θ\theta. I know of two ways to compute this. The easiest way is to just use our previous derivation for the marginal likelihood. Since the prior and posterior are both beta distributed, we know this should work:

p(x^X)=01p(x^θ)p(θX)dθ=1B(αN,βN)01θx^+αN1(1θ)1x^+βN1dθ=B(x^+αN,1x^+βN)B(αN,βN).(6) \begin{aligned} p(\hat{x} \mid X) &= \int_{0}^{1} p(\hat{x} \mid \theta) p(\theta \mid X) \text{d}\theta \\ &= \frac{1}{\text{B}(\alpha_N, \beta_N)} \int_{0}^{1} \theta^{\hat{x} + \alpha_N - 1} (1 - \theta)^{1 - \hat{x} + \beta_N - 1} \text{d}\theta \\ &= \frac{\text{B}(\hat{x} + \alpha_N, 1 - \hat{x} + \beta_N)}{\text{B}(\alpha_N, \beta_N)}. \end{aligned} \tag{6}

Alternatively, since x^\hat{x} only has support for two values, we can compute each separately:

p(x^=1X)=01p(x^=1θ)p(θX)dθ=01θbeta(αN,βN)dθ=Eθbeta(αN,βN)[θ]=αNαN+βN,p(x^=0X)=01p(x^=0θ)p(θX)dθ=01(1θ)beta(αN,βN)dθ=01beta(αN,βN)dθ01θbeta(αN,βN)dθ=1αNαN+βN=βNαN+βN.(7) \begin{aligned} p(\hat{x} = 1 \mid X) &= \int_{0}^{1} p(\hat{x} = 1\mid \theta) p(\theta \mid X) \text{d}\theta \\ &= \int_{0}^{1} \theta \text{beta}(\alpha_N, \beta_N) \text{d}\theta \\ &= \mathbb{E}_{\theta \sim \text{beta}(\alpha_N, \beta_N)}[\theta] \\ &= \frac{\alpha_N}{\alpha_N + \beta_N}, \\ \\ p(\hat{x} = 0 \mid X) &= \int_{0}^{1} p(\hat{x} = 0 \mid \theta) p(\theta \mid X) \text{d}\theta \\ &= \int_{0}^{1} (1 - \theta) \text{beta}(\alpha_N, \beta_N) \text{d}\theta \\ &= \int_{0}^{1} \text{beta}(\alpha_N, \beta_N) \text{d}\theta - \int_{0}^{1} \theta \text{beta}(\alpha_N, \beta_N) \text{d}\theta \\ &= 1 - \frac{\alpha_N}{\alpha_N + \beta_N} \\ &= \frac{\beta_N}{\alpha_N + \beta_N}. \end{aligned} \tag{7}

Taken together, we can write the posterior predictive as

p(x^X)=(αN)x^(βN)1x^αN+βN.(8) p(\hat{x} \mid X) = \frac{(\alpha_N)^{\hat{x}} (\beta_N)^{1 - \hat{x}}}{\alpha_N + \beta_N}. \tag{8}