Bayesian Inference for Beta–Bernoulli Models

I derive the posterior, marginal likelihood, and posterior predictive distributions for beta–Bernoulli models.

Published

19 August 2020

The goal of this post is to derive the posterior, marginal likelihood, and posterior predictive distributions for a Bernoulli model with a beta prior. This will be similar to my post on Bayesian inference for the Gaussian and requires the reader to understand conjugacy.

We assume our data $X = \{x_1, \dots, x_N\}$ are Bernoulli distributed with a beta prior:

$x \sim \text{Bern}(\theta), \quad \theta \sim \text{beta}(\alpha, \beta), \quad \theta \in [0, 1]. \tag{1}$

Posterior. Showing conjugacy by deriving the posterior is relatively easy. The likelihood times prior is

$\begin{aligned} p(\theta \mid X) &\propto \prod_{n=1}^{N} p(x_n \mid \theta) p(\theta) \\ &= \prod_{n=1}^{N} \theta^{x_n} (1 - \theta)^{1 - x_n} \frac{1}{\text{B}(\alpha, \beta)} \theta^{\alpha - 1} (1 - \theta)^{\beta - 1} \\ &\propto \theta^{\sum_n x_n + \alpha - 1} (1 - \theta)^{N - \sum_n x_n + \beta - 1}. \end{aligned} \tag{2}$

In Eq. $2$ , the beta function $\text{B}$ is defined as

$\text{B}(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)} = \int_{0}^{1} \mu^{\alpha - 1} (1 - \mu)^{\beta - 1} \text{d}\mu. \tag{3}$

So we see the posterior is proportional to a beta distribution:

$\begin{aligned} p(\theta \mid X) &= \text{beta}(\alpha_N, \beta_N), \\ \alpha_N &= \sum_{n=1}^N x_n + \alpha, \\ \beta_N &= N - \sum_{n=1}^N x_n + \beta. \end{aligned} \tag{4}$

Marginal likelihood. To compute the marginal likelihood, we just leverage the integral definition of the beta function:

$\begin{aligned} p(X) &= \int_{0}^{1} p(X \mid \theta) p(\theta) \text{d}\theta \\ &= \int_{0}^{1} \frac{1}{\text{B}(\alpha, \beta)} \theta^{\sum_n x_n + \alpha - 1} (1 - \theta)^{N - \sum_n x_n + \beta - 1} \text{d}\theta \\ &= \frac{1}{\text{B}(\alpha, \beta)} \int_{0}^{1} \theta^{\alpha_N - 1} (1 - \theta)^{\beta_N - 1} \text{d}\theta \\ &= \frac{\text{B}(\alpha_N, \beta_N)}{\text{B}(\alpha, \beta)}. \end{aligned} \tag{5}$

Posterior predictive. To compute the posterior predictive over an unseen observation $\hat{x}$ , we integrate out our uncertainty about the parameters $\theta$ . I know of two ways to compute this. The easiest way is to just use our previous derivation for the marginal likelihood. Since the prior and posterior are both beta distributed, we know this should work:

$\begin{aligned} p(\hat{x} \mid X) &= \int_{0}^{1} p(\hat{x} \mid \theta) p(\theta \mid X) \text{d}\theta \\ &= \frac{1}{\text{B}(\alpha_N, \beta_N)} \int_{0}^{1} \theta^{\hat{x} + \alpha_N - 1} (1 - \theta)^{1 - \hat{x} + \beta_N - 1} \text{d}\theta \\ &= \frac{\text{B}(\hat{x} + \alpha_N, 1 - \hat{x} + \beta_N)}{\text{B}(\alpha_N, \beta_N)}. \end{aligned} \tag{6}$

Alternatively, since $\hat{x}$ only has support for two values, we can compute each separately:

$\begin{aligned} p(\hat{x} = 1 \mid X) &= \int_{0}^{1} p(\hat{x} = 1\mid \theta) p(\theta \mid X) \text{d}\theta \\ &= \int_{0}^{1} \theta \text{beta}(\alpha_N, \beta_N) \text{d}\theta \\ &= \mathbb{E}_{\theta \sim \text{beta}(\alpha_N, \beta_N)}[\theta] \\ &= \frac{\alpha_N}{\alpha_N + \beta_N}, \\ \\ p(\hat{x} = 0 \mid X) &= \int_{0}^{1} p(\hat{x} = 0 \mid \theta) p(\theta \mid X) \text{d}\theta \\ &= \int_{0}^{1} (1 - \theta) \text{beta}(\alpha_N, \beta_N) \text{d}\theta \\ &= \int_{0}^{1} \text{beta}(\alpha_N, \beta_N) \text{d}\theta - \int_{0}^{1} \theta \text{beta}(\alpha_N, \beta_N) \text{d}\theta \\ &= 1 - \frac{\alpha_N}{\alpha_N + \beta_N} \\ &= \frac{\beta_N}{\alpha_N + \beta_N}. \end{aligned} \tag{7}$

Taken together, we can write the posterior predictive as

$p(\hat{x} \mid X) = \frac{(\alpha_N)^{\hat{x}} (\beta_N)^{1 - \hat{x}}}{\alpha_N + \beta_N}. \tag{8}$