The goal of this post is to derive the posterior, marginal likelihood, and posterior predictive distributions for a Bernoulli model with a beta prior. This will be similar to my post on Bayesian inference for the Gaussian and requires the reader to understand conjugacy.
We assume our data X={x1,…,xN} are Bernoulli distributed with a beta prior:
x∼Bern(θ),θ∼beta(α,β),θ∈[0,1].(1)
Posterior. Showing conjugacy by deriving the posterior is relatively easy. The likelihood times prior is
p(θ∣X)∝n=1∏Np(xn∣θ)p(θ)=n=1∏Nθxn(1−θ)1−xnB(α,β)1θα−1(1−θ)β−1∝θ∑nxn+α−1(1−θ)N−∑nxn+β−1.(2)
In Eq. 2, the beta function B is defined as
B(α,β)=Γ(α+β)Γ(α)Γ(β)=∫01μα−1(1−μ)β−1dμ.(3)
So we see the posterior is proportional to a beta distribution:
p(θ∣X)αNβN=beta(αN,βN),=n=1∑Nxn+α,=N−n=1∑Nxn+β.(4)
Marginal likelihood. To compute the marginal likelihood, we just leverage the integral definition of the beta function:
p(X)=∫01p(X∣θ)p(θ)dθ=∫01B(α,β)1θ∑nxn+α−1(1−θ)N−∑nxn+β−1dθ=B(α,β)1∫01θαN−1(1−θ)βN−1dθ=B(α,β)B(αN,βN).(5)
Posterior predictive. To compute the posterior predictive over an unseen observation x^, we integrate out our uncertainty about the parameters θ. I know of two ways to compute this. The easiest way is to just use our previous derivation for the marginal likelihood. Since the prior and posterior are both beta distributed, we know this should work:
p(x^∣X)=∫01p(x^∣θ)p(θ∣X)dθ=B(αN,βN)1∫01θx^+αN−1(1−θ)1−x^+βN−1dθ=B(αN,βN)B(x^+αN,1−x^+βN).(6)
Alternatively, since x^ only has support for two values, we can compute each separately:
p(x^=1∣X)p(x^=0∣X)=∫01p(x^=1∣θ)p(θ∣X)dθ=∫01θbeta(αN,βN)dθ=Eθ∼beta(αN,βN)[θ]=αN+βNαN,=∫01p(x^=0∣θ)p(θ∣X)dθ=∫01(1−θ)beta(αN,βN)dθ=∫01beta(αN,βN)dθ−∫01θbeta(αN,βN)dθ=1−αN+βNαN=αN+βNβN.(7)
Taken together, we can write the posterior predictive as
p(x^∣X)=αN+βN(αN)x^(βN)1−x^.(8)