I work through several cases of Bayesian parameter estimation of Gaussian models.
Published
04 April 2019
Estimating the parameters of a Gaussian distribution and its conjugate prior is common task in Bayesian inference. In this blog post, I want to derive the likelihood, conjugate prior, and posterior, and posterior predictive for a few important cases: when we estimate just μ with known σ2, when we estimate just σ2 with known $\mu$, and when jointly estimate both parameters. For simplicity, I’ll stick to univariate models. My goal is to provide detailed yet fluid derivations. Once again, I rely on (Bishop, 2006; Murphy, 2007) for their excellent explanations and notation.
Estimating $\mu$ with known $\sigma^2$
Let’s begin with the simplest case: a Gaussian distribution with an unknown mean μ but with a known or fixed σ2.
We can see that if we multiply this likelihood by another Gaussian, we will get another Gaussian. This is because each Gaussian can be written as exponential times a quadratic, the product of which is another exponential times a quadratic. Therefore a conjugate prior on μ has the form
Let’s work through this calculation in detail to see the form of the posterior distribution. To be clear, the goal is to write the posterior in a form like this,
p(μ∣D)∝αexp{β(μ−γ)2}
for some values α, β, and γ. This means we need to massage the likelihood times the prior until we get a functional form as above. First, we just apply the definitions and then drop anything that clearly does not depend on μ. As a convention, any time we drop a term because it does not depend on μ, we highlight it in red before it is dropped. Finally, let
At this point, we might be stuck. We’ve isolated μ and μ2 but it’s not clear how to solve for just μ. But Bishop (p. 98) gives a hint, writing:
Simple manipulation involving completing the square in the exponent shows that the posterior distribution is given by…
But what does this mean? This deserves a small digression.
Completing the square
Recall that completing the square is a trick we learned in algebra when factorizing polynomials. Consider the problem of solving for x in this equation:
x2−4x=5
We did this by completing the square, meaning adding something to x2−4x to make it square-able. In this case, note that
x2−4xx2−4x+4(x−2)2x−2x=5=5+4=9=±3=5 or −1
More generally, since for any quadratic polynomial,
Since a, b, c, and d are all constants, this reduces to the form:
(x+α)2=β
for some values α and β, and we know that x=±β−α. This is a general technique and is actually how one would derive the quadratic formula. But for us, this is precisely the trick we want to use here, except for now x=μ.
Continuing the derivation
To ease notation, let’s ignore the exponent and −1/2 term for now, and let a and b be defined as
where we use the cute trick that we can add (b/a)2−(b/a)2=0 inside the parentheses, but then ignore one of the newly added terms because it does not depend on μ. Adding the exponent and −1/2 term back, we get:
p(μ∣D)∝N(μ∣μN,σN2)=exp{−2a(μ−ab)2}
We can then solve for the posterior parameters, μN and σN2, in terms of a and b:
where we use the fact that μML is xˉ or the sample mean
μML=N1n=1∑Nxn
Note two things. First, if N=0, then μN=μ0, which is expected. The prior is our modeling assumption in the absence of data. And if N→∞, then μN=μML, which is ideal. With enough data, we disregard our prior in favor of the optimal parameter given our data.
Posterior predictive
In Bayesian inference, the posterior predictive is
p(D′∣D)
where D′ is unseen data. In words, it is the distribution of unobserved data given observed data. Once we have our posterior and prior, we can marginalize over our parameters to get our posterior predictive:
where step ⋆ holds because the modeling assumption is that D′ is conditionally independent from D given μ or that p(D′∣D,μ)=p(D′∣μ). This is a reasonable assumption in that it claims that our training data and unseen data are both generated independently from the same distribution.
Since both our posterior and prior are Gaussians, we can use the following fact:
See (Bishop, 2006), page 93 for details. In our case, we have
xμΨyAbP=μ=μN=σN2=D′=1=0=σ2
which gives us
p(D′∣D)=N(D′∣μN,σ2+σN2)
And we’re done.
Estimating σ2 with known μ
Now let’s examine the scenario in which μ is fixed but σ2 is unknown. It is tempting to jump directly into the scenario in which both parameters are unknown, but I think it is worth it to go carefully through this example first. The remaining two cases are quicker to go through.
It is common to work with the precision of a Gaussian, λ≜σ21, rather than μ. The reason is that many terms in the Gaussian have σ2 in a denominator, and it is easier to work with λ rather than σ−2.
Likelihood, prior, and posterior
In this case, our likelihood is in terms of the precision λ is
Now we want prior that has a functional form that is λ to some power times the exponent of a linear function of λ. As we will see, this functional form will ensure conjugacy, meaning:
gammaposterior∝normallikelihood×gammaprior
Consider the gamma distribution with hyperparameters a0 and b0:
Gamma(λ∣a0,b0)=Γ(a0)1b0a0λa0−1exp(−b0λ)
where the gamma functionΓ(a) is just a normalizing constant that does not depend on λ. It is easy to verify conjugacy by just computing our posterior:
where we use the fact that σML2=N1∑n=1N(xn−μ)2. Once again, note that as N→0, aN and bN reduce to a0 and b0. As we observe more data (as N increases), the hyperparameters a0 and b0 are overwhelmed by the other additive terms.
Posterior predictive
The posterior predictive for m=1 new observations is a T-distribution, but I will skip this derivation because it is extremely detailed. See (Murphy, 2007) for a derivation.
Estimating both μ and σ2
Finally, let’s explore the scenario in which both μ and σ2 are unknown. Once again, we’ll work with the precision λ=σ21 for mathematical convenience.
But at this point, we need to think carefully about what form this should take. Note that we can decompose our prior as
p(μ,λ)=p(μ∣λ)p(λ)
This means we can use the results from the previous two sections: p(μ∣λ) will be a Gaussian distribution and p(λ) will be a gamma distribution. This is known as a normal-gamma or Gaussian-gamma:
p(μ,λ)=N(μ∣a,b)Gamma(λ∣c,d)(⋆)
for some values a, b, c, and d. If our prior takes the functional form ⋆, then we want our likelihood to be amenable: it should be a Guassian in terms of μ times a gamma in terms of λ. Let’s start by expanding the exponent and moving any term containing as μ to one exponent:
We’re almost done. The rightmost exponent almost looks like the exponential term in the gamma distribution, except we need to pull out −λ. And we need to move λN/2 to between the two exponents so that:
and we’re done. In this case, I won’t work through the posterior because it should be obvious (in the sense that it is just algebra) what happens when we multiply a Gaussian-gamma prior times this likelihood.
Posterior predictive
Once again, I will skip this derivation, but it should be intuitive that if the previous case reduced to a T-distribution, this case would as well. See (Murphy, 2007) for a derivation.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning.
Murphy, K. P. (2007). Conjugate Bayesian analysis of the Gaussian distribution. Def, 1(2\sigma2), 16.