Bayesian Inference for the Gaussian

I work through several cases of Bayesian parameter estimation of Gaussian models.

Estimating the parameters of a Gaussian distribution and its conjugate prior is common task in Bayesian inference. In this blog post, I want to derive the likelihood, conjugate prior, and posterior, and posterior predictive for a few important cases: when we estimate just μ\mu with known σ2\sigma^2, when we estimate just σ2\sigma^2 with known $\mu$, and when jointly estimate both parameters. For simplicity, I’ll stick to univariate models. My goal is to provide detailed yet fluid derivations. Once again, I rely on (Bishop, 2006; Murphy, 2007) for their excellent explanations and notation.

Estimating $\mu$ with known $\sigma^2$

Let’s begin with the simplest case: a Gaussian distribution with an unknown mean μ\mu but with a known or fixed σ2\sigma^2.

Likelihood, prior, and posterior

The likelihood is

p(Dμ,σ2)=n=1Np(xnμ,σ2)n=1N(1(2πσ2)1/2exp{12σ2(xnμ)2})=1(2πσ2)N/2exp{12σ2n=1N(xnμ)2} \begin{aligned} p(D \mid \mu, \sigma^2) &= \prod_{n=1}^{N} p(\textbf{x}_n \mid \mu, \sigma^2) \\ &\triangleq \prod_{n=1}^{N} \Bigg( \frac{1}{(2 \pi \sigma^2)^{1/2}} \exp \Big\{ -\frac{1}{2 \sigma^2} (x_n - \mu)^2 \Big\} \Bigg) \\ &= \frac{1}{(2 \pi \sigma^2)^{N/2}} \exp \Big\{ -\frac{1}{2 \sigma^2} \sum_{n=1}^{N} (x_n - \mu)^2 \Big\} \end{aligned}

We can see that if we multiply this likelihood by another Gaussian, we will get another Gaussian. This is because each Gaussian can be written as exponential times a quadratic, the product of which is another exponential times a quadratic. Therefore a conjugate prior on μ\mu has the form

p(μ)=N(μμ0,σ02)=1(2πσ02)1/2exp{12σ02(μμ0)2} p(\mu) = \mathcal{N}(\mu \mid \mu_0, \sigma_0^2) = \frac{1}{(2 \pi \sigma_0^2)^{1/2}} \exp \Big\{ -\frac{1}{2 \sigma_0^2} (\mu - \mu_0)^2 \Big\}

and the posterior is

p(μD)p(Dμ,σ2)p(μ) p(\mu \mid D) \propto p(D \mid \mu, \sigma^2) p(\mu)

Let’s work through this calculation in detail to see the form of the posterior distribution. To be clear, the goal is to write the posterior in a form like this,

p(μD)αexp{β(μγ)2} p(\mu \mid D) \propto \alpha \exp \Big\{\beta (\mu - \gamma)^2 \Big\}

for some values α\alpha, β\beta, and γ\gamma. This means we need to massage the likelihood times the prior until we get a functional form as above. First, we just apply the definitions and then drop anything that clearly does not depend on μ\mu. As a convention, any time we drop a term because it does not depend on μ\mu, we highlight it in red before it is dropped. Finally, let

xˉ=1Nn=1Nxn    n=1Nxn=Nxˉ \bar{x} = \frac{1}{N} \sum_{n=1}^{N} x_n \implies \sum_{n=1}^{N} x_n = N \bar{x}

Then we have:

p(μD)p(Dμ,σ2)p(μ)(1(2πσ2)N/2exp{12σ2n=1N(xnμ)2})(1(2πσ02)1/2exp{12σ02(μμ0)2})=1(2πσ2)N/2(2πσ02)1/2exp{12σ2n=1N(xnμ)212σ02(μμ0)2}exp{12σ2n=1N(xnμ)212σ02(μμ0)2}=exp{12σ2(n=1Nxn2+μ22xnμ)12σ02(μ2+μ022μμ0)}=exp{12σ2(n=1Nxn2+Nμ22μn=1Nxn)12σ02(μ2+μ022μμ0)}exp{12σ2(Nμ22μn=1Nxn)12σ02(μ22μμ0)}=exp{12(Nμ2σ22μNxˉσ2+μ2σ022μμ0σ02)}=exp{12(μ2(Nσ2+1σ02)2μ(Nxˉσ2+μ0σ02))} \begin{aligned} p(\mu \mid D) &\propto p(D \mid \mu, \sigma^2) p(\mu) \\ &\triangleq \Bigg( \frac{1}{(2 \pi \sigma^2)^{N/2}} \exp \Big\{ -\frac{1}{2 \sigma^2} \sum_{n=1}^{N} (x_n - \mu)^2 \Big\} \Bigg) \Bigg( \frac{1}{(2 \pi \sigma_0^2)^{1/2}} \exp \Big\{ -\frac{1}{2 \sigma_0^2} (\mu - \mu_0)^2 \Big\} \Bigg) \\ &= \textcolor{#CC0000}{\frac{1}{(2 \pi \sigma^2)^{N/2} (2 \pi \sigma_0^2)^{1/2}}} \exp \Big\{ -\frac{1}{2 \sigma^2} \sum_{n=1}^{N} (x_n - \mu)^2 -\frac{1}{2 \sigma_0^2} (\mu - \mu_0)^2 \Big\} \\ &\propto \exp \Big\{ -\frac{1}{2 \sigma^2} \sum_{n=1}^{N} (x_n - \mu)^2 -\frac{1}{2 \sigma_0^2} (\mu - \mu_0)^2 \Big\} \\ &= \exp \Big\{ -\frac{1}{2 \sigma^2} \Big( \sum_{n=1}^{N} x_n^2 + \mu^2 - 2 x_n \mu \Big) - \frac{1}{2 \sigma_0^2} \Big( \mu^2 + \mu_0^2 - 2 \mu \mu_0 \Big) \Big\} \\ &= \exp \Big\{ -\frac{1}{2 \sigma^2} \Big( \textcolor{#CC0000}{\sum_{n=1}^{N} x_n^2} + N \mu^2 - 2 \mu \sum_{n=1}^{N} x_n \Big) - \frac{1}{2 \sigma_0^2} \Big( \mu^2 + \textcolor{#CC0000}{\mu_0^2}- 2 \mu \mu_0 \Big) \Big\} \\ &\propto \exp \Big\{ -\frac{1}{2 \sigma^2} \Big( N \mu^2 - 2 \mu \sum_{n=1}^{N} x_n \Big) - \frac{1}{2 \sigma_0^2} \Big( \mu^2 - 2 \mu \mu_0 \Big) \Big\} \\ &= \exp \Big\{ - \frac{1}{2} \Big( \frac{N \mu^2}{\sigma^2} - \frac{2 \mu N \bar{x}}{\sigma^2} + \frac{\mu^2}{\sigma_0^2} - \frac{2 \mu \mu_0}{\sigma_0^2} \Big) \Big\} \\ &= \exp \Big\{ - \frac{1}{2} \Big( \mu^2 \Big( \frac{N}{\sigma^2} + \frac{1}{\sigma_0^2} \Big) - 2 \mu \Big( \frac{N \bar{x}}{\sigma^2} + \frac{\mu_0}{\sigma_0^2} \Big) \Big) \Big\} \end{aligned}

At this point, we might be stuck. We’ve isolated μ\mu and μ2\mu^2 but it’s not clear how to solve for just μ\mu. But Bishop (p. 98) gives a hint, writing:

Simple manipulation involving completing the square in the exponent shows that the posterior distribution is given by…

But what does this mean? This deserves a small digression.

Completing the square

Recall that completing the square is a trick we learned in algebra when factorizing polynomials. Consider the problem of solving for xx in this equation:

x24x=5 x^2 - 4x = 5

We did this by completing the square, meaning adding something to x24xx^2 - 4x to make it square-able. In this case, note that

x24x=5x24x+4=5+4(x2)2=9x2=±3x=5 or 1 \begin{aligned} x^2 - 4x &= 5 \\ x^2 - 4x + 4 &= 5 + 4 \\ (x - 2)^2 &= 9 \\ x - 2 &= \pm 3 \\ x &= \text{$5$ or $-1$} \end{aligned}

More generally, since for any quadratic polynomial,

(nm)2=n22nm+m2 (n - m)^2 = n^2 - 2nm + m^2

we can complete the square in the following way:

ax2+bx+c=dx2+bax+ca=dax2+bax+(b2a)2=daca+(b2a)2(x+b2a)2=daca+(b2a)2 \begin{aligned} ax^2 + bx + c &= d \\ x^2 + \frac{b}{a} x + \frac{c}{a} &= \frac{d}{a} \\ x^2 + \frac{b}{a} x + \Big( \frac{b}{2a} \Big)^2 &= \frac{d}{a} - \frac{c}{a} + \Big( \frac{b}{2a} \Big)^2 \\ \Big( x + \frac{b}{2a} \Big)^2 &= \frac{d}{a} - \frac{c}{a} + \Big( \frac{b}{2a} \Big)^2 \end{aligned}

Since aa, bb, cc, and dd are all constants, this reduces to the form:

(x+α)2=β (x + \alpha)^2 = \beta

for some values α\alpha and β\beta, and we know that x=±βαx = \pm \sqrt{\beta} - \alpha. This is a general technique and is actually how one would derive the quadratic formula. But for us, this is precisely the trick we want to use here, except for now x=μx = \mu.

Continuing the derivation

To ease notation, let’s ignore the exponent and 1/2-1/2 term for now, and let aa and bb be defined as

(Nσ2+1σ02)aμ22(Nxˉσ2+μ0σ02)bμ \overbrace{\Big( \frac{N}{\sigma^2} + \frac{1}{\sigma_0^2} \Big)}^{a} \mu^2 - 2 \overbrace{\Big( \frac{N \bar{x}}{\sigma^2} + \frac{\mu_0}{\sigma_0^2} \Big)}^{b} \mu

Then our equation is

aμ22bμ a \mu^2 - 2 b \mu

We can then complete the square as above:

aμ22bμ=a(μ22baμ)=a(μ22baμ+(ba)2(ba)2)a(μ22baμ+(ba)2)=a(μba)2 \begin{aligned} a \mu^2 - 2 b \mu &= a \Big( \mu^2 - 2 \frac{b}{a} \mu \Big) \\ &= a \Big( \mu^2 - 2 \frac{b}{a} \mu + \Big( \frac{b}{a} \Big)^2 \textcolor{#CC0000}{- \Big( \frac{b}{a} \Big)^2} \Big) \\ &\propto a \Big( \mu^2 - 2 \frac{b}{a} \mu + \Big( \frac{b}{a} \Big)^2 \Big) \\ &= a \Big( \mu - \frac{b}{a} \Big)^2 \end{aligned}

where we use the cute trick that we can add (b/a)2(b/a)2=0(b/a)^2 - (b/a)^2 = 0 inside the parentheses, but then ignore one of the newly added terms because it does not depend on μ\mu. Adding the exponent and 1/2-1/2 term back, we get:

p(μD)N(μμN,σN2)=exp{a2(μba)2} p(\mu \mid D) \propto \mathcal{N}(\mu \mid \mu_N, \sigma_N^2) = \exp \Big\{ - \frac{a}{2} \Big( \mu - \frac{b}{a} \Big)^2 \Big\}

We can then solve for the posterior parameters, μN\mu_N and σN2\sigma_N^2, in terms of aa and bb:

a2=12σN2σN2=1a=(Nσ2+1σ02)1 \begin{aligned} -\frac{a}{2} &= - \frac{1}{2 \sigma_N^2} \\ &\Downarrow \\ \sigma_N^2 &= \frac{1}{a} \\ &= \Big( \frac{N}{\sigma^2} + \frac{1}{\sigma_0^2} \Big)^{-1} \end{aligned}

and

μN=ba=Nxˉσ2+μ0σ02Nσ2+1σ02=Nxˉσ02+μ0σ2σ2σ02Nσ02+σ2σ2σ02=Nxˉσ02+μ0σ2Nσ02+σ2=Nxˉσ02Nσ02+σ2+μ0σ2Nσ02+σ2=(Nσ02Nσ02+σ2)μML+(σ2Nσ02+σ2)μ0 \begin{aligned} \mu_N &= \frac{b}{a} \\\\ &= \frac{\frac{N \bar{x}}{\sigma^2} + \frac{\mu_0}{\sigma_0^2}}{\frac{N}{\sigma^2} + \frac{1}{\sigma_0^2}} \\\\ &= \frac{\frac{N \bar{x} \sigma_0^2 + \mu_0 \sigma^2}{\sigma^2 \sigma_0^2}}{\frac{N \sigma_0^2 + \sigma^2}{\sigma^2 \sigma_0^2}} \\\\ &= \frac{N \bar{x} \sigma_0^2 + \mu_0 \sigma^2}{N \sigma_0^2 + \sigma^2} \\\\ &= \frac{N \bar{x} \sigma_0^2}{N \sigma_0^2 + \sigma^2} + \frac{\mu_0 \sigma^2}{N \sigma_0^2 + \sigma^2} \\\\ &= \Big( \frac{N \sigma_0^2}{N \sigma_0^2 + \sigma^2} \Big) \mu_{\text{ML}} + \Big( \frac{\sigma^2}{N \sigma_0^2 + \sigma^2} \Big) \mu_0 \end{aligned}

where we use the fact that μML\mu_{\text{ML}} is xˉ\bar{x} or the sample mean

μML=1Nn=1Nxn \mu_{\text{ML}} = \frac{1}{N} \sum_{n=1}^N x_n

Note two things. First, if N=0N = 0, then μN=μ0\mu_{N} = \mu_0, which is expected. The prior is our modeling assumption in the absence of data. And if NN \rightarrow \infty, then μN=μML\mu_{N} = \mu_{\text{ML}}, which is ideal. With enough data, we disregard our prior in favor of the optimal parameter given our data.

Posterior predictive

In Bayesian inference, the posterior predictive is

p(DD) p(D' \mid D)

where DD' is unseen data. In words, it is the distribution of unobserved data given observed data. Once we have our posterior and prior, we can marginalize over our parameters to get our posterior predictive:

p(DD)=p(DD,μ)p(μD)dμ=p(Dμ)p(μD)dμN(Dμ,σ2)N(μμN,σN2)du \begin{aligned} p(D' \mid D) &= \int p(D' \mid D, \mu) p(\mu \mid D) d\mu \\ &\stackrel{\star}{=} \int p(D' \mid \mu) p(\mu \mid D) d\mu \\ &\triangleq \int \mathcal{N}(D' \mid \mu, \sigma^2) \mathcal{N}(\mu \mid \mu_N, \sigma_N^2) du \end{aligned}

where step \star holds because the modeling assumption is that DD' is conditionally independent from DD given μ\mu or that p(DD,μ)=p(Dμ)p(D' \mid D, \mu) = p(D' \mid \mu). This is a reasonable assumption in that it claims that our training data and unseen data are both generated independently from the same distribution.

Since both our posterior and prior are Gaussians, we can use the following fact:

p(x)=N(xμ,Ψ)p(yx)=N(yAx+b,P)p(y)=N(yAμ+b,P+AΨA) \begin{aligned} p(\textbf{x}) &= \mathcal{N}(\textbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Psi}) \\ p(\textbf{y} \mid \textbf{x}) &= \mathcal{N}(\textbf{y} \mid A \textbf{x} + \textbf{b}, \textbf{P}) \\ &\Downarrow \\ p(\textbf{y}) &= \mathcal{N}(\textbf{y} \mid \textbf{A} \boldsymbol{\mu} + \textbf{b}, \textbf{P} + \textbf{A} \boldsymbol{\Psi} \textbf{A}^{\top}) \end{aligned}

See (Bishop, 2006), page 93 for details. In our case, we have

x=μμ=μNΨ=σN2y=DA=1b=0P=σ2 \begin{aligned} \textbf{x} &= \mu \\ \boldsymbol{\mu} &= \mu_N \\ \boldsymbol{\Psi} &= \sigma_N^2 \\ \textbf{y} &= D' \\ \textbf{A} &= 1 \\ \textbf{b} &= 0 \\ \textbf{P} &= \sigma^2 \end{aligned}

which gives us

p(DD)=N(DμN,σ2+σN2) p(D' \mid D) = \mathcal{N}(D' \mid \mu_N, \sigma^2 + \sigma_N^2)

And we’re done.

Estimating σ2\sigma^2 with known μ\mu

Now let’s examine the scenario in which μ\mu is fixed but σ2\sigma^2 is unknown. It is tempting to jump directly into the scenario in which both parameters are unknown, but I think it is worth it to go carefully through this example first. The remaining two cases are quicker to go through.

It is common to work with the precision of a Gaussian, λ1σ2\lambda \triangleq \frac{1}{\sigma^2}, rather than μ\mu. The reason is that many terms in the Gaussian have σ2\sigma^2 in a denominator, and it is easier to work with λ\lambda rather than σ2\sigma^{-2}.

Likelihood, prior, and posterior

In this case, our likelihood is in terms of the precision λ\lambda is

p(Dμ,λ)=n=1Np(xnμ,λ)n=1N(λ1/2(2π)1/2exp{λ2(xnμ)2})=λN/2(2π)N/2exp{λ2n=1N(xnμ)2} \begin{aligned} p(D \mid \mu, \lambda) &= \prod_{n=1}^{N} p(\textbf{x}_n \mid \mu, \lambda) \\ &\triangleq \prod_{n=1}^{N} \Bigg( \frac{\lambda^{1/2}}{(2 \pi)^{1/2}} \exp \Big\{ -\frac{\lambda}{2} (x_n - \mu)^2 \Big\} \Bigg) \\ &= \frac{\lambda^{N/2}}{(2 \pi)^{N/2}} \exp \Big\{ -\frac{\lambda}{2} \sum_{n=1}^{N} (x_n - \mu)^2 \Big\} \end{aligned}

Now we want prior that has a functional form that is λ\lambda to some power times the exponent of a linear function of λ\lambda. As we will see, this functional form will ensure conjugacy, meaning:

gammaposteriornormallikelihood×gammaprior \overbrace{\text{gamma}}^{\text{posterior}} \propto \overbrace{\text{normal}}^{\text{likelihood}} \times \overbrace{\text{gamma}}^{\text{prior}}

Consider the gamma distribution with hyperparameters a0a_0 and b0b_0:

Gamma(λa0,b0)=1Γ(a0)b0a0λa01exp(b0λ) \text{Gamma}(\lambda \mid a_0, b_0) = \frac{1}{\Gamma(a_0)} b_0^{a_0} \lambda^{a_0-1} \exp(-b_0 \lambda)

where the gamma function Γ(a)\Gamma(a) is just a normalizing constant that does not depend on λ\lambda. It is easy to verify conjugacy by just computing our posterior:

p(λD)p(Dλ,σ2)p(λ)(λN/2(2π)N/2exp{λ2n=1N(xnμ)2})(1Γ(a0)b0a0λa01exp(b0λ))(λN/2exp{λn=1N(xnμ)2})(λa01exp(b0λ))=λN/2+a01exp{λ(b0+n=1N(xnμ)2)} \begin{aligned} p(\lambda \mid D) &\propto p(D \mid \lambda, \sigma^2) p(\lambda) \\ &\triangleq \Bigg( \frac{\lambda^{N/2}}{\textcolor{#CC0000}{(2 \pi)^{N/2}}} \exp \Big\{ -\frac{\lambda}{\textcolor{#CC0000}{2}} \sum_{n=1}^{N} (x_n - \mu)^2 \Big\} \Bigg) \Bigg( \textcolor{#CC0000}{\frac{1}{\Gamma(a_0)}} \textcolor{#CC0000}{b_0^{a_0}} \lambda^{a_0-1} \exp(-b_0 \lambda) \Bigg) \\ &\propto \Bigg( \lambda^{N/2} \exp \Big\{ -\lambda \sum_{n=1}^{N} (x_n - \mu)^2 \Big\} \Bigg) \Bigg( \lambda^{a_0-1} \exp(-b_0 \lambda) \Bigg) \\ &= \lambda^{N/2 + a_0 - 1} \exp \Big\{ -\lambda \Big( b_0 + \sum_{n=1}^{N} (x_n - \mu)^2 \Big) \Big\} \end{aligned}

where we can see that this is another gamma distribution. We can compute the parameters aNa_N and bNb_N with a little algebra:

aN1=N2+a01aN=N2+a0bN=b0+12n=1N(xnμ)2=b0+N2σML2 \begin{aligned} a_N - 1 &= \frac{N}{2} + a_0 - 1 \\ a_N &= \frac{N}{2} + a_0 \\ \\ b_N &= b_0 + \frac{1}{2} \sum_{n=1}^{N} (x_n - \mu)^2 \\ &= b_0 + \frac{N}{2} \sigma^2_{\text{ML}} \end{aligned}

where we use the fact that σML2=1Nn=1N(xnμ)2\sigma^{2}_{\text{ML}} = \frac{1}{N} \sum_{n=1}^{N} (x_n - \mu)^2. Once again, note that as N0N \rightarrow 0, aNa_N and bNb_N reduce to a0a_0 and b0b_0. As we observe more data (as NN increases), the hyperparameters a0a_0 and b0b_0 are overwhelmed by the other additive terms.

Posterior predictive

The posterior predictive for m=1m = 1 new observations is a T-distribution, but I will skip this derivation because it is extremely detailed. See (Murphy, 2007) for a derivation.

Estimating both μ\mu and σ2\sigma^2

Finally, let’s explore the scenario in which both μ\mu and σ2\sigma^2 are unknown. Once again, we’ll work with the precision λ=1σ2\lambda = \frac{1}{\sigma^2} for mathematical convenience.

Likelihood, prior, and posterior

The likelihood is

p(Dμ,λ)=n=1Np(xnμ,λ)=n=1N[(λ2π)1/2exp{λ2(xnμ)2}]=(λ2π)N/2exp{λ2n=1N(xnμ)2}λN/2exp{λ2n=1N(xnμ)2} \begin{aligned} p(D \mid \mu, \lambda) &= \prod_{n=1}^{N} p(\textbf{x}_n \mid \mu, \lambda) \\ &= \prod_{n=1}^{N} \Big[ \Big( \frac{\lambda}{2 \pi} \Big)^{1/2} \exp \Big\{-\frac{\lambda}{2} (x_n - \mu)^2 \Big\} \Big] \\ &= \Big( \frac{\lambda}{\textcolor{#CC0000}{2 \pi}} \Big)^{N/2} \exp \Big\{-\frac{\lambda}{2} \sum_{n=1}^{N} (x_n - \mu)^2 \Big\} \\ &\propto \lambda^{N/2} \exp \Big\{-\frac{\lambda}{2} \sum_{n=1}^{N} (x_n - \mu)^2 \Big\} \end{aligned}

But at this point, we need to think carefully about what form this should take. Note that we can decompose our prior as

p(μ,λ)=p(μλ)p(λ) p(\mu, \lambda) = p(\mu \mid \lambda) p(\lambda)

This means we can use the results from the previous two sections: p(μλ)p(\mu \mid \lambda) will be a Gaussian distribution and p(λ)p(\lambda) will be a gamma distribution. This is known as a normal-gamma or Gaussian-gamma:

p(μ,λ)=N(μa,b)Gamma(λc,d)() p(\mu, \lambda) = \mathcal{N}(\mu \mid a, b) \text{Gamma}(\lambda \mid c, d) \tag{$\star$}

for some values aa, bb, cc, and dd. If our prior takes the functional form \star, then we want our likelihood to be amenable: it should be a Guassian in terms of μ\mu times a gamma in terms of λ\lambda. Let’s start by expanding the exponent and moving any term containing as μ\mu to one exponent:

=λN/2exp{λ2n=1N(xn2+μ22xnμ)}=λN/2exp{λ2n=1Nxn2Nλμ22+λμn=1Nxn}=λN/2exp{Nλμ22}exp{λ2n=1Nxn2+λμn=1Nxn}=λN/2exp{Nλμ22+λμn=1Nxn}exp{λ2n=1Nxn2} \begin{aligned} &= \lambda^{N/2} \exp \Big\{-\frac{\lambda}{2} \sum_{n=1}^{N} (x_n^2 + \mu^2 - 2 x_n \mu) \Big\} \\ &= \lambda^{N/2} \exp \Big\{-\frac{\lambda}{2} \sum_{n=1}^{N} x_n^2 - \frac{N \lambda \mu^2}{2} + \lambda \mu \sum_{n=1}^{N} x_n \Big\} \\ &= \lambda^{N/2} \exp \Big\{ - \frac{N \lambda \mu^2}{2} \Big\} \exp \Big\{-\frac{\lambda}{2} \sum_{n=1}^{N} x_n^2 + \lambda \mu \sum_{n=1}^{N} x_n \Big\} \\ &= \lambda^{N/2} \exp \Big\{ - \frac{N \lambda \mu^2}{2} + \lambda \mu \sum_{n=1}^{N} x_n \Big\} \exp \Big\{-\frac{\lambda}{2} \sum_{n=1}^{N} x_n^2 \Big\} \end{aligned}

This looks okay. Now we need to complete the square as before, again using the notation xˉ=1Nn=1Nxn\bar{x} = \frac{1}{N} \sum_{n=1}^{N} x_n:

=λN/2exp{Nλ2(μ22μn=1NxnN)}exp{λ2n=1Nxn2}=λN/2exp{Nλ2(μ22μxˉ+xˉ2xˉ2)}exp{λ2n=1Nxn2}=λN/2exp{Nλ2(μ22μxˉ+xˉ2)}exp{λ2n=1Nxn2+Nλxˉ22} \begin{aligned} &= \lambda^{N/2} \exp \Big\{ - \frac{N \lambda}{2} \Big( \mu^2 - 2 \mu \frac{\sum_{n=1}^{N} x_n}{N} \Big) \Big\} \exp \Big\{-\frac{\lambda}{2} \sum_{n=1}^{N} x_n^2 \Big\} \\ &= \lambda^{N/2} \exp \Big\{ - \frac{N \lambda}{2} \Big( \mu^2 - 2 \mu \bar{x} + \bar{x}^2 - \bar{x}^2 \Big) \Big\} \exp \Big\{-\frac{\lambda}{2} \sum_{n=1}^{N} x_n^2 \Big\} \\ &= \lambda^{N/2} \exp \Big\{ - \frac{N \lambda}{2} \Big( \mu^2 - 2 \mu \bar{x} + \bar{x}^2 \Big) \Big\} \exp \Big\{-\frac{\lambda}{2} \sum_{n=1}^{N} x_n^2 + \frac{N \lambda \bar{x}^2}{2} \Big\} \end{aligned}

We’re almost done. The rightmost exponent almost looks like the exponential term in the gamma distribution, except we need to pull out λ- \lambda. And we need to move λN/2\lambda^{N/2} to between the two exponents so that:

=exp{Nλ2(μxˉ)2}GaussianλN/2exp{λ(n=1Nxn22(n=1Nxn)22N)}gamma \begin{aligned} &= \overbrace{\exp \Big\{ - \frac{N \lambda}{2} \Big( \mu - \bar{x} \Big)^2 \Big\}}^{\text{Gaussian}} \quad \overbrace{ \lambda^{N/2} \exp \Big\{- \lambda \Big( \frac{\sum_{n=1}^{N} x_n^2}{2} - \frac{\Big( \sum_{n=1}^{N} x_n \Big)^2}{2 N} \Big) \Big\}}^{\text{gamma}} \end{aligned}

and we’re done. In this case, I won’t work through the posterior because it should be obvious (in the sense that it is just algebra) what happens when we multiply a Gaussian-gamma prior times this likelihood.

Posterior predictive

Once again, I will skip this derivation, but it should be intuitive that if the previous case reduced to a T-distribution, this case would as well. See (Murphy, 2007) for a derivation.

  1. Bishop, C. M. (2006). Pattern Recognition and Machine Learning.
  2. Murphy, K. P. (2007). Conjugate Bayesian analysis of the Gaussian distribution. Def, 1(2\sigma2), 16.