Bayesian Inference for the Gaussian

I work through several cases of Bayesian parameter estimation of Gaussian models.

Published

04 April 2019

Estimating the parameters of a Gaussian distribution and its conjugate prior is common task in Bayesian inference. In this blog post, I want to derive the likelihood, conjugate prior, and posterior, and posterior predictive for a few important cases: when we estimate just $\mu$ with known $\sigma^2$ , when we estimate just $\sigma^2$ with known $\mu$, and when jointly estimate both parameters. For simplicity, I’ll stick to univariate models. My goal is to provide detailed yet fluid derivations. Once again, I rely on (Bishop, 2006; Murphy, 2007) for their excellent explanations and notation.

Estimating $\mu$ with known $\sigma^2$

Let’s begin with the simplest case: a Gaussian distribution with an unknown mean $\mu$ but with a known or fixed $\sigma^2$ .

Likelihood, prior, and posterior

The likelihood is

$\begin{aligned} p(D \mid \mu, \sigma^2) &= \prod_{n=1}^{N} p(\textbf{x}_n \mid \mu, \sigma^2) \\ &\triangleq \prod_{n=1}^{N} \Bigg( \frac{1}{(2 \pi \sigma^2)^{1/2}} \exp \Big\{ -\frac{1}{2 \sigma^2} (x_n - \mu)^2 \Big\} \Bigg) \\ &= \frac{1}{(2 \pi \sigma^2)^{N/2}} \exp \Big\{ -\frac{1}{2 \sigma^2} \sum_{n=1}^{N} (x_n - \mu)^2 \Big\} \end{aligned}$

We can see that if we multiply this likelihood by another Gaussian, we will get another Gaussian. This is because each Gaussian can be written as exponential times a quadratic, the product of which is another exponential times a quadratic. Therefore a conjugate prior on $\mu$ has the form

$p(\mu) = \mathcal{N}(\mu \mid \mu_0, \sigma_0^2) = \frac{1}{(2 \pi \sigma_0^2)^{1/2}} \exp \Big\{ -\frac{1}{2 \sigma_0^2} (\mu - \mu_0)^2 \Big\}$

and the posterior is

$p(\mu \mid D) \propto p(D \mid \mu, \sigma^2) p(\mu)$

Let’s work through this calculation in detail to see the form of the posterior distribution. To be clear, the goal is to write the posterior in a form like this,

$p(\mu \mid D) \propto \alpha \exp \Big\{\beta (\mu - \gamma)^2 \Big\}$

for some values $\alpha$ , $\beta$ , and $\gamma$ . This means we need to massage the likelihood times the prior until we get a functional form as above. First, we just apply the definitions and then drop anything that clearly does not depend on $\mu$ . As a convention, any time we drop a term because it does not depend on $\mu$ , we highlight it in red before it is dropped. Finally, let

$\bar{x} = \frac{1}{N} \sum_{n=1}^{N} x_n \implies \sum_{n=1}^{N} x_n = N \bar{x}$

Then we have:

$\begin{aligned} p(\mu \mid D) &\propto p(D \mid \mu, \sigma^2) p(\mu) \\ &\triangleq \Bigg( \frac{1}{(2 \pi \sigma^2)^{N/2}} \exp \Big\{ -\frac{1}{2 \sigma^2} \sum_{n=1}^{N} (x_n - \mu)^2 \Big\} \Bigg) \Bigg( \frac{1}{(2 \pi \sigma_0^2)^{1/2}} \exp \Big\{ -\frac{1}{2 \sigma_0^2} (\mu - \mu_0)^2 \Big\} \Bigg) \\ &= \textcolor{#CC0000}{\frac{1}{(2 \pi \sigma^2)^{N/2} (2 \pi \sigma_0^2)^{1/2}}} \exp \Big\{ -\frac{1}{2 \sigma^2} \sum_{n=1}^{N} (x_n - \mu)^2 -\frac{1}{2 \sigma_0^2} (\mu - \mu_0)^2 \Big\} \\ &\propto \exp \Big\{ -\frac{1}{2 \sigma^2} \sum_{n=1}^{N} (x_n - \mu)^2 -\frac{1}{2 \sigma_0^2} (\mu - \mu_0)^2 \Big\} \\ &= \exp \Big\{ -\frac{1}{2 \sigma^2} \Big( \sum_{n=1}^{N} x_n^2 + \mu^2 - 2 x_n \mu \Big) - \frac{1}{2 \sigma_0^2} \Big( \mu^2 + \mu_0^2 - 2 \mu \mu_0 \Big) \Big\} \\ &= \exp \Big\{ -\frac{1}{2 \sigma^2} \Big( \textcolor{#CC0000}{\sum_{n=1}^{N} x_n^2} + N \mu^2 - 2 \mu \sum_{n=1}^{N} x_n \Big) - \frac{1}{2 \sigma_0^2} \Big( \mu^2 + \textcolor{#CC0000}{\mu_0^2}- 2 \mu \mu_0 \Big) \Big\} \\ &\propto \exp \Big\{ -\frac{1}{2 \sigma^2} \Big( N \mu^2 - 2 \mu \sum_{n=1}^{N} x_n \Big) - \frac{1}{2 \sigma_0^2} \Big( \mu^2 - 2 \mu \mu_0 \Big) \Big\} \\ &= \exp \Big\{ - \frac{1}{2} \Big( \frac{N \mu^2}{\sigma^2} - \frac{2 \mu N \bar{x}}{\sigma^2} + \frac{\mu^2}{\sigma_0^2} - \frac{2 \mu \mu_0}{\sigma_0^2} \Big) \Big\} \\ &= \exp \Big\{ - \frac{1}{2} \Big( \mu^2 \Big( \frac{N}{\sigma^2} + \frac{1}{\sigma_0^2} \Big) - 2 \mu \Big( \frac{N \bar{x}}{\sigma^2} + \frac{\mu_0}{\sigma_0^2} \Big) \Big) \Big\} \end{aligned}$

At this point, we might be stuck. We’ve isolated $\mu$ and $\mu^2$ but it’s not clear how to solve for just $\mu$ . But Bishop (p. 98) gives a hint, writing:

Simple manipulation involving completing the square in the exponent shows that the posterior distribution is given by…

But what does this mean? This deserves a small digression.

Completing the square

Recall that completing the square is a trick we learned in algebra when factorizing polynomials. Consider the problem of solving for $x$ in this equation:

$x^2 - 4x = 5$

We did this by completing the square, meaning adding something to $x^2 - 4x$ to make it square-able. In this case, note that

$\begin{aligned} x^2 - 4x &= 5 \\ x^2 - 4x + 4 &= 5 + 4 \\ (x - 2)^2 &= 9 \\ x - 2 &= \pm 3 \\ x &= \text{$5$ or $-1$} \end{aligned}$

More generally, since for any quadratic polynomial,

$(n - m)^2 = n^2 - 2nm + m^2$

we can complete the square in the following way:

$\begin{aligned} ax^2 + bx + c &= d \\ x^2 + \frac{b}{a} x + \frac{c}{a} &= \frac{d}{a} \\ x^2 + \frac{b}{a} x + \Big( \frac{b}{2a} \Big)^2 &= \frac{d}{a} - \frac{c}{a} + \Big( \frac{b}{2a} \Big)^2 \\ \Big( x + \frac{b}{2a} \Big)^2 &= \frac{d}{a} - \frac{c}{a} + \Big( \frac{b}{2a} \Big)^2 \end{aligned}$

Since $a$ , $b$ , $c$ , and $d$ are all constants, this reduces to the form:

$(x + \alpha)^2 = \beta$

for some values $\alpha$ and $\beta$ , and we know that $x = \pm \sqrt{\beta} - \alpha$ . This is a general technique and is actually how one would derive the quadratic formula. But for us, this is precisely the trick we want to use here, except for now $x = \mu$ .

Continuing the derivation

To ease notation, let’s ignore the exponent and $-1/2$ term for now, and let $a$ and $b$ be defined as

$\overbrace{\Big( \frac{N}{\sigma^2} + \frac{1}{\sigma_0^2} \Big)}^{a} \mu^2 - 2 \overbrace{\Big( \frac{N \bar{x}}{\sigma^2} + \frac{\mu_0}{\sigma_0^2} \Big)}^{b} \mu$

Then our equation is

$a \mu^2 - 2 b \mu$

We can then complete the square as above:

$\begin{aligned} a \mu^2 - 2 b \mu &= a \Big( \mu^2 - 2 \frac{b}{a} \mu \Big) \\ &= a \Big( \mu^2 - 2 \frac{b}{a} \mu + \Big( \frac{b}{a} \Big)^2 \textcolor{#CC0000}{- \Big( \frac{b}{a} \Big)^2} \Big) \\ &\propto a \Big( \mu^2 - 2 \frac{b}{a} \mu + \Big( \frac{b}{a} \Big)^2 \Big) \\ &= a \Big( \mu - \frac{b}{a} \Big)^2 \end{aligned}$

where we use the cute trick that we can add $(b/a)^2 - (b/a)^2 = 0$ inside the parentheses, but then ignore one of the newly added terms because it does not depend on $\mu$ . Adding the exponent and $-1/2$ term back, we get:

$p(\mu \mid D) \propto \mathcal{N}(\mu \mid \mu_N, \sigma_N^2) = \exp \Big\{ - \frac{a}{2} \Big( \mu - \frac{b}{a} \Big)^2 \Big\}$

We can then solve for the posterior parameters, $\mu_N$ and $\sigma_N^2$ , in terms of $a$ and $b$ :

$\begin{aligned} -\frac{a}{2} &= - \frac{1}{2 \sigma_N^2} \\ &\Downarrow \\ \sigma_N^2 &= \frac{1}{a} \\ &= \Big( \frac{N}{\sigma^2} + \frac{1}{\sigma_0^2} \Big)^{-1} \end{aligned}$

and

$\begin{aligned} \mu_N &= \frac{b}{a} \\\\ &= \frac{\frac{N \bar{x}}{\sigma^2} + \frac{\mu_0}{\sigma_0^2}}{\frac{N}{\sigma^2} + \frac{1}{\sigma_0^2}} \\\\ &= \frac{\frac{N \bar{x} \sigma_0^2 + \mu_0 \sigma^2}{\sigma^2 \sigma_0^2}}{\frac{N \sigma_0^2 + \sigma^2}{\sigma^2 \sigma_0^2}} \\\\ &= \frac{N \bar{x} \sigma_0^2 + \mu_0 \sigma^2}{N \sigma_0^2 + \sigma^2} \\\\ &= \frac{N \bar{x} \sigma_0^2}{N \sigma_0^2 + \sigma^2} + \frac{\mu_0 \sigma^2}{N \sigma_0^2 + \sigma^2} \\\\ &= \Big( \frac{N \sigma_0^2}{N \sigma_0^2 + \sigma^2} \Big) \mu_{\text{ML}} + \Big( \frac{\sigma^2}{N \sigma_0^2 + \sigma^2} \Big) \mu_0 \end{aligned}$

where we use the fact that $\mu_{\text{ML}}$ is $\bar{x}$ or the sample mean

$\mu_{\text{ML}} = \frac{1}{N} \sum_{n=1}^N x_n$

Note two things. First, if $N = 0$ , then $\mu_{N} = \mu_0$ , which is expected. The prior is our modeling assumption in the absence of data. And if $N \rightarrow \infty$ , then $\mu_{N} = \mu_{\text{ML}}$ , which is ideal. With enough data, we disregard our prior in favor of the optimal parameter given our data.

Posterior predictive

In Bayesian inference, the posterior predictive is

$p(D' \mid D)$

where $D'$ is unseen data. In words, it is the distribution of unobserved data given observed data. Once we have our posterior and prior, we can marginalize over our parameters to get our posterior predictive:

$\begin{aligned} p(D' \mid D) &= \int p(D' \mid D, \mu) p(\mu \mid D) d\mu \\ &\stackrel{\star}{=} \int p(D' \mid \mu) p(\mu \mid D) d\mu \\ &\triangleq \int \mathcal{N}(D' \mid \mu, \sigma^2) \mathcal{N}(\mu \mid \mu_N, \sigma_N^2) du \end{aligned}$

where step $\star$ holds because the modeling assumption is that $D'$ is conditionally independent from $D$ given $\mu$ or that $p(D' \mid D, \mu) = p(D' \mid \mu)$ . This is a reasonable assumption in that it claims that our training data and unseen data are both generated independently from the same distribution.

Since both our posterior and prior are Gaussians, we can use the following fact:

$\begin{aligned} p(\textbf{x}) &= \mathcal{N}(\textbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Psi}) \\ p(\textbf{y} \mid \textbf{x}) &= \mathcal{N}(\textbf{y} \mid A \textbf{x} + \textbf{b}, \textbf{P}) \\ &\Downarrow \\ p(\textbf{y}) &= \mathcal{N}(\textbf{y} \mid \textbf{A} \boldsymbol{\mu} + \textbf{b}, \textbf{P} + \textbf{A} \boldsymbol{\Psi} \textbf{A}^{\top}) \end{aligned}$

See (Bishop, 2006), page 93 for details. In our case, we have

$\begin{aligned} \textbf{x} &= \mu \\ \boldsymbol{\mu} &= \mu_N \\ \boldsymbol{\Psi} &= \sigma_N^2 \\ \textbf{y} &= D' \\ \textbf{A} &= 1 \\ \textbf{b} &= 0 \\ \textbf{P} &= \sigma^2 \end{aligned}$

which gives us

$p(D' \mid D) = \mathcal{N}(D' \mid \mu_N, \sigma^2 + \sigma_N^2)$

And we’re done.

Estimating $\sigma^2$ with known $\mu$

Now let’s examine the scenario in which $\mu$ is fixed but $\sigma^2$ is unknown. It is tempting to jump directly into the scenario in which both parameters are unknown, but I think it is worth it to go carefully through this example first. The remaining two cases are quicker to go through.

It is common to work with the precision of a Gaussian, $\lambda \triangleq \frac{1}{\sigma^2}$ , rather than $\mu$ . The reason is that many terms in the Gaussian have $\sigma^2$ in a denominator, and it is easier to work with $\lambda$ rather than $\sigma^{-2}$ .

Likelihood, prior, and posterior

In this case, our likelihood is in terms of the precision $\lambda$ is

$\begin{aligned} p(D \mid \mu, \lambda) &= \prod_{n=1}^{N} p(\textbf{x}_n \mid \mu, \lambda) \\ &\triangleq \prod_{n=1}^{N} \Bigg( \frac{\lambda^{1/2}}{(2 \pi)^{1/2}} \exp \Big\{ -\frac{\lambda}{2} (x_n - \mu)^2 \Big\} \Bigg) \\ &= \frac{\lambda^{N/2}}{(2 \pi)^{N/2}} \exp \Big\{ -\frac{\lambda}{2} \sum_{n=1}^{N} (x_n - \mu)^2 \Big\} \end{aligned}$

Now we want prior that has a functional form that is $\lambda$ to some power times the exponent of a linear function of $\lambda$ . As we will see, this functional form will ensure conjugacy, meaning:

$\overbrace{\text{gamma}}^{\text{posterior}} \propto \overbrace{\text{normal}}^{\text{likelihood}} \times \overbrace{\text{gamma}}^{\text{prior}}$

Consider the gamma distribution with hyperparameters $a_0$ and $b_0$ :

$\text{Gamma}(\lambda \mid a_0, b_0) = \frac{1}{\Gamma(a_0)} b_0^{a_0} \lambda^{a_0-1} \exp(-b_0 \lambda)$

where the gamma function $\Gamma(a)$ is just a normalizing constant that does not depend on $\lambda$ . It is easy to verify conjugacy by just computing our posterior:

$\begin{aligned} p(\lambda \mid D) &\propto p(D \mid \lambda, \sigma^2) p(\lambda) \\ &\triangleq \Bigg( \frac{\lambda^{N/2}}{\textcolor{#CC0000}{(2 \pi)^{N/2}}} \exp \Big\{ -\frac{\lambda}{\textcolor{#CC0000}{2}} \sum_{n=1}^{N} (x_n - \mu)^2 \Big\} \Bigg) \Bigg( \textcolor{#CC0000}{\frac{1}{\Gamma(a_0)}} \textcolor{#CC0000}{b_0^{a_0}} \lambda^{a_0-1} \exp(-b_0 \lambda) \Bigg) \\ &\propto \Bigg( \lambda^{N/2} \exp \Big\{ -\lambda \sum_{n=1}^{N} (x_n - \mu)^2 \Big\} \Bigg) \Bigg( \lambda^{a_0-1} \exp(-b_0 \lambda) \Bigg) \\ &= \lambda^{N/2 + a_0 - 1} \exp \Big\{ -\lambda \Big( b_0 + \sum_{n=1}^{N} (x_n - \mu)^2 \Big) \Big\} \end{aligned}$

where we can see that this is another gamma distribution. We can compute the parameters $a_N$ and $b_N$ with a little algebra:

$\begin{aligned} a_N - 1 &= \frac{N}{2} + a_0 - 1 \\ a_N &= \frac{N}{2} + a_0 \\ \\ b_N &= b_0 + \frac{1}{2} \sum_{n=1}^{N} (x_n - \mu)^2 \\ &= b_0 + \frac{N}{2} \sigma^2_{\text{ML}} \end{aligned}$

where we use the fact that $\sigma^{2}_{\text{ML}} = \frac{1}{N} \sum_{n=1}^{N} (x_n - \mu)^2$ . Once again, note that as $N \rightarrow 0$ , $a_N$ and $b_N$ reduce to $a_0$ and $b_0$ . As we observe more data (as $N$ increases), the hyperparameters $a_0$ and $b_0$ are overwhelmed by the other additive terms.

Posterior predictive

The posterior predictive for $m = 1$ new observations is a T-distribution, but I will skip this derivation because it is extremely detailed. See (Murphy, 2007) for a derivation.

Estimating both $\mu$ and $\sigma^2$

Finally, let’s explore the scenario in which both $\mu$ and $\sigma^2$ are unknown. Once again, we’ll work with the precision $\lambda = \frac{1}{\sigma^2}$ for mathematical convenience.

Likelihood, prior, and posterior

The likelihood is

$\begin{aligned} p(D \mid \mu, \lambda) &= \prod_{n=1}^{N} p(\textbf{x}_n \mid \mu, \lambda) \\ &= \prod_{n=1}^{N} \Big[ \Big( \frac{\lambda}{2 \pi} \Big)^{1/2} \exp \Big\{-\frac{\lambda}{2} (x_n - \mu)^2 \Big\} \Big] \\ &= \Big( \frac{\lambda}{\textcolor{#CC0000}{2 \pi}} \Big)^{N/2} \exp \Big\{-\frac{\lambda}{2} \sum_{n=1}^{N} (x_n - \mu)^2 \Big\} \\ &\propto \lambda^{N/2} \exp \Big\{-\frac{\lambda}{2} \sum_{n=1}^{N} (x_n - \mu)^2 \Big\} \end{aligned}$

But at this point, we need to think carefully about what form this should take. Note that we can decompose our prior as

$p(\mu, \lambda) = p(\mu \mid \lambda) p(\lambda)$

This means we can use the results from the previous two sections: $p(\mu \mid \lambda)$ will be a Gaussian distribution and $p(\lambda)$ will be a gamma distribution. This is known as a normal-gamma or Gaussian-gamma:

$p(\mu, \lambda) = \mathcal{N}(\mu \mid a, b) \text{Gamma}(\lambda \mid c, d) \tag{$\star$}$

for some values $a$ , $b$ , $c$ , and $d$ . If our prior takes the functional form $\star$ , then we want our likelihood to be amenable: it should be a Guassian in terms of $\mu$ times a gamma in terms of $\lambda$ . Let’s start by expanding the exponent and moving any term containing as $\mu$ to one exponent:

$\begin{aligned} &= \lambda^{N/2} \exp \Big\{-\frac{\lambda}{2} \sum_{n=1}^{N} (x_n^2 + \mu^2 - 2 x_n \mu) \Big\} \\ &= \lambda^{N/2} \exp \Big\{-\frac{\lambda}{2} \sum_{n=1}^{N} x_n^2 - \frac{N \lambda \mu^2}{2} + \lambda \mu \sum_{n=1}^{N} x_n \Big\} \\ &= \lambda^{N/2} \exp \Big\{ - \frac{N \lambda \mu^2}{2} \Big\} \exp \Big\{-\frac{\lambda}{2} \sum_{n=1}^{N} x_n^2 + \lambda \mu \sum_{n=1}^{N} x_n \Big\} \\ &= \lambda^{N/2} \exp \Big\{ - \frac{N \lambda \mu^2}{2} + \lambda \mu \sum_{n=1}^{N} x_n \Big\} \exp \Big\{-\frac{\lambda}{2} \sum_{n=1}^{N} x_n^2 \Big\} \end{aligned}$

This looks okay. Now we need to complete the square as before, again using the notation $\bar{x} = \frac{1}{N} \sum_{n=1}^{N} x_n$ :

$\begin{aligned} &= \lambda^{N/2} \exp \Big\{ - \frac{N \lambda}{2} \Big( \mu^2 - 2 \mu \frac{\sum_{n=1}^{N} x_n}{N} \Big) \Big\} \exp \Big\{-\frac{\lambda}{2} \sum_{n=1}^{N} x_n^2 \Big\} \\ &= \lambda^{N/2} \exp \Big\{ - \frac{N \lambda}{2} \Big( \mu^2 - 2 \mu \bar{x} + \bar{x}^2 - \bar{x}^2 \Big) \Big\} \exp \Big\{-\frac{\lambda}{2} \sum_{n=1}^{N} x_n^2 \Big\} \\ &= \lambda^{N/2} \exp \Big\{ - \frac{N \lambda}{2} \Big( \mu^2 - 2 \mu \bar{x} + \bar{x}^2 \Big) \Big\} \exp \Big\{-\frac{\lambda}{2} \sum_{n=1}^{N} x_n^2 + \frac{N \lambda \bar{x}^2}{2} \Big\} \end{aligned}$

We’re almost done. The rightmost exponent almost looks like the exponential term in the gamma distribution, except we need to pull out $- \lambda$ . And we need to move $\lambda^{N/2}$ to between the two exponents so that:

$\begin{aligned} &= \overbrace{\exp \Big\{ - \frac{N \lambda}{2} \Big( \mu - \bar{x} \Big)^2 \Big\}}^{\text{Gaussian}} \quad \overbrace{ \lambda^{N/2} \exp \Big\{- \lambda \Big( \frac{\sum_{n=1}^{N} x_n^2}{2} - \frac{\Big( \sum_{n=1}^{N} x_n \Big)^2}{2 N} \Big) \Big\}}^{\text{gamma}} \end{aligned}$

and we’re done. In this case, I won’t work through the posterior because it should be obvious (in the sense that it is just algebra) what happens when we multiply a Gaussian-gamma prior times this likelihood.

Posterior predictive

Once again, I will skip this derivation, but it should be intuitive that if the previous case reduced to a T-distribution, this case would as well. See (Murphy, 2007) for a derivation.

Bishop, C. M. (2006). Pattern Recognition and Machine Learning.
Murphy, K. P. (2007). Conjugate Bayesian analysis of the Gaussian distribution. Def, 1(2\sigma2), 16.

Bayesian Inference for the Gaussian

I work through several cases of Bayesian parameter estimation of Gaussian models.

Published

Estimating $\mu$ with known $\sigma^2$

Likelihood, prior, and posterior

Completing the square

Continuing the derivation

Posterior predictive

Estimating σ2\sigma^2σ2 with known μ\muμ

Likelihood, prior, and posterior

Posterior predictive

Estimating both μ\muμ and σ2\sigma^2σ2

Likelihood, prior, and posterior

Posterior predictive

Estimating $\sigma^2$ with known $\mu$

Estimating both $\mu$ and $\sigma^2$