Conjugate Analysis for the Multivariate Gaussian

I work through Bayesian parameter estimation of the mean for the multivariate Gaussian.

Published

18 November 2020

The goal of this post is to derive the likelihood, posterior, and posterior predictive for a multivariate Gaussian model with an unknown mean parameter. Consider the $D$ -variate Gaussian,

$\mathbf{x} \sim \mathcal{N}_D(\boldsymbol{\mu}, \boldsymbol{\Sigma}), \tag{1}$

with density function $f(\cdot)$ :

$f(\mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) = (2\pi)^{-D/2} \det(\boldsymbol{\Sigma})^{-1/2} \exp\left(-\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right). \tag{2}$

The likelihood is over $N$ i.i.d. observations, denoted with the design matrix $\mathbf{X}$ , is

$\begin{aligned} \mathcal{L}(\mathbf{X} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) &= \prod_{n=1}^{N} f(\mathbf{x}_n \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) \\ &= (2\pi)^{-ND/2} \det(\boldsymbol{\Sigma})^{-N/2} \exp\left(-\frac{1}{2} \sum_{n=1}^{N} (\mathbf{x}_n - \boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{x}_n - \boldsymbol{\mu})\right). \end{aligned} \tag{3}$

Assume $\boldsymbol{\Sigma}$ is known. A common prior for the mean parameter $\boldsymbol{\mu}$ is another Gaussian,

$\pi(\boldsymbol{\mu}) = \mathcal{N}_D(\boldsymbol{\mu}_0, \boldsymbol{\Sigma}_0). \tag{4}$

Our goal is to show that $\pi(\boldsymbol{\mu})$ is a conjugate prior, meaning the posterior $p(\boldsymbol{\mu} \mid \mathbf{x})$ is also Gaussian. The posterior is

$p(\boldsymbol{\mu} \mid \mathbf{X}) \propto \mathcal{L}(\mathbf{X} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) \pi(\boldsymbol{\mu}), \tag{5}$

which we can write out explicitly as

$\begin{aligned} &\mathcal{L}(\mathbf{X} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) \pi(\boldsymbol{\mu}) \\ &\propto \exp\left(-\frac{1}{2} \sum_{n=1}^{N} (\mathbf{x}_n - \boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{x}_n - \boldsymbol{\mu})\right) \exp\left(-\frac{1}{2} (\boldsymbol{\mu} - \boldsymbol{\mu}_0)^{\top} \boldsymbol{\Sigma}_0^{-1} (\boldsymbol{\mu} - \boldsymbol{\mu}_0)\right). \end{aligned} \tag{6}$

We can write the terms inside the exponents as

$\sum_{n=1}^{N} \mathbf{x}_n^{\top} \boldsymbol{\Sigma}^{-1} \mathbf{x}_n + N \boldsymbol{\mu}^{\top} \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} - 2 N \bar{\mathbf{x}} \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} + \boldsymbol{\mu}^{\top} \boldsymbol{\Sigma}_0^{-1} \boldsymbol{\mu} + \boldsymbol{\mu}_0^{\top} \boldsymbol{\Sigma}_0^{-1} \boldsymbol{\mu}_0 - 2 \boldsymbol{\mu}^{\top} \boldsymbol{\Sigma}_0^{-1} \boldsymbol{\mu}_0, \tag{7}$

where $\bar{\mathbf{x}} = 1/N \sum_{n=1}^{N} \mathbf{x}_n$ . Since the posterior is w.r.t. to $\boldsymbol{\mu}$ , we can drop terms that do not depend on $\boldsymbol{\mu}$ —we’ll still retain the Gaussian kernel and can just properly normalize it—and combine like terms:

$p(\boldsymbol{\mu} \mid \mathbf{X}) \propto \exp\left( -\frac{1}{2} \left[\boldsymbol{\mu}^{\top} (N\boldsymbol{\Sigma}^{-1} + \boldsymbol{\Sigma}_0^{-1}) \boldsymbol{\mu} - 2 \boldsymbol{\mu} (N\boldsymbol{\Sigma}^{-1} \bar{\mathbf{x}} + \boldsymbol{\Sigma}_0^{-1} \boldsymbol{\mu}_0 ) \right] \right). \tag{8}$

Now we just complete the square. First, let

$\begin{aligned} \mathbf{M} &= N\boldsymbol{\Sigma}^{-1} + \boldsymbol{\Sigma}_0^{-1}, \\ \mathbf{b} &= N\boldsymbol{\Sigma}^{-1} \bar{\mathbf{x}} + \boldsymbol{\Sigma}_0^{-1} \boldsymbol{\mu}_0. \end{aligned} \tag{9}$

Then Eq. $8$ is equivalent to

$\begin{aligned} p(\boldsymbol{\mu} \mid \mathbf{X}) &\propto \exp\left( -\frac{1}{2} \left[ (\boldsymbol{\mu} - \mathbf{M}^{-1} \mathbf{b})^{\top} \mathbf{M} (\boldsymbol{\mu} - \mathbf{M}^{-1} \mathbf{b}) - \mathbf{b}^{\top} \mathbf{M}^{-1} \mathbf{b} \right]\right), \\ &\Downarrow \\ p(\boldsymbol{\mu} \mid \mathbf{X}) &= \mathcal{N}_D(\boldsymbol{\mu} \mid \mathbf{M}^{-1} \mathbf{b}, \mathbf{M}^{-1}). \end{aligned} \tag{10}$

This is the posterior for a multivariate Gaussian with unknown mean. We can write it a standard form, e.g. see (Murphy, 2007), as:

$\begin{aligned} p(\boldsymbol{\mu} \mid \mathbf{X}) &= \mathcal{N}_D(\boldsymbol{\mu} \mid \boldsymbol{\mu}_N, \boldsymbol{\Sigma}_N), \\ \boldsymbol{\Sigma}_N &= (N\boldsymbol{\Sigma}^{-1} + \boldsymbol{\Sigma}_0^{-1})^{-1}, \\ \boldsymbol{\mu}_N &= \boldsymbol{\Sigma}_N (N\boldsymbol{\Sigma}^{-1} \bar{\mathbf{x}} + \boldsymbol{\Sigma}_0^{-1} \boldsymbol{\mu}_0). \end{aligned} \tag{11}$

To compute the posterior predictive,

$p(\mathbf{x}_{\texttt{new}} \mid \mathbf{X}) = \int f(\mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) p(\boldsymbol{\mu} \mid \mathbf{X}) d\boldsymbol{\mu}, \tag{12}$

we observe the following:

$\begin{aligned} (\mathbf{x}_{\texttt{new}} - \boldsymbol{\mu}) &\sim \mathcal{N}(\mathbf{x}_{\texttt{new}} - \boldsymbol{\mu} \mid \mathbf{0}, \boldsymbol{\Sigma}), \\ \boldsymbol{\mu} &\sim \mathcal{N}(\boldsymbol{\mu} \mid \boldsymbol{\mu}_{N}, \boldsymbol{\Sigma}_N). \end{aligned} \tag{13}$

Notice that $(\mathbf{x}_{\texttt{new}} - \boldsymbol{\mu})$ and $\boldsymbol{\mu}$ are independent. Intuitively, if I tell you the value of $\boldsymbol{\mu}$ , that tells you nothing about the value of $(\mathbf{x}_{\texttt{new}} - \boldsymbol{\mu})$ because the distribution on $(\mathbf{x}_{\texttt{new}} - \boldsymbol{\mu})$ does not contain the parameter $\boldsymbol{\mu}$ . Formally,

$p(\mathbf{x}_{\texttt{new}} - \boldsymbol{\mu} \mid \boldsymbol{\mu}) = p(\mathbf{x}_{\texttt{new}} - \boldsymbol{\mu}). \tag{14}$

Since $(\mathbf{x}_{\texttt{new}} - \boldsymbol{\mu})$ and $\boldsymbol{\mu}$ are independent, we can simply add the means and covariances of the two distributions when adding the random variables, implying:

$\begin{aligned} \mathbf{x}_{\texttt{new}} &= (\mathbf{x}_{\texttt{new}} - \boldsymbol{\mu}) + \boldsymbol{\mu} \\ &\Downarrow \\ p(\mathbf{x}_{\texttt{new}} \mid \mathbf{X}) &= \mathcal{N}(\mathbf{x}_{\texttt{new}} \mid \boldsymbol{\mu}_N, \boldsymbol{\Sigma} + \boldsymbol{\Sigma}_N). \end{aligned}\tag{15}$

Murphy, K. P. (2007). Conjugate Bayesian analysis of the Gaussian distribution. Def, 1(2\sigma2), 16.