Conjugate Analysis for the Multivariate Gaussian

I work through Bayesian parameter estimation of the mean for the multivariate Gaussian.

The goal of this post is to derive the likelihood, posterior, and posterior predictive for a multivariate Gaussian model with an unknown mean parameter. Consider the DD-variate Gaussian,

xND(μ,Σ),(1) \mathbf{x} \sim \mathcal{N}_D(\boldsymbol{\mu}, \boldsymbol{\Sigma}), \tag{1}

with density function f()f(\cdot):

f(xμ,Σ)=(2π)D/2det(Σ)1/2exp(12(xμ)Σ1(xμ)).(2) f(\mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) = (2\pi)^{-D/2} \det(\boldsymbol{\Sigma})^{-1/2} \exp\left(-\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right). \tag{2}

The likelihood is over NN i.i.d. observations, denoted with the design matrix X\mathbf{X}, is

L(Xμ,Σ)=n=1Nf(xnμ,Σ)=(2π)ND/2det(Σ)N/2exp(12n=1N(xnμ)Σ1(xnμ)).(3) \begin{aligned} \mathcal{L}(\mathbf{X} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) &= \prod_{n=1}^{N} f(\mathbf{x}_n \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) \\ &= (2\pi)^{-ND/2} \det(\boldsymbol{\Sigma})^{-N/2} \exp\left(-\frac{1}{2} \sum_{n=1}^{N} (\mathbf{x}_n - \boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{x}_n - \boldsymbol{\mu})\right). \end{aligned} \tag{3}

Assume Σ\boldsymbol{\Sigma} is known. A common prior for the mean parameter μ\boldsymbol{\mu} is another Gaussian,

π(μ)=ND(μ0,Σ0).(4) \pi(\boldsymbol{\mu}) = \mathcal{N}_D(\boldsymbol{\mu}_0, \boldsymbol{\Sigma}_0). \tag{4}

Our goal is to show that π(μ)\pi(\boldsymbol{\mu}) is a conjugate prior, meaning the posterior p(μx)p(\boldsymbol{\mu} \mid \mathbf{x}) is also Gaussian. The posterior is

p(μX)L(Xμ,Σ)π(μ),(5) p(\boldsymbol{\mu} \mid \mathbf{X}) \propto \mathcal{L}(\mathbf{X} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) \pi(\boldsymbol{\mu}), \tag{5}

which we can write out explicitly as

L(Xμ,Σ)π(μ)exp(12n=1N(xnμ)Σ1(xnμ))exp(12(μμ0)Σ01(μμ0)).(6) \begin{aligned} &\mathcal{L}(\mathbf{X} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) \pi(\boldsymbol{\mu}) \\ &\propto \exp\left(-\frac{1}{2} \sum_{n=1}^{N} (\mathbf{x}_n - \boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{x}_n - \boldsymbol{\mu})\right) \exp\left(-\frac{1}{2} (\boldsymbol{\mu} - \boldsymbol{\mu}_0)^{\top} \boldsymbol{\Sigma}_0^{-1} (\boldsymbol{\mu} - \boldsymbol{\mu}_0)\right). \end{aligned} \tag{6}

We can write the terms inside the exponents as

n=1NxnΣ1xn+NμΣ1μ2NxˉΣ1μ+μΣ01μ+μ0Σ01μ02μΣ01μ0,(7) \sum_{n=1}^{N} \mathbf{x}_n^{\top} \boldsymbol{\Sigma}^{-1} \mathbf{x}_n + N \boldsymbol{\mu}^{\top} \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} - 2 N \bar{\mathbf{x}} \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} + \boldsymbol{\mu}^{\top} \boldsymbol{\Sigma}_0^{-1} \boldsymbol{\mu} + \boldsymbol{\mu}_0^{\top} \boldsymbol{\Sigma}_0^{-1} \boldsymbol{\mu}_0 - 2 \boldsymbol{\mu}^{\top} \boldsymbol{\Sigma}_0^{-1} \boldsymbol{\mu}_0, \tag{7}

where xˉ=1/Nn=1Nxn\bar{\mathbf{x}} = 1/N \sum_{n=1}^{N} \mathbf{x}_n. Since the posterior is w.r.t. to μ\boldsymbol{\mu}, we can drop terms that do not depend on μ\boldsymbol{\mu}—we’ll still retain the Gaussian kernel and can just properly normalize it—and combine like terms:

p(μX)exp(12[μ(NΣ1+Σ01)μ2μ(NΣ1xˉ+Σ01μ0)]).(8) p(\boldsymbol{\mu} \mid \mathbf{X}) \propto \exp\left( -\frac{1}{2} \left[\boldsymbol{\mu}^{\top} (N\boldsymbol{\Sigma}^{-1} + \boldsymbol{\Sigma}_0^{-1}) \boldsymbol{\mu} - 2 \boldsymbol{\mu} (N\boldsymbol{\Sigma}^{-1} \bar{\mathbf{x}} + \boldsymbol{\Sigma}_0^{-1} \boldsymbol{\mu}_0 ) \right] \right). \tag{8}

Now we just complete the square. First, let

M=NΣ1+Σ01,b=NΣ1xˉ+Σ01μ0.(9) \begin{aligned} \mathbf{M} &= N\boldsymbol{\Sigma}^{-1} + \boldsymbol{\Sigma}_0^{-1}, \\ \mathbf{b} &= N\boldsymbol{\Sigma}^{-1} \bar{\mathbf{x}} + \boldsymbol{\Sigma}_0^{-1} \boldsymbol{\mu}_0. \end{aligned} \tag{9}

Then Eq. 88 is equivalent to

p(μX)exp(12[(μM1b)M(μM1b)bM1b]),p(μX)=ND(μM1b,M1).(10) \begin{aligned} p(\boldsymbol{\mu} \mid \mathbf{X}) &\propto \exp\left( -\frac{1}{2} \left[ (\boldsymbol{\mu} - \mathbf{M}^{-1} \mathbf{b})^{\top} \mathbf{M} (\boldsymbol{\mu} - \mathbf{M}^{-1} \mathbf{b}) - \mathbf{b}^{\top} \mathbf{M}^{-1} \mathbf{b} \right]\right), \\ &\Downarrow \\ p(\boldsymbol{\mu} \mid \mathbf{X}) &= \mathcal{N}_D(\boldsymbol{\mu} \mid \mathbf{M}^{-1} \mathbf{b}, \mathbf{M}^{-1}). \end{aligned} \tag{10}

This is the posterior for a multivariate Gaussian with unknown mean. We can write it a standard form, e.g. see (Murphy, 2007), as:

p(μX)=ND(μμN,ΣN),ΣN=(NΣ1+Σ01)1,μN=ΣN(NΣ1xˉ+Σ01μ0).(11) \begin{aligned} p(\boldsymbol{\mu} \mid \mathbf{X}) &= \mathcal{N}_D(\boldsymbol{\mu} \mid \boldsymbol{\mu}_N, \boldsymbol{\Sigma}_N), \\ \boldsymbol{\Sigma}_N &= (N\boldsymbol{\Sigma}^{-1} + \boldsymbol{\Sigma}_0^{-1})^{-1}, \\ \boldsymbol{\mu}_N &= \boldsymbol{\Sigma}_N (N\boldsymbol{\Sigma}^{-1} \bar{\mathbf{x}} + \boldsymbol{\Sigma}_0^{-1} \boldsymbol{\mu}_0). \end{aligned} \tag{11}

To compute the posterior predictive,

p(xnewX)=f(xμ,Σ)p(μX)dμ,(12) p(\mathbf{x}_{\texttt{new}} \mid \mathbf{X}) = \int f(\mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) p(\boldsymbol{\mu} \mid \mathbf{X}) d\boldsymbol{\mu}, \tag{12}

we observe the following:

(xnewμ)N(xnewμ0,Σ),μN(μμN,ΣN).(13) \begin{aligned} (\mathbf{x}_{\texttt{new}} - \boldsymbol{\mu}) &\sim \mathcal{N}(\mathbf{x}_{\texttt{new}} - \boldsymbol{\mu} \mid \mathbf{0}, \boldsymbol{\Sigma}), \\ \boldsymbol{\mu} &\sim \mathcal{N}(\boldsymbol{\mu} \mid \boldsymbol{\mu}_{N}, \boldsymbol{\Sigma}_N). \end{aligned} \tag{13}

Notice that (xnewμ)(\mathbf{x}_{\texttt{new}} - \boldsymbol{\mu}) and μ\boldsymbol{\mu} are independent. Intuitively, if I tell you the value of μ\boldsymbol{\mu}, that tells you nothing about the value of (xnewμ)(\mathbf{x}_{\texttt{new}} - \boldsymbol{\mu}) because the distribution on (xnewμ)(\mathbf{x}_{\texttt{new}} - \boldsymbol{\mu}) does not contain the parameter μ\boldsymbol{\mu}. Formally,

p(xnewμμ)=p(xnewμ).(14) p(\mathbf{x}_{\texttt{new}} - \boldsymbol{\mu} \mid \boldsymbol{\mu}) = p(\mathbf{x}_{\texttt{new}} - \boldsymbol{\mu}). \tag{14}

Since (xnewμ)(\mathbf{x}_{\texttt{new}} - \boldsymbol{\mu}) and μ\boldsymbol{\mu} are independent, we can simply add the means and covariances of the two distributions when adding the random variables, implying:

xnew=(xnewμ)+μp(xnewX)=N(xnewμN,Σ+ΣN).(15) \begin{aligned} \mathbf{x}_{\texttt{new}} &= (\mathbf{x}_{\texttt{new}} - \boldsymbol{\mu}) + \boldsymbol{\mu} \\ &\Downarrow \\ p(\mathbf{x}_{\texttt{new}} \mid \mathbf{X}) &= \mathcal{N}(\mathbf{x}_{\texttt{new}} \mid \boldsymbol{\mu}_N, \boldsymbol{\Sigma} + \boldsymbol{\Sigma}_N). \end{aligned}\tag{15}

  1. Murphy, K. P. (2007). Conjugate Bayesian analysis of the Gaussian distribution. Def, 1(2\sigma2), 16.