The goal of this post is to derive the likelihood, posterior, and posterior predictive for a multivariate Gaussian model with an unknown mean parameter. Consider the D-variate Gaussian,
x∼ND(μ,Σ),(1)
with density function f(⋅):
f(x∣μ,Σ)=(2π)−D/2det(Σ)−1/2exp(−21(x−μ)⊤Σ−1(x−μ)).(2)
The likelihood is over N i.i.d. observations, denoted with the design matrix X, is
L(X∣μ,Σ)=n=1∏Nf(xn∣μ,Σ)=(2π)−ND/2det(Σ)−N/2exp(−21n=1∑N(xn−μ)⊤Σ−1(xn−μ)).(3)
Assume Σ is known. A common prior for the mean parameter μ is another Gaussian,
π(μ)=ND(μ0,Σ0).(4)
Our goal is to show that π(μ) is a conjugate prior, meaning the posterior p(μ∣x) is also Gaussian. The posterior is
p(μ∣X)∝L(X∣μ,Σ)π(μ),(5)
which we can write out explicitly as
L(X∣μ,Σ)π(μ)∝exp(−21n=1∑N(xn−μ)⊤Σ−1(xn−μ))exp(−21(μ−μ0)⊤Σ0−1(μ−μ0)).(6)
We can write the terms inside the exponents as
n=1∑Nxn⊤Σ−1xn+Nμ⊤Σ−1μ−2NxˉΣ−1μ+μ⊤Σ0−1μ+μ0⊤Σ0−1μ0−2μ⊤Σ0−1μ0,(7)
where xˉ=1/N∑n=1Nxn. Since the posterior is w.r.t. to μ, we can drop terms that do not depend on μ—we’ll still retain the Gaussian kernel and can just properly normalize it—and combine like terms:
p(μ∣X)∝exp(−21[μ⊤(NΣ−1+Σ0−1)μ−2μ(NΣ−1xˉ+Σ0−1μ0)]).(8)
Now we just complete the square. First, let
Mb=NΣ−1+Σ0−1,=NΣ−1xˉ+Σ0−1μ0.(9)
Then Eq. 8 is equivalent to
p(μ∣X)p(μ∣X)∝exp(−21[(μ−M−1b)⊤M(μ−M−1b)−b⊤M−1b]),⇓=ND(μ∣M−1b,M−1).(10)
This is the posterior for a multivariate Gaussian with unknown mean. We can write it a standard form, e.g. see (Murphy, 2007), as:
p(μ∣X)ΣNμN=ND(μ∣μN,ΣN),=(NΣ−1+Σ0−1)−1,=ΣN(NΣ−1xˉ+Σ0−1μ0).(11)
To compute the posterior predictive,
p(xnew∣X)=∫f(x∣μ,Σ)p(μ∣X)dμ,(12)
we observe the following:
(xnew−μ)μ∼N(xnew−μ∣0,Σ),∼N(μ∣μN,ΣN).(13)
Notice that (xnew−μ) and μ are independent. Intuitively, if I tell you the value of μ, that tells you nothing about the value of (xnew−μ) because the distribution on (xnew−μ) does not contain the parameter μ. Formally,
p(xnew−μ∣μ)=p(xnew−μ).(14)
Since (xnew−μ) and μ are independent, we can simply add the means and covariances of the two distributions when adding the random variables, implying:
xnewp(xnew∣X)=(xnew−μ)+μ⇓=N(xnew∣μN,Σ+ΣN).(15)