A Fast and Numerically Stable Implementation of the Multivariate Normal PDF

Naively computing the probability density function for the multivariate normal can be slow and numerically unstable. I work through SciPy's implementation.

Consider the multivariate normal probability density function (PDF) for xRD\mathbf{x} \in \mathbb{R}^D with parameters μRD\boldsymbol{\mu} \in \mathbb{R}^D (mean) and ΣRD×D\boldsymbol{\Sigma} \in \mathbb{R}^{D \times D} (covariance):

f(x;μ,Σ)=1(2π)Ddet(Σ)exp{12(xμ)Σ1(xμ)}. f(\mathbf{x}; \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{\sqrt{(2 \pi)^D \det(\boldsymbol{\Sigma})}} \exp\Big\{ -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) \Big\}.

Computing this is tricky because we need to compute both the inverse and determinant of the D×DD \times D covariance matrix Σ\boldsymbol{\Sigma}. Each of these operations has runtime complexity O(D3)\mathcal{O}(D^3). Furthermore, computing matrix inverses and determinants can be numerically unstable if done naively. SciPy has a fast and numerically stable implementation that is worth understanding. The big idea is to do one intensive operation, eigenvalue decomposition, and then use that decomposition to compute the matrix inverse and determinant cheaply.

Matrix inverse

Since Σ\boldsymbol{\Sigma} is Hermitian, it has an eigendecomposition

Σ=QΛQ \boldsymbol{\Sigma} = \mathbf{Q} \boldsymbol{\Lambda} \mathbf{Q}^{\top}

where Q\mathbf{Q} is an orthogonal matrix whose columns are the eigenvectors of Σ\boldsymbol{\Sigma} and where Λ\boldsymbol{\Lambda} is a diagonal matrix of the associated eigenvalues. Since (AB)1=B1A1(\mathbf{A}\mathbf{B})^{-1} = \mathbf{B}^{-1} \mathbf{A}^{-1} in general and since Q=Q1\mathbf{Q}^{\top} = \mathbf{Q}^{-1} due to orthogonality, we can easily compute the inverse of Σ\boldsymbol{\Sigma},

Σ1=(QΛQ)1=(Q)1Λ1Q1=QΛ1Q. \begin{aligned} \boldsymbol{\Sigma}^{-1} &= (\mathbf{Q} \boldsymbol{\Lambda} \mathbf{Q}^{\top})^{-1} \\ &= (\mathbf{Q}^{\top})^{-1} \boldsymbol{\Lambda}^{-1} \mathbf{Q}^{-1} \\ &= \mathbf{Q} \boldsymbol{\Lambda}^{-1} \mathbf{Q}^{\top}. \end{aligned}

Since Λ\boldsymbol{\Lambda} is diagonal, Λ1\boldsymbol{\Lambda}^{-1} is just one divided by each value along the diagonal.

Determinant

To compute the determinant of Σ\boldsymbol{\Sigma}, SciPy uses the following mathematical fact: for any D×DD \times D matrix M\mathbf{M} with eigenvalues λ1,λ2,λD\lambda_1, \lambda_2, \dots \lambda_D, the determinant is the product of the eigenvalues or

det(M)=d=1Dλd. \det(\mathbf{M}) = \prod_{d=1}^{D} \lambda_d.

However, computing the determinant this way is numerically unstable, as I’ve written about before. The upshot is that if λn\lambda_n is small, the computed determinant might be zero due to machine precision. So SciPy computes the log of the PDF so that computing the determinant amounts to

log(det(M))=d=1Dlog(λd). \log(\det(\mathbf{M})) = \sum_{d=1}^{D} \log(\lambda_d).

To compute the PDF, SciPy first computes the log PDF and then computes the exponent of that quantity. For completeness, the log PDF for the multivariate normal is

logf(x;μ,Σ)=12(xμ)Σ1(xμ)log((2π)Ddet(Σ))=12[(xμ)Σ1(xμ)+log((2π)Ddet(Σ))]=12[(xμ)Σ1(xμ)+Dlog(2π)+log(det(Σ))]. \begin{aligned} \log f(\mathbf{x}; \boldsymbol{\mu}, \boldsymbol{\Sigma}) &= -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) - \log\big(\sqrt{(2 \pi)^D \det(\boldsymbol{\Sigma})}\big) \\ &= -\frac{1}{2} \big[ (\mathbf{x} - \boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) + \log\big((2 \pi)^D \det(\boldsymbol{\Sigma})\big) \big] \\ &= -\frac{1}{2} \big[ (\mathbf{x} - \boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) + D\log(2 \pi) + \log(\det(\boldsymbol{\Sigma})) \big]. \end{aligned}

Implementation

Below is an abbreviated version of SciPy’s implementation of multivariate_normal.pdf. For clarity, I’ve removed any code for modularity or standarization. SciPy’s implementation makes one additional optimization worth mentioning. Rather than computing

(xμ)Σ1(xμ) (\mathbf{x} - \boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})

naively, it computes (xμ)QΛ1(\mathbf{x} - \boldsymbol{\mu})^{\top} \mathbf{Q} \sqrt{\boldsymbol{\Lambda}^{-1}} and then squares it.

def pdf(x, mean, cov):
    return np.exp(logpdf(x, mean, cov))


def logpdf(x, mean, cov):
    # `eigh` assumes the matrix is Hermitian.
    vals, vecs = np.linalg.eigh(cov)
    logdet     = np.sum(np.log(vals))
    valsinv    = np.array([1./v for v in vals])
    # `vecs` is R times D while `vals` is a R-vector where R is the matrix 
    # rank. The asterisk performs element-wise multiplication.
    U          = vecs * np.sqrt(valsinv)
    rank       = len(vals)
    dev        = x - mean
    # "maha" for "Mahalanobis distance".
    maha       = np.square(np.dot(dev, U)).sum()
    log2pi     = np.log(2 * np.pi)
    return -0.5 * (rank * log2pi + maha + logdet)

SciPy performs a bunch of other checks such as thresholding based on eigenvalues, raising an exception for singular matrices if specified, and so forth.