The Exponential Family

Probability distributions that are members of the exponential family have mathematically convenient properties for Bayesian inference. I provide the general form, work through several examples, and discuss several important properties.

Published

19 March 2019

The exponential family is a class of probability distributions with convenient mathematical properties (Pitman, 1936; Koopman, 1936; Darmois, 1935). Many commonly used distributions are part of the exponential family, such as the Gaussian, exponential, gamma, chi-squared, beta, Dirichlet, Bernoulli, categorical, Poisson, Wishart, inverse Wishart, and geometric distributions. I want to start by just providing the general form and then demonstrating that a few example distributions are members of the family. Once we are familiar with the form, we will discuss several important properties for Bayesian inference.

The general form for any member of the exponential family is

$p(\textbf{x} \mid \boldsymbol{\eta}) = h(\textbf{x}) g(\boldsymbol{\eta}) \exp \big\{ \boldsymbol{\eta}^{\top} u(\textbf{x}) \big\} \tag{1}$

where

$\boldsymbol{\eta}$ is the natural parameter.
$h(\textbf{x})$ is the underlying measure.
$u(\textbf{x})$ is the sufficient statistic of the data.
$g(\boldsymbol{\eta})$ is the normalizer, ensuring that

$g(\boldsymbol{\eta}) \int h(\textbf{x}) \exp\big\{ \boldsymbol{\eta}^{\top} u(\textbf{x}) \big\} \text{d}\textbf{x} = 1 \tag{2}$

In simpler notation, we want to ensure that $f(x)$ is a valid probability by computing $Z = 1 / \int f(x) dx$ and then normalizing as $Z f(x) = 1$ . Equation $1$ is sometimes written as

$p(\textbf{x} \mid \boldsymbol{\eta}) = h(\textbf{x}) \exp \big\{ \boldsymbol{\eta}^{\top} T(\textbf{x}) - A(\boldsymbol{\eta}) \big\}$

where

$T(\textbf{x}) = u(\textbf{x})$
$A(\boldsymbol{\eta}) = - \log g(\boldsymbol{\eta})$ is called log normalizer because

$\begin{aligned} A(\boldsymbol{\eta}) &= - \log g(\boldsymbol{\eta}) \\ &= - \log \Big( \frac{1}{\int h(\textbf{x}) \exp\big\{ \boldsymbol{\eta}^{\top} u(\textbf{x}) \big\} \text{d}\textbf{x}} \Big) \\ &= \log \int h(\textbf{x}) \exp\big\{ \boldsymbol{\eta}^{\top} u(\textbf{x}) \big\} \text{d}\textbf{x} \end{aligned}$

One way to think about this form is that it is akin to a superclass in programming. Any distribution that can be written as Equation $1$ can be shown to have certain useful properties. Before discussing these properties, let’s look at a few examples.

Examples

Bernoulli distribution in exponential family form

Let $x$ be a Bernoulli random variable with parameter $0 \leq \mu \leq 1$ . Since these are scalar values, we do not denote them with bold symbols. The functional form of the Bernoulli is

$p(x \mid \mu) = \text{Bern}(x \mid \mu) = \mu^{x} (1 - \mu)^{1 - x}$

Let’s try to get this into standard exponential form (Equation $1$ ) with a little algebraic manipulation. First, we can introduce an exponent by taking $x = \exp \log x$ :

$\begin{aligned} p(x \mid \mu) &= \mu^{x} (1 - \mu)^{1 - x} \\ &= \exp\big\{ \log \big( \mu^{x} (1 - \mu)^{1 - x} \big) \big\} \\ &= \exp\big\{ x \log \mu + \log (1 - \mu) - x \log(1 - \mu) \big\} \end{aligned}$

Now we want to express everything inside the exponent as a function of $x$ times something else. To do this, we remove any terms that do not depend on $x$ out of the exponent. Using the fact that $e^{a + b} = e^a e^b$ , we have

$\begin{aligned} p(x \mid \mu) &= \exp\big\{ x \log \mu + \log (1 - \mu) - x \log(1 - \mu) \big\} \\ &= \exp\big\{\log (1 - \mu)\big\} \exp\big\{ x \log \mu - x \log(1 - \mu) \big\} \\ &= (1 - \mu) \exp\big\{ x \log \mu - x \log(1 - \mu) \big\} \\ &= (1 - \mu) \exp\big\{ x \log \big( \frac{\mu}{1 - \mu} \big) \big\} \end{aligned}$

This looks like it is in exponential form. Now we want to express the term in front of the exponent, $1 - \mu$ , in terms of $\eta$ . We can note that $\eta = \log \frac{\mu}{1 - \mu}$ according to Equation $1$ and then use this fact to solve for $\mu$ in terms of $\eta$ :

$\begin{aligned} \eta &= \log \big( \frac{\mu}{1-\mu} \big) \\ e^{\eta} &= \frac{\mu}{1-\mu} \\ e^{\eta} - \mu e^{\eta} &= \mu \\ \mu + \mu e^{\eta} &= e^{\eta} \\ \mu &= \frac{e^{\eta}}{1 + e^{\eta}} \\ \mu &= \frac{1}{1 + e^{-\eta}} \tag{3} \end{aligned}$

where Equation $3$ is the sigmoid function, denoted $\sigma(\cdot)$ . We can write $1 - \mu$ in terms of $\sigma(\cdot)$ and $\eta$ as

$\sigma(-\eta) = \frac{1}{1 + e^{\eta}} = 1 - \frac{1}{1 + e^{-\eta}} = 1 - \mu$

We could have defined $\sigma$ differently such that the argument was just $\eta$ , but I believe this formulation is used so that $\sigma$ is the sigmoid function. Putting everything together, we can express the Bernoulli distribution as a distribution in the exponential family’s standard form as

$p(x \mid \eta) = \sigma(-\eta) \exp(\eta x)$

where

$\eta = \log \big( \frac{\mu}{1-\mu} \big)$
$h(x) = 1$
$u(x) = x$
$g(\eta) = \sigma(-\eta) = \frac{1}{1 + e^{\eta}}$

Poisson distribution in exponential family form

Let $x$ be a Poisson random variable with parameter $\mu > 0$ . The functional form of the Poisson is

$p(x \mid \mu) = \frac{e^{-\mu} \mu^x}{x!}$

and can be written in exponential family form as

$\begin{aligned} p(x \mid \mu) &= \frac{1}{x!} \exp(-\mu) \exp(x \log \mu) \\ &= h(x) g(\eta) \exp(\eta x) \end{aligned}$

where

$\eta = \log \mu$
$h(x) = \frac{1}{x!}$
$u(x) = x$
$g(\eta) = \exp(-\exp(\eta))$

Gaussian distribution in exponential family form

As a final example of a distribution with two parameters, the functional form of the univariate Gaussian distribution is

$p(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \big\{ - \frac{1}{2 \sigma^2} (x - \mu)^2 \big\}$

We already have a term with an exponent, but let’s once again remove any terms in the exponent that do not contain $x$ :

$\begin{aligned} p(x \mid \mu, \sigma^2) &= \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \big\{ - \frac{1}{2 \sigma^2} (x - \mu)^2 \big\} \\ &= \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \big\{ - \frac{1}{2 \sigma^2} x^2 + \frac{\mu}{\sigma^2} x - \frac{\mu^2}{2 \sigma^2} \big\} \\ &= \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \big\{ - \frac{\mu^2}{2 \sigma^2} \big\} \exp \big\{ \frac{\mu}{\sigma^2} x - \frac{1}{2 \sigma^2} x^2 \big\} \tag{4} \end{aligned}$

This suggests the following values for $u(\textbf{x})$ and $\boldsymbol{\eta}$ (we re-introduce the bold symbols to denote vectors):

$\boldsymbol{u}(x) = \begin{bmatrix} x \\ x^2 \end{bmatrix} \qquad \boldsymbol{\eta}(\mu, \sigma^2) = \begin{bmatrix} \frac{\mu}{\sigma^2} \\ - \frac{1}{2 \sigma^2} \end{bmatrix}$

Here, we can see that the natural parameters $\boldsymbol{\eta}$ contain both $\mu$ and $\sigma^2$ . Once again, we can use these values to derive $h(\textbf{x})$ and $g(\boldsymbol{\eta})$ . We can write $h(\textbf{x})$ by collecting terms that do not depend $\mu$ or $\sigma^2$ , since those variables are in the definition of $\boldsymbol{\eta}$ . So

$h(\textbf{x}) = \frac{1}{\sqrt{2 \pi}} = (2 \pi)^{-1/2}$

This means that $g(\boldsymbol{\eta})$ must account for everything else outside the rightmost exponent in Equation $4$ or

$g(\boldsymbol{\eta}) = \frac{1}{\sigma} \exp \big\{ - \frac{\mu^2}{2 \sigma^2} \big\}$

We can write $\frac{1}{\sigma}$ in terms of $\eta_2$ by observing that

$(- 2 \eta_2)^{1/2} = \sqrt{-2 \Big(- \frac{1}{2 \sigma^2} \Big)} = \frac{1}{\sigma}$

We can write the exponent in terms of both $\eta_1$ and $\eta_2$ by observing that

$\frac{\eta_1^2}{4 \eta_2} = \frac{\big( \frac{\mu}{\sigma^2} \big)^2}{4 \big( -\frac{1}{2 \sigma^2} \big)} = \frac{\frac{\mu^2}{\sigma^4}}{\frac{-2}{\sigma^2}} = - \frac{\mu^2}{2 \sigma^2}$

Putting this all together, we get

$p(\textbf{x} \mid \boldsymbol{\eta}) = h(\textbf{x}) g(\boldsymbol{\eta}) \exp \big\{ \boldsymbol{\eta}^{\top} \textbf{x} \big\}$

where

$\boldsymbol{\eta} = \begin{bmatrix} \frac{\mu}{\sigma^2} & - \frac{1}{2 \sigma^2} \end{bmatrix}^{\top}$
$h(\textbf{x}) = (2 \pi)^{-1/2}$
$u(\textbf{x}) = \begin{bmatrix} x & x^2 \end{bmatrix}^{\top}$
$g(\boldsymbol{\eta}) = (- 2 \eta_2)^{1/2} \frac{\eta_1^2}{4 \eta_2}$

Sufficient statistics

We will now show why $u(\textbf{x})$ is called the sufficient statistic. The upshot is that $\sum u(\textbf{x}_i)$ is all we need to compute the maximum likelihood estimates for the natural parameters. To show this, let’s maximize the log likelihood. For any distribution in exponential family form, the likelihood is

$p(X \mid \boldsymbol{\eta}) = \prod_{i=1}^{n} p(\textbf{x}_i \mid \boldsymbol{\eta}) = \prod_{i=1}^{n} h(\textbf{x}_i) g(\boldsymbol{\eta}) \exp \big\{ \boldsymbol{\eta}^{\top} u(\textbf{x}_i) \big\}$

where $X = \{\textbf{x}_1, \textbf{x}_2, \dots, \textbf{x}_n\}$ is a set of $n$ vectors. Since by construction, $g(\boldsymbol{\eta})$ has no dependence on $X$ , we can write

$p(X \mid \boldsymbol{\eta}) = \Big( \prod_{i=1}^{n} h(\textbf{x}_i) \Big) g(\boldsymbol{\eta})^{n} \exp \big\{ \boldsymbol{\eta}^{\top} \sum_{i=1}^n u(\textbf{x}_i) \big\}$

And the log likelihood is

$\log p(X \mid \boldsymbol{\eta}) = \log \Big( \prod_{i=1}^{n} h(\textbf{x}_i) \Big) + n \log g(\boldsymbol{\eta}) + \boldsymbol{\eta}^{\top} \sum_{i=1}^n u(\textbf{x}_i) \tag{5}$

Now if we want to find the minimum $\boldsymbol{\eta}_{\text{ML}}$ , we can compute the derivative of Equation $5$ , set it equal to $0$ , and solve for $\boldsymbol{\eta}$ . But first, note that $h(\textbf{x})$ has no dependence upon $\boldsymbol{\eta}$ and will disappear in the derivative, so we want to solve

$\nabla n \log g(\boldsymbol{\eta}) + \nabla \boldsymbol{\eta}^{\top} \sum_{i=1}^n u(\textbf{x}_i) = 0$

Dividing both sides by $n$ and moving one term to the other side of the equality, we get

$-\nabla \log g(\boldsymbol{\eta}) = \nabla \boldsymbol{\eta}^{\top} \frac{1}{n} \sum_{i=1}^n u(\textbf{x}_i)$

Finally, note that $\nabla \boldsymbol{\eta}^{\top} \textbf{x} = \textbf{x}$ . This is because for the $i$ -th component in the $p$ -dimensional gradient,

$\frac{\partial}{\partial \eta_i} (\eta_1 x_1 + \eta_2 x_2 + \dots + \eta_i x_i + \dots \eta_p x_p) = x_i$

This gives us

$-\nabla \log g(\boldsymbol{\eta}_{\text{ML}}) = \nabla A(\boldsymbol{\eta}_{\text{ML}}) = \frac{1}{n} \sum_{i=1}^n u(x_i) \tag{6}$

The point of this derivation is to demonstrate that the optimal natural parameters $\boldsymbol{\eta}_{\text{ML}}$ only depend upon $\sum u(\textbf{x}_i)$ . In other words, we do not need to store all the data, only $\sum u(\textbf{x}_i)$ , which we can think of as a compact representation or summarization of our data. Importantly, because this sufficient statistic is a sum, we can compute it incrementally as the data arrives.

Moments through differentiation

The mean and variance of a probability distribution are defined using integration:

$\mathbb{E}[X] = \int_x x f(x) dx \qquad \text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2]$

But in the exponential family, we can compute moments through differentiation (of $g(\boldsymbol{\eta})$ ), which is typically easier than integration. To show this, let’s compute the derivative of Equation $2$ . First, for ease of notation, let

$f(\textbf{x}, \boldsymbol{\eta})= h(\textbf{x}) \exp\big\{ \boldsymbol{\eta}^{\top} u(\textbf{x}) \big\} \text{d}\textbf{x}$

Then we have

$\begin{aligned} \nabla g(\boldsymbol{\eta}) \int f(\textbf{x}, \boldsymbol{\eta}) &= \nabla 1 \\ \nabla g(\boldsymbol{\eta}) \int f(\textbf{x}, \boldsymbol{\eta}) \text{d}\textbf{x} + g(\boldsymbol{\eta}) \int f(\textbf{x}, \boldsymbol{\eta}) u(\textbf{x}) \text{d}\textbf{x} &= 0 \\ -\nabla g(\boldsymbol{\eta}) \int f(\textbf{x}, \boldsymbol{\eta}) \text{d}\textbf{x} &= g(\boldsymbol{\eta}) \int f(\textbf{x}, \boldsymbol{\eta}) u(\textbf{x}) \text{d}\textbf{x} \tag{7} \end{aligned}$

Now note that $\mathbb{E}[u(\textbf{x})]$ is the right side of Equation $7$ ,

$\begin{aligned} \mathbb{E}[u(\textbf{x})] &= \int u(\textbf{x}) p(\textbf{x} \mid \boldsymbol{\eta}) \\ &= \int u(\textbf{x}) h(\textbf{x}) g(\boldsymbol{\eta}) \exp\big\{ \boldsymbol{\eta}^{\top} u(\textbf{x}) \big\} \text{d}\textbf{x} \\ &= g(\boldsymbol{\eta}) \int f(\textbf{x}, \boldsymbol{\eta}) u(\textbf{x}) \text{d}\textbf{x} \end{aligned}$

and that, using Equation $1$ , the left side of Equation $7$ can be rewritten as

$-\nabla g(\boldsymbol{\eta}) \int f(\textbf{x}, \boldsymbol{\eta}) \text{d}\textbf{x} = - \nabla g(\boldsymbol{\eta}) \frac{1}{g(\boldsymbol{\eta})} = - \nabla \log g(\boldsymbol{\eta})$

Putting this together, we get

$-\nabla \log g(\boldsymbol{\eta}) = \nabla A(\boldsymbol{\eta}) = \mathbb{E}[u(\textbf{x})]$

In words, we can differentiate the log normalizer to compute the mean of the sufficient statistic. Furthermore, note that as $n \rightarrow \infty$ , Equation $6$ becomes $\mathbb{E}[u(\textbf{x})]$ . This demonstrates that $\boldsymbol{\eta}_{\text{ML}}$ is equal to the true parameter value of $\boldsymbol{\eta}$ in the limit or that our estimator is unbiased.

Furthermore, if you take the second derivative (not shown here), it is easy to see that

$\nabla^2 A(\boldsymbol{\eta}) = \text{Var}(u(\textbf{x}))$

See (Wainwright & Jordan, 2008) for a derivation.

Conjugate priors

Conjugacy is an important property in Bayesian inference. If you are unfamiliar with the term, please read my previous post first. Every exponential family member has a conjugate prior of the form

$p(\boldsymbol{\eta} \mid \boldsymbol{\chi}, \nu) = f(\boldsymbol{\chi}, \nu) g(\boldsymbol{\eta})^{\nu} \exp \big\{\boldsymbol{\eta}^{\top} \boldsymbol{\chi} \big\}$

where $\nu$ and $\boldsymbol{\chi}$ are hyperparameters and $f(\boldsymbol{\chi}, \nu)$ depends on the form of the exponential family member. To verify conjugacy, recall that we must verify that the posterior has the same functional form as the prior. Let’s confirm that:

$\begin{aligned} p(\boldsymbol{\eta} \mid X, \boldsymbol{\chi}, \nu) &= p(X \mid \boldsymbol{\eta}) p(\boldsymbol{\eta} \mid \boldsymbol{\chi}, \nu) \\ &= \Big( \Big( \prod_{i=1}^{n} h(\textbf{x}_i) \Big) g(\boldsymbol{\eta})^{n} \exp \big\{ \boldsymbol{\eta}^{\top} \sum_{i=1}^n u(\textbf{x}_i) \big\} \Big) \Big( f(\boldsymbol{\chi}, \nu) g(\boldsymbol{\eta})^{\nu} \exp \big\{\boldsymbol{\eta}^{\top} \boldsymbol{\chi} \big\} \Big) \\ &= \Big( \prod_{i=1}^{n} h(\textbf{x}_i) \Big) f(\boldsymbol{\chi}, \nu) g(\boldsymbol{\eta})^{n + \nu} \exp \big\{ \boldsymbol{\eta}^{\top} \sum_{i=1}^n u(\textbf{x}_i) + \boldsymbol{\eta}^{\top} \boldsymbol{\chi} \big\} \end{aligned}$

Since the first $n + 1$ terms are constant w.r.t. $\boldsymbol{\eta}$ , we can write

$p(\boldsymbol{\eta} \mid X, \boldsymbol{\chi}, \nu) \propto g(\boldsymbol{\eta})^{n + \nu} \exp \big\{ \boldsymbol{\eta}^{\top} \big( \sum_{i=1}^n u(\textbf{x}_i) + \boldsymbol{\chi} \big) \big\}$

And we’re done. Note that this is the same exponential family form as the prior, with parameters:

$\begin{aligned} \nu &= \nu_{\text{prior}} + n \\ \boldsymbol{\chi} &= \boldsymbol{\chi}_{\text{prior}} + \sum_{i=1}^{n} u(\textbf{x}_i) \end{aligned}$

A major benefit of the exponential family representation is that both the prior and posterior take the same form with respect to $\boldsymbol{\eta}$ , and difference is modeled entirely by hyperparameters $\nu$ and $\boldsymbol{\chi}$ , the latter of which is a function of the sufficient statistics of our data. This framework lends itself nicely to sequential learning. In this case, the updates are

$\begin{aligned} \nu_{t} &= \begin{cases} \nu_{\text{prior}} & \text{if $t = 0$} \\ \nu_{t-1} + 1 & \text{if $t > 0$} \end{cases} \\ \boldsymbol{\chi}_{t} &= \begin{cases} \boldsymbol{\chi}_{\text{prior}} & \text{if $t = 0$} \\ \boldsymbol{\chi}_{t-1} + u(\textbf{x}_{t-1}) & \text{if $t > 0$} \end{cases} \end{aligned}$

which are straightforward to compute.

Conclusion

Many common distributions are members of the exponential family, which has many convenient mathematical properties. Exponential family likelihoods allow for inference with incrementally computable sufficient statistics. These models are conjugate such that both the prior and posterior have the same functional form over hyperparameters $\nu$ and $\boldsymbol{\chi}$ . The moments of the exponential family can be obtained by differentiating the normalizer $g(\boldsymbol{\eta})$ . And any efficient inference algorithms that work for the exponential family can be abstracted to work for a variety of common distributions. Furthermore, while we did not prove it in this post, we note that the natural parameter space is convex (Wainwright & Jordan, 2008).

Pitman, E. J. G. (1936). Sufficient statistics and intrinsic accuracy. Mathematical Proceedings of the Cambridge Philosophical Society, 32(4), 567–579.
Koopman, B. O. (1936). On distributions admitting a sufficient statistic. Transactions of the American Mathematical Society, 39(3), 399–409.
Darmois, G. (1935). Sur les lois de probabilitéa estimation exhaustive. CR Acad. Sci. Paris, 260(1265), 85.
Wainwright, M. J., & Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2), 1–305.