The Exponential Family

Probability distributions that are members of the exponential family have mathematically convenient properties for Bayesian inference. I provide the general form, work through several examples, and discuss several important properties.

The exponential family is a class of probability distributions with convenient mathematical properties (Pitman, 1936; Koopman, 1936; Darmois, 1935). Many commonly used distributions are part of the exponential family, such as the Gaussian, exponential, gamma, chi-squared, beta, Dirichlet, Bernoulli, categorical, Poisson, Wishart, inverse Wishart, and geometric distributions. I want to start by just providing the general form and then demonstrating that a few example distributions are members of the family. Once we are familiar with the form, we will discuss several important properties for Bayesian inference.

The general form for any member of the exponential family is

p(xη)=h(x)g(η)exp{ηu(x)}(1) p(\textbf{x} \mid \boldsymbol{\eta}) = h(\textbf{x}) g(\boldsymbol{\eta}) \exp \big\{ \boldsymbol{\eta}^{\top} u(\textbf{x}) \big\} \tag{1}

where

g(η)h(x)exp{ηu(x)}dx=1(2) g(\boldsymbol{\eta}) \int h(\textbf{x}) \exp\big\{ \boldsymbol{\eta}^{\top} u(\textbf{x}) \big\} \text{d}\textbf{x} = 1 \tag{2}

In simpler notation, we want to ensure that f(x)f(x) is a valid probability by computing Z=1/f(x)dxZ = 1 / \int f(x) dx and then normalizing as Zf(x)=1Z f(x) = 1. Equation 11 is sometimes written as

p(xη)=h(x)exp{ηT(x)A(η)} p(\textbf{x} \mid \boldsymbol{\eta}) = h(\textbf{x}) \exp \big\{ \boldsymbol{\eta}^{\top} T(\textbf{x}) - A(\boldsymbol{\eta}) \big\}

where

A(η)=logg(η)=log(1h(x)exp{ηu(x)}dx)=logh(x)exp{ηu(x)}dx \begin{aligned} A(\boldsymbol{\eta}) &= - \log g(\boldsymbol{\eta}) \\ &= - \log \Big( \frac{1}{\int h(\textbf{x}) \exp\big\{ \boldsymbol{\eta}^{\top} u(\textbf{x}) \big\} \text{d}\textbf{x}} \Big) \\ &= \log \int h(\textbf{x}) \exp\big\{ \boldsymbol{\eta}^{\top} u(\textbf{x}) \big\} \text{d}\textbf{x} \end{aligned}

One way to think about this form is that it is akin to a superclass in programming. Any distribution that can be written as Equation 11 can be shown to have certain useful properties. Before discussing these properties, let’s look at a few examples.

Examples

Bernoulli distribution in exponential family form

Let xx be a Bernoulli random variable with parameter 0μ10 \leq \mu \leq 1. Since these are scalar values, we do not denote them with bold symbols. The functional form of the Bernoulli is

p(xμ)=Bern(xμ)=μx(1μ)1x p(x \mid \mu) = \text{Bern}(x \mid \mu) = \mu^{x} (1 - \mu)^{1 - x}

Let’s try to get this into standard exponential form (Equation 11) with a little algebraic manipulation. First, we can introduce an exponent by taking x=explogxx = \exp \log x:

p(xμ)=μx(1μ)1x=exp{log(μx(1μ)1x)}=exp{xlogμ+log(1μ)xlog(1μ)} \begin{aligned} p(x \mid \mu) &= \mu^{x} (1 - \mu)^{1 - x} \\ &= \exp\big\{ \log \big( \mu^{x} (1 - \mu)^{1 - x} \big) \big\} \\ &= \exp\big\{ x \log \mu + \log (1 - \mu) - x \log(1 - \mu) \big\} \end{aligned}

Now we want to express everything inside the exponent as a function of xx times something else. To do this, we remove any terms that do not depend on xx out of the exponent. Using the fact that ea+b=eaebe^{a + b} = e^a e^b, we have

p(xμ)=exp{xlogμ+log(1μ)xlog(1μ)}=exp{log(1μ)}exp{xlogμxlog(1μ)}=(1μ)exp{xlogμxlog(1μ)}=(1μ)exp{xlog(μ1μ)} \begin{aligned} p(x \mid \mu) &= \exp\big\{ x \log \mu + \log (1 - \mu) - x \log(1 - \mu) \big\} \\ &= \exp\big\{\log (1 - \mu)\big\} \exp\big\{ x \log \mu - x \log(1 - \mu) \big\} \\ &= (1 - \mu) \exp\big\{ x \log \mu - x \log(1 - \mu) \big\} \\ &= (1 - \mu) \exp\big\{ x \log \big( \frac{\mu}{1 - \mu} \big) \big\} \end{aligned}

This looks like it is in exponential form. Now we want to express the term in front of the exponent, 1μ1 - \mu, in terms of η\eta. We can note that η=logμ1μ\eta = \log \frac{\mu}{1 - \mu} according to Equation 11 and then use this fact to solve for μ\mu in terms of η\eta:

η=log(μ1μ)eη=μ1μeημeη=μμ+μeη=eημ=eη1+eημ=11+eη(3) \begin{aligned} \eta &= \log \big( \frac{\mu}{1-\mu} \big) \\ e^{\eta} &= \frac{\mu}{1-\mu} \\ e^{\eta} - \mu e^{\eta} &= \mu \\ \mu + \mu e^{\eta} &= e^{\eta} \\ \mu &= \frac{e^{\eta}}{1 + e^{\eta}} \\ \mu &= \frac{1}{1 + e^{-\eta}} \tag{3} \end{aligned}

where Equation 33 is the sigmoid function, denoted σ()\sigma(\cdot). We can write 1μ1 - \mu in terms of σ()\sigma(\cdot) and η\eta as

σ(η)=11+eη=111+eη=1μ \sigma(-\eta) = \frac{1}{1 + e^{\eta}} = 1 - \frac{1}{1 + e^{-\eta}} = 1 - \mu

We could have defined σ\sigma differently such that the argument was just η\eta, but I believe this formulation is used so that σ\sigma is the sigmoid function. Putting everything together, we can express the Bernoulli distribution as a distribution in the exponential family’s standard form as

p(xη)=σ(η)exp(ηx) p(x \mid \eta) = \sigma(-\eta) \exp(\eta x)

where

Poisson distribution in exponential family form

Let xx be a Poisson random variable with parameter μ>0\mu > 0. The functional form of the Poisson is

p(xμ)=eμμxx! p(x \mid \mu) = \frac{e^{-\mu} \mu^x}{x!}

and can be written in exponential family form as

p(xμ)=1x!exp(μ)exp(xlogμ)=h(x)g(η)exp(ηx) \begin{aligned} p(x \mid \mu) &= \frac{1}{x!} \exp(-\mu) \exp(x \log \mu) \\ &= h(x) g(\eta) \exp(\eta x) \end{aligned}

where

Gaussian distribution in exponential family form

As a final example of a distribution with two parameters, the functional form of the univariate Gaussian distribution is

p(xμ,σ2)=12πσ2exp{12σ2(xμ)2} p(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \big\{ - \frac{1}{2 \sigma^2} (x - \mu)^2 \big\}

We already have a term with an exponent, but let’s once again remove any terms in the exponent that do not contain xx:

p(xμ,σ2)=12πσ2exp{12σ2(xμ)2}=12πσ2exp{12σ2x2+μσ2xμ22σ2}=12πσ2exp{μ22σ2}exp{μσ2x12σ2x2}(4) \begin{aligned} p(x \mid \mu, \sigma^2) &= \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \big\{ - \frac{1}{2 \sigma^2} (x - \mu)^2 \big\} \\ &= \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \big\{ - \frac{1}{2 \sigma^2} x^2 + \frac{\mu}{\sigma^2} x - \frac{\mu^2}{2 \sigma^2} \big\} \\ &= \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \big\{ - \frac{\mu^2}{2 \sigma^2} \big\} \exp \big\{ \frac{\mu}{\sigma^2} x - \frac{1}{2 \sigma^2} x^2 \big\} \tag{4} \end{aligned}

This suggests the following values for u(x)u(\textbf{x}) and η\boldsymbol{\eta} (we re-introduce the bold symbols to denote vectors):

u(x)=[xx2]η(μ,σ2)=[μσ212σ2] \boldsymbol{u}(x) = \begin{bmatrix} x \\ x^2 \end{bmatrix} \qquad \boldsymbol{\eta}(\mu, \sigma^2) = \begin{bmatrix} \frac{\mu}{\sigma^2} \\ - \frac{1}{2 \sigma^2} \end{bmatrix}

Here, we can see that the natural parameters η\boldsymbol{\eta} contain both μ\mu and σ2\sigma^2. Once again, we can use these values to derive h(x)h(\textbf{x}) and g(η)g(\boldsymbol{\eta}). We can write h(x)h(\textbf{x}) by collecting terms that do not depend μ\mu or σ2\sigma^2, since those variables are in the definition of η\boldsymbol{\eta}. So

h(x)=12π=(2π)1/2 h(\textbf{x}) = \frac{1}{\sqrt{2 \pi}} = (2 \pi)^{-1/2}

This means that g(η)g(\boldsymbol{\eta}) must account for everything else outside the rightmost exponent in Equation 44 or

g(η)=1σexp{μ22σ2} g(\boldsymbol{\eta}) = \frac{1}{\sigma} \exp \big\{ - \frac{\mu^2}{2 \sigma^2} \big\}

We can write 1σ\frac{1}{\sigma} in terms of η2\eta_2 by observing that

(2η2)1/2=2(12σ2)=1σ (- 2 \eta_2)^{1/2} = \sqrt{-2 \Big(- \frac{1}{2 \sigma^2} \Big)} = \frac{1}{\sigma}

We can write the exponent in terms of both η1\eta_1 and η2\eta_2 by observing that

η124η2=(μσ2)24(12σ2)=μ2σ42σ2=μ22σ2 \frac{\eta_1^2}{4 \eta_2} = \frac{\big( \frac{\mu}{\sigma^2} \big)^2}{4 \big( -\frac{1}{2 \sigma^2} \big)} = \frac{\frac{\mu^2}{\sigma^4}}{\frac{-2}{\sigma^2}} = - \frac{\mu^2}{2 \sigma^2}

Putting this all together, we get

p(xη)=h(x)g(η)exp{ηx} p(\textbf{x} \mid \boldsymbol{\eta}) = h(\textbf{x}) g(\boldsymbol{\eta}) \exp \big\{ \boldsymbol{\eta}^{\top} \textbf{x} \big\}

where

Sufficient statistics

We will now show why u(x)u(\textbf{x}) is called the sufficient statistic. The upshot is that u(xi)\sum u(\textbf{x}_i) is all we need to compute the maximum likelihood estimates for the natural parameters. To show this, let’s maximize the log likelihood. For any distribution in exponential family form, the likelihood is

p(Xη)=i=1np(xiη)=i=1nh(xi)g(η)exp{ηu(xi)} p(X \mid \boldsymbol{\eta}) = \prod_{i=1}^{n} p(\textbf{x}_i \mid \boldsymbol{\eta}) = \prod_{i=1}^{n} h(\textbf{x}_i) g(\boldsymbol{\eta}) \exp \big\{ \boldsymbol{\eta}^{\top} u(\textbf{x}_i) \big\}

where X={x1,x2,,xn}X = \{\textbf{x}_1, \textbf{x}_2, \dots, \textbf{x}_n\} is a set of nn vectors. Since by construction, g(η)g(\boldsymbol{\eta}) has no dependence on XX, we can write

p(Xη)=(i=1nh(xi))g(η)nexp{ηi=1nu(xi)} p(X \mid \boldsymbol{\eta}) = \Big( \prod_{i=1}^{n} h(\textbf{x}_i) \Big) g(\boldsymbol{\eta})^{n} \exp \big\{ \boldsymbol{\eta}^{\top} \sum_{i=1}^n u(\textbf{x}_i) \big\}

And the log likelihood is

logp(Xη)=log(i=1nh(xi))+nlogg(η)+ηi=1nu(xi)(5) \log p(X \mid \boldsymbol{\eta}) = \log \Big( \prod_{i=1}^{n} h(\textbf{x}_i) \Big) + n \log g(\boldsymbol{\eta}) + \boldsymbol{\eta}^{\top} \sum_{i=1}^n u(\textbf{x}_i) \tag{5}

Now if we want to find the minimum ηML\boldsymbol{\eta}_{\text{ML}}, we can compute the derivative of Equation 55, set it equal to 00, and solve for η\boldsymbol{\eta}. But first, note that h(x)h(\textbf{x}) has no dependence upon η\boldsymbol{\eta} and will disappear in the derivative, so we want to solve

nlogg(η)+ηi=1nu(xi)=0 \nabla n \log g(\boldsymbol{\eta}) + \nabla \boldsymbol{\eta}^{\top} \sum_{i=1}^n u(\textbf{x}_i) = 0

Dividing both sides by nn and moving one term to the other side of the equality, we get

logg(η)=η1ni=1nu(xi) -\nabla \log g(\boldsymbol{\eta}) = \nabla \boldsymbol{\eta}^{\top} \frac{1}{n} \sum_{i=1}^n u(\textbf{x}_i)

Finally, note that ηx=x\nabla \boldsymbol{\eta}^{\top} \textbf{x} = \textbf{x}. This is because for the ii-th component in the pp-dimensional gradient,

ηi(η1x1+η2x2++ηixi+ηpxp)=xi \frac{\partial}{\partial \eta_i} (\eta_1 x_1 + \eta_2 x_2 + \dots + \eta_i x_i + \dots \eta_p x_p) = x_i

This gives us

logg(ηML)=A(ηML)=1ni=1nu(xi)(6) -\nabla \log g(\boldsymbol{\eta}_{\text{ML}}) = \nabla A(\boldsymbol{\eta}_{\text{ML}}) = \frac{1}{n} \sum_{i=1}^n u(x_i) \tag{6}

The point of this derivation is to demonstrate that the optimal natural parameters ηML\boldsymbol{\eta}_{\text{ML}} only depend upon u(xi)\sum u(\textbf{x}_i). In other words, we do not need to store all the data, only u(xi)\sum u(\textbf{x}_i), which we can think of as a compact representation or summarization of our data. Importantly, because this sufficient statistic is a sum, we can compute it incrementally as the data arrives.

Moments through differentiation

The mean and variance of a probability distribution are defined using integration:

E[X]=xxf(x)dxVar(X)=E[(XE[X])2] \mathbb{E}[X] = \int_x x f(x) dx \qquad \text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2]

But in the exponential family, we can compute moments through differentiation (of g(η)g(\boldsymbol{\eta})), which is typically easier than integration. To show this, let’s compute the derivative of Equation 22. First, for ease of notation, let

f(x,η)=h(x)exp{ηu(x)}dx f(\textbf{x}, \boldsymbol{\eta})= h(\textbf{x}) \exp\big\{ \boldsymbol{\eta}^{\top} u(\textbf{x}) \big\} \text{d}\textbf{x}

Then we have

g(η)f(x,η)=1g(η)f(x,η)dx+g(η)f(x,η)u(x)dx=0g(η)f(x,η)dx=g(η)f(x,η)u(x)dx(7) \begin{aligned} \nabla g(\boldsymbol{\eta}) \int f(\textbf{x}, \boldsymbol{\eta}) &= \nabla 1 \\ \nabla g(\boldsymbol{\eta}) \int f(\textbf{x}, \boldsymbol{\eta}) \text{d}\textbf{x} + g(\boldsymbol{\eta}) \int f(\textbf{x}, \boldsymbol{\eta}) u(\textbf{x}) \text{d}\textbf{x} &= 0 \\ -\nabla g(\boldsymbol{\eta}) \int f(\textbf{x}, \boldsymbol{\eta}) \text{d}\textbf{x} &= g(\boldsymbol{\eta}) \int f(\textbf{x}, \boldsymbol{\eta}) u(\textbf{x}) \text{d}\textbf{x} \tag{7} \end{aligned}

Now note that E[u(x)]\mathbb{E}[u(\textbf{x})] is the right side of Equation 77,

E[u(x)]=u(x)p(xη)=u(x)h(x)g(η)exp{ηu(x)}dx=g(η)f(x,η)u(x)dx \begin{aligned} \mathbb{E}[u(\textbf{x})] &= \int u(\textbf{x}) p(\textbf{x} \mid \boldsymbol{\eta}) \\ &= \int u(\textbf{x}) h(\textbf{x}) g(\boldsymbol{\eta}) \exp\big\{ \boldsymbol{\eta}^{\top} u(\textbf{x}) \big\} \text{d}\textbf{x} \\ &= g(\boldsymbol{\eta}) \int f(\textbf{x}, \boldsymbol{\eta}) u(\textbf{x}) \text{d}\textbf{x} \end{aligned}

and that, using Equation 11, the left side of Equation 77 can be rewritten as

g(η)f(x,η)dx=g(η)1g(η)=logg(η) -\nabla g(\boldsymbol{\eta}) \int f(\textbf{x}, \boldsymbol{\eta}) \text{d}\textbf{x} = - \nabla g(\boldsymbol{\eta}) \frac{1}{g(\boldsymbol{\eta})} = - \nabla \log g(\boldsymbol{\eta})

Putting this together, we get

logg(η)=A(η)=E[u(x)] -\nabla \log g(\boldsymbol{\eta}) = \nabla A(\boldsymbol{\eta}) = \mathbb{E}[u(\textbf{x})]

In words, we can differentiate the log normalizer to compute the mean of the sufficient statistic. Furthermore, note that as nn \rightarrow \infty, Equation 66 becomes E[u(x)]\mathbb{E}[u(\textbf{x})]. This demonstrates that ηML\boldsymbol{\eta}_{\text{ML}} is equal to the true parameter value of η\boldsymbol{\eta} in the limit or that our estimator is unbiased.

Furthermore, if you take the second derivative (not shown here), it is easy to see that

2A(η)=Var(u(x)) \nabla^2 A(\boldsymbol{\eta}) = \text{Var}(u(\textbf{x}))

See (Wainwright & Jordan, 2008) for a derivation.

Conjugate priors

Conjugacy is an important property in Bayesian inference. If you are unfamiliar with the term, please read my previous post first. Every exponential family member has a conjugate prior of the form

p(ηχ,ν)=f(χ,ν)g(η)νexp{ηχ} p(\boldsymbol{\eta} \mid \boldsymbol{\chi}, \nu) = f(\boldsymbol{\chi}, \nu) g(\boldsymbol{\eta})^{\nu} \exp \big\{\boldsymbol{\eta}^{\top} \boldsymbol{\chi} \big\}

where ν\nu and χ\boldsymbol{\chi} are hyperparameters and f(χ,ν)f(\boldsymbol{\chi}, \nu) depends on the form of the exponential family member. To verify conjugacy, recall that we must verify that the posterior has the same functional form as the prior. Let’s confirm that:

p(ηX,χ,ν)=p(Xη)p(ηχ,ν)=((i=1nh(xi))g(η)nexp{ηi=1nu(xi)})(f(χ,ν)g(η)νexp{ηχ})=(i=1nh(xi))f(χ,ν)g(η)n+νexp{ηi=1nu(xi)+ηχ} \begin{aligned} p(\boldsymbol{\eta} \mid X, \boldsymbol{\chi}, \nu) &= p(X \mid \boldsymbol{\eta}) p(\boldsymbol{\eta} \mid \boldsymbol{\chi}, \nu) \\ &= \Big( \Big( \prod_{i=1}^{n} h(\textbf{x}_i) \Big) g(\boldsymbol{\eta})^{n} \exp \big\{ \boldsymbol{\eta}^{\top} \sum_{i=1}^n u(\textbf{x}_i) \big\} \Big) \Big( f(\boldsymbol{\chi}, \nu) g(\boldsymbol{\eta})^{\nu} \exp \big\{\boldsymbol{\eta}^{\top} \boldsymbol{\chi} \big\} \Big) \\ &= \Big( \prod_{i=1}^{n} h(\textbf{x}_i) \Big) f(\boldsymbol{\chi}, \nu) g(\boldsymbol{\eta})^{n + \nu} \exp \big\{ \boldsymbol{\eta}^{\top} \sum_{i=1}^n u(\textbf{x}_i) + \boldsymbol{\eta}^{\top} \boldsymbol{\chi} \big\} \end{aligned}

Since the first n+1n + 1 terms are constant w.r.t. η\boldsymbol{\eta}, we can write

p(ηX,χ,ν)g(η)n+νexp{η(i=1nu(xi)+χ)} p(\boldsymbol{\eta} \mid X, \boldsymbol{\chi}, \nu) \propto g(\boldsymbol{\eta})^{n + \nu} \exp \big\{ \boldsymbol{\eta}^{\top} \big( \sum_{i=1}^n u(\textbf{x}_i) + \boldsymbol{\chi} \big) \big\}

And we’re done. Note that this is the same exponential family form as the prior, with parameters:

ν=νprior+nχ=χprior+i=1nu(xi) \begin{aligned} \nu &= \nu_{\text{prior}} + n \\ \boldsymbol{\chi} &= \boldsymbol{\chi}_{\text{prior}} + \sum_{i=1}^{n} u(\textbf{x}_i) \end{aligned}

A major benefit of the exponential family representation is that both the prior and posterior take the same form with respect to η\boldsymbol{\eta}, and difference is modeled entirely by hyperparameters ν\nu and χ\boldsymbol{\chi}, the latter of which is a function of the sufficient statistics of our data. This framework lends itself nicely to sequential learning. In this case, the updates are

νt={νpriorif t=0νt1+1if t>0χt={χpriorif t=0χt1+u(xt1)if t>0 \begin{aligned} \nu_{t} &= \begin{cases} \nu_{\text{prior}} & \text{if $t = 0$} \\ \nu_{t-1} + 1 & \text{if $t > 0$} \end{cases} \\ \boldsymbol{\chi}_{t} &= \begin{cases} \boldsymbol{\chi}_{\text{prior}} & \text{if $t = 0$} \\ \boldsymbol{\chi}_{t-1} + u(\textbf{x}_{t-1}) & \text{if $t > 0$} \end{cases} \end{aligned}

which are straightforward to compute.

Conclusion

Many common distributions are members of the exponential family, which has many convenient mathematical properties. Exponential family likelihoods allow for inference with incrementally computable sufficient statistics. These models are conjugate such that both the prior and posterior have the same functional form over hyperparameters ν\nu and χ\boldsymbol{\chi}. The moments of the exponential family can be obtained by differentiating the normalizer g(η)g(\boldsymbol{\eta}). And any efficient inference algorithms that work for the exponential family can be abstracted to work for a variety of common distributions. Furthermore, while we did not prove it in this post, we note that the natural parameter space is convex (Wainwright & Jordan, 2008).

  1. Pitman, E. J. G. (1936). Sufficient statistics and intrinsic accuracy. Mathematical Proceedings of the Cambridge Philosophical Society, 32(4), 567–579.
  2. Koopman, B. O. (1936). On distributions admitting a sufficient statistic. Transactions of the American Mathematical Society, 39(3), 399–409.
  3. Darmois, G. (1935). Sur les lois de probabilitéa estimation exhaustive. CR Acad. Sci. Paris, 260(1265), 85.
  4. Wainwright, M. J., & Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2), 1–305.