Probability distributions that are members of the exponential family have mathematically convenient properties for Bayesian inference. I provide the general form, work through several examples, and discuss several important properties.
Published
19 March 2019
The exponential family is a class of probability distributions with convenient mathematical properties (Pitman, 1936; Koopman, 1936; Darmois, 1935). Many commonly used distributions are part of the exponential family, such as the Gaussian, exponential, gamma, chi-squared, beta, Dirichlet, Bernoulli, categorical, Poisson, Wishart, inverse Wishart, and geometric distributions. I want to start by just providing the general form and then demonstrating that a few example distributions are members of the family. Once we are familiar with the form, we will discuss several important properties for Bayesian inference.
The general form for any member of the exponential family is
p(x∣η)=h(x)g(η)exp{η⊤u(x)}(1)
where
η is the natural parameter.
h(x) is the underlying measure.
u(x) is the sufficient statistic of the data.
g(η) is the normalizer, ensuring that
g(η)∫h(x)exp{η⊤u(x)}dx=1(2)
In simpler notation, we want to ensure that f(x) is a valid probability by computing Z=1/∫f(x)dx and then normalizing as Zf(x)=1. Equation 1 is sometimes written as
One way to think about this form is that it is akin to a superclass in programming. Any distribution that can be written as Equation 1 can be shown to have certain useful properties. Before discussing these properties, let’s look at a few examples.
Examples
Bernoulli distribution in exponential family form
Let x be a Bernoulli random variable with parameter 0≤μ≤1. Since these are scalar values, we do not denote them with bold symbols. The functional form of the Bernoulli is
p(x∣μ)=Bern(x∣μ)=μx(1−μ)1−x
Let’s try to get this into standard exponential form (Equation 1) with a little algebraic manipulation. First, we can introduce an exponent by taking x=explogx:
Now we want to express everything inside the exponent as a function of x times something else. To do this, we remove any terms that do not depend on x out of the exponent. Using the fact that ea+b=eaeb, we have
This looks like it is in exponential form. Now we want to express the term in front of the exponent, 1−μ, in terms of η. We can note that η=log1−μμ according to Equation 1 and then use this fact to solve for μ in terms of η:
where Equation 3 is the sigmoid function, denoted σ(⋅). We can write 1−μ in terms of σ(⋅) and η as
σ(−η)=1+eη1=1−1+e−η1=1−μ
We could have defined σ differently such that the argument was just η, but I believe this formulation is used so that σ is the sigmoid function. Putting everything together, we can express the Bernoulli distribution as a distribution in the exponential family’s standard form as
p(x∣η)=σ(−η)exp(ηx)
where
η=log(1−μμ)
h(x)=1
u(x)=x
g(η)=σ(−η)=1+eη1
Poisson distribution in exponential family form
Let x be a Poisson random variable with parameter μ>0. The functional form of the Poisson is
p(x∣μ)=x!e−μμx
and can be written in exponential family form as
p(x∣μ)=x!1exp(−μ)exp(xlogμ)=h(x)g(η)exp(ηx)
where
η=logμ
h(x)=x!1
u(x)=x
g(η)=exp(−exp(η))
Gaussian distribution in exponential family form
As a final example of a distribution with two parameters, the functional form of the univariate Gaussian distribution is
p(x∣μ,σ2)=2πσ21exp{−2σ21(x−μ)2}
We already have a term with an exponent, but let’s once again remove any terms in the exponent that do not contain x:
This suggests the following values for u(x) and η (we re-introduce the bold symbols to denote vectors):
u(x)=[xx2]η(μ,σ2)=[σ2μ−2σ21]
Here, we can see that the natural parameters η contain bothμ and σ2. Once again, we can use these values to derive h(x) and g(η). We can write h(x) by collecting terms that do not depend μ or σ2, since those variables are in the definition of η. So
h(x)=2π1=(2π)−1/2
This means that g(η) must account for everything else outside the rightmost exponent in Equation 4 or
g(η)=σ1exp{−2σ2μ2}
We can write σ1 in terms of η2 by observing that
(−2η2)1/2=−2(−2σ21)=σ1
We can write the exponent in terms of both η1 and η2 by observing that
4η2η12=4(−2σ21)(σ2μ)2=σ2−2σ4μ2=−2σ2μ2
Putting this all together, we get
p(x∣η)=h(x)g(η)exp{η⊤x}
where
η=[σ2μ−2σ21]⊤
h(x)=(2π)−1/2
u(x)=[xx2]⊤
g(η)=(−2η2)1/24η2η12
Sufficient statistics
We will now show why u(x) is called the sufficient statistic. The upshot is that ∑u(xi) is all we need to compute the maximum likelihood estimates for the natural parameters. To show this, let’s maximize the log likelihood. For any distribution in exponential family form, the likelihood is
Now if we want to find the minimum ηML, we can compute the derivative of Equation 5, set it equal to 0, and solve for η. But first, note that h(x) has no dependence upon η and will disappear in the derivative, so we want to solve
∇nlogg(η)+∇η⊤i=1∑nu(xi)=0
Dividing both sides by n and moving one term to the other side of the equality, we get
−∇logg(η)=∇η⊤n1i=1∑nu(xi)
Finally, note that ∇η⊤x=x. This is because for the i-th component in the p-dimensional gradient,
∂ηi∂(η1x1+η2x2+⋯+ηixi+…ηpxp)=xi
This gives us
−∇logg(ηML)=∇A(ηML)=n1i=1∑nu(xi)(6)
The point of this derivation is to demonstrate that the optimal natural parameters ηML only depend upon ∑u(xi). In other words, we do not need to store all the data, only ∑u(xi), which we can think of as a compact representation or summarization of our data. Importantly, because this sufficient statistic is a sum, we can compute it incrementally as the data arrives.
Moments through differentiation
The mean and variance of a probability distribution are defined using integration:
E[X]=∫xxf(x)dxVar(X)=E[(X−E[X])2]
But in the exponential family, we can compute moments through differentiation (of g(η)), which is typically easier than integration. To show this, let’s compute the derivative of Equation 2. First, for ease of notation, let
and that, using Equation 1, the left side of Equation 7 can be rewritten as
−∇g(η)∫f(x,η)dx=−∇g(η)g(η)1=−∇logg(η)
Putting this together, we get
−∇logg(η)=∇A(η)=E[u(x)]
In words, we can differentiate the log normalizer to compute the mean of the sufficient statistic. Furthermore, note that as n→∞, Equation 6 becomes E[u(x)]. This demonstrates that ηML is equal to the true parameter value of η in the limit or that our estimator is unbiased.
Furthermore, if you take the second derivative (not shown here), it is easy to see that
Conjugacy is an important property in Bayesian inference. If you are unfamiliar with the term, please read my previous post first. Every exponential family member has a conjugate prior of the form
p(η∣χ,ν)=f(χ,ν)g(η)νexp{η⊤χ}
where ν and χ are hyperparameters and f(χ,ν) depends on the form of the exponential family member. To verify conjugacy, recall that we must verify that the posterior has the same functional form as the prior. Let’s confirm that:
Since the first n+1 terms are constant w.r.t. η, we can write
p(η∣X,χ,ν)∝g(η)n+νexp{η⊤(i=1∑nu(xi)+χ)}
And we’re done. Note that this is the same exponential family form as the prior, with parameters:
νχ=νprior+n=χprior+i=1∑nu(xi)
A major benefit of the exponential family representation is that both the prior and posterior take the same form with respect to η, and difference is modeled entirely by hyperparameters ν and χ, the latter of which is a function of the sufficient statistics of our data. This framework lends itself nicely to sequential learning. In this case, the updates are
Many common distributions are members of the exponential family, which has many convenient mathematical properties. Exponential family likelihoods allow for inference with incrementally computable sufficient statistics. These models are conjugate such that both the prior and posterior have the same functional form over hyperparameters ν and χ. The moments of the exponential family can be obtained by differentiating the normalizer g(η). And any efficient inference algorithms that work for the exponential family can be abstracted to work for a variety of common distributions. Furthermore, while we did not prove it in this post, we note that the natural parameter space is convex (Wainwright & Jordan, 2008).
Pitman, E. J. G. (1936). Sufficient statistics and intrinsic accuracy. Mathematical Proceedings of the Cambridge Philosophical Society, 32(4), 567–579.
Koopman, B. O. (1936). On distributions admitting a sufficient statistic. Transactions of the American Mathematical Society, 39(3), 399–409.
Darmois, G. (1935). Sur les lois de probabilitéa estimation exhaustive. CR Acad. Sci. Paris, 260(1265), 85.
Wainwright, M. J., & Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2), 1–305.