The KL Divergence: From Information to Density Estimation
The KL divergence, also known as "relative entropy", is a commonly used metric for density estimation. I re-derive the relationships between probabilities, entropy, and relative entropy for quantifying similarity between distributions.
In statistics, the Kullback–Leibler (KL) divergence is a metric for how similar two probability distributions are. A standard formulation—and the one I encountered first—is the following. Given two probability distributions and , the KL divergence is the integral
In this post, I want to show how Eq. is both a measure of relative information entropy and a reasonable way to compare densities.
Relative information
Let us reconstruct the notion of information entropy (abbr. information) from first principles. The information received from a random variable taking a particular value can be viewed as a measure of our “surprise” about that value. A value with low information is not surprising; a value with high information is. Now imagine you didn’t know or couldn’t remember the equation for information. What would be a sensible formulation? First, it makes sense that information is a monotonic function of probability. Higher probability means strictly lower information. For example, rolling a die and getting an even number should be less surprising than rolling a die and getting a .
It would also make sense that information should be additive for independent events. If I roll a die and flip a coin, my total information should be some additive combination of the two probabilities. This thought exercise is useful because, at least for me, it makes clear why the information about a random variable taking on a value , denoted , is defined as it is:
The negative sign is because higher probabilities result in less information. And the is a monotonically increasing function of probabilities that has the useful property that two independent random events have additive information:
What does that mean for the KL divergence? Let’s consider Eq. again, but now write it in terms of information:
In other words, one interpretation of the KL divergence is that it captures the relative information or relative entropy between two distributions and . Also note that the KL divergence is not symmetric, i.e. in general.
At this point, it makes sense that the KL divergence might be a good metric for understanding how similar two distributions are. But why does minimizing the KL divergence between two densities—as in variational inference—guarantee that our optimization objective is performing density estimation? The answer to this relies on the convexity of logarithms and Jensen’s inequality.
Nonnegativity of the KL divergence
First, the big picture. We want to use the notion of convexity to prove Jensen’s inequality. Jensen’s inequality will allow us to move the logarithm in Eq. outside the integral. Since the integral of a density is , the log of the integral is . This will provide a lower bound on the KL divergence or formally: with equality when . With that in mind, let’s move forward.
A function is convex if the following holds
for some . This is a common formulation, and the reader can find numerous explanations and visualizations for why this is true. Intuitively, the function of any point between and inclusive is less than or equal to any point between and . Draw a few functions on a piece of paper and see which ones are convex.
An aside: proof of Jensen’s inequality
But at this point, I think many explanations of the KL divergence skip a step. They say something like, “And by Jensen’s inequality…” without proving Jensen’s inequality. Let’s actually do that. Jensen’s inequality is
where and . The proof is by induction. Let be a convex function. Now consider the base case:
This clearly holds because is convex and . Now for the inductive case, we want to show that
First, let’s start with our inductive hypothesis and add to both sides:
Now the s on the right-hand-side no longer sum to . Let’s normalize both sides of the equation by multiplying by :
This normalization constant makes sense because . Now note that the terms labeled and above sum to . And since is convex, we can say
At this point, we’re basically done. The left-hand-side of the above inequality can be simplified to
which we have already shown is less than or equal to as desired.
Jensen’s inequality for distributions
Now consider this: since and , we can interpret as the probability of our random variable taking on a specific value , giving us
which for continuous densities is equivalent to
Since logarithms are convex functions, we can apply Jensen’s inequality to the KL divergence to prove a lower bound:
We first flip the fraction so that the terms cancel, then apply Jensen’s inequality, and finally use the fact that .
Conclusion
In summary, we have used convexity to prove Jensen’s inequality to prove that the KL divergence is always nonnegative. If we minimize the KL divergence between two densities, we are minimizing the relative information between the two distributions.