The mutual information (MI) between two random variables captures how much information entropy is obtained about one random variable by observing the other. Since that definition does not specify which is the observed random variable, we might suspect this is a symmetric quantity. In fact, it is; and I claimed that without proof in my previous post. The goal of this post is to show why this definition is indeed symmetric. The proof will highlight a useful interpretation of MI.
Let X and Y be continuous random variables with densities pX(x) and pY(y) respectively. The MI of X and Y is
MI(X,Y)=H[X]−EY[H[X∣Y=y]]=−∫xpX(x)lnpX(x)dx+∫ypY(y)∫xpX∣Y(x,y)lnpX∣Y(x,y)dxdy=−∫y∫xpX,Y(x,y)lnpX(x)dxdy+∫ypY(y)∫xpX∣Y(x,y)lnpX∣Y(x,y)dxdy=−∫y∫xpX,Y(x,y)lnpX(x)dxdy+∫y∫xpX,Y(x,y)lnpX∣Y(x,y)dxdy=∫y∫xpX,Y(x,y)(lnpX∣Y(x,y)−lnpX(x))dxdy=∫y∫xpX,Y(x,y)ln[pX(x)pX∣Y(x,y)]dxdy=∫y∫xpX,Y(x,y)ln[pX(x)pX∣Y(x,y)×pY(y)pY(y)]dxdy=∫y∫xpX,Y(x,y)ln[pX(x)pY(y)pX,Y(x,y)]dxdy=KL[pX,Y∥pX⊗pY].(1)
In the last line of Eq. 1, “KL” refers to the KL divergence. Since the KL divergence is non-negative, mutual information is also non-negative. Furthermore, the mutual information is zero if and only if X and Y are independent. This makes intuitive sense: if two random variables are independent, observing one tells you nothing about the other.
Finally, it’s pretty easy to see that we can simply reverse our calculations to get the mutual information of Y and X:
MI(Y,X)=H[Y]−EX[H[Y∣X=x]]=−∫ypY(y)lnpY(y)dy+∫xpX(x)∫ypY∣X(y,x)lnpY∣X(y,x)dydx=−∫x∫ypX,Y(x,y)lnpY(y)dydx+∫xpX(x)∫ypY∣X(y,x)lnpY∣X(y,x)dydx=−∫x∫ypX,Y(x,y)lnpY(y)dydx+∫x∫ypX,Y(x,y)lnpY∣X(y,x)dydx=∫x∫ypX,Y(x,y)(lnpY∣X(x,y)−lnpY(y))dydx=∫x∫ypX,Y(x,y)ln[pY(y)pY∣X(x,y)]dydx=∫x∫ypX,Y(x,y)ln[pY(y)pY∣X(x,y)×pX(x)pX(x)]dydx=∫x∫ypX,Y(x,y)ln[pX(x)pY(y)pX,Y(x,y)]dydx=KL[pX,Y∥pX⊗pY(y)].(2)
It’s easy to see that these derivations hold if X and Y are both discrete. The tricky case is if X is discrete and Y is continuous. Certainly, the derivations still work if we can interchange integrals and sums, which is true for finite sums. However, when the sums are infinite, we are effectively interchanging limits and integration. I don’t know enough measure theory to know when this is possible; but my very loose understanding is that the main value of Lebesgue integration is that it is easier to know when such regularity conditions hold.