A Python Demonstration that Mutual Information Is Symmetric

I provide a numerical demonstration that the mutual information of two random variables, the observations and latent variables in a Gaussian mixture model, is symmetric.

Published

11 November 2020

In a previous post, I showed mathematically that mutual information (MI) is a symmetric quantity. In this post, I want to work through a complete example of computing MI: a Gaussian mixture model with observations $\mathbf{x}$ and latent variables $\mathbf{z}$ . I’ll show that $\text{MI}(\mathbf{x}, \mathbf{z}) \approx \text{MI}(\mathbf{z}, \mathbf{x})$ using quadrature.

Gaussian mixture model

Recall that the probabilistic model for a mixture of Gaussians is

$p(\mathbf{x}) = \sum_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k), \tag{1}$

where $\sum_{k=1}^K \pi_k = 1$ are weights, and $\{\boldsymbol{\mu}\}_{k=1}^{K}$ and $\{\boldsymbol{\mu}\}_{k=1}^{K}$ are means and covariances respectively. In words, our data are modeled as a linear superposition of Gaussian distributions. See (Bishop, 2006) for details. To index mixtures, we introduce a one-hot $K$ -vector $\mathbf{z}$ such that

$\begin{aligned} p(\mathbf{z}) &= \text{categorical}([\pi_1, \dots, \pi_K]), \\ p(\mathbf{x} \mid \mathbf{z}) &= \prod_{k=1}^{K} \mathcal{N}(\boldsymbol{\mu}_{k}, \boldsymbol{\Sigma}_{k}^2)^{z_{k}}. \end{aligned} \tag{2}$

The goal of this post is to empirically demonstrate that the mutual information between $\mathbf{x}$ and $\mathbf{z}$ is symmetric, that Eqs. $3$ and $4$ are equal,

$\text{MI}(\mathbf{x}, \mathbf{z}) = \mathbb{H}[p(\mathbf{x} \mid \boldsymbol{\theta})] - \textcolor{#bc2612}{\mathbb{E}_{\mathbf{z}}\left[\mathbb{H}[p(\mathbf{x} \mid \mathbf{z}, \boldsymbol{\theta})]\right]}, \tag{3}$

$\text{MI}(\mathbf{z}, \mathbf{x}) = \textcolor{#bc2612}{\mathbb{H}[p(\mathbf{z} \mid \boldsymbol{\theta})]} - \mathbb{E}_{\mathbf{x}}\left[\mathbb{H}[p(\mathbf{z} \mid \mathbf{x}, \boldsymbol{\theta})]\right], \tag{4}$

where $\boldsymbol{\theta} = \{\pi_1, \dots, \pi_K, \boldsymbol{\mu}_1, \dots \boldsymbol{\mu}_K, \boldsymbol{\Sigma}_1, \dots, \boldsymbol{\Sigma}_K\}$ . Notice that we can compute the red terms exactly since $\mathbf{z}$ is a discrete random variable (so expectations are sums) and because we have a closed-form solution to the entropy of the Gaussian $p(\mathbf{x} \mid \mathbf{z}, \boldsymbol{\theta})$ . However, we have to approximate the other terms, which we will do using numerical quadrature.

Let’s approximate Eq. $3$ and then Eq. $4$ and show that we get the same result.

Approximating $\text{MI}(\mathbf{x}, \mathbf{z})$

Before we get started, let’s load some data and fit our mixture model. We have to fit a model because we’re conditioning on $\boldsymbol{\theta}$ :

from sklearn.datasets import load_iris
from sklearn.mixture import GaussianMixture

X, _ = load_iris(return_X_y=True)
X    = X[:, 0][:, None]  # Let's stick to 1-dimensional data.
K    = 3
gmm  = GaussianMixture(n_components=K)
gmm  = gmm.fit(X)

mus  = gmm.means_
sigs = np.sqrt(gmm.covariances_)
pis  = gmm.weights_
Z    = gmm.predict(X)

To compute the right-hand-side (RHS) of Eq. $3$ ,

$\mathbb{E}_{\mathbf{z}}\left[\mathbb{H}[p(\mathbf{x} \mid \mathbf{z}, \boldsymbol{\theta})]\right] = \sum_{k=1}^{K} p(\mathbf{z} = k \mid \boldsymbol{\theta}) \mathbb{H}[p(\mathbf{x} \mid \boldsymbol{\theta}, \mathbf{z} = k)], \tag{5}$

we simply sum over the entropies of each Gaussian likelihood. Notice that $p(\mathbf{z} = k \mid \boldsymbol{\theta})$ is just the inferred $\pi_k$ and entropy of $p(\mathbf{x} \mid \boldsymbol{\theta}, \mathbf{z} = k)$ is just the entropy of the Gaussian $\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$ :

from scipy.stats import norm

def rhs3_exact():
    return np.sum([
        pis[k] * norm(mus[k], sigs[k]).entropy()
        for k in range(K)
    ])

However, we need approximate the left-hand-side (LHS) of Eq. $3$ , which we’ll do using quadrature. Notice that the marginal distribution $p(\mathbf{x} \mid \boldsymbol{\theta})$ is just Eq. $1$ . We then plug this value into the formula for entropy. In code, this is:

from scipy.integrate import quad

def lhs3_quadrature():
    
    def marginal(x):
        return np.sum([
            pis[k] * norm(mus[k], sigs[k]).pdf(x)
            for k in range(K)]
        )

    def marginal_entropy(x):
        px = marginal(x)
        return -1 * px * np.log(px + 1e-9)
    
    # Integrate over every possible value of x.
    rhs, _ = quad(marginal_entropy, -np.inf, np.inf)
    
    return rhs

The plus 1e-9 in the log calculation is for numerical stability when px is very small.

Approximating $\text{MI}(\mathbf{z}, \mathbf{x})$

For Eq. $4$ , the LHS is easy; it is the entropy of a categorical random variable:

from scipy.stats import multinomial

def lhs4_exact():
    return multinomial(1, pis).entropy()

Finally, we just have to compute the RHS of Eq. $4$ , the expected entropy of the posterior over $\mathbf{z}$ . Notice that we can easily compute the posterior of $\mathbf{z}$ for a single observation $\mathbf{x}$ :

$p(\mathbf{z} \mid \mathbf{x}) = \frac{p(\mathbf{x}, \mathbf{z})}{\sum_{k=1}^{K} p(\mathbf{x}, \mathbf{z} = k)}, \quad p(\mathbf{x}, \mathbf{z}) = \prod_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \mathbf{\Sigma}_k). \tag{6}$

The only tricky part is that when performing quadrature, small values of $\mathbf{x}$ will result in very small probabilities that do not normalize properly. Instead, let’s work in log space and then use the log-sum-exp trick to normalize the posterior:

from scipy.special import logsumexp

def rhs4_quadrature():

    def marginal(x):
        return np.sum([
            pis[k] * norm(mus[k], sigs[k]).pdf(x)
            for k in range(K)]
        )
    
    def z_posterior(x):
        log_p = np.empty(K)
        for k in range(K):
            log_p[k] = np.log(pis[k]) + norm(mus[k], sigs[k]).logpdf(x)
        # Log-sum-exp trick.
        z_post = np.exp(log_p - logsumexp(log_p))
        assert(np.isclose(z_post.sum(), 1))
        return z_post
    
    def exp_entropy_z(x):
        mx = marginal(x)
        zp = z_posterior(x)
        return -1 * mx * np.sum(zp * np.log(zp + 1e-9))
    
    # Integrate over every possible value of x.
    rhs, _ = quad(exp_entropy_z, -np.inf, np.inf)
    return rhs

Putting everything together, we see that the two approximations produce very similar mutual information estimates. On a single run on my machine, I get:

print(lhs3_quadrature() - rhs3_exact())  # 0.969247379948003
print(lhs4_exact() - rhs4_quadrature())  # 0.9692473862194966

Conclusion

The symmetry of mutual information is nice because it may be easier to compute one formulation than the other. In this example, we might prefer to integrate over the predictive distribution in Eq. $3$ or the posterior in Eq. $4$ . This is the main idea behind predictive entropy search (Hernández-Lobato et al., 2014), which I’ve discussed previously.

Bishop, C. M. (2006). Pattern Recognition and Machine Learning.
Hernández-Lobato, J. M., Hoffman, M. W., & Ghahramani, Z. (2014). Predictive entropy search for efficient global optimization of black-box functions. Advances in Neural Information Processing Systems, 918–926.