From Entropy Search to Predictive Entropy Search

In Bayesian optimization, a popular acquisition function is predictive entropy search, which is a clever reframing of another acquisition function, entropy search. I rederive the connection and explain why this reframing is useful.

In Bayesian optimization, we want to maximize a black-box function ff,

x=arg ⁣maxxf(x),(1) \mathbf{x}_{\star} = \arg\!\max_{\mathbf{x}} f(\mathbf{x}), \tag{1}

where ff is nonlinear in general and typically expensive to compute. The main idea is to place a Gaussian process (GP) prior on this function; and to use the inferred GP posterior to compute an acquisition function, α(x)\alpha(\mathbf{x}) to decide where to evaluate ff next. See (Snoek et al., 2012) for a more thorough explanation. A popular acquisition function is predictive entropy search (Hernández-Lobato et al., 2014); and the goal of this post is to carefully derive the key trick used in that paper.

In a classic paper, (MacKay, 1992) proposed an information theoretic approach to active data selection. His idea was to use an approximation of the model’s posterior entropy to decide which data to select next. In words, select new data points that minimize our uncertainty about the model’s parameters. Building on this idea, (Hennig & Schuler, 2012) proposed a method called entropy search (ES), an acquisition function that attempts to maximize information about the function maximum given new data:

αES(x)=H[p(xD)]Ep(yD,x)[H[p(xD{x,y})]].(2) \alpha_{\texttt{ES}}(\mathbf{x}) = \mathbb{H}[p(\mathbf{x}_{\star} \mid \mathcal{D})] - \mathbb{E}_{p(y \mid \mathcal{D}, \mathbf{x})}\left[\mathbb{H}\left[p(\mathbf{x}_{\star} \mid \mathcal{D} \cup \{\mathbf{x}, y\})\right]\right]. \tag{2}

In words, the left term is the entropy of the posterior of the maximizer with respect to our current data D\mathcal{D}. For any new data point x\mathbf{x}, this is constant. The right term is the same posterior entropy, except under an expectation. This expectation is necessary because we want to consider all possible values y=f(x)y = f(\mathbf{x}). So maximizing αES(x)\alpha_{\texttt{ES}}(\mathbf{x}) means minimizing the posterior entropy after a new observation, {x,y}\{\mathbf{x}, y\}.

However, Eq. 22 is hard to compute. First, we must evaluate p(xD{x,y})p(\mathbf{x}_{\star} \mid \mathcal{D} \cup \{\mathbf{x}, y\}) for all possible values of yy and for many values of x\mathbf{x}. Furthermore, the entropy terms themselves have no closed-form solutions. We have no guarantee that the posterior over the optimal value has an easy analytical expression. To address these computational difficulties, (Hernández-Lobato et al., 2014) make the following observation:

αES(x)=H[p(xD)]Ep(yD,x)[H[p(xD{x,y})]]=H[p(yD,x)]Ep(xD)[H[p(yD,x,x)]]=αPES(x).(3) \begin{aligned} \alpha_{\texttt{ES}}(\mathbf{x}) &= \mathbb{H}[p(\mathbf{x}_{\star} \mid \mathcal{D})] - \mathbb{E}_{p(y \mid \mathcal{D}, \mathbf{x})}\left[\mathbb{H}\left[p(\mathbf{x}_{\star} \mid \mathcal{D} \cup \{\mathbf{x}, y\})\right]\right] \\ &= \mathbb{H}[p(y \mid \mathcal{D}, \mathbf{x})] - \mathbb{E}_{p(\mathbf{x}_{\star} \mid \mathcal{D})}\left[\mathbb{H}\left[p(y \mid \mathcal{D}, \mathbf{x}, \mathbf{x}_{\star})\right]\right] \\ &= \alpha_{\texttt{PES}}(\mathbf{x}). \tag{3} \end{aligned}

They rewrite the ES acquisition function, αES(x)\alpha_{\texttt{ES}}(\mathbf{x}), using the symmetry of mutual information, into the form of their so-called predictive entropy search (PES) acquisition function, αPES(x)\alpha_{\texttt{PES}}(\mathbf{x}). Before understanding why, let’s rederive the trick. First, consider the mutual information of x\mathbf{x}_{\star} and a random variable YY. We can write this as

MI(x,Y)=H[x]H[xY]=H[x]p(y)H[xY=y]dy.=H[x]Ey[H[xY=y]].(4) \begin{aligned} \text{MI}(\mathbf{x}_{\star}, Y) &= \mathbb{H}[\mathbf{x}_{\star}] - \mathbb{H}[\mathbf{x}_{\star} \mid Y] \\ &= \mathbb{H}[\mathbf{x}_{\star}] - \int p(y) \mathbb{H}[\mathbf{x}_{\star} \mid Y = y] \text{d}y. \\ &= \mathbb{H}[\mathbf{x}_{\star}] - \mathbb{E}_{y}\left[ \mathbb{H}[\mathbf{x}_{\star} \mid Y = y] \right]. \end{aligned} \tag{4}

The first step is just the definition of mutual information, and the second step is the definition of conditional entropy. The last line of Eq. 44 is essentially first line of Eq. 33. If the distribution on yy is our posterior predictive, p(yD,x)p(y \mid \mathcal{D}, \mathbf{x}), and the distribution on x\mathbf{x}_{\star} is p(xD)p(\mathbf{x}_{\star} \mid \mathcal{D}), then we can rewrite the last line as

H[p(xD)]Ep(yD,x)[H[p(xy,D,x)]].(5) \mathbb{H}[p(\mathbf{x}_{\star} \mid \mathcal{D})] - \mathbb{E}_{p(y \mid \mathcal{D}, \mathbf{x})}\left[ \mathbb{H}[p(\mathbf{x}_{\star} \mid y, \mathcal{D}, \mathbf{x})] \right]. \tag{5}

However, mutual information is symmetric, and we can rewrite Eq. 44 as

MI(x,Y)=MI(Y,x)=H[Y]H[Yx]=H[Y]p(x)H[Yx=x]dx=H[Y]Ex[H[Yx]].(6) \begin{aligned} \text{MI}(\mathbf{x}_{\star}, Y) &= \text{MI}(Y, \mathbf{x}_{\star}) \\ &= \mathbb{H}[Y] - \mathbb{H}[Y \mid \mathbf{x}_{\star}] \\ &= \mathbb{H}[Y] - \int p(x) \mathbb{H}[Y \mid \mathbf{x}_{\star} = x] \text{d}x \\ &= \mathbb{H}[Y] - \mathbb{E}_{\mathbf{x}_{\star}} \left[ \mathbb{H}[Y \mid \mathbf{x}_{\star}] \right]. \end{aligned} \tag{6}

And this is essentially the middle line of Eq. 33. Again, the distribution on x\mathbf{x}_{\star} is p(xD)p(\mathbf{x}_{\star} \mid \mathcal{D}), and the distribution on yy is the predictive, p(yD,x)p(y \mid \mathcal{D}, \mathbf{x}). This gives us

H[p(yD,x)]Ep(xD)[H[p(yD,x,x)]].(7) \mathbb{H}[p(y \mid \mathcal{D}, \mathbf{x})] - \mathbb{E}_{p(\mathbf{x}_{\star} \mid \mathcal{D})} \left[ \mathbb{H}[p(y \mid \mathcal{D}, \mathbf{x}, \mathbf{x}_{\star})] \right]. \tag{7}

So why is this a nice idea? Notice that the left entropy term is now with respect to the posterior predictive. By placing a Gaussian process prior on ff, we induce a Gaussian assumption on yy. This means the posterior predictive has a closed-form solution. The right term must still be approximated, but this is nicer than approximating both terms.

Intuitively, we’ve flipped the problem. Rather than thinking about information gain with respect to a challenging distribution over a maximum where we must consider all possible values of yy, we think about the information gain with respect to our data, considering all possible values of x\mathbf{x}_{\star}. The latter may be easier because the predictive distributions are sometimes nicer to work with.

Conclusion

Predictive entropy search is a simple and nice idea, rewriting quantities that are hard to compute as quantities that are easier to compute. I think the method is so-named because it frames the problem in terms of predictive distributions. The trick used in this paper was initially proposed in (Houlsby et al., 2012) and occurs elsewhere, such as in (Hernández-Lobato et al., 2015).

  1. Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical bayesian optimization of machine learning algorithms. Advances in Neural Information Processing Systems, 2951–2959.
  2. Hernández-Lobato, J. M., Hoffman, M. W., & Ghahramani, Z. (2014). Predictive entropy search for efficient global optimization of black-box functions. Advances in Neural Information Processing Systems, 918–926.
  3. MacKay, D. J. C. (1992). Information-based objective functions for active data selection. Neural Computation, 4(4), 590–604.
  4. Hennig, P., & Schuler, C. J. (2012). Entropy search for information-efficient global optimization. The Journal of Machine Learning Research, 13(1), 1809–1837.
  5. Houlsby, N., Huszar, F., Ghahramani, Z., & Hernández-Lobato, J. M. (2012). Collaborative Gaussian processes for preference learning. Advances in Neural Information Processing Systems, 2096–2104.
  6. Hernández-Lobato, J. M., Gelbart, M., Hoffman, M., Adams, R., & Ghahramani, Z. (2015). Predictive entropy search for bayesian optimization with unknown constraints. International Conference on Machine Learning, 1699–1707.