Modeling Repulsion with Determinantal Point Processes

Determinantal point process are point processes characterized by the determinant of a positive semi-definite matrix, but what this means is not necessarily obvious. I explain how such a process can model repulsive systems.

A point process P\mathcal{P} over a set X\mathcal{X} is a probability measure over a random subset XX of X\mathcal{X}. For example, in a Bernoulli point process with parameter 0p10 \leq p \leq 1, each event xXx \in X is independent and occurs with probability pp. A determinantal point process (DPP) is a point process which is characterized by the determinant of a positive semi-definite matrix or kernel. The consequence of this is that DPPs are good for modeling repulsion, but why this follows from their definition may not be obvious. I want to explain my understanding of how they how they can be used as probabilistic models of repulsion. For simplicity, I only consider discrete sets.

It’s worth mentioning that while many point processes are determinantal, it is not well-understood why they are common (Tao, 2009), and they have recently received renewed attention in the mathematics and machine learning communities (Kulesza et al., 2012).

Determinantal point processes

A determinantal point process is a point process such that for every random subset XX drawn according to P\mathcal{P}, we have for every subset AXA \subseteq \mathcal{X}:

P(AX)=det(KA)(1) \mathcal{P}(A \subseteq X) = \det(K_A) \tag{1}

where KK is a positive semi-definite (real, symmetric) matrix, called a kernel, indexed by the elements in X\mathcal{X} and,

KA[Kij]i,jA K_A \equiv [K_{ij}]_{i,j \in A}

which means that the elements of KK are indexed by the indices of AA. (We haven’t specified the actual values in KK.) There is a lot of notation to unpack here. First, both XX and AA are subsets X\mathcal{X}. But we say that XX is a random subset sampled according to P\mathcal{P}. Thus, we call XX a realization of our point process, meaning it was sampled according to the distribution of P\mathcal{P}, while AA is just any subset of X\mathcal{X}. And for P\mathcal{P} to be a determinantal point process, a property of that realization must be that for every possible subset of points AA, the probability of that set being in XX is equal to the determinant of some real, symmetric matrix. If you’re like me, what this means is not obvious, but it becomes more clear once we work through the math.

Let’s start with the simplest case. What happens when A={xi}A = \{x_i\}. Then we have:

P({xi}X)=Kii \mathcal{P}(\{x_i\} \subseteq X) = K_{ii}

In words, the diagonal of the matrix KK represents the marginal probability that xix_i is included in any realization XX of our determinantal point process P\mathcal{P}. For example, if KiiK_{ii} is close to 11, that means that any realization from the DPP will almost certainly include xix_i. Now let’s look at a two element set, A={xi,xj}A = \{x_i, x_j\}:

P({xi,xj}X)=KiiKijKjiKjj=KiiKjjKijKji=P({xi}X)P({xj}X)Kij2(1) \begin{aligned} \mathcal{P}(\{x_i, x_j\} \subseteq X) &= \begin{vmatrix} K_{ii} & K_{ij} \\ K_{ji} & K_{jj} \end{vmatrix} \\ &= K_{ii} K_{jj} - K_{ij} K_{ji} \\ &= \mathcal{P}(\{x_i\} \subseteq X) \mathcal{P}(\{x_j\} \subseteq X) - K_{ij}^2 \end{aligned} \tag{1}

where we can say KijKji=Kij2K_{ij}K_{ji} = K_{ij}^2 because KK is symmetric. So the off-diagonal elements loosely represent negative correlation in the sense that the larger the value of KijK_{ij}, the less likely the joint probability of P({xi,xj}X)\mathcal{P}(\{x_i, x_j\} \subseteq X).

If you’re like me, this above example from (Kulesza et al., 2012) is useful for two dimensions, but I find it hard to reason about how the probabilities change with higher-dimensional kernels. For example, this is the probability for three elements (dropping the set notation for legibility):

P(xi,xj,xk)=P(xi)P(xj)P(xk)P(xi)Kjk2P(j)Kik2P(xk)Kij2+2KijKjkKki \mathcal{P}(x_i, x_j, x_k) = \mathcal{P}(x_i) \mathcal{P}(x_j) \mathcal{P}(x_k) - \mathcal{P}(x_i) K_{jk}^2 - \mathcal{P}(_j) K_{ik}^2 - \mathcal{P}(x_k) K_{ij}^2 + 2 K_{ij} K_{jk} K_{ki}

This is harder to reason about. I might note that if all the values are fixed and KijK_{ij} increases, then Kij2K_{ij}^2 increases faster than 2Kij2 K_{ij}, and therefore the probability decreases. But I find this kind of thinking about interactions slippery. I think an easier way to understand the determinantal point process is to use a geometrical understanding of the determinant while remembering that KK is symmetric.

Geometry of the determinant

Consider the geometrical intuition for the determinant of an m×nm \times n matrix: it is the signed scale factor representing how much a matrix transforms the volume of an nn-cube into an mm-dimensional parallelepiped. If that statement did not make sense, please read my previous post on a geometrical understanding of matrices. With this idea in mind, let’s consider the determinant for a simple matrix MM:

M=[1β01] M = \begin{bmatrix} 1 & \beta \\ 0 & 1 \end{bmatrix}

If we think of MM as a geometric transformation, then the columns of MM indicate where the standard basis vectors in R2\mathbb{R}^2 land in the transformed space. We can visualize this transformation by imagining how a unit square is transformed by MM (Figure 11).

Figure 1: (Left) The unit cubed in R2\mathbb{R}^2 defined by standard basis vectors e1=[1,0]\textbf{e}_1 = [1, 0]^{\top} and e2=[0,1]\textbf{e}_2 = [0, 1]^{\top}. (Right) The unit cube after being sheared by a matrix M=[1β01]M = \begin{bmatrix}1 & \beta \\ 0 & 1\end{bmatrix}. The determinant of both matrices is 11.

In this case, the determinant of MM is still 11 because the area of the parallelogram (Figure 1b1b) is the same as the area of the unit square (Figure 1a1a). But what happens if MM were symmetric (Figure 22)? Recall that symmetry is a property of DPP kernels.

Figure 2: (Left) The unit cubed in R2\mathbb{R}^2 defined by standard basis vectors e1=[1,0]\textbf{e}_1 = [1, 0]^{\top} and e2=[0,1]\textbf{e}_2 = [0, 1]^{\top}. (Right) The unit cube after being sheared in both dimensions by a matrix M=[1ββ1]M = \begin{bmatrix}1 & \beta \\ \beta & 1\end{bmatrix}. The determinant of the unit cubed is 11, while the determinant of the sheared matrix is less than 11.

In this case, the off-diagonal elements of MM shear the matrix in two dimensions. The larger the value of β\beta, provided 0<β<10 < \beta < 1, the smaller the determinant of MM. If β1\beta \geq 1, then MM is no longer positive semi-definite because the determinant is either 00 or negative.

I like this geometrical intuition for the kernels of DPPs because it generalizes better in my mind. I find generalizing Equation 11 difficult, but I can imagine a determinant of a kernel matrix in three dimensions. As the off-diagonal elements of this kernel matrix increase, the unit cube in 3D space is sheared more and more, just like in the 2D case. For example, consider again the probability of three elements from a DPP, P(xi,xj,xk)\mathcal{P}(x_i, x_j, x_k). What if Kjk=KkjK_{jk} = K_{kj} was high? Then the matrix transformation represented by KAK_A would be sheared along the kjkj-plane (Figure 33). Since the probability of the triplet is proportional to the determinant of this matrix, the probability of the triplet decreases.

Figure 3: (Left) The unit cube in R3\mathbb{R}^3. (Right) The unit cube sheared in two dimensions. The determinant of a matrix MM that models this transformation is less than 11.

Gram matrices

There’s another way to think of the geometry of the determinant in relation to a DPP. Consider a realization of P\mathcal{P}, X={x1,x2,,xr}X = \{x_1, x_2, \dots, x_r \}. We can construct a n×nn \times n positive semi-definite matrix in the following way:

K=XX K = X^{\top} X

In this case, the determinant of KK is equal to the squared volume of the parallelepiped spanned by the vectors in XX. To see this, recall two facts. First, recall that det(M)=det(M)\det(M^{\top}) = \det(M) for any matrix MM. And second, note that in general, the absolute value of the determinant of real vectors is equal to the volume of the parallelepiped spanned by those vectors. Then we can write:

det(K)=det(X)det(X)=vol2({Xi}i=1k) \det(K) = \det(X) \det(X) = \text{vol}^2(\{X_i\}_{i=1}^{k})

This is a useful way to think about DPPs because it means that if we construct our kernel KK in the manner above, we have a very nice intuition for our model: if XX is a realization from our point process P\mathcal{P} over X\mathcal{X}, then as the points in XX move farther apart, the probability of sampling XX increases. This is because now KK is just a Gram matrix, i.e.:

K=[x1,x1x1,x2x1,xrx2,x1x2,x2x2,xrxr,x1xr,x2xr,xr] K = \begin{bmatrix} \langle x_1, x_1 \rangle & \langle x_1, x_2 \rangle & \dots & \langle x_1, x_r \rangle \\ \langle x_2, x_1 \rangle & \langle x_2, x_2 \rangle & \dots & \langle x_2, x_r \rangle \\ \vdots & \vdots & \ddots & \vdots \\ \langle x_r, x_1 \rangle & \langle x_r, x_2 \rangle & \dots & \langle x_r, x_r \rangle \end{bmatrix}

and if two elements xix_i and xjx_j have a larger inner product, then the probability of them co-occuring in a realization of our point process decreases. For real matrices, this inner product just the dot product, which has a simple, geometrical intuition.

Conclusion

DPPs are an interesting way to model repulsive systems. In probabilistic machine learning, they have been used as a prior for latent variable models (Zou & Adams, 2012)—think of a DPP with over a Gram matrix of latent variable parameters—, and in optimization, they have been used for diversifying mini-batches (Zhang et al., 2017), which may lead to stochastic gradients with lower variance, especially in contexts such as imbalanced data.

  1. Tao, T. (2009). Determinantal processes. https://terrytao.wordpress.com/2009/08/23/determinantal-processes/
  2. Kulesza, A., Taskar, B., & others. (2012). Determinantal point processes for machine learning. Foundations and Trends® in Machine Learning, 5(2–3), 123–286.
  3. Zou, J. Y., & Adams, R. P. (2012). Priors for diversity in generative latent variable models. Advances in Neural Information Processing Systems, 2996–3004.
  4. Zhang, C., Kjellstrom, H., & Mandt, S. (2017). Determinantal point processes for mini-batch diversification. ArXiv Preprint ArXiv:1705.00607.