The Unscented Transform

The unscented transform, most commonly associated with the nonlinear Kalman filter, was proposed by Jeffrey Uhlmann to estimate a nonlinear transformation of a Gaussian. I illustrate the main idea.

The unscented transform (UT) was originally prosed by Jeffrey Uhlmann as part of his PhD thesis (Uhlmann, 1995), although it is most well-known as a component in the unscented Kalman filter (Julier & Uhlmann, 1997; Wan & Van Der Merwe, 2000). The basic premise of the UT is that it easier to approximate a Gaussian distribution than it is to approximate an arbitrary density after a nonlinear transformation. This is a surprisingly simple and useful idea.

Imagine you have some DD-dimensional Gaussian distributed variables,

xniidND(μX,ΣX),(1) \mathbf{x}_n \stackrel{\texttt{iid}}{\sim} \mathcal{N}_D(\boldsymbol{\mu}_X, \boldsymbol{\Sigma}_X), \tag{1}

you and want to estimate their density after applying a nonlinear transformation to them:

yn=f(xn).(2) \mathbf{y}_n = f(\mathbf{x}_n). \tag{2}

How would we do this? If the transformation were linear, such as

yn=Axn+b,(3) \mathbf{y}_n = \mathbf{A} \mathbf{x}_n + \mathbf{b}, \tag{3}

this would be easy, since an affine transformation of a Gaussian is also a Gaussian. We could estimate the density of yn\mathbf{y}_n in closed form. This is one reason why so many probabilistic models rely on linear-Gaussian assumptions (Roweis & Ghahramani, 1999). For example, in a Kalman filter, we might have some state estimation that we posit is Gaussian distributed. Linear dynamics means we propogate that estimation—and our model uncertainty—forward in time using a linear map. However, it’s less obvious what to do in Eq. 22, when our transformation is nonlinear.

Figure 1. (Left) One thousand data points X\mathbf{X} (gray), as well as their sigma points (black "X" marks) and a contour plot of the Gaussian density estimated from X\mathbf{X}. (Right) The transformed data points Y=f(X)\mathbf{Y} = f(\mathbf{X}), as well as the transformed sigma points. The contour plot of the estimated Gaussian is computed only from the sigma points. The nonlinear function f()f(\cdot) is the logistic sigmoid function.

The UT is the following: compute a subset of points, called sigma points S={si}i=12D+1\mathbf{S} = \{\mathbf{s}_i\}_{i=1}^{2D+1}, and propogate them through the nonlinear map f()f(\cdot). Then estimate a Gaussian distribution as

yi=f(si),i{1,,2D+1},yiN(μ~Y,Σ~Y),(4) \begin{aligned} \mathbf{y}_i &= f(\mathbf{s}_i), \quad \forall i \in \{1, \dots, 2D+1\}, \\ \mathbf{y}_i &\sim \mathcal{N}(\tilde{\boldsymbol{\mu}}_Y, \tilde{\boldsymbol{\Sigma}}_Y), \end{aligned} \tag{4}

where μ~Y\tilde{\boldsymbol{\mu}}_Y and Σ~Y\tilde{\boldsymbol{\Sigma}}_Y are the sample mean and covariance, respectively. Imagine that this worked; it would be nice because it is relatively cheap, since there are only 2D+12D+1 sigma points. This scales well w.r.t. the dimensions of our data, compared to other methods such as Monte Carlo sampling. While there are different proposals for how to choose sigma points, let’s use the simplest one, proposed by Uhlmann in his PhD thesis.

First, let L\mathbf{L} denote the scaled Cholesky decomposition of the covariance matrix of ΣX\boldsymbol{\Sigma}_X or

L=cholesky(DΣX).(5) \mathbf{L} = \text{cholesky}\left(D \boldsymbol{\Sigma}_X\right). \tag{5}

Then the sigma points are

s0=μX,si=μX+L:,i,i{1,,D},sj=μXL:,j,j{D+1,,2D}.(6) \begin{aligned} \mathbf{s}_0 &= \boldsymbol{\mu}_X, \\ \mathbf{s}_i &= \boldsymbol{\mu}_X + \mathbf{L}_{:,i}, \quad \forall i \in \{1, \dots, D\}, \\ \mathbf{s}_j &= \boldsymbol{\mu}_X - \mathbf{L}_{:,j}, \quad \forall j \in \{D+1, \dots, 2D\}. \end{aligned} \tag{6}

What is this doing? Speaking loosely, a positive definite matrix can be thought of as a multidimensional generalization of a positive number, and the Cholesky decomposition is then a multidimensional generalization of the square root. So we’re taking the square root of our unscaled covariance matrix, i.e. computing a multidimensional standard deviation. Eq. 66 says that our sigma points are the sample mean of X={xn}n=1N\mathbf{X} = \{\mathbf{x}_n\}_{n=1}^{N}, as well as points that are one standard deviation away from that mean in both directions in all dimensions DD. This is why there are 2D+12D + 1 sigma points.

While that’s a mouthful, it’s easy to visualize (Fig. 11, left). And to estimate a Gaussian density, we simply take the sample mean and sample covariance of f(S)f(\mathbf{S}) (Fig. 11, right). Here, I’ve used the logistic sigmoid function for f()f(\cdot).

That’s it. That’s the essence of the UT. There are natural extensions to this idea, such as different ways of computing the sigma points; weighting the sigma points after transforming them; and using this method to estimate non-Gaussian densities. But I think Fig. 11 nicely captures the main idea.

  1. Uhlmann, J. K. (1995). Dynamic map building and localization: New theoretical foundations [PhD thesis]. University of Oxford Oxford.
  2. Julier, S. J., & Uhlmann, J. K. (1997). New extension of the Kalman filter to nonlinear systems. Signal Processing, Sensor Fusion, and Target Recognition VI, 3068, 182–193.
  3. Wan, E. A., & Van Der Merwe, R. (2000). The unscented Kalman filter for nonlinear estimation. Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications, and Control Symposium (Cat. No. 00EX373), 153–158.
  4. Roweis, S., & Ghahramani, Z. (1999). A unifying review of linear Gaussian models. Neural Computation, 11(2), 305–345.