The Unscented Transform
Introduction
The unscented transform is a way to approximate the probability distribution of a variable that is normally distributed after it is put through a nonlinear function. Concretely, let’s suppose that \(x \sim N(\mu, \Sigma)\), and let \(f : \mathbb{R}^n \to \mathbb{R}^m\) be some nonlinear function. Then we might ask: how can we approximate the distribution of \(f(x)\) by another normal distribution?
The ‘traditional’ approach is through linearisation. We know that if \(f\) is a linear or affine mapping, \(f(x) = A x + b\), then the distribution of \(f(x)\) is given by
\[Ax+b \sim N(A\mu + b, A\Sigma A^\top)\]So one way to approximate the distribution of \(f(x)\) for a nonlinear \(f\) is to simply say
\[f(x) \approx f(\mu) + D f(\mu)[x-\mu] \sim N(f(\mu), D f(\mu) \Sigma D f(\mu)^\top),\]where \(Df(\mu)\) is the differential (Jacobian) of \(f\) at \(\mu\). The problem is that this may not be a very good approximation, so what else can we do?
Sigma Points
What if, instead of trying to approximate \(f\) and then applying it to the distribution, we approximate the distribution and then apply \(f\)? This is the idea of the unscented transform. But how can we approximate the distribution? The answer is ‘sigma points’. A set of sigma points \(S\) for the distribution \(N(\mu, \Sigma)\) consists of points \(x^{(i)}\) and weights \(w^{(i)}\) so that
- \(\sum_i w^{(i)} = 1\),
- \(\sum_i w^{(i)} x^{(i)} = \mu\),
- \(\sum_i w^{(i)} (x^{(i)} - \mu)(x^{(i)} - \mu)^\top = \Sigma\).
This approximates the original distribution if you interpret it as a probability function. Let \(\Omega = \{ x^{(i)} \mid i=0,1,...,p \}\), define a probability function \(p: \Omega \to \mathbb{R}\) by \(p(x^{(i)}) = w^{(i)}\), and let \(x\) be a random variable distributed according to \(p\). Then,
- \(\sum_{x \in \Omega} p(x) = \sum_i w^{(i)} = 1\),
- \(\mathbb{E}[x] = \sum_i w^{(i)} x^{(i)} = \mu\),
- \(\mathrm{cov}(x,x) = \mathbb{E}\left[ (x - \mathbb{E}[x])(x - \mathbb{E}[x])^\top \right] = \sum_i w^{(i)} (x^{(i)} - \mu)(x^{(i)} - \mu)^\top = \Sigma\).
Now, this is only possible if there are at least \(n+1\) points, but there is no unique choice of sigma points. However, the original choice by Uhlmann gives us a ‘canonical’ choice; for \(i = 1,..., 2n\) define
\[\begin{aligned} w^{(i)} &= \frac{1}{2n}, & x^{(i)} &= \mu + \begin{cases} (\sqrt{n \Sigma})_i & \text{if } i \leq n \\ -(\sqrt{n \Sigma})_i & \text{if } i > n \end{cases} \end{aligned}\]Here \((\sqrt{n\Sigma})_i\) denotes the \(i\)th column of the matrix square-root of \(\Sigma\) multiplied by \(n\). Specifically, if \(A = \sqrt{n \Sigma}\), then \(A A^\top = n \Sigma\).
Unscented Transform
Now that we have a set of sigma points that approximate the original distribution, we need to apply the nonlinear function \(f\). This time, instead of approximating \(f\), we simply apply it to our sigma points and gather statistics at the end. Let \(x\) be a random variable distributed according to the sigma points. Then we compute the mean and covariance of \(f(x)\) as follows.
\[\begin{aligned} \hat{\eta} &:= \mathbb{E} \left[ f(x) \right], \\ &= \sum_i w^{(i)} f(x^{(i)}), \\ %----------------------------- \hat{\Sigma} &:= \mathbb{E} \left[ (f(x) - \mathbb{E} \left[ f(x) \right])(f(x) - \mathbb{E} \left[ f(x) \right])^\top \right], \\ &= \mathbb{E} \left[ (f(x) - \hat{\eta})(f(x) - \hat{\eta})^\top \right], \\ &= \sum_i w^{(i)} (f(x^{(i)}) - \hat{\eta})(f(x^{(i)}) - \hat{\eta})^\top. \end{aligned}\]This is how we now approximate \(f(x)\): we say that, approximately, \(f(x) \sim N(\hat{\eta}, \hat{\Sigma})\). This is quite different from the linearisation approach, but is it any better? That depends on the function \(f\), the choice of sigma points, and also on what you mean by “better”.
Analysis
How does the unscented transform compare to the true distribution of \(f(x)\)? Let \(x \sim N(\mu, \Sigma)\), and let \(s = x - \mu \sim N(0, \Sigma)\). Then we can calculate the expected value of \(f(x)\) by using a Taylor expansion,
\[\begin{aligned} \mathbb{E} \left[ f(x) \right] &= \mathbb{E} \left[ f(\mu + s) \right], \\ &= \mathbb{E} \left[ f(\mu) + Df(\mu)[s] + \frac{1}{2} D^2 f(\mu)[s,s] + \cdots \right], \\ &= \mathbb{E} [f(\mu)] + Df(\mu) \mathbb{E} [s] + \frac{1}{2} D^2 f(\mu) \mathbb{E}[s,s] + \cdots, \\ &= f(\mu) + Df(\mu) [0] + \frac{1}{2} D^2 f(\mu) \mathrm{cov}(s,s) + \cdots, \\ &= f(\mu) + \frac{1}{2} D^2 f(\mu) \Sigma + O(\mathbb{E}[\vert s \vert^3]). \end{aligned}\]In other words, the expected value of \(f(x)\) is determined by the mean and covariance of \(x\) up to third order deviations from the mean. In particular, this means that the unscented transform gives the correct transformed distribution for all functions \(f\) where \(D^k f = 0\) for \(k \geq 3\); that is, functions \(f\) that are degree 2 polynomials. In fact, using the canonical choice of sigma points, this is true for polynomials of degree 3 as well.
What about the covariance obtained from the unscented transform? It turns out that this is equal to the covariance obtained from the unscented transform only when \(f\) is a first-order (linear-affine) function. However, practical experience has shown that it can offer better performance in a range of applications.
Conclusion
The unscented transform provides a different way to propagate uncertainty through nonlinear functions than the “standard” approach of linearisation. Rather than approximate the function as a linear function, it approximates the probability distribution as a discrete probability function. This has the advantage that there is no need to compute the derivative of a function to linearise it, and propagates the mean of the function more accurately than the linearisation approach. In practice, a number of applications show that the unscented transform can outperform the linearisation approach. Overall, in my view it is a very interesting perspective on approximations to probability that is not widely enough understood.