Why neural networks may seem better than the KF even when they are not — and how to both fix this and improve your KF itself
This post introduces our recent paper from NeurIPS 2023. Code is available on PyPI.
Background
The Kalman Filter (KF) is a celebrated method for sequential forecasting and control since 1960. While many new methods were introduced in the last decades, the KF’s simple design makes it a practical, robust and competitive method to this day. The original paper from 1960 has 12K citations in the last 5 years alone. Its broad applications include navigation, medical treatment, marketing analysis, deep learning and even getting to the moon.
Technically, the KF predicts the state x of a system (e.g. spaceship location) from a sequence of noisy observations (e.g. radar/camera). It estimates a distribution over the state (e.g. location estimate + uncertainty). Every time-step, it predicts the next state according to the dynamics model F, and increases the uncertainty according to the dynamics noise Q. Every observation, it updates the state and its uncertainty according to the new observation z and its noise R.
Kalman Filter or a Neural Network?
The KF prediction model F is linear, which is somewhat restrictive. So we built a fancy neural network on top of the KF. This gave us better prediction accuracy than the standard KF! Good for us!
Finalizing our experiments for a paper, we conducted some ablation tests. In one of them, we removed the neural network completely, and just optimized the internal KF parameters Q and R. Imagine the look at my face when this optimized KF outperformed not only the standard KF, but also my fancy network! The same KF model exactly, with the same 60-yo linear architecture, becomes superior just by changing the values of its noise parameters.
The KF beating a neural network is interesting but anecdotal to our problem. More important is the methodological insight: before the extended tests, we were about to declare the network as superior to the KF — just because we didn’t compare the two properly.
Message 1: To make sure that your neural network is actually better than the KF, optimize the KF just as nicely as you do for the network.
Remark — does this mean that the KF is better than neural networks? We certainly make no such general claim. Our claim is about the methodology — that both models should be optimized similarly if you’d like to compare them. Having said that, we do demonstrate *anecdotally* that the KF can be better in the Doppler radar problem, despite the non-linearity in the problem. In fact, this was so hard for me to accept, that I lost a bet about my neural KF, along with many weeks of hyperparameter optimization and other tricks.
Optimizing the Kalman Filter
When comparing two architectures, optimize them similarly. Sounds somewhat trivial, doesn’t it? As it happens, this flaw was not unique to our research: in the literature of non-linear filtering, the linear KF (or its extension EKF) is usually used as a baseline for comparison, but is rarely optimized. And there is actually a reason for that: the standard KF parameters are “known” to already yield optimal predictions, so why bother optimizing further?
Unfortunately, the optimality of the closed-form equations does not hold in practice, as it relies on a set of quite strong assumptions, which rarely hold in the real world. In fact, in the simple, classic, low-dimensional problem of a Doppler radar, we found no less than 4 violations of the assumptions. In some cases, the violation is tricky to even notice: for example, we simulated iid observation noise — but in spherical coordinates. Once transforming to Cartesian coordinates, the noise is no longer iid!
Message 2: Do not trust the KF assumptions, and thus avoid the closed-form covariance estimation. Instead, optimize the parameters wrt your loss — just as with any other prediction model.
In other words, in the real world, noise covariance estimation is no longer a proxy to optimize the prediction errors. This discrepancy between the objectives creates surprising anomalies. In one experiment, we replace noise estimation with an *oracle* KF that knows the exact noise in the system. This oracle is still inferior to the Optimized KF — since accurate noise estimation is not the desired objective, but rather accurate state prediction. In another experiment, the KF *deteriorates* when it is fed with more data, since it effectively pursues a different objective than the MSE!
So how to optimize the KF?
Behind the standard noise-estimation method for KF tuning, stands the view of the KF parameters as representatives of the noise. This view is beneficial in some contexts. However, as discussed above, for the sake of optimization, we should “forget” about this role of the KF parameters, and just treat them as model parameters, whose objective is loss minimization. This alternative view also tells us how to optimize: just like any sequential prediction model, such as RNN! Given the data, we just make predictions, calculate loss, backpropagate gradients, update the model, and repeat.
The main difference from RNN, is that the parameters Q,R come in the form of a covariance matrix, so they should remain symmetric and positive definite. To handle this, we use the Cholesky decomposition to write Q=LL*, and optimize the entries of L. This guarantees that Q remains positive definite regardless of the optimization updates. This trick is used for both Q,R.
This optimization procedure was found fast and stable in all of our experiments, as the number of parameters was several orders of magnitude beneath typical neural networks. And while the training is easy to implement yourself, you may also use our PyPI package, as demonstrated in this example 🙂
Summary
As summarized in the diagram below, our main message is that the KF assumptions cannot be trusted, and thus we should optimize the KF directly — whether we use it as our primary prediction model, or as a reference for comparison with a new method.
Our simple training procedure is available in PyPI. More importantly, since our architecture remains identical to the original KF, any system using KF (or Extended-KF) can be easily upgraded to OKF just by re-learning the parameters — without adding a single line of code in inference time.
Optimization or Architecture: How to Hack Kalman Filtering was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Optimization or Architecture: How to Hack Kalman Filtering
Go Here to Read this Fast! Optimization or Architecture: How to Hack Kalman Filtering