Mapping and localization, preferably from a small number of observations, are two fundamental tasks in robotics. We address these tasks by combining spatial structure and end-to-end learning in a novel neural network architecture: the Differentiable Mapping Network (DMN). The DMN constructs a spatially structured view-embedding map and uses it for subsequent visual localization with a particle filter. Importantly, the DMN architecture is end-to-end differentiable, so we can jointly learn mapping and localization using gradient descent. We apply the DMN to sparse visual localization, where a robot needs to localize in a new environment with respect to a small number of images from known viewpoints. We evaluate the DMN using simulated environments and a challenging real-world Street View dataset. We find that the DMN learns effective map representations for visual localization; and the benefit of structure increases with less training data, larger environments, and more observations for mapping.
Peter Karkus, Anelia Angelova, and Rico Jonschkowski, "Differentiable Mapping Networks: Learning Structured Map Representations for Sparse Visual Localization", International Conference on Robotics and Automation (ICRA), 2020.
In the sparse visual localization task a robot has access to a handful of visual observations from known viewpoints to build a map, and then it needs to localize with respect to this map given a new sequence of visual observations.
The task is challenging due to the small number of observations in which the relevant spatial information is encoded in rich visual features.
Sparse visual localization in the Street View domain. The robot receives context image-pose pairs, C1, C2, C3, C4, and a query image Q1; and it needs to estimate the pose of the query image.
Differentiable Mapping Network (DMN) schematic.
The Differentiable Mapping Network (DMN) is a novel neural network architecture that learns a spatial view-embedding map for sparse localization.
For each context input Ci the DMN computes an embedding Vi (top left of figure). The set of embeddings and viewpoint coordinates taken together make up the map representation m (top right), which is used for localization (bottom) using a differentiable particle filter where particles correspond to query pose candidates.
Starting with a set of particles sampled from the initial belief b0, the observation model updates the particle weights by comparing the query image Qt to the map using egocentric spatial attention, i.e., attention based on the relative pose of the context viewpoints and the particle. The transition model updates particles with the egomotion input ∆t. At each time step the query pose output st is estimated by the weighted mean of particles.
Examples for one-step sparse localization in the Rooms and Mazes domains from the GQN dataset. The last column shows DMN particles (black), context viewpoints (green) and the unknown query pose (blue). Particle weights are visualized as a heat map aggregated over all yaw values.
Sequential localization example in the Street View domain. The context images for this example are shown in the first figure. The bottom row shows DMN particles (black), context viewpoints (green) and the unknown query pose (blue). Particle weights are visualized as a heat map aggregated over all yaw values.