Course Description

Course Description

The tutorial covers the task of visual localization at large-scale, where the goal is to localize a single image solely based on visual information. The tutorial includes localization approaches for different granularity levels, ranging from simple recognition of named locations and GPS estimation to the precise estimation of the 6D camera pose. The tutorial’s scope covers cases with different spatial/geographical extend, e.g. a small indoor/outdoor scene, city-level, world-level, as well as localization under changing conditions.

In the coarse localization regime, the task is typically handled via retrieval approaches, which is covered in the first part of the tutorial. A typical use case is the following: Given a database of geo-tagged images, the goal is to determine the place depicted in a new query image. Traditionally, this problem is solved by transferring the geo-tag of the most similar database image to the query image. The major focus of this part is on the visual representation models used for retrieval, where we include both classical feature-based approaches and recent deep learning ones. The second and third part of the tutorial encompass methods for the precise localization regime with features-based and deep learning approaches, respectively. A typical use-case for these algorithms is to estimate the full 6 Degree-of-Freedom (6DOF) pose of a query image, i.e., the position and orientation from which the image was taken, for applications such as robotics, autonomous vehicles (self-driving cars), Augmented / Mixed / Virtual Reality, loop closure detection in SLAM, and Structure-from-Motion.

This tutorial covers the state-of-the-art in visual localization, with three goals: 1) Provide a comprehensive overview over the current state-of-the-art. This is aimed at first- and second-year PhD students and engineers from industry who are getting started with or are interested in this topic. 2) Have experts teach the tricks of the trade to more experienced PhD students and engineers who want to refine their knowledge on visual localization. 3) Highlight current open challenges. This outlines what current algorithms can and cannot do. Throughout the tutorial, we provide links to publicly available source code for the discussed approaches. We will also highlight the different properties of the datasets commonly used for experimental evaluation.

Part I: Retrieval for coarse localization - Giorgos, Yannis

The first part of the tutorial covers coarse localization, also called visual place recognition. It considers an application scenario in which the scene is represented by a set of geo-tagged images or images of named places, e.g. landmarks, known buildings and locations. The aim is to determine which place is visible in an image taken by a user. Traditionally, this has been modeled as an image retrieval problem; the tutorial pays attention to the visual representation aspect. We cover both classical feature-based visual representations and recent ones that rely on deep learning. A large range of approaches is cast under the same generic framework based on match kernels and we pronounce the relation between classical and deep approaches. Despite good transferability of CNN representations to the localization task, fine-tuning is essential. We review different approaches on obtaining training data in ways the dispense with the need for human annotations. We additionally include recent results on the intersection of classical and deep learning approaches. Local feature indexing and geometric matching appears to be directly applicable on top of local features learned via deep learning. This gives rise to interesting open problems related to better adapting the features to the indexing scheme or vice versa, which will be discussed in the tutorial. At the end of this part we will present recent approaches for asymmetric setups that use lightweight networks, e.g. on mobile devices, to efficiently process the query image.

Part II: Feature-based Visual Localization - Zuzana, Torsten, Marc

The second part of this tutorial consists of three sub-parts. In the first (Camera Pose Estimation), we discuss the problem of camera pose estimation from a set of 2D-3D correspondences. We discuss how the pose can be estimated from a minimal set of matches, considering calibrated and uncalibrated, as well as global and rolling shutter cameras. This part details the underlying mathematical models based on solving systems of multi-variate polynomial equations and strategies to efficiently solve such systems. The techniques discussed here are also relevant for Part III.

The second part (Feature-based Visual Localization), discusses how to establish the 2D-3D matches needed for camera pose estimation based on local features and a given 3D model of the scene. The topics of this part are: (1) using image retrieval techniques and coarse localization in the context of large-scale (i.e., city-scale) localization and long-term localization. (2) the challenges of feature matching under changing conditions, e.g., day-night or seasonal changes. (3) using machine learning to learn robust and reliable local features and strategies for feature matching and outlier filtering. In addition, we briefly explain how feature-based localization can be efficiently implemented on mobile devices with limited compute and memory capabilities, e.g., smart phones.

The third part (Privacy-Preserving Localization) focuses the topic of privacy in the context of visual localization. Given the emergence of localization services such as the AR cloud, this part discusses privacy-preserving represen- tations for the 3D scene models required for feature-based localization. In addition, it introduces approaches for privacy-preserving queries to an external localization service as well as privacy-preserving 3D reconstruction.

Part III: Learning-based Visual Localization - Eric

This part covers two popular learning-based approaches to localization: scene coordinate regression and direct pose regression.

Scene coordinate regression methods predict image-to-scene correspondences densely or semi-densely for a query image as a replacement for discrete feature matching. Standard pose estimation strategies (covered in the previous part of the tutorial) yield the final pose estimate, and differentiable approximations to said strategies, e.g. RANSAC or PnP solvers, allow for end-to-end training. We cover random forest-based and neural network-based approaches and discuss their advantages and disadvantages. We also discuss methods that adapt pre-trained representations to new environments, or dispense with learning scene-specific representations altogether.

Direct pose regression methods come in two flavours, absolute pose regression and relative pose regression. Absolute pose regression methods learn a scene-specific mapping from visual appearance to a 6DoF pose. Relative pose regression methods learn a mapping from pairs of images to the relative pose between them, where the mapping is not scene-specific. We discuss important design choices for direct pose regression methods, such as the network architecture, the choice of pose parametrization, and the design of the loss function.

Finally, we cover open problems of current state-of-the-art learning-based approaches and compare them to classical feature-based approaches. We discuss the limited accuracy of direct pose regression, the limited scalability of scene coordinate regression, and considerable training times. We compare scene compression properties of various approaches, and discuss what ground truth data is required to train various methods.