Course Description

Course Description

Visual localization is the problem of estimating the position and orientation from which an image was taken, i.e., its corresponding camera pose, with respect to some scene representation. Solving the visual localization problem is a fundamental step in many Computer Vision applications, including robotics, autonomous vehicles (self-driving cars), Augmented / Mixed / Virtual Reality, loop closure detection in SLAM, and Structure-from-Motion.

This tutorial covers the state-of-the-art in visual localization algorithms as well as their current short-comings and open research problems. The course is divided into three parts: The first part covers classical feature-based approaches. Using hand-crafted or learned local feature descriptors, these methods first establish correspondences between features found in a query image and 3D points in a 3D scene model. The resulting 2D-3D matches are then used to accurately estimate the camera pose of the query image. The second part then discusses learning-based approaches that replace parts of or the full localization pipeline with learned alternatives. The final part of the tutorial compares traditional feature-based and learning-based methods. It analyzes their short-comings and discusses open problems and current trends in visual localization. As such, its intention is to help young researchers to identify promising research questions.

One focus of the tutorial is to discuss the relationship between the different approaches, thus serving as an introduction into this research field.

Part I: Current State of Feature-based Localization

Assuming that the scene is represented by a 3D Structure-from-Motion model, the full 6 degree-of- freedom pose of a query image can be estimated very precisely [42]. State-of-the-art approaches for feature-based visual localization [11,25,26,35,40,50,56] compute the pose from 2D-3D correspondences between 2D features in the query image and 3D points in the model. These correspondences are determined by matching the descriptors of the query features against descriptors associated with the 3D points. Such feature-based approaches are covered in the first part of the tutorial, which consists of three subparts.

The first subpart gives a brief introduction to the basic building blocks of feature-based methods: Local features (both hand-crafted and learned), data structures for descriptor matching [34], and algorithms for camera pose estimation from 2D-3D correspondences [6,16–18,42].

The second subpart focuses on efficient feature-based localization. We discuss prioritized matching schemes that enable state-of-the-art localization systems to efficiently handle 3D models consisting of millions of 3D points by only considering features and 3D points that are likely to yield a match [11,26,40]. This includes detailing how to exploit existing visibility information between 3D points in the model and the database images used to reconstruct the scene for both the matching [11,26,28,40] and pose estimation [25,35,40,56] stages of localization. We also describe real-time localization algorithms for mobile devices such as drones or tablets [27,29,33].

The final subpart covers scalable feature-based methods [25,28,43,50,56]. We will analyze why the efficient algorithms covered in the previous subpart fail to scale to larger or more complex scenes. We will then introduce approaches able to scale to city-scale scenes through advanced camera pose estimation techniques [25,50,56] and intermediate image retrieval steps [35,43]. In addition, we will discuss approaches aimed at reducing memory requirements by 3D model compression [9,26] or by re-computing the model on the fly [43].

Part II: Current State of Learning-based Localization

The second part of the tutorial covers learning-based algorithms for visual localization. In particular, we discuss two popular approaches: Methods based on scene coordinate regression train random forests or neural networks to predict a 3D point coordinate for each pixel in an image [3–5,10,15,30,48,54]. Standard pose estimation strategies (covered in the first part of the tutorial) are then used to compute the camera pose from the resulting set of 2D-3D matches. While these approaches replace the matching part of visual localization algorithms with an learned alternative, camera pose regression techniques [12,19–21,55] replace the full localization pipeline by machine learning techniques. This part of the tutorial covers both approaches and consists of three subparts.

The first subpart serves as an introduction to the machine learning concepts used in this part of the tutorial: Random forests and convolutional neural networks.

The second subpart focuses on approaches for direct camera pose regression. These methods, popularized by PoseNet [21], train a convolutional neural network that takes an image as an input and outputs an estimate of its camera pose. Compared to feature-based methods, which use an explicit representation of the scene in the form of a 3D model, these approaches implicitly represent the scene via the weights of their neural networks. We will discuss the various architectures [12, 21, 55] and loss functions [20, 21] used by state-of-the-art methods and compare them against feature-based approaches on standard benchmark datasets.

The final subpart focuses on localization approaches based on scene coordinate regression. Given dense depths maps for each training image, these methods train random forests or neural networks to directly predict the 3D point position corresponding to each pixel in an image. We will cover both random forest-based [4,15,48,54] and neural network-based [3,5] approaches and discuss their advantages and disadvantages [30]. In addition, we will show how the learned representations can be adapted to new environments on the fly [10] and how they can be trained without dense depth maps [5]. Finally, we will compare these methods against feature-based and pose regression-based methods on standard datasets.

Part III: Current Topics & Open Problems

The last part of this tutorial focuses on short-comings and limitations of the previously presented approaches, as well as on currently open problems in the area of visual localization. As such, its intention is to help young researchers to identify promising research questions. The part is divided into four subparts.

The first subpart discusses failure cases of feature-based and learning-based approaches. This part is intended to provide the audience with the knowledge about what current state-of-the-art algorithms can and cannot do.

The second subpart focuses on the problem of long-term localization, i.e., the problem of localizing query images taken under widely different conditions against a reference model taken under a single condition. We will introduce a recent benchmark dataset for long-term localization [41] and discuss approaches that try to solve long-term localization by integrating higher-level scene understanding (in the form of semantic segmentation) into the localization pipeline [46,49,52].

The second part of the tutorial will show that pose regression-based methods perform significantly worse than algorithms that use (explicit or implicit) knowledge of the 3D scene structure. The third subpart is dedicated to analyze the performance of pose regression methods in more detail. The final subpart discusses current learning-based methods and details their struggles to handle larger or more complex scenes.