Given a database of geo-tagged images or images of known places, the goal of visual place recognition algorithms is to determine the place depicted in a new query image. Traditionally, this problem is solved by transferring the geo-tags or place identities of the most similar database images to the query image. Highly related to the visual place recognition problem is the task of visual localization: Given a scene representation computed from a database of geo-tagged images, e.g., a 3D model recovered via Structure-from-Motion, visual localization approaches aim to estimate the full 6 Degree-of-Freedom (6DOF) pose of a query image, i.e., the position and orientation from which the image was taken. Both place recognition and visual localization are fundamental steps in many Computer Vision applications, including robotics, autonomous vehicles (self-driving cars), Augmented / Mixed / Virtual Reality, loop closure detection in SLAM, and Structure-from-Motion.
This tutorial covers the state-of-the-art in place recognition and visual localization, with three goals: 1) Provide a comprehensive overview over the current state-of-the-art. This is aimed at first- and second-year PhD students and engineers from industry who are getting started with or are interested in this topic. 2) Have experts teach the tricks of the trade to more experienced PhD students and engineers who want to refine their knowledge on place recognition and localization. 3) Highlight current open challenges in place recognition and localization. This outlines what current algorithms can and cannot do. .
The tutorial consists of four parts, with the first two parts being dedicated to place recognition and the second two parts being focused on localization: The first part discusses classical visual place recognition techniques based on local features. The second part explains recent learning- based place recognition techniques and highlights their advantages and disadvantages compared to classical approaches. The third part discusses visual localization approaches based on local features that use an explicit 3D representation of the scene. The fourth part focuses on learning- based localization algorithms that represent the scene implicitly through the weights of a neural network. It also includes a discussion of the advantages and disadvantages of learned over classical approaches. In all parts, we will discuss open problems and challenges. Throughout the tutorial, we provide links to publicly available source code for the discussed approaches. We will also highlight the different properties of the datasets commonly used for experimental evaluation.
The first part of the tutorial, covering visual place recognition, considers an application scenario in which the scene is represented by a set of geo-tagged images. The aim of visual place recognition approaches is to determine which place is visible in an image taken by a user. Traditionally, this has been modeled as an image retrieval problem [5,22,25,29,44,64,77,94,95,99,101], enabling the use of efficient and scalable retrieval approaches [60,63,81].
This part introduces classical image retrieval methods which are derived from the Bag-of-Words (BoW) model [81]. These typically employ local feature detectors and local descriptors, while they perform the indexing and matching either at local feature level with inverted files [4,36,62,88] or by constructing a global and compact image representation such as VLAD [26, 37, 91], and Fisher vectors [38]. A range of such techniques can be cast under the same generic framework of feature embedding, pooling and matching [88]. Classical retrieval approaches are adapted to the place recognition problem, where confusing features [44] and repetitive scene structures [35,95] are rather common and need to be handled explicitly. Other methods aim to better distinguish between different places, e.g., by learning the appearance of different parts of the scene [19,25,29] or by identifying structures unique to certain areas [71,77].
The high performance of methods based on local features comes along with high complexity cost and/or large memory requirements. Recent advances on image retrieval show that a CNN-based representation constitutes a compact and effective global image descriptor. We will firstly present approaches that employ Convolutional Neural Networks (CNN) pre-trained for classification and directly apply them for retrieval and location recognition [6,27,40,69,93]. The achieved perfor- mance signifies nice generalization properties, however, fine-tuning with a landmark dataset can significantly help [7]. Such improvement is offered at the cost of additional human annotation of datasets that are more appropriate for the target task. We will then discuss ways that dispense with the need of such annotation. Weak-supervision is achieved through geo-tagged databases and allows to train VLAD on top of CNN activations in an end-to-end manner [3], while unsupervised fine-tuning is achieved by exploiting BoW and classical geometric matching to automatically mine hard training data [28,66]. We will discuss how CNN-based representations are actually cast under the same framework as classical approaches; instead of a set of sparse local features, now a set of dense CNN-based features is matched in different ways. This inspires transfer of classical methods to CNN-based ones, such as VLAD to NetVLAD [3].
Finally, we will present recent results on the intersection of classical and deep learning ap- proaches [61,65]. Local feature indexing and geometric matching appears to be directly applicable on top of local features learned via deep learning [61]. This gives rise to interesting open prob- lems related to better adapting the features to the indexing scheme or vice versa, which will be discussed in the tutorial.
Assuming that the scene is represented by a 3D Structure-from-Motion model, the full 6DOF pose of a query image can be estimated very precisely [74]. Approaches for feature-based visual localization [23,48,49,70,72,83,100] compute the pose from 2D-3D matches between 2D features in the query image and 3D points in the model. These correspondences are determined by matching the descriptors of the query features against descriptors associated with the 3D points. Such feature-based approaches are covered in the third part of the tutorial, consisting of four subparts. The first subpart gives a brief introduction to the basic building blocks of feature-based methods: Local features (both hand-crafted and learned), data structures for descriptor matching [58], and algorithms for camera pose estimation from 2D-3D correspondences [15,31,32,39,74].
The second subpart focuses on efficient feature-based localization. We discuss prioritized match- ing schemes that enable state-of-the-art localization systems to efficiently handle 3D models con- sisting of millions of 3D points by only considering features and 3D points that are likely to yield a match [23,49,72]. This includes detailing how to exploit existing visibility information between 3D points in the model and the database images used to reconstruct the scene for both the match- ing [23, 49, 52, 72] and pose estimation [48, 70, 72, 100] stages of localization. We also describe real-time localization algorithms for mobile devices such as drones or tablets [51,53,57].
The third subpart covers scalable feature-based methods [48,52,75,83,100]. We will analyze why the efficient algorithms covered in the previous subpart fail to scale to larger or more complex scenes. We will then introduce approaches able to scale to city-scale scenes through advanced camera pose estimation techniques [48,83,100] and intermediate image retrieval steps [70,75]. In addition, we will discuss approaches aimed at reducing memory requirements by 3D model compression [17,20,49] or by re-computing the model on the fly [75].
The last subpart focuses on the largely open problem of long-term localization, i.e., the prob- lem of localizing query images taken under widely different conditions against a reference model taken under a single condition. We will introduce a recent benchmark dataset for long-term localization [73] and discuss approaches that try to solve long-term localization by integrating higher-level scene understanding (in the form of semantic segmentation) into the localization pipeline [79,82,85].
The last part of the tutorial covers learning-based algorithms for visual localization. Compared to feature-based methods, which use an explicit representation of the scene in the form of a 3D model, these approaches implicitly represent the scene via the weights of their neural networks. We cover two popular approaches:
The first line of research asks whether the full, classical localization pipeline can be replaced by machine learning techniques. Such approaches regress the 6D camera pose directly from an input image using a neural network [8,14,24,41–43,98]. In this part of the tutorial, we discuss how prior knowledge about the task can be incorporated to a certain extend in the network architecture [8,14,24,43,68,96,98], the choice of pose parametrization [14,43], and the design of the loss function [8,14,42,43].
The second line of research keeps the overall strategy of classical visual localization, and replaces only parts with learned alternatives. In particular, methods based on scene coordinate regression learn to predict a 3D point coordinate densely, for each pixel in an image [11–13,21,30,54,80,97]. Standard pose estimation strategies (covered in the previous parts of the tutorial) are then used to compute the camera pose from the resulting set of 2D-3D matches. We will cover random forest-based [12,30,80,97] and, to a larger extent, neural network-based [11,13] approaches and discuss their advantages and disadvantages [54]. We explain how to train scene coordinate regression from training images with dense depth maps or a 3D model of the environment, or just from a collection of RGB images with annotated poses [13]. In addition, we will show how the learned representations can be adapted to new environments on the fly [21], and how robust RANSAC-based pose optimization can be made differentiable for end-to-end training [11,13].
Finally, we will compare these methods against feature-based and pose regression-based methods on standard datasets. When discussing the results, we cover open problems of current state- of-the-art learning-based approaches. These include the limited accuracy of direct pose regression, the scalability of scene coordinate regression, and considerable training times in general. We discuss the importance of modeling ambiguities in the scene representation, and how to exploit global image context while maintaining good generalization.