Tutorial Syllabus

The tutorial consists of two parts covering the general problem of visual place recognition and the more specific task of image-based localization. Each part first discusses classical approaches based on local image features and then introduces recent advances made by employing deep learning. Throughout the tutorial, we provide links to publicly available source code for the discussed approaches. We will also highlight the different properties of the datasets commonly used for experimental evaluation and provide pointers to the publicly available datasets.

Part 1: Visual place recognition

The first part of the tutorial, covering visual place recognition, looks at an application scenario in which the scene is represented by a set of geo-tagged images. The aim of visual place recognition approaches is to determine which place is visible in an image taken by a user. Traditionally, this has been modeled as an image retrieval problem [2,9,13,17,31,41,50,64,65,67,69], enabling the use of efficient and scalable retrieval approaches [38,40,51]. Consequently, we first introduce standard image methods based on local features such as Bag-of-Words (BoW) [51], VLAD [14, 23], and Fisher vectors [24].

Next, we discuss their adaption to the place recognition problem, where confusing features [31] and repetitive scene structures [22, 65] are rather common and need to be handled explicitly. In addition, we explain how known spatial relations between the database images [64, 67] and priors on the user’s position [9] can be incorporated into place recognition pipelines as well as how to employ synthetic views to increase the robustness against strong viewpoint and illumination changes [59].

The high performance of methods based on local features comes along with high complexity cost and large memory requirements. Recent advances on image retrieval show that a CNN-based representation constitutes a compact and effective global image descriptor. We will firstly present approaches that employ Convolutional Neural Networks (CNN) pre-trained for classification and directly apply them for retrieval and location recognition [3, 15, 26, 43, 56]. The achieved performance implies nice generalization properties, however, fine-tuning with a landmark dataset can significantly help [4]. Such improvement is offered at the cost of additional human annotation of datasets that are more appropriate for the target task. We will then discuss works that dispense with the need of such annotation. Weak-supervision is achieved through geo-tagged databases and allows to train VLAD on top of CNN activations in an end-to-end manner [1], while unsupervised fine-tuning is achieved by exploiting BoW and Structure-from-Motion (SfM) to automatically mine hard training data [42]. The first part of the tutorial concludes by showing that methods based on local features and methods based on CNNs are not two distinct research directions by showing that geometric matching with local features is able improve CNN training [16, 42].

Part 2: Image-based / structure-based localization

While place recognition techniques aim at determining which place is visible in a given image, image-based or structure-based localization approaches try to determine the exact position and orientation from which a photo was taken. Assuming that the scene is represented by a 3D Structure-from-Motion model, the full 6 degree-of-freedom pose of a query image can be estimated very precisely [49]. State-of-the-art approaches for image-based localization [10, 32, 33, 47, 48, 52] compute the pose from 2D-3D correspondences between 2D features in the query image and 3D points in the model, which are determined through descriptor matching.

In the second part of this tutorial, we first introduce the standard data structures for descriptor matching [37] as well as different approaches to estimate the camera pose from the 2D-3D correspondences [6, 18, 19, 25, 49]. We then discuss prioritized matching schemes that enable stateof-the-art localization systems to efficiently handle 3D models consisting of millions of 3D points by only considering features and 3D points that are likely to yield a match [10, 33, 47, 48]. This includes detailing how to exploit existing visibility information between 3D points in the model and the database images used to reconstruct the scene for both the matching [10, 33, 48] and pose estimation [32, 48] stages of localization. Prioritized matching approaches rely on the discriminative power of individual local features. Unfortunately, individual features become less discriminative in larger scenes, where it becomes more likely that multiple scene points have a similar appearance. In order to scale to larger scenes, scalable approaches relax the feature matching criteria, thus accepting more wrong matches. They handle the resulting higher outlier ratios (sometimes up to 99% or more) by adapting RANSAC’s sampling strategy [32] or by employing deterministic geometric outlier filtering steps whose run-time does not depend on the percentage of wrong matches [52, 68]. We will discuss these advanced localization methods and will highlight their advantages and disadvantages.

All methods discussed up to this point rely on a powerful desktop PC both in terms of processing power and memory consumption and are thus not applicable for mobile devices. Next, we thus introduce approaches that enable large-scale localization on mobile devices such as smart phones and tablets by combining non-real-time localization methods with real-time camera pose tracking [34–36].

The second part of the tutorial then concludes by introducing deep learning-based localization approaches that completely forego local feature matching and instead directly try to regress the camera pose [28, 29]. We thereby focus on the advantages and disadvantages of these recently proposed approaches with respect to traditional methods for image-based localization. We will explore the various representations and methods to learn to localise with deep learning. We will also discuss the use of Bayesian neural networks for probabilistic localisation with deep learning. The tutorial closes with summarizing advantages and disadvantages of two approaches: visual place recognition and image-based localization, and also discussing the potential integration and fusion of them. We will also discuss some example applications suitable for either or both the approaches.