CVPR 2015 Tutorial on Large-Scale Visual Place Recognition and Image-Based Localization

Sunday, June 7th - Half Day (2pm - 6pm)
Room 204

 Torsten Sattler   Akihiko Torii 

Tutorial Description

The tutorial consists of two parts covering the general problem of visual place recognition and the more specific task of image-based localization. Throughout the tutorial, we provide links to publicly available source code for the discussed approaches. We will also discuss the different properties of the datasets commonly used for experimental evaluation and provide pointers to the publicly available datasets.

The first part of the tutorial, covering visual place recognition, looks at an application scenario in which the scene is represented by a set of geotagged images. The aim of visual place recognition approaches is to approximate the position of the viewer by identifying the place visible in the query image [4, 8, 10, 13, 22, 32, 39, 48, 49, 52, 53] using (image) retrieval methods [29, 31, 41]. We first introduce popular image  representations such as standard bag-of-visual-words [41], spatial pyramid [23], FLAIR [51], VLAD [19, 11], and Fisher vector [20]. We next discuss several improvements to the standard retrieval pipeline that detect and remove confusing features [22], exploit the known spatial relations between the images [48, 52], incorporate priors on the viewer’s position [8], and enable place recognition systems to handle the repetitive structures prevalent in urban environments [18, 49]. We present techniques aiming to better distinguish between different places, e.g., by learning the appearance of different parts of the scene [6, 10, 13] or by identifying structures unique to certain areas [39]. Finally, we discuss recent techniques incorporating large appearance changes between the images of query to database by explicitly synthesizing images and features using 3D models [40, 3].

Assuming that the scene is represented by a 3D Structure-from-Motion model, the full pose of the query image, i.e., its position and orientation, can be estimated very precisely [37]. State-of-the-art approaches for image-based localization [9, 25, 24, 34, 35, 43] compute the pose from 2D-3D correspondences between 2D features in the query image and 3D points in the model, which are determined through descriptor matching. In the second part of the tutorial, we first introduce the standard data structures for descriptor matching [28] as well as different approaches to estimate the camera pose from the 2D-3D correspondences [5, 14, 15, 21, 37]. We then detail the prioritized matching schemes that enable state-of-the-art localization systems to efficiently handle 3D models consisting of millions of 3D points by only considering features and 3D points that are likely to yield a match [9, 25, 34, 35]. We thereby focus on the details required for an efficient implementation of such systems. This includes discussing how to exploit existing visibility information between 3D points in the model and the database images used to reconstruct the scene for both the matching [9, 25, 35] and pose estimation [24, 35] stages of localization. Next, we explain that there is a clear limit to the scalability of methods based on prioritized matching [24, 36] and then discuss the two dominant strategies for scalable image-based localization: The first strategy employs camera pose estimation techniques that are able to handle extreme outlier ratios of 99% or more [24, 43], while the second strategy relies on place recognition techniques to first determine which part of the model is seen in the query image before performing descriptor matching [6, 17, 38]. In addition, we detail the trade-off between localization performance and the level of detail with which 3D models depict the scene [7, 12, 25, 36]. All methods discussed up to this point rely on a powerful desktop PC both in terms of processing power and memory consumption and are thus not applicable for mobile devices. In the last part of this tutorial, we thus introduce approaches that enable large-scale localization on mobile devices such as smart phones and tablets [2, 26, 27].


The course is mostly self-contained and discusses all relevant techniques and methods. Only for the second part of the tutorial, image-based localization against 3D models, some basic knowledge of Structure-from-Motion systems is required. These basics (feature matching between images, creation of feature tracks, triangulation, and model creation) are not covered in this tutorial but rather in the first half of the tutorial "Open Source Structure-from-Motion" that will be held on the same day in the morning.

Previous Tutorial

The slides of the previous iteration of our tutorial can be found on its website.

Tentative Schedule
  • Introduction [2:00pm - 2:05pm, 10 min, Torsten]
  • Visual Place Recognition [2:05pm - 3:40pm, 85 min, Akihiko]
  • Questions [3:40pm - 3:45pm, 5 min, Akihiko]
  • Coffee Break [3:45pm - 4:15pm, 30 min]
  • Image-Based Localization [4:15pm - 5:45pm, 90 min, Torsten]
  • Questions & Closing Remarks [5:45pm - 6:00pm, 15 min, Akihiko & Torsten]