Course Description

Course Description

The tutorial covers the task of visual localization at large-scale, where the goal is to localize a single image solely based on visual information. The tutorial includes localization approaches for different granularity levels, ranging from simple recognition of named locations and GPS estimation to the precise estimation of the 6D camera pose. The tutorial’s scope covers cases with different spatial/geographical extend, e.g. a small indoor/outdoor scene, city-level, world-level, as well as localization under changing conditions. 

In the coarse localization regime, the task is typically handled via retrieval approaches, which is covered in the first part of the tutorial. A typical use case is the following: Given a database of geo-tagged images, the goal is to determine the place depicted in a new query image. Traditionally, this problem is solved by transferring the geo-tag of the most similar database image to the query image. The major focus of this part is on the visual representation models used for retrieval, where we include both classical feature-based approaches and recent deep learning ones. The second and third part of the tutorial encompass methods for the precise localization regime with features-based and deep learning approaches, respectively. A typical use-case for these algorithms is to estimate the full 6 Degree-of-Freedom (6DOF) pose of a query image, i.e., the position and orientation from which the image was taken, for applications such as robotics, autonomous vehicles (self-driving cars), Augmented / Mixed / Virtual Reality, loop closure detection in SLAM, and Structure-from-Motion.

This tutorial covers the state-of-the-art in visual localization, with three goals: 1) Provide a comprehensive overview over the current state-of-the-art. This is aimed at first- and second-year PhD students and engineers from industry who are getting started with or are interested in this topic. 2) Have experts teach the tricks of the trade to more experienced PhD students and engineers who want to refine their knowledge on visual localization. 3) Highlight current open challenges. This outlines what current algorithms can and cannot do. Throughout the tutorial, we provide links to publicly available source code for the discussed approaches. We will also highlight the different properties of the datasets commonly used for experimental evaluation.

Part I: Retrieval for coarse localization - Giorgos, Yannis

The first part of the tutorial covers coarse localization, also called visual place recognition. It considers an application scenario in which the scene is represented by a set of geo-tagged images or images of named places, e.g. landmarks, known buildings and locations. The aim is to determine which place is visible in an image taken by a user. Traditionally, this has been modeled as an image retrieval problem; the tutorial pays attention to the visual representation aspect. We cover both classical feature-based visual representations and recent ones that rely on deep learning. A large range of approaches is cast under the same generic framework based on match kernels and the relation is pronounced between classical and deep approaches. Despite good transferability of CNN representations to the localization task, fine-tuning is essential. We review different approaches on obtaining training data in ways that dispense with the need for human annotations. We additionally include recent results on the intersection of classical and deep learning approaches. Local feature indexing and geometric matching appears to be directly applicable on top of local features learned via deep learning. This gives rise to interesting open problems related to better adapting the features to the indexing scheme or vice versa, which will be discussed in the tutorial. A dedicated part will focus on transformers, including both ViT-like architectures and transformer-based processing of CNN features. At the end of this part, we will present recent approaches for asymmetric setups that use lightweight networks or lightweight input to efficiently process the query image on low-resources devices.

Part II: Feature-based Visual Localization - Sudipta, Torsten, Zuzana

The second part of this tutorial consists of three sub-parts. In the first (Camera Pose Estimation), we discuss the problem of camera pose estimation from a set of 2D-3D matches. We discuss how the pose can be estimated from a minimal set of matches, considering calibrated and uncalibrated, as well as global and rolling shutter cameras. This part details the underlying mathematical models based on solving systems of multi-variate polynomial equations and strategies to efficiently solve such systems. The techniques discussed here are also relevant for Part III. 

The second part (Feature-based Visual Localization), discusses how to establish the 2D-3D matches needed for camera pose estimation based on local features and a given 3D model of the scene. The topics of this part are: (1) using image retrieval techniques and coarse localization in the context of large-scale (i.e., city-scale) localization and long-term localization. (2) the challenges of feature matching under changing conditions, e.g., day-night or seasonal changes. (3) using machine learning to learn robust and reliable local features and strategies for feature matching and outlier filtering. In addition, we briefly explain how feature-based localization can be efficiently implemented on mobile devices with limited compute and memory capabilities, e.g., smart phones. 

The third part (Privacy-Preserving Localization) focuses the topic of privacy in the context of visual localization. While the previously discussed approaches tend to be quite accurate, they also often have high storage requirements. Although cloud processing and storage can alleviate the storage issue, uploading visual features to persistent storage can raise privacy concerns because the features can be potentially inverted to recover sensitive information about the scene or subjects. We describe recent localization approaches that address the issues of privacy and storage. The concrete topics of this part are: (1) privacy-preserving localization approaches enabled by new geometric representations for 2D image keypoints and 3D mapped points. (2) privacy-preserving representations for visual features. (3) learned approaches which by design, address both storage and privacy issues. In particular, we review a recent technique that works by designating a few salient 3D points in the scene as landmarks, and then learning scene-specific predictors to detect these landmarks from query images.

Part III: Learning-based Visual Localization - Eric

This part covers two popular learning-based approaches to localization: scene coordinate regression and pose regression. Scene coordinate regression methods predict image-to-scene correspondences densely for a query image as a replacement for discrete feature matching. We discuss differentiable pose estimation strategies (e.g. RANSAC or PnP solvers) that allow for end-to-end training. We focus on deep scene coordinate regression with neural networks, and review their advantages and disadvantages concerning accuracy, mapping speed, map size, and scalability. We present strategies to address some of the major limitations, like ensemble methods for lack of scalability, or on-the-fly adaptation for long mapping times. 

Pose regression uses a neural network to predict the camera pose relative to the scene (absolute pose regression) or relative to a referene image (relative pose regression). We present important design choices for pose regression methods, such as popular network architectures, the choice and impact of the pose parametrization, and the design of the loss function. We discuss challenges of pose regression, such as limited accuracy, and we discuss opportunities, such as instant relocalisation without the need to build a map beforehand.

Part IV: Datasets & Benchmarks - Marc 

The last part reviews existing popular datasets and benchmarks for visual localization, describing their advantages and limitations. A special focus of this part is on a recently (ECCV’22) proposed benchmark (LaMAR).