Multi‑modal CrossViT using 3D spatial information for visual localization
Multi‑modal CrossViT using 3D spatial information for visual localization
Visual Localization entails estimating the position and orientation of a camera from input images. Since it is pivotal to robotics, autonomous vehicles, and augmented reality felds, improvements to its accuracy and computational efficiency are vital. Although several approaches to hierarchical visual localization have been proposed, the convolutional operations in their global localization stage inflate their computational requirements. This study proposes a hierarchical framework comprised of a multi-modal CrossViT (Vision Transformer) that leverages both image features and 3D spatial information to generate more robust global descriptors. In the contrastive learning approach employed, the positive and negative image sets for each anchor image are designated based on the presence of shared 3D points. The intersection-over-union between 3D bounding boxes generated from a pair of images is used as a quantitative measure of similarity for positive sets in the loss computation. The embedding capacity of the proposed multi-modal CrossViT is transferred onto an architecture that takes a single image as input using a knowledge distillation approach. Local matching models are used to establish correspondences between the anchor image and each of the retrieved reference images. The fnal camera pose is determined using the random sample consensus and perspective-n-point algorithm. The large-scale Aachen DayNight datasets were used to evaluate the efciency and accuracy of the proposed approach. Experimental results show that the proposed approach achieves performance comparable to that of previous state-of-the-art approaches with signifcantly less processing and memory requirements (58.9 times fewer per-second foating-point operations and 21.6 times fewer parameters than the NetVLAD model).
Figure: Architectures of the proposed method in training for global localization (left) and inference for visual localization (right)
J. Kang, M. Mpabulungi, H. Hong, Multi‑modal CrossViT using 3D spatial information for visual localization, Multimedia Tools and Applications, Published Online: 18 October 2024 (SCIE).