Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection Consistency
Related Publications
Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection Consistency
AAAI Conference on Artificial Intelligence (AAAI 2021)
Seokju Lee, Sunghoon Im, Stephen Lin, In So Kweon
Silver Prize, 16th Samsung Electro-Mechanics Best Paper Award
Winner, Qualcomm Innovation Fellowship Korea 2020Instance-wise Depth and Motion Learning from Monocular Videos
NeurIPS Workshop on Machine Learning for Autonomous Driving (NeurIPSW 2020)
NeurIPS Workshop on Differentiable Computer Vision, Graphics, and Physics in Machine Learning (NeurIPSW 2020)
Seokju Lee, Sunghoon Im, Stephen Lin, In So Kweon
Honorable Mention, 12th Electronic Times ICT Paper Contest
Abstract
We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular camera setup without supervision. Our technical contributions are three-fold. First, we highlight the fundamental difference between inverse and forward projection while modeling the individual motion of each rigid object, and propose a geometrically correct projection pipeline using a differentiable forward projection module. Second, we design a unified instance-aware photometric and geometric consistency loss that holistically imposes self-supervisory signals for every background and object region. Lastly, we introduce a general-purpose auto-annotation scheme using any off-the-shelf instance segmentation and optical flow models to produce video instance segmentation maps that will be utilized as input to our training pipeline. These proposed elements are validated in a detailed ablation study. Through extensive experiments conducted on the KITTI and Cityscapes dataset, our framework is shown to outperform the state-of-the-art depth and motion estimation methods.
Contributions
We propose a neural forward projection module that maps the source image to the target viewpoint based on the source depth and the relative pose.
We propose unified instance-wise photometric and geometric consistency losses for self-supervised learning of depth and camera/object motions.
We introduce an auto-annotation scheme to generate a video instance segmentation dataset from the existing KITTI autonomous driving dataset.
Proposed Frameworks
Neural Forward Projection
Inverse warping causes appearance distortion and ghosting effects.
Our forward warping results.
Hole filling: pre-upsampling reference depth with different factors.
Inverse warping distorts the appearance of the moving object, while forward warping preserves geometric characteristics.
Qualitative Results (Cityscapes)
Better depth results on objects having similar velocity to the camera.
Better depth results on objects approaching from the other side.
Better instance-wise depth results.
Proposed Dataset: KITTI-VIS
Visualization of our KITTI-VIS dataset.