Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection Consistency

Seokju Lee¹ Sunghoon Im² Stephen Lin³ In So Kweon¹

¹KAIST ²DGIST ³Microsoft Research

[GitHub]

Related Publications

  1. Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection Consistency
    AAAI Conference on Artificial Intelligence (AAAI 2021)
    Seokju Lee, Sunghoon Im, Stephen Lin, In So Kweon
    Silver Prize, 16th Samsung Electro-Mechanics Best Paper Award
    Winner, Qualcomm Innovation Fellowship Korea 2020

  2. Instance-wise Depth and Motion Learning from Monocular Videos
    NeurIPS Workshop on Machine Learning for Autonomous Driving (NeurIPSW 2020)
    NeurIPS Workshop on Differentiable Computer Vision, Graphics, and Physics in Machine Learning (NeurIPSW 2020)
    Seokju Lee, Sunghoon Im, Stephen Lin, In So Kweon
    Honorable Mention, 12th Electronic Times ICT Paper Contest

Abstract

We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular camera setup without supervision. Our technical contributions are three-fold. First, we highlight the fundamental difference between inverse and forward projection while modeling the individual motion of each rigid object, and propose a geometrically correct projection pipeline using a differentiable forward projection module. Second, we design a unified instance-aware photometric and geometric consistency loss that holistically imposes self-supervisory signals for every background and object region. Lastly, we introduce a general-purpose auto-annotation scheme using any off-the-shelf instance segmentation and optical flow models to produce video instance segmentation maps that will be utilized as input to our training pipeline. These proposed elements are validated in a detailed ablation study. Through extensive experiments conducted on the KITTI and Cityscapes dataset, our framework is shown to outperform the state-of-the-art depth and motion estimation methods.

Contributions

  • We propose a neural forward projection module that maps the source image to the target viewpoint based on the source depth and the relative pose.

  • We propose unified instance-wise photometric and geometric consistency losses for self-supervised learning of depth and camera/object motions.

  • We introduce an auto-annotation scheme to generate a video instance segmentation dataset from the existing KITTI autonomous driving dataset.

Proposed Frameworks

Neural Forward Projection

Inverse warping causes appearance distortion and ghosting effects.

Our forward warping results.

Hole filling: pre-upsampling reference depth with different factors.

Inverse warping distorts the appearance of the moving object, while forward warping preserves geometric characteristics.

Qualitative Results (Cityscapes)

Better depth results on objects having similar velocity to the camera.

Better depth results on objects approaching from the other side.

Better instance-wise depth results.

Proposed Dataset: KITTI-VIS

Visualization of our KITTI-VIS dataset.

Demo: Unified Visual Odometry (YouTube)

Code/Dataset/Models [GitHub]