Insta-DM (AAAI'21)

Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection Consistency

Seokju Lee¹ Sunghoon Im² Stephen Lin³ In So Kweon¹

¹KAIST ²DGIST ³Microsoft Research

[GitHub]

Related Publications

Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection Consistency
AAAI Conference on Artificial Intelligence (AAAI 2021)
Seokju Lee, Sunghoon Im, Stephen Lin, In So Kweon
Silver Prize, 16th Samsung Electro-Mechanics Best Paper Award
Winner, Qualcomm Innovation Fellowship Korea 2020
Instance-wise Depth and Motion Learning from Monocular Videos
NeurIPS Workshop on Machine Learning for Autonomous Driving (NeurIPSW 2020)
NeurIPS Workshop on Differentiable Computer Vision, Graphics, and Physics in Machine Learning (NeurIPSW 2020)
Seokju Lee, Sunghoon Im, Stephen Lin, In So Kweon
Honorable Mention, 12th Electronic Times ICT Paper Contest

Abstract

We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular camera setup without supervision. Our technical contributions are three-fold. First, we highlight the fundamental difference between inverse and forward projection while modeling the individual motion of each rigid object, and propose a geometrically correct projection pipeline using a differentiable forward projection module. Second, we design a unified instance-aware photometric and geometric consistency loss that holistically imposes self-supervisory signals for every background and object region. Lastly, we introduce a general-purpose auto-annotation scheme using any off-the-shelf instance segmentation and optical flow models to produce video instance segmentation maps that will be utilized as input to our training pipeline. These proposed elements are validated in a detailed ablation study. Through extensive experiments conducted on the KITTI and Cityscapes dataset, our framework is shown to outperform the state-of-the-art depth and motion estimation methods.

Contributions

We propose a neural forward projection module that maps the source image to the target viewpoint based on the source depth and the relative pose.
We propose unified instance-wise photometric and geometric consistency losses for self-supervised learning of depth and camera/object motions.
We introduce an auto-annotation scheme to generate a video instance segmentation dataset from the existing KITTI autonomous driving dataset.