Unsupervised Learning of Depth and Ego-Motion: A Structured Approach

struct2depth

An unsupervised learning method for depth and ego-motion from monocular video, modeling the 3D scene and individual object motion.

Approach

Learning the motion transforms for individual objects and enforcing photometric consistency among other losses.

Paper

Learning to predict scene depth from RGB inputs is a challenging task both for indoor and outdoor robot navigation. In this work we address unsupervised learning of scene depth and robot ego-motion where supervision is provided by monocular videos, as cameras are the cheapest, least restrictive and most ubiquitous sensor for robotics.

Previous work in unsupervised image-to-depth learning has established strong baselines in the domain. We propose a novel approach which produces higher quality results, is able to model moving objects and is shown to transfer across data domains, e.g. from outdoors to indoor scenes. The main idea is to introduce geometric structure in the learning process, by modeling the scene and the individual objects; camera ego-motion and object motions are learned from monocular videos as input. Furthermore an online refinement method is introduced to adapt learning on the fly to unknown domains.

The proposed approach outperforms all state-of-the-art approaches, including those that handle motion e.g. through learned flow. Our results are comparable in quality to the ones which used stereo as supervision and significantly improve depth prediction on scenes and datasets which contain a lot of object motion. The approach is of practical relevance, as it allows transfer across environments, by transferring models trained on data collected for robot navigation in urban scenes to indoor navigation settings.

Main Paper: Vincent Casser, Soeren Pirk, Reza Mahjourian, Anelia Angelova: Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos. Thirty-Third AAAI Conference on Artificial Intelligence (AAAI'19).

Preprint: https://arxiv.org/abs/1811.06152

Extended Version: Vincent Casser, Soeren Pirk, Reza Mahjourian, Anelia Angelova: Unsupervised Monocular Depth and Ego-motion Learning with Structure and Semantics. CVPR Workshop on Visual Odometry & Computer Vision Applications Based on Location Clues (VOCVALC), 2019.

Preprint: https://arxiv.org/abs/1906.05717

  @inproceedings{casser2019struct2depth,
      title={Depth Prediction without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos},
      author={Casser, Vincent and Pirk, Soeren and Mahjourian, Reza and Angelova, Anelia},
      booktitle={Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19)},
      year={2019}
  }
  @inproceedings{casser2019unsupervised,
      title={Unsupervised Monocular Depth and Ego-Motion Learning with Structure and Semantics},
      author={Casser, Vincent and Pirk, Soeren and Mahjourian, Reza and Angelova, Anelia},
      booktitle={CVPR Workshop on Visual Odometry and Computer Vision Applications Based on Location Cues (VOCVALC)},
      year={2019}
  }

Poster

Code

Code is released as part of Tensorflow models:

Models

The Tensorflow model, trained on the KITTI dataset is here.

The Tensorflow model, trained on the Ciryscapes dataset is here.

Depth Prediction Results on dynamic scenes from Cityscapes dataset

Depth prediction: the baseline maps moving objects to infinity (center column); struct2depth correctly estimates depth (right).

Depth prediction from a single image compared to Lidar ground truth. KITTI dataset.

Depth prediction (top); Lidar ground truth (bottom).

Transfer learning to indoor navigation data on the Fetch Robot

Training on Cityscapes and testing on Fetch data. Baseline algorithm (middle), our online refinement (bottom).