Web Stereo Video Supervision for Depth Prediction from Dynamic Scenes

Chaoyang Wang Simon Lucey Federico Perazzi Oliver Wang

Carnegie Mellon University Adobe Inc.

Samples from WSVD dataset

We present a fully data-driven method to compute depth from diverse monocular video sequences that contain large amounts of non-rigid objects, e.g., people. To learn reconstruction cues for non-rigid scenes, we introduce a new dataset (WSVD) consisting of stereo videos scraped from Youtube. This dataset has a wide variety of scene types, and features many nonrigid objects.


@misc{wang2019web, title={Web Stereo Video Supervision for Depth Prediction from Dynamic Scenes}, author={Chaoyang Wang and Simon Lucey and Federico Perazzi and Oliver Wang}, year={2019}, eprint={1904.11112}, archivePrefix={arXiv}, primaryClass={cs.CV}}



We infer depth from pairs of frames, which allows our network to take advantage of multiview information. As multiview information is ambiguous with respect to moving objects, we learn a prior on scenes with nonrigid objects by a new large scale stereo dataset. We also introduce a novel loss function for training that yields high quality results on web-sourced videos with unknown intrinsics. Please see the paper for full details.

More samples from WSVD

Results of model trained on WSVD

Download WSVD dataset

Web Stereo Video Dataset consists of 553 stereoscopic videos from YouTube.

To download the videos, first download wsvd_list.txt, then run the following command assuming 'youtube-dl' has already been installed.

youtube-dl --download-archive downloaded.txt -f 'bestvideo[ext=mp4]' -a wsvd_list.txt -o 'wsvd/%(id)s.%(ext)s'

Frame ID list

We provide the list of video frames which are the result of the following procedure:

  • Frames are grouped into individual clips which share the same context.
  • Static frames, black screens, frames with dominant regions of animation (e.g. titles) are removed.
  • Frames for which stereo matching (by Flownet 2.0) produces poor result are manually removed.
  • Train/test splits are provided. We randomly choose ~500 videos for training, and leave the rest as testing set.

We notice some of the stereoscopic videos place their left/right view on the opposite side, resulting in inverse disparity. We manually label those cases. The labels are provided per clips:

  • '1': correct left/right view layout.
  • '-1': opposite left/right view layout. In this case, you should multiply the horizontal flow by -1 when translating it to disparity.

The frame ID lists (index starts from 1) are provided as pickle files:

Code for processing and annotation

Coming soon...

Inference code and model

Coming soon...

  • Single-view depth network trained on WSVD
  • Multi-view depth network trained on WSVD

Our multi-view depth network