Multiscopic Vision

Weihao Yuan, Michael Yu Wang, Qifeng Chen
Hong Kong University of Science and Technology, Hong Kong SAR, China

We extend the idea of stereo matching to multiscopic vision to obtain high-quality depth, in which more constraints can be enforced in estimated depth maps. We refer to the problem of depth estimation with multiple images captured at aligned camera locations as multiscopic vision, as an analog to stereo vision with two horizontally aligned images. Inspired by the principle of stereo vision that depth estimation with two perfectly aligned images is relatively easier than with two images with arbitrary unknown camera poses, we believe capturing multiple images with aligned camera locations can benefit obtaining more accurate and comprehensive depth estimation.

1. Multiscopic Dataset

We build a dataset composed of 1200 scenes of synthetic multiscopic data rendered by 3D engines and 100 scenes of real images taken with real-world cameras.

Synthetic images are rendered by 3D engines [1-3] from 3D models [4-6]. Real images are captured with multi-eye cameras designed by ourselves.

This dataset can also be used for other interesting topics. For instance, multiscopic images can be utilized for better image enhancement since there are more constraints to be enforced, such as super-resolution with multiple images or improving the brightness of images taken in the dark. Also, view synthesis can be done in a multiscopic structure, such as predicting the right view provided with only the left and center views.

Download Dataset: 1400 scenes of synthetic data and real data

2. MFuseNet: Robust Depth Estimation With Learned Multiscopic Fusion

We design a multiscopic vision system that utilizes a low-cost monocular RGB camera to acquire accurate depth estimation. Unlike multi-view stereo with images captured at unconstrained camera poses, the proposed system controls the motion of a camera to capture a sequence of images in horizontally or vertically aligned positions with the same parallax. In this system, we propose a new heuristic method and a robust learning-based method to fuse multiple cost volumes between the reference image and its surrounding images. The experiments on the real-world Middlebury dataset and real robot demonstration show that our multiscopic vision system outperforms traditional two-frame stereo matching methods in depth estimation.

3. Stereo Matching by Self-supervision of Multiscopic Vision

Self-supervised learning for depth estimation possesses several advantages over supervised learning. The benefits of no need for ground-truth depth, online fine-tuning, and better generalization with unlimited data attract researchers to seek self-supervised solutions. In this work, we propose a new self-supervised framework for stereo matching utilizing multiple images captured at aligned camera positions. A cross photometric loss, an uncertainty-aware mutual-supervision loss, and a new smoothness loss are introduced to optimize the network in learning disparity maps end-to-end without ground-truth depth information. After being trained with only the synthetic images, our network can perform well in unseen outdoor scenes. Our experiment shows that our model obtains better disparity maps than previous unsupervised methods on the KITTI dataset and is comparable to supervised methods when generalized to unseen data.

Reference:

[1] Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. arXiv preprint arXiv:1904.01201, 2019.

[2] J. McCormac, A. Handa, S. Leutenegger, and A. J. Davison, “Scenenet rgb-d: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation?”in IEEE International Conference on Computer Vision (ICCV), 2017

[3] Y. Wu, Y. Wu, G. Gkioxari, and Y. Tian, “Building generalizable agents with a realistic and rich 3d environment,” arXiv preprint arXiv:1801.02209, 2018

[4] Chang, Angel and Dai, Angela and Funkhouser, Thomas and Halber, Maciej and Niessner, Matthias and Savva, Manolis and Song, Shuran and Zeng, Andy and Zhang, Yinda, "Matterport3D: Learning from RGB-D Data in Indoor Environments." in International Conference on 3D Vision (3DV), 2017.

[5] Chang, Angel X., et al. "Shapenet: An information-rich 3d model repository." arXiv preprint arXiv:1512.03012 (2015).

[6] Fei Xia, Amir R. Zamir, Zhi-Yang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson env: real-world perception for embodied agents. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018.