Depth Field Networks for Generalizable
Multi-view Scene Representation

Vitor Guizilini* Igor Vasiljevic* Jiading Fang* Rares Ambrus

Greg Shakhnarovich Matthew Walter Adrien Gaidon

Abstract. Modern 3D computer vision leverages learning to boost geometric reasoning, mapping image data to classical structures such as cost volumes or epipolar constraints to improve matching. These architectures are specialized according to the particular problem, and thus require significant task-specific tuning, often leading to poor domain generalization performance. Recently, generalist Transformer architectures have achieved impressive results in tasks such as optical flow and depth estimation by encoding geometric priors as inputs rather than as enforced constraints. In this paper, we extend this idea and propose to learn an implicit, multi-view consistent scene representation, introducing a series of 3D data augmentation techniques as a geometric inductive prior to increase view diversity. We also show that introducing view synthesis as an auxiliary task further improves depth estimation. Our Depth Field Networks (DeFiNe) achieve state-of-the-art results in stereo and video depth estimation without explicit geometric constraints, and improve on zero-shot domain generalization by a wide margin.

Contributions:

  • We use a generalist Transformer-based architecture to learn a depth estimator from an arbitrary number of posed images. In this setting, we (i) propose a series of 3D augmentations that improve the geometric consistency of our learned latent representation; and (ii) show that jointly learning view synthesis as an auxiliary task improves depth estimation.

  • Our Depth Field Networks (DeFiNe) not only achieve state-of-the-art stereo depth estimation results on the widely used ScanNet dataset, but also exhibit superior generalization properties with state-of-the-art results on zero-shot transfer to 7-Scenes.

  • DeFiNe also enables depth estimation from arbitrary viewpoints. We evaluate this novel generalization capability in the context of interpolation (between timesteps), and extrapolation (future timesteps)

Citation

@inproceedings{tri_define,

author = {Vitor Guizilini and Igor Vasiljevic and Jiading Fang and Rares Ambrus and Greg Shakhnarovich and Matthew Walter and Adrien Gaidon},

title = {Depth Field Networks for Generalizable Multi-view Scene Representation},

booktitle = {European Conference on Computer Vision (ECCV)},

year = {2022},

}