Multi-Frame Self-Supervised
Depth with Transformers

Vitor Guizilini Rares Ambrus Dian Chen Sergey Zakharov Adrien Gaidon


Abstract. Multi-frame depth estimation improves over single-frame approaches by also leveraging geometric relationships between images via feature matching, in addition to learning appearance-based features. In this paper we revisit feature matching for self-supervised monocular depth estimation, and propose a novel transformer architecture for cost volume generation. We use depth-discretized epipolar sampling to select matching candidates, and refine predictions through a series of self- and cross-attention layers. These layers sharpen the matching probability between pixel features, improving over standard similarity metrics prone to ambiguities and local minima. The refined cost volume is decoded into depth estimates, and the whole pipeline is trained end-to-end from videos using only a photometric objective. Experiments on the KITTI and DDAD datasets show that our DepthFormer architecture establishes a new state of the art in self-supervised monocular depth estimation, and is even competitive with highly specialized supervised single-frame architectures. We also show that our learned cross-attention network yields representations transferable across datasets, increasing the effectiveness of pre-training strategies.

Contributions:

  • We introduce a novel architecture, the DepthFormer, that improves multi-view feature matching via cross- and self-attention combined with depth-discretized epipolar sampling.

  • Our architecture leads to state-of-the-art depth estimation results. It outperforms other self-supervised multi-frame methods by a large margin, and even surpasses supervised single-frame architectures.

  • Our learned attention-based matching function is transferable across datasets, which can significantly improve convergence speed while decreasing memory.

Citation

@inproceedings{tri_depthformer_cvpr22,

author = {Vitor Guizilini and Rares Ambrus and Dian Chen and Sergey Zakharov and Adrien Gaidon},

title = {Multi-Frame Self-Supervised Depth with Transformers},

booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},

year = {2022},

}