ZeroDepth: Towards Zero-Shot Scale-Aware
Monocular Depth Estimation

Vitor Guizilini     Igor Vasiljevic     Dian Chen     Rares Ambrus     Adrien Gaidon

The pointclouds above were generated by the same model, that has never seen any of these datasets, and without groundtruth scale alignment. 

Abstract. Monocular depth estimation is scale-ambiguous, and thus requires scale supervision to produce metric predictions. Even so, the resulting models will be geometry-specific, with learned scales that cannot be directly transferred across domains. Because of that, recent works focus instead on relative depth, eschewing scale in favor of improved up-to-scale zero-shot transfer. In this work we introduce \Acronym, a novel monocular depth estimation framework capable of predicting metric scale for arbitrary test images from different domains and camera parameters. This is achieved by (i) the use of input-level geometric embeddings that enable the network to learn a scale prior over objects; and (ii) decoupling the encoder and decoder stages, via a variational latent representation that is conditioned on single frame information. We evaluated ZeroDepth targeting both outdoor (KITTI, DDAD, nuScenes) and indoor (NYUv2) benchmarks, and achieved a new state-of-the-art in both settings using the same pre-trained model, outperforming  methods that train on in-domain data and require test-time scaling to produce metrically accurate estimates. 

Contributions: