SVDistNet: Self-Supervised Near-Field Distance Estimation on Surround View Fisheye Cameras

Varun Ravi Kumar*", Marvin Klingner**, Senthil Yogamani', Markus Bach*, Stefan Milz", Tim Fingscheidt**, and Patrick Mäder''

*Valeo DAR Kronach, Germany, **Technische Universität Braunschweig, Germany,
'Valeo Vision Systems, Ireland and ''Technische Universitat Ilmenau, Germany

T-ITS

Abstract

A 360° perception of scene geometry is essential for automated driving, notably for parking and urban driving scenarios. Typically, it is achieved using surround-view fisheye cameras, focusing on the near-field area around the vehicle. Most of the current depth estimation approaches focus on employing just a single camera, which cannot be straightforwardly generalized to multiple cameras. Besides, the depth estimation model needs to be deployed across different sized car lines with varying camera geometries. Even in a single-car line, there are variations in intrinsics due to manufacturing tolerances. Deep learning models are sensitive to these changes, and it is practically infeasible to train and test on each camera variant. Thus, we introduce novel camera-geometry adaptive multi-scale convolutions, which utilize the camera parameters as a conditional input, enabling the model to generalize to unseen fisheye cameras. We incorporate this into our previous work on self-supervised semantically-guided distance estimation on fisheye images.

Additionally, we improve the distance estimation by pairwise and patchwise vector-based self-attention encoder networks, further increasing the distance estimation's performance. We evaluate our approach on the Fisheye WoodScape surround-view dataset, significantly improving over previous approaches. We also show a generalization of our approach across different camera viewing angles and perform extensive experiments to support our contributions. To enable comparison with other approaches, we evaluate the front camera data on the KITTI dataset (pinhole camera images) and achieve state-of-the-art performance among self-supervised monocular methods. We showcase our work's short video with quantitative and qualitative results, highlight our novel contributions, and the post-processed estimates deployed into our car at https://youtu.be/bmX0UcU9wtA. The code and dataset will be made public as part of the WoodScape project. https://github.com/valeoai/WoodScape

Our surround-view distance estimation framework is facilitated by employing a single network on images from multiple cameras. A surround-view coverage of geometric information is obtained for an autonomous vehicle by utilizing and post-processing the distance maps from all cameras.

Method

Overview of our surround-view self-supervised distance estimation framework} making use of semantic guidance and camera-geometry adaptive convolutions (orange blocks). Our framework consists of processing units to train the self-supervised distance estimation (blue blocks) and semantic segmentation (green blocks). The camera tensor C_t (orange block) helps our SVDistNet yield distance maps on multiple camera-viewpoints and make the network camera independent. C_t can also be adapted to the standard camera models. Both modalities are weighted and optimized simultaneously by the multi-task loss from. By our proposed framework, it is possible to obtain surround-view geometric information by post-processing of the predicted distance maps in 3D space.

Dealing with dynamic objects And PROPOSED Network

Application of our semantic masking methods, to handle potentially dynamic objects. The dynamic objects inside the segmentation masks from consecutive frames in (b) and (d) are accumulated to a dynamic object mask, which is used to mask the photometric error (e), as shown in (h).

Overview of our proposed network architecture for semantically guided self-supervised distance estimation. It consists of a shared vector-based self-attention encoder and task-specific decoders. Our encoder is a self-attention network with pairwise and patchwise variants, while the decoder uses pixel-adaptive convolutions, which are both complemented by our novel Camera Geometry convolutions.