FSNet: Redesign Self-Supervised MonoDepth for Full-Scale Depth Prediction for Autonomous Driving
Full-Scale Depth Prediction with Poses
The proposed algorithm setup is well-suited for well-calibrated and localized robots with certain neural computing power. We studied in detail how to obtain real-world-scale depth prediction, trained with robot poses.
User-side features:
Lightweight: Same simple single-frame network structure as the lightweight MonoDepth2. No pose net during training.
Adaptivity: Directly trained with image-pose sequences obtained by the robot in the target environment. Possible to co-train with multiple datasets.
Deployability: Real-world-scale depth prediction right from the lightweight neural network.
Tech contributions:
Tackle initial corruption when directly use pose in training. NO PoseNet.
Multichannel output setup - stable training
Optical-flow mask - dynamic masking
Self-distillation
Post-processing with sparse VO.
Paper Summary Video:
Additional Words for Motivation:
We believe that we should not expect PoseNet, a ResNet on a concatenation of two images, to produce more reliable poses than the localization module in a robot. So we try our best to completely avoid using PoseNet. This creates much chaos in training, but we managed to fix it in FSNet.
We believe the network should try to directly predict accurate depth with a correct scale from the very beginning. So our method could produce meaningful results on static frames or scenes with little/no VO points (same as the network's direct prediction). There are images without VO points in our multi-frame experiment, but our method is robust enough to get it done.
These ideas motivate the "Redesign" we claimed.
Visualized on a full validating sequence of KITTI-360.
Network only reads one image per frame. We modularize data publisher, network inferencing into independent ROS nodes to render the video in real-time.
Visualized on nuScenes dataset.
We produce extra experiments on the dataset to demonstrate the potential of FSNet.