Abstract
Dashboard cameras capture a tremendous amount of driving scene video each day. These videos are purposefully coupled with vehicle sensing data, such as from the speedometer and inertial sensors, providing an additional sensing modality for free. In this work, we leverage the large-scale unlabeled yet naturally paired data for visual representation learning in the driving scenario. A representation is learned in an end-to-end self-supervised framework for predicting dense optical flow from a single frame with paired sensing data. We postulate that success on this task requires the network to learn semantic and geometric knowledge in the ego-centric view. For example, forecasting a future view to be seen from a moving vehicle requires an understanding of scene depth, scale, and movement of objects. We demonstrate that our learned representation can benefit other tasks that require detailed scene understanding and outperforms competing unsupervised representations on semantic segmentation.
Motivation
Visuomotor:
“The ability to synchronize visual information with physical movement”
“How do I move?” ⟺ “How my visual surroundings change?”
→ Forecasting future sight requires a comprehensive understanding of a scenery.
Learning visuomotor by exploiting the optical flow.
→ Traditional optical flow estimation: flow = F(I₁, I₂)
→ Our optical flow prediction with visuomotor guidance: flow = G(I₁, S₁)
S₁: vehicle sensor measurement (vehicle speed and angular velocity) on I₁ frame.
→ We train G for visual representation learning.
Multi-modal (motion-guided) self-supervised learning.
Contributions
A generic sensor fusion architecture that predicts dense optical flow.
→ Generalized to diverse basic networks (AlexNet, VGG, ResNet).
→ Directly fusing motorial sensor data and visual information in a mutually interacting way.
Sensor fusion meta-architecture for various applications in driving scenarios.
→ End-to-end trainable networks for visual representation learning: semantic segmentation.
→ Single image view synthesis.
A large-scale driving dataset.
Proposed Sensor Fusion Architecture
The overall architecture for optical flow prediction (proxy task) guided by vehicle motion.
→ Bidirectional, generic encoder-decoder structure + sensor modulator.
→ The encoder can be any general-purpose feature extractor (AlexNet, VGG, ResNet).
Sensor modality:
→ We only use motorial sensor data, which can be negated by *time-reversal (velocity, angular momentum, etc.).
→ *Time reversal symmetry (T-symmetry): A physics law under the transformation of the time inversion, 𝑇: 𝑡→−𝑡. There are two types of physical variables regarding an effect of the time-reversal: variables not changing upon the time-reversal, i.e., position, acceleration, and energy of a particle (even), and variables negated by the time-reversal, i.e., time, velocity, and angular momentum of a particle (odd).
Learned Representation Test: Semantic Segmentation
Fine-tuned with FCN-8s (for ResNet-18/-32/VGG) and FCN-32s (for AlexNet)
Dataset: CamVid & Cityscapes
Mean IoU comparisons for semantic segmentation
Application: Single Image View Synthesis
6-DoF dynamic control input with virtual path generation (dataset: EuRoC-MAV).
Comparison with previous unsupervised flow-based view synthesis methods (dataset: KITTI).
Controllable by continuous sensor embedding (supplementary video).