Visuomotor Understanding for Representation Learning of Driving Scenes

Seokju Lee¹, Junsik Kim¹, Tae-Hyun Oh², Yongseop Jeong¹, Donggeun Yoo³, Stephen Lin⁴, In So Kweon¹

¹KAIST, ²MIT CSAIL, ³Lunit, ⁴Microsoft Research

BMVC 2019


Dashboard cameras capture a tremendous amount of driving scene video each day. These videos are purposefully coupled with vehicle sensing data, such as from the speedometer and inertial sensors, providing an additional sensing modality for free. In this work, we leverage the large-scale unlabeled yet naturally paired data for visual representation learning in the driving scenario. A representation is learned in an end-to-end self-supervised framework for predicting dense optical flow from a single frame with paired sensing data. We postulate that success on this task requires the network to learn semantic and geometric knowledge in the ego-centric view. For example, forecasting a future view to be seen from a moving vehicle requires an understanding of scene depth, scale, and movement of objects. We demonstrate that our learned representation can benefit other tasks that require detailed scene understanding and outperforms competing unsupervised representations on semantic segmentation.


  • Visuomotor:

“The ability to synchronize visual information with physical movement”

“How do I move?” ⟺ “How my visual surroundings change?”

→ Forecasting future sight requires a comprehensive understanding of a scenery.

  • Learning visuomotor by exploiting the optical flow.

→ Traditional optical flow estimation: flow = F(I₁, I₂)

→ Our optical flow prediction with visuomotor guidance: flow = G(I₁, S₁)

S₁: vehicle sensor measurement (vehicle speed and angular velocity) on I₁ frame.

→ We train G for visual representation learning.

  • Multi-modal (motion-guided) self-supervised learning.


  • A generic sensor fusion architecture that predicts dense optical flow.

Generalized to diverse basic networks (AlexNet, VGG, ResNet).

→ Directly fusing motorial sensor data and visual information in a mutually interacting way.

  • Sensor fusion meta-architecture for various applications in driving scenarios.

→ End-to-end trainable networks for visual representation learning: semantic segmentation.

→ Single image view synthesis.

  • A large-scale driving dataset.

Proposed Sensor Fusion Architecture

  • The overall architecture for optical flow prediction (proxy task) guided by vehicle motion.

→ Bidirectional, generic encoder-decoder structure + sensor modulator.

→ The encoder can be any general-purpose feature extractor (AlexNet, VGG, ResNet).

  • Sensor modality:

→ We only use motorial sensor data, which can be negated by *time-reversal (velocity, angular momentum, etc.).

→ *Time reversal symmetry (T-symmetry): A physics law under the transformation of the time inversion, 𝑇: 𝑡→−𝑡. There are two types of physical variables regarding an effect of the time-reversal: variables not changing upon the time-reversal, i.e., position, acceleration, and energy of a particle (even), and variables negated by the time-reversal, i.e., time, velocity, and angular momentum of a particle (odd).

Learned Representation Test: Semantic Segmentation

  • Fine-tuned with FCN-8s (for ResNet-18/-32/VGG) and FCN-32s (for AlexNet)

  • Dataset: CamVid & Cityscapes

  • Mean IoU comparisons for semantic segmentation

Application: Single Image View Synthesis

  • 6-DoF dynamic control input with virtual path generation (dataset: EuRoC-MAV).

  • Comparison with previous unsupervised flow-based view synthesis methods (dataset: KITTI).

  • Controllable by continuous sensor embedding (supplementary video).

Code [GitHub]

Dataset [GitHub]