Multi-View Masked World Models for Visual Robotic Manipulation

Younggyo Seo*, Junsu Kim*, Stephen James, Kimin Lee, Jinwoo Shin, Pieter Abbeel

[Paper] [Code]

Zero-Shot Sim-To-Real Transfer with Fixed Camera

MV-MWM can solve visual robotic manipulation tasks with random viewpoints even in the absence of camera calibration.

Viewpoint 1

Viewpoint 2

Viewpoint 3

Zero-Shot Sim-To-Real Transfer with Hand-Held Camera

Surprisingly, MV-MWM can solve visual robotic manipulation tasks with hand-held cameras.






Visual robotic manipulation research and applications often use multiple cameras, or views, to better perceive the world. How else can we utilize the richness of multi-view data? In this paper, we investigate how to learn good representations with multi-view data and utilize them for visual robotic manipulation. Specifically, we train a multi-view masked autoencoder which reconstructs pixels of randomly masked viewpoints and then learn a world model operating on the representations from the autoencoder. We demonstrate the effectiveness of our method in a range of scenarios, including multi-view control and single-view control with auxiliary cameras for representation learning. We also show that the multi-view masked autoencoder trained with multiple randomized viewpoints enables training a policy with strong viewpoint randomization and transferring the policy to solve real-robot tasks without camera calibration and an adaptation procedure. 

Multi-View Masked World Models

Given multi-view data from multiple cameras or multiple randomized viewpoints, we mask viewpoints from video frames at random and train a multi-view masked autoencoder to reconstruct pixels of both masked and unmasked viewpoints. We then learn a world model upon frozen autoencoder representations to solve tasks from various robotic manipulation setups, including a multi-view control, a single-view control, and a viewpoint-robust control in both simulation and real-world.

Main Experiments: Aggregate Performance

Main Experiments: Viewpoint Randomization

Ablation Studies