Masked Imitation Learning: Discovering Environment-Invariant Modalities in Multimodal Demonstrations

Yilun Hao, Ruinan Wang, Zhangjie Cao, Zihan Wang, Yuchen Cui, Dorsa Sadigh

Abstract

Multimodal demonstrations provide robots with an abundance of information to make sense of the world. However, such abundance may not always lead to good performance when it comes to learning sensorimotor control policies from human demonstrations. Extraneous data modalities can lead to state over-specification, where the state contains extraneous modalities that are not only useless for decision-making but also changes data distribution across environments. State over-specification causes the learned policy not to generalize well outside of the training data distribution. In this work, we propose Masked Imitation Learning (MIL) to address state over-specification by selectively using different modalities. Specifically, we design a masked policy network with a binary mask to block certain modalities and develop a bi-level optimization algorithm to learn this mask to accurately filter over-specified modalities. We demonstrate empirically that MIL outperforms baseline algorithms in simulated domains including MuJoCo and a robot arm environment using the Robomimic dataset, and effectively recovers the environment-invariant modalities on a multimodal dataset collected on a real robot.

Consider the example in above image, a robot is trained to perform a pick-and-place task in simulation from three different modalities (RGB image, depth image, and proprioception). To locate and pick up the object, depth and proprioception information are sufficient. Therefore, the remaining modality of RGB image that changes from the simulation to the real setting would over-specify the state for this particular task.

When such extraneous information is used by policy models to predict action, they are less likely to generalize at test time, especially when the testing environment changes, e.g., the common paradigms of training in simulation and testing the policy on the real robot as shown above.

In this paper, we focus on addressing the state over-specification problem introduced by extraneous modalities to avoid overfitting training data when learning from multimodal data.

Robomimic-Can Result