Play it by Ear: Learning Skills amidst Occlusion
through Audio-Visual Imitation Learning

Maximilian Du*, Olivia Y. Lee*, Suraj Nair, and Chelsea Finn

Stanford University

Paper | Code

Robotics: Science and Systems, 2022

Our goal is to enable robots to tackle challenging visually occluded manipulation tasks (like extracting keys from a bag), via end-to-end interactive imitation learning from vision and sound.

Learning from Vision and Sound

Interactive Imitation with a Human Supervisor

Our method contains three key components: (1) We leverage an end-to-end learned approach that takes as input and fuses both vision and sound modalities. (2) We use a memory augmented neural network that captures the multi-modal inputs over an extended history. (3) We leverage interactive imitation learning to fine-tune the policy efficiently given online interaction.

In our real robot experiment we consider the task of extracting keys from a bag in easy and hard initializations. We find that our full approach is able to succeed at extracting the keys with a 70% success rate, more than double the model that doesn't use audio

In simulation experiments we further verify that all components of our method are critical, and that online interventions enable significantly improved performance:

5 Minute Video Summary:

playitbyear_arxiv_final.mp4