Our goal is to enable robots to tackle challenging visually occluded manipulation tasks (like extracting keys from a bag), via end-to-end interactive imitation learning from vision and sound.
Learning from Vision and Sound
Interactive Imitation with a Human Supervisor
Our method contains three key components: (1) We leverage an end-to-end learned approach that takes as input and fuses both vision and sound modalities. (2) We use a memory augmented neural network that captures the multi-modal inputs over an extended history. (3) We leverage interactive imitation learning to fine-tune the policy efficiently given online interaction.
In our real robot experiment we consider the task of extracting keys from a bag in easy and hard initializations. We find that our full approach is able to succeed at extracting the keys with a 70% success rate, more than double the model that doesn't use audio
In simulation experiments we further verify that all components of our method are critical, and that online interventions enable significantly improved performance:
5 Minute Video Summary:
