Making Sense of Vision and Touch:
Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks
Michelle A. Lee*, Yuke Zhu*, Krishnan Srinivasan, Parth Shah,
Silvio Savarese, Li Fei-Fei, Animesh Garg, Jeannette Bohg
Abstract: Contact-rich manipulation tasks in unstructured environments often require both haptic and visual feedback. However, it is non-trivial to manually design a robot controller that combines modalities with very different characteristics. While deep reinforcement learning has shown success in learning control policies for high-dimensional inputs, these algorithms are generally intractable to deploy on real robots due to sample complexity. We use self-supervision to learn a compact and multimodal representation of our sensory inputs, which can then be used to improve the sample efficiency of our policy learning. We evaluate our method on a peg insertion task, generalizing over different geometry and clearances, while being robust to external perturbations. Results for simulated and real robot experiments are presented.
ICRA 2019 Conference Paper: available on arXiv
Extended Version: available on arXiv
Blog post: http://ai.stanford.edu/blog/selfsupervised-multimodal/
Code and dataset: https://github.com/stanford-iprl-lab/multimodal_representation
Best paper of ICRA 2019, Finalist for Best Paper in Cognitive Robotics ICRA 2019
Contact: michellelee {at} cs {dot} stanford {dot} edu for more information
* These authors contributed equally to the paper
Supplementary Video:
Bibtex:
@inproceedings{lee2019icra,
title={Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks},
author={Lee, Michelle A and Zhu, Yuke and Srinivasan, Krishnan and Shah, Parth and Savarese, Silvio and Fei-Fei, Li and Garg, Animesh and Bohg, Jeannette},
booktitle={2019 IEEE International Conference on Robotics and Automation (ICRA)},
year={2019},
url={https://arxiv.org/abs/1810.10191}
}
Research supported by: