MViTac:
Self-Supervised Visual-Tactile Representation Learning via Multimodal Contrastive Training
Vedant Dave*
Fotios Lygerakis*
Elmar Rueckert
Abstract
This paper addresses the challenge of integrating visual and tactile sensory data within robotic systems. We propose MViTac, a novel self-supervised framework leveraging both intra-modal and inter-modal contrastive learning. MViTac outperforms existing self-supervised and select supervised methods on material property identification and grasp prediction tasks.
Introduction
Problem: Robots need to combine vision and touch to understand their environment for manipulation tasks.
Motivation: Self-supervised learning reduces the need for large, hand-labeled datasets, which are time-consuming and expensive to create for tactile data.
Research Gap: Existing methods focus on only one modality or don't fully explore relationships within and between modalities.
Methodology
Contrastive Learning: Learning by comparing similar and dissimilar examples.
Intra-modal learning: Enhancing representations within a single modality (e.g., only within visual data).
Inter-modal learning: Aligning representations between modalities (e.g., finding connections between visual and tactile).
MViTac Loss
Intra-modal loss
Inter-modal loss
Findings
MViTac Outperforms in Most Cases: MViTac consistently beats other self-supervised methods (TAG, SSVTP) on both material identification and grasp prediction tasks.
Multimodal Advantage: Combining visual and tactile data almost always led to better results than using tactile data alone, regardless of the method.
Competitive with Supervised Learning: MViTac even outperformed supervised methods on the material property identification task.
Limitations on Small Datasets: The supervised method still excelled on the grasping prediction task, likely due to the smaller dataset size. This is a common limitation of self-supervised approaches.
Material property identification
Robot Grasping Success
Discussion
MViTac Outperforms in Most Cases: MViTac consistently beats other self-supervised methods (TAG, SSVTP) on both material identification and grasp prediction tasks.
Multimodal Advantage: Combining visual and tactile data almost always led to better results than using tactile data alone, regardless of the method.
Competitive with Supervised Learning: MViTac even outperformed supervised methods on the material property identification task.
Limitations on Small Datasets: The supervised method still excelled on the grasping prediction task, likely due to the smaller dataset size. This is a common limitation of self-supervised approaches.