MViTac:

Self-Supervised Visual-Tactile Representation Learning via Multimodal Contrastive Training

Vedant Dave*

Fotios Lygerakis*

Elmar Rueckert

*Equal Contribution

Paper

Code

Calandra's Dataset

Touch&Go Dataset

Abstract

This paper addresses the challenge of integrating visual and tactile sensory data within robotic systems. We propose MViTac, a novel self-supervised framework leveraging both intra-modal and inter-modal contrastive learning. MViTac outperforms existing self-supervised and select supervised methods on material property identification and grasp prediction tasks.

Introduction

Problem: Robots need to combine vision and touch to understand their environment for manipulation tasks.
Motivation: Self-supervised learning reduces the need for large, hand-labeled datasets, which are time-consuming and expensive to create for tactile data.
Research Gap: Existing methods focus on only one modality or don't fully explore relationships within and between modalities.

Methodology

Contrastive Learning: Learning by comparing similar and dissimilar examples.
Intra-modal learning: Enhancing representations within a single modality (e.g., only within visual data).
Inter-modal learning: Aligning representations between modalities (e.g., finding connections between visual and tactile).

MViTac Loss

Intra-modal loss

Inter-modal loss

Findings

MViTac Outperforms in Most Cases: MViTac consistently beats other self-supervised methods (TAG, SSVTP) on both material identification and grasp prediction tasks.
Multimodal Advantage: Combining visual and tactile data almost always led to better results than using tactile data alone, regardless of the method.
Competitive with Supervised Learning: MViTac even outperformed supervised methods on the material property identification task.
Limitations on Small Datasets: The supervised method still excelled on the grasping prediction task, likely due to the smaller dataset size. This is a common limitation of self-supervised approaches.

Material property identification

Robot Grasping Success

Discussion

MViTac Outperforms in Most Cases: MViTac consistently beats other self-supervised methods (TAG, SSVTP) on both material identification and grasp prediction tasks.
Multimodal Advantage: Combining visual and tactile data almost always led to better results than using tactile data alone, regardless of the method.
Competitive with Supervised Learning: MViTac even outperformed supervised methods on the material property identification task.
Limitations on Small Datasets: The supervised method still excelled on the grasping prediction task, likely due to the smaller dataset size. This is a common limitation of self-supervised approaches.

MViTac:

Self-Supervised Visual-Tactile Representation Learning via Multimodal Contrastive Training

Vedant Dave*

Fotios Lygerakis*

Elmar Rueckert

Abstract

Introduction

Methodology

MViTac Loss

Intra-modal loss

Inter-modal loss

Findings

Material property identification

Robot Grasping Success

Discussion

Cite

@misc{dave2024multimodal,

title={Multimodal Visual-Tactile Representation Learning through Self-Supervised Contrastive Pre-Training},

author={Vedant Dave and Fotios Lygerakis and Elmar Rueckert},

year={2024},

eprint={2401.12024},

archivePrefix={arXiv},

primaryClass={cs.RO}

}

Contact us:

For any question you can contact us at: vedant.dave@unileoben.ac.at