VITaL Pretraining: Visuo-Tactile Pretraining for Tactile and Non-Tactile Manipulation Policies

Abraham George¹ Selam Gano¹ Pranav Katragadda¹ Amir Barati Farimani¹

¹Carnegie Mellon University Mechanical Engineering

[Paper] [Code] [Hardware CAD]

Overview

In this project, we explored using visuo-tactile information in imitation learning frameworks for tackling complex manipulation problems, leveraging our multimodal dataset to pre-train our model via a contrastive loss. We show that our pretraining strategy, which gives a visuo-tactile agent a moderate performance improvement, can be used to significantly improve the performance of a vision-only agent. By pretraining with tactile information, vision-only agents were able to achieve a success rate on par with their visuo-tactile counterparts, without requiring tactile information during deployment.

We evaluated our method on the task of USB cable plugging,  a dexterous manipulation task that relies on fine-grain visuo-tactile serving, along with two block-stacking tasks.

Pretraining Approach

We implemented a CLIP-inspired contrastive-loss pretraining strategy. We trained two encoders, each using a separate modality, to produce latent representations of the scene. By optimizing the encoders to maximize the cross-modality similarity of latent representations from the same scene while minimizing the similarity of latent representations from different scenes, we were able to learn the relationships between tactile and visual features.

To ensure that both encoders have a complete representation of the scene, we combined the tactile observations (which only contain local information) with the position observations (which only encode global information). Each encoder consists of a ResNet-18 backbone, topped with a fully connected MLP layer to project the ResNet encoding to a 512 embedding space. In the tactile encoder, the positional information was also passed into the MLP projection head.

To form the contrastive pairs for the pretraining, we observations randomly from a single demonstration, with the requirement that the samples must be at least 1 second apart. By only sampling observation pairs from a single trajectory, and ensuring they are sufficiently separated in time and separated by time, this approach instilled previously unused temporal information into the encoder, as the model had to learn how the scene changed over time, rather than in between runs. 

Learning Pipelines

We used two imitation learning frameworks to evalute our pretraining method: ACT and diffusion policy. ACT (left) is trained as an autoencoder, predicting a sequence of actions at each timestep. At inference, the latent variable is set to 0. The network is queried each timestep, and all action predictions for that timestep are ensembled using a weighted average. Diffusion Policy (right) learns to predict noise applied to an action sequence. During inference, the action sequence is initialized with Gaussian noise and is iteratively denoised to produce output actions. 

Experimental Setup

We evaluated using a Franka Emika Panda robot. To complete the task, the robot had to navigate to a USB cable, unplug it from its holder, and plug it into the last port of a USB hub.  During operation,  a GelSight captures tactile observations, while 6 Realsense D415 cameras observe the scene.

Results

Vision + Tactile Pre-trained

Vision Only, Not Pre-trained

Vision Only, Pre-trained

Bibtex 

@article{george2024visuo,

  title={Visuo-Tactile Pretraining for Cable Plugging},

  author={George, Abraham and Gano, Selam and Katragadda, Pranav and Farimani, Amir Barati},

  journal={arXiv preprint arXiv:2403.11898},

  year={2024}}