Self-Supervised Visuo-Tactile Pretraining
to Locate and Follow Garment Features
Bibtex
@inproceedings{kerr2022ssvtp,
title={Self-Supervised Visuo-Tactile Pretraining to Locate and Follow Garment Features},
author={Kerr, Justin and Huang, Huang and Wilcox, Albert and Hoque, Ryan and Ichnowski, Jeffrey and Calandra, Roberto and Goldberg, Ken},
booktitle={Robotics: Science and Systems},
year={2023}
}
Abstract
Humans make extensive use of vision and touch as complementary senses, with vision providing global information about the scene and touch measuring local information during manipulation without suffering from occlusions. While prior work demonstrates the efficacy of tactile sensing for precise manipulation of deformables, they typically rely on supervised, human-labeled datasets. We propose Self-Supervised VisuoTactile Pretraining (SSVTP), a framework for learning multi-task visuo-tactile representations in a self-supervised manner through cross-modal supervision. We design a mechanism that enables a robot to autonomously collect precisely spatially-aligned visual and tactile image pairs, then train visual and tactile encoders to embed these pairs into a shared latent space using cross-modal contrastive loss. We apply this latent space to downstream perception and control of deformable garments on flat surfaces, and evaluate the flexibility of the learned representations without fine-tuning on 5 tasks: feature classification, contact localization, anomaly detection, feature search from a visual query (e.g., garment feature localization under occlusion), and edge following along cloth edges. The pretrained representations achieve a 73-100% success rate on these 5 tasks.
Overview
We design a self-supervised framework to collect 4500 spatially aligned visual and tactile images, and use this dataset to learn a shared visuo-tactile latent space Z. We apply this latent space without fine-tuning for 4 locating task: contact localization, tactile localization, vision-query search and anomaly detection, and for 2 following tasking: following a cable and following a seam of the dress.
Method
Data Collection
To permit cross-modal representation learning we wish to collect spatially aligned pairs of visual and tactile images. To accomplish this, we design a self-supervised data collection pipeline that uses a robot to automatically collect paired images on deformable surfaces defined in Section III. We design a custom end-effector (a,b) to hold both sensors parallel to each other with a fixed offset d between the camera and tactile sensor. At the beginning of each round of data collection, we normally set up the workspace by laying out 5-15 different objects at stable poses on a flat surface to form a deformable surface (c). Data collection example is shown in the video below.
Architecture
We decouple the learning of visuo-tactile content representation (top) and orientation representation (bottom). We train a vision encoder and a tactile encoder. We use the collected data and contrastive learning to learn a visuo-tactile content association in a latent space. We train a separate rotation network on the same dataset to learn the difference in rotation between a vision image and a tactile image. We use color augmentation to prevent potential spurious lighting statistics in the data. We discretize the rotations into Nb buckets uniformly between 0 and 360 and formulate the problem as classification where the input is a concatenated visuo-tactile image pair and the output is a distribution over bucket indices.
Physical Experiments
Anomaly Detection
To inspect garments, humans often slide their hands across the surface to detect textures which stand out from the background. We mimic this behavior by sliding the tactile sensor across a deformable surface and using the trained tactile encoder to localize tactile anomalies in materials. An example is shown in the figure below.
Visual Feature Search
When searching for a specific tactile feature in a garment (e.g., a button in a shirt), anomaly detection is insufficient due to the existence of distractors (e.g. seams). To remedy this, we search for a tactile reading which matches a visual query image of the target. This visual image is embedded into the shared latent space. While sliding, the system computes the cosine similarity (dot product) between the current tactile embedding and the query image embedding. An example is shown in the figure below.
Edge Following
The robot servos along material features specified purely from visual images, leveraging the visuo-tactile pretraining to avoid the need for a servoing dataset. Given an input visual image specifying the feature to follow as well as the direction to follow it in, the system uses the rotation network to predict the relative rotation between the query visual image and the current tactile image. An example is shown in the figure below.
Tactile Localization
Given a top-down visual image of the entire workspace (a sort of “visual map”) and a tactile image, this module outputs a probability distribution over possible contact locations in the visual image. We divide the input visual image into a grid with uniform patch size and resolution and compute cosine similarity of each patch embedding with the tactile embedding as shown in the figure below.
Tactile Classification
We construct a classifier from trained encoders by providing canonical visual images of the classes we consider, which can be rapidly captured. We augment these visual images with the same augmentation as performed during training and embed them into the latent space. We use the weighted k-nearest neighbors algorithm to classify the tactile input as shown below.