Tactile MNIST
A Benchmark for Active Tactile Perception
Tactile MNIST
A Benchmark for Active Tactile Perception
Tim Schneider, Cristiana de Farias, Roberto Calandra, Liming Chen, and Jan Peters
Tactile perception has the potential to significantly enhance dexterous robotic manipulation by providing rich local information that can complement or substitute for other sensory modalities such as vision. However, because tactile sensing is inherently local, it is not well-suited for tasks that require broad spatial awareness or global scene understanding on its own. A human-inspired strategy to address this issue is to consider active perception techniques instead. That is, to actively guide sensors toward regions with more informative or significant features and integrate such information over time in order to understand a scene or complete a task. Both active perception and different methods for tactile sensing have received significant attention recently. Yet, despite advancements, both fields lack standardized benchmarks. To bridge this gap, we introduce the Tactile MNIST Benchmark Suite, an open-source, Gymnasium-compatible benchmark specifically designed for active tactile perception tasks, including localization, classification, and volume estimation. Our benchmark suite offers diverse simulation scenarios, from simple toy environments all the way to complex tactile perception tasks using vision-based tactile sensors. Furthermore, we also offer a comprehensive dataset comprising 13,500 synthetic 3D MNIST digit models and 153,600 real-world tactile samples collected from 600 3D printed digits. Using this dataset, we train a CycleGAN for realistic tactile simulation rendering. By providing standardized protocols and reproducible evaluation frameworks, our benchmark suite facilitates systematic progress in the fields of tactile sensing and active perception.
The Tactile MNIST benchmark provides four simulated active tactile classification tasks ranging from classification and counting to pose and volume estimation. Each task comes with a unique set of challenges and, thus, Tactile MNIST requires adaptive algorithms and clever exploration strategies. The aim of these benchmark tasks is to provide an extensible framework for a fair comparison of active tactile perception methods.
All tasks are implemented as ap_gym environments, which is a Gymnasium-compatible framework for active perception tasks.
TactileMNIST
In the TactileMNIST environment, the agent's objective is to classify 3D models of handwritten digits by touch alone. Aside from finding the object, the main challenge in the TactileMNIST environment is to learn contour following strategies to efficiently classify it once found. Object pose perturbation is enabled, meaning that the object shifts around slightly while being touched. This requires the agent to use robust strategies that are invariant to small shifts in the object's pose.
Starstruck
In the Starstruck environment, the agent must count the number of stars in a scene cluttered with other objects. Since all stars look the same, distinguishing stars from other objects is rather straightforward. Instead, the main challenge posed in this environment is to learn an effective search strategy to systematically cover as much space as possible.
Toolbox
In the Toolbox environment, the agent's objective is to locate a wrench positioned randomly on a platform and estimate its precise 2D position and 1D orientation. Unlike the previous classification tasks, Toolbox is poses a regression problem that requires combining multiple touch observations to resolve ambiguities inherent in the wrench’s shape. For example, touching the handle may reveal lateral placement but not longitudinal position or orientation, making it critical for the agent to explore strategically and seek out one of the wrench’s ends to accurately determine its pose. Overall, the Toolbox tests the agent’s ability to both find and precisely localize an object through sequential tactile exploration.
TactileMNISTVolume
In the TactileMNISTVolume environment, the agent's objective is to estimate the volume of 3D models of handwritten digits by touch alone. Aside of finding the object, the main challenge in the TactileMNISTVolume environment is to learn contour following strategies to efficiently explore it once found. Object pose perturbation is enabled, meaning that the object shifts around slightly while being touched. This requires the agent to use robust strategies that are invariant to small shifts in the object's pose.
In addition to the benchmark tasks, Tactile MNIST provides two large datasets: MNIST 3D and the Real Tactile MNIST Dataset. The former is a dataset of 3D models of hand-written MNIST digits, which are used in the TactileMNIST and TactileMNISTVolume environments. The latter is a dataset of real-world tactile interactions with 3D-printed MNIST 3D objects, which we use to train a CycleGAN for realistic tactile simulations. Both datasets are available on Huggingface. To access them, check out our GitHub repository.
MNIST 3D is a collection of 13,580 auto-generated 3D-printable meshes derived from a 500 × 500 pixel high-resolution MNIST variant and scaled to fit in a 10 × 10 cm square. The MNIST 3D dataset poses an exciting tactile classification challenge, as it has significant variability in shape and size within the classes, while also being large enough to facilitate learning from data. A single touch is rarely enough to classify objects from this dataset, as segments of hand-written digits are usually ambiguous. Hence, even after finding the object, the agent has to apply some strategy (e.g., contour following) to gather enough information for a successful classification. In addition to tactile sensing, this dataset could also be used as a benchmark for 3D mesh classification methods.
More details on the generation of the MNIST 3D dataset can be found in our paper.
The Tactile MNIST benchmark includes a real–world, static tactile dataset of 3D-printed MNIST 3D digits captured with a GelSight Mini tactile sensor mounted on a Franka Research 3 robo: the Real Tactile MNIST Dataset. The dataset contains video sequences of 153,600 touches across 600 digits, which amounts to 256 touches per object collected in sequence. For data acquisition, we laid each 3D-printed MNIST digit in a 12x12cm grid on a rubber mat and used a Franka Research 3 robot arm, with a GelSight Mini tactile sensor, to press the sensor down at random locations in the cell. Once we measured a normal force exceeding 5N, we stopped pressing and registered the time stamp. To prevent degradation of the elastomer gel, we replaced the GelSight sensor’s gel pad after every 76,800 touches (i.e., halfway through each dataset). Finally, we partitioned each dataset into training (90 %) and test (10 %) splits, ensuring uniform class distributions across each split. Note that we also provide two processed versions of this dataset, where we replaced the videos with still images at the time of contact, one in full resolution at 320x240px and one scaled to 64x64px for faster loading and training.
More details on the generation of this dataset can be found in our paper.
Using the Real Tactile MNIST Dataset, we train a CycleGAN to produce realistic tactile images in simulation.
TactileMNISTVolume-CycleGAN
The TactileMNISTVolume-CycleGAN environment is a CycleGAN variant of TactileMNISTVolume.