Recent advancements in integrating tactile sensing with vision-language models (VLMs) have demonstrated remarkable potential for robotic multimodal perception. However, existing tactile descriptions remain limited to superficial attributes like texture, neglecting critical contact states essential for robotic manipulation. To bridge this gap, we propose CLTP, an intuitive and effective language tactile pretraining framework that aligns tactile 3D point clouds with natural language in various contact scenarios, thus enabling contact-state-aware tactile language understanding for contact-rich manipulation tasks. We first collect a novel dataset of 50k+ tactile 3D point cloud-language pairs, where descriptions explicitly capture multidimensional contact states (e.g., contact location, shape, and force) from the tactile sensor’s perspective. CLTP leverages a pre-aligned and frozen vision-language feature space to bridge holistic textual and tactile modalities. Experiments validate its superiority in three downstream tasks: zero-shot 3D classification, contact state classification, and tactile 3D large language model (LLM) interaction. To the best of our knowledge, this is the first study to align tactile and language representations from the contact state perspective for manipulation tasks, providing great potential for tactile-language-action model learning.
We construct the TCL3D dataset, a comprehensive contact-state-aware tactile dataset built from YCB objects, self-made pegs and McMaster, encompassing both daily and industrial scenarios. We prepare a total of 117 objects for this dataset. According to the size of the object, we divide them into two categories: large object set (62 objects, most of which are from the YCB dataset) and small object set (55 objects). Different contact strategies were formulate to obtain the desired tactile 3D point cloud, the corresponding tactile rendering image, and the contact state-oriented tactile language description.
Standard Contact State Classification
Tac3D-LLM Applications
For more experimental results, please refer to our paper and supplementary material.