TacVLA: Contact-Aware Tactile Fusion
for Robust Vision-Language-Action Manipulation
Under Review
Under Review
Overview of TacVLA
(a) Input modalities including visual observations, language instructions, and tactile measurements. (b) TacVLA architecture, consisting of modality-specific encoders and tokenizer, a pretrained VLM backbone, an action expert, and the contact-aware gating module. (c) The proposed contact-aware gating module that selectively activates tactile tokens based on the contact state, enabling adaptive multimodal fusion during contact-rich manipulation. (d) Experimental evaluation on contact-rich constraint-locked disassembly and in-box picking tasks, together with robustness tests under camera occlusion and human disturbance.
The first row: overview of the task execution
The second row: two views from the front camera and the wrist camera, and visualizations of the tactile sensor.
Task 1
Task 2
Task 3
Task 4
Tasks:
The real-world experimental setup and procedures:
We conduct four constraint-locked disassembly and in-box picking tasks. The experiments demonstrate the capability of TacVLA in contact-rich fine-grained manipulation, as well as its robustness to visual occlusion in the in-box picking scenario.
The robot must retrieve objects from a confined space with limited visual access: (1) the front camera has no access to box inside, (2) the wrist camera has limited illumination and occlusion inside the box. This task further emphasizes the importance of tactile sensing for handling scenarios with severe visual occlusion and complex contact interactions, as the robot must rely heavily on tactile feedback to successfully retrieve the objects.
In-box picking
In-box picking - screw driver
In-box picking - strawberry
In-box picking - lemmon 3rd view
Performance 🎭
Robustness Evaluation
We evaluate the performance of our TacVLA model under conditions of visual occlusion and runtime disturbance to
demonstrate its ability to adapt and maintain performance in challenging scenarios.
Success rates
block camera
We compare finetuned Pi0.5, finetuned Pi0.5 with tactile input (w/o gating), and TacVLA (ours). Our method consistently improves performance under visual occlusion
Failure Case: