TacVLA: Contact-Aware Tactile Fusion

for Robust Vision-Language-Action Manipulation

Under Review

Overview of TacVLA

(a) Input modalities including visual observations, language instructions, and tactile measurements. (b) TacVLA architecture, consisting of modality-specific encoders and tokenizer, a pretrained VLM backbone, an action expert, and the contact-aware gating module. (c) The proposed contact-aware gating module that selectively activates tactile tokens based on the contact state, enabling adaptive multimodal fusion during contact-rich manipulation. (d) Experimental evaluation on contact-rich constraint-locked disassembly and in-box picking tasks, together with robustness tests under camera occlusion and human disturbance.

IROS26_3006_VI_i.mp4

The additional videos:

The first row: overview of the task execution

The second row: two views from the front camera and the wrist camera, and visualizations of the tactile sensor.

IMG_2756.MOV

IMG_2759.MOV

IMG_2760.MOV

IMG_2763.MOV

dis_1_tac_rollout_mac.mp4

Task 1

dis_2_tac_rollout_mac.mp4

Task 2

dis_3_tac_rollout_mac.mp4

Task 3

dis_4_tac_rollout_mac.mp4

Task 4

Tasks:

The real-world experimental setup and procedures:

We conduct four constraint-locked disassembly and in-box picking tasks. The experiments demonstrate the capability of TacVLA in contact-rich fine-grained manipulation, as well as its robustness to visual occlusion in the in-box picking scenario.

Videos of in-box picking:

The robot must retrieve objects from a confined space with limited visual access: (1) the front camera has no access to box inside, (2) the wrist camera has limited illumination and occlusion inside the box. This task further emphasizes the importance of tactile sensing for handling scenarios with severe visual occlusion and complex contact interactions, as the robot must rely heavily on tactile feedback to successfully retrieve the objects.

box__success_tactile.mp4

In-box picking

box_screw_driver.mp4

In-box picking - screw driver

box_strawberry.mp4

In-box picking - strawberry

box_lemmon_third_view.MOV

In-box picking - lemmon 3rd view

Performance 🎭

Robustness Evaluation

⬇️Videos⬇️

We evaluate the performance of our TacVLA model under conditions of visual occlusion and runtime disturbance to

demonstrate its ability to adapt and maintain performance in challenging scenarios.

human_disturbance.mp4

Success rates

block camera

We compare finetuned Pi0.5, finetuned Pi0.5 with tactile input (w/o gating), and TacVLA (ours). Our method consistently improves performance under visual occlusion

0120.mp4

0120(1).mp4

0120(2).mp4

0120(3).mp4

Failure Case:

dis3_block_fail.MOV

dis1_fail.MOV

dis2_block_fail.MOV

dis4_block_fail.MOV

Page updated

Google Sites

Report abuse