Anonymous Author(s)
Video
We propose Flow Before Imitation (FBI), an imitation learning framework integrating tactile and visual information through motion dynamics.
Abstract
Dexterous in-hand manipulation is a long-standing challenge in robotics due to complex contact dynamics and partial observability. While humans synergize vision and touch for such tasks, robotic approaches often prioritize one modality, therefore limiting adaptability. This paper introduces Flow Before Imitation (FBI), a visuotactile imitation learning framework that dynamically fuses tactile interactions with visual observations through motion dynamics. Unlike prior static fusion methods, FBI establishes a causal link between tactile signals and object motion via a dynamics-aware latent model. FBI employs a transformer-based interaction module to fuse flow-derived tactile features with visual inputs, training a one-step diffusion policy for real-time execution. Extensive experiments demonstrate that the proposed method outperforms the baseline methods in both simulation and the real world on two customized in-hand manipulation tasks and three standard dexterous manipulation tasks.
Overview
Dexterous in-hand manipulation requires rich hand-object interactions, where vision and touch complement each other. To this point, we propose Flow Before Imitation (FBI), a visuotactile imitation learning algorithm, where we fuse the 3D vision information with dense contact states as conditions for the Shortcut Model to solve contact-rich dexterous tasks.
While visuotactile systems excel in ideal conditions, we often witness hardware limitations in real-world deployments, since distributed tactile sensors could be unavailable or cost-prohibitive. To enhance accessibility and lower the barriers for labs and industries without tactile hardware, our method can maintain the dexterous manipulation capabilities using only visual input. Thanks to the dynamic perspective, tactile readings can be predicted from the point cloud flow by the Flow2Tactile Module. Thus, our algorithm can work in two different modes: Vision-Only and Visuotactile.
Results
FBI enables two operational modes with or without physical tactile sensors in the real-world experiments, largely extending the application scenarios.
In simulated tasks, FBI achieves 16.6% (Vision-only) to 18.4% (Visuotactile) improvement compared to DP3, with a smaller variance. In the In-hand Reorientation task, FBI surpasses DP3 with a 19.3% (Vision-only) to 21.4% (Visuotactile) improvement averaged over 9 different objects, indicating that after fusing dense contact states with visual information, our algorithm demonstrates superior performance, particularly in handling more complex dexterous tasks.
In real-world tasks, FBI outperforms all baselines in all tasks, achieving 15.0% (Vision-only) to 16.5% (Visuotactile) improvement compared to the previous SOTA baseline, ManiCM.
For more information, please refer to our paper and website