ManipForce
Force-Guided Policy Learning with Frequency-Aware Representation
for Contact-Rich Manipulation
ManipForce
Force-Guided Policy Learning with Frequency-Aware Representation
for Contact-Rich Manipulation
Abstract
Contact-rich manipulation tasks such as precision assembly require precise control of interaction forces, yet existing imitation learning methods rely mainly on vision-only demonstrations. We propose ManipForce, a handheld system designed to capture high-frequency force–torque (F/T) and RGB data during natural human demonstrations for contact-rich manipulation. Building on these demonstrations, we introduce the Frequency-Aware Multimodal Transformer (FMT). FMT encodes asynchronous RGB and F/T signals using frequency- and modality-aware embeddings and fuses them via bi-directional cross-attention within a transformer diffusion policy. Through extensive experiments on six real-world contact-rich manipulation tasks—such as gear assembly, box flipping, and battery insertion—FMT trained on ManipForce demonstrations achieves robust performance with an average success rate of 83% across all tasks, substantially outperforming RGB-only baselines. Ablation and sampling-frequency analyses further confirm that incorporating high-frequency F/T data and cross-modal integration improves policy performance, especially in tasks demanding high precision and stable contact.
ManipForce Hardware Design
Our gripper mechanism is purpose-built to make human-guided data collection intuitive, precise, and stable. The design integrates several complementary features that improve usability and data quality. At its core, a rack-and-pinion mechanism with a familiar trigger interface drives parallel jaw motion, allowing users to perform natural and accurate grasping actions. A deformable pin-ray structure at the fingertips increases compliance and contact sensitivity, while integrated linear guides at the gripper attachment points suppress mechanical vibration and promote smoother, more precise manipulation. A trigger-lock mechanism holds the jaws closed after grasping, reducing user fatigue and ensuring that recorded signals primarily reflect task-relevant interaction forces. Finally, all gripper components are mounted downstream of the F/T sensor to capture the complete interaction forces during operation.
Human Demonstration
Gear Assembly
Battery Disassembly
Frequency-aware Multimodal Transformer (FMT)
We present FMT, a model that learns frequency-aware multimodal representations to fuse low-rate RGB with high-frequency F/T data for contact-rich manipulation. Extending Diffusion Policy, FMT applies frequency-aware multimodal embeddings and bi-directional cross-attention to align visual cues with fine-grained force information. We outline its four components—multimodal tokenization, frequency–modality embeddings, cross-attention fusion, and the diffusion-based policy head—that together enable robust learning from asynchronous multimodal inputs.
Policy Rollout on Contact-rich Tasks
We evaluate our approach on 6 contact-rich manipulation tasks that demand precise force control and multimodal feedback.
(1) Gear Assembly (1×): Requires precise rotational alignment combined with controlled axial force to achieve successful insertion.
(2) LAN Plug Insertion (1×): Demands sub-millimeter positional accuracy and carefully regulated insertion forces to avoid damaging the connector.
(3) Box Flipping (2×): A non-prehensile manipulation task that involves coordinated push-and-roll motions guided by contact sensing.
(4) Open Lid (1×): The gripper must detect and apply force to a small handle to lift and open the lid.
(5) Battery Disassembly (2×): A tool is inserted into narrow gaps to sense contact forces and safely lift the batteries.
(6) Battery Insertion (1×): Involves inserting a spring-loaded battery while maintaining the correct force and direction to prevent slipping or ejection.
Comparison with RGB-only Method
Gear stuck on the shaft
Cannot press steadily – contact unstable
Unable to maintain contact
Unable to hook into the battery slot