Abstract
Embodied intelligence for contact-rich manipulation has predominantly relied on position control, while explicit awareness and regulation of interaction forces remain under-explored, limiting stability, precision, and robustness in real-world tasks. We propose ForceVLA2, an end-to-end vision-language-action framework that equips robots with hybrid force-position control and explicit force awareness. ForceVLA2 introduces force-based prompts into the VLM expert to construct force-aware task concepts across stages, and employs a Cross-Scale Mixture-of-Experts (MoE) in the action expert to adaptively fuse these concepts with real-time interaction forces for closed-loop hybrid force--position regulation. To support learning and evaluation, we construct ForceVLA2-Dataset, containing 1000 trajectories over 5 contact-rich tasks, including wiping, pressing, and assembling, with multi-view images, task prompts, proprioceptive state, and force signals. Extensive experiments show that ForceVLA2 substantially improves success rates and reliability in contact-rich manipulation, outperforming 𝝅0 and 𝝅05 by 48.0% and 35.0%, respectively, across the 5 tasks, and mitigating common failure modes such as arm overload and unstable contact, thereby actively advancing force-aware interactive physical intelligence in VLAs.
Motivation
ForceVLA2 concept
ForceVLA2 bridges the gap between high-level sense reasoning and low-level physical interaction by transforming force from a passive perceptual input into an active hybrid force-position control policy.
Overcoming Position-Only Limits: Current VLAs predominantly rely on position control, lacking the physical common sense and force awareness required for stable, contact-rich manipulation.
Dual-Horizon Control Architecture: Inspired by human sensorimotor systems, ForceVLA2 addresses the need for a dual-level architecture that combines long-horizon strategic reasoning with short-horizon reactive force regulation.
Closing the Perception-Action Loop: By integrating force as a fundamental signal, ForceVLA2 mitigates failure modes like arm overload and contact instability, advancing the robustness of robots in real-world physical environments.
Method
Force feedback teleoperation system
The overview of dataset collection system
Multimodal Sensory Input
A real-time closed-loop is established between the leader and follower arms, achieving a high-transparency operational experience through position-force symmetric mapping.
Visual perception: Visual feedback from the base-mounted and hand-mounted (eye-in-hand) cameras.
Proprioception: Real-time state estimation including the tool center point (TCP) pose and gripper width.
Force sensing: Estimated external forces and torques exerted at the end-effector.
Hybrid Force-Position Command
The policy output integrates spatial trajectories with haptic objectives, providing action sequences, force targets, and control mode, all aligned with the estimated task progress.
Action chunks: A sequence of predicted future states to ensure smooth, continuous motion.
Force objectives: The desired magnitude and direction of the interaction force.
Compliance Control: The specific control mode for the robot controller (e.g., transitioning between position control and force control).
Task progress: An estimation of sub-task progress to facilitate hierarchical execution and switching.
Model design
Framework of ForceVLA2
Force prompt: leverages a Force Prompt to internalize multi-scale force cues for long-range task planning.
Force encoding: jointly encodes multi-view vision, language, and force tokens within a VLM to guide logical task decomposition and spatial reasoning.
Reactive gradient pathway: force observations bypass high-level fusion via a direct pathway to the Action Expert, enabling near-instantaneous reactive responses during physical contact.
Cross-Scale MoE: dynamically modulates modalities via a Mixture-of-Experts, determining whether vision or force should dominate control at each millisecond of the task.
ForceVLA2 - Dataset
The illustration of ForceVLA2-Dataset
The first dataset with force prompts for task decomposition.
The only one providing force-control supervision.
Contain 1,000 demonstrations across five contact-rich tasks.
Experiment
Real world experiment
Outperforming 𝝅0 and 𝝅0.5 by 48.0% and 35.0%, respectively, across the 5 tasks.
On force-sensitive tasks, ForceVLA2 surpasses the second-best model by up to 50%.
Results visualization
Adversarial distriburbance
Failure cases