ForceVLA2: Unleashing Hybrid Force-Position Control with Force Awareness for Contact-Rich Manipulation

Abstract

Embodied intelligence for contact-rich manipulation has predominantly relied on position control, while explicit awareness and regulation of interaction forces remain under-explored, limiting stability, precision, and robustness in real-world tasks. We propose ForceVLA2, an end-to-end vision-language-action framework that equips robots with hybrid force-position control and explicit force awareness. ForceVLA2 introduces force-based prompts into the VLM expert to construct force-aware task concepts across stages, and employs a Cross-Scale Mixture-of-Experts (MoE) in the action expert to adaptively fuse these concepts with real-time interaction forces for closed-loop hybrid force--position regulation. To support learning and evaluation, we construct ForceVLA2-Dataset, containing 1000 trajectories over 5 contact-rich tasks, including wiping, pressing, and assembling, with multi-view images, task prompts, proprioceptive state, and force signals. Extensive experiments show that ForceVLA2 substantially improves success rates and reliability in contact-rich manipulation, outperforming 𝝅0 and 𝝅05 by 48.0% and 35.0%, respectively, across the 5 tasks, and mitigating common failure modes such as arm overload and unstable contact, thereby actively advancing force-aware interactive physical intelligence in VLAs.

code and dataset is coming soon

Motivation

ForceVLA2 concept

ForceVLA2 bridges the gap between high-level sense reasoning and low-level physical interaction by transforming force from a passive perceptual input into an active hybrid force-position control policy.

Overcoming Position-Only Limits: Current VLAs predominantly rely on position control, lacking the physical common sense and force awareness required for stable, contact-rich manipulation.
Dual-Horizon Control Architecture: Inspired by human sensorimotor systems, ForceVLA2 addresses the need for a dual-level architecture that combines long-horizon strategic reasoning with short-horizon reactive force regulation.
Closing the Perception-Action Loop: By integrating force as a fundamental signal, ForceVLA2 mitigates failure modes like arm overload and contact instability, advancing the robustness of robots in real-world physical environments.

Method

Overview of ForceVLA2

Force feedback teleoperation system

The overview of dataset collection system

Multimodal Sensory Input

A real-time closed-loop is established between the leader and follower arms, achieving a high-transparency operational experience through position-force symmetric mapping.

Visual perception: Visual feedback from the base-mounted and hand-mounted (eye-in-hand) cameras.

Proprioception: Real-time state estimation including the tool center point (TCP) pose and gripper width.
Force sensing: Estimated external forces and torques exerted at the end-effector.

Hybrid Force-Position Command

The policy output integrates spatial trajectories with haptic objectives, providing action sequences, force targets, and control mode, all aligned with the estimated task progress.

Action chunks: A sequence of predicted future states to ensure smooth, continuous motion.
Force objectives: The desired magnitude and direction of the interaction force.
Compliance Control: The specific control mode for the robot controller (e.g., transitioning between position control and force control).
Task progress: An estimation of sub-task progress to facilitate hierarchical execution and switching.

Model design

Framework of ForceVLA2

Long-Horizon Force Awareness via Prompting

Force prompt: leverages a Force Prompt to internalize multi-scale force cues for long-range task planning.
Force encoding: jointly encodes multi-view vision, language, and force tokens within a VLM to guide logical task decomposition and spatial reasoning.

Short-Horizon Force-to-Control Loop

Reactive gradient pathway: force observations bypass high-level fusion via a direct pathway to the Action Expert, enabling near-instantaneous reactive responses during physical contact.
Cross-Scale MoE: dynamically modulates modalities via a Mixture-of-Experts, determining whether vision or force should dominate control at each millisecond of the task.

ForceVLA2 - Dataset

The illustration of ForceVLA2-Dataset

The first dataset with force prompts for task decomposition.
The only one providing force-control supervision.

Contain 1,000 demonstrations across five contact-rich tasks.

press_bottle.mp4

Press bottle

clean_board_v4.mp4

Clean board

clean_vase.mp4

Clean vase

search_plate_v3.mp4

Retrieve plate

assemable_gears.mp4

Assemble gears

Experiment

Real world experiment

Outperforming 𝝅0 and 𝝅0.5 by 48.0% and 35.0%, respectively, across the 5 tasks.
On force-sensitive tasks, ForceVLA2 surpasses the second-best model by up to 50%.

Results visualization

C1409S03.MP4

C1338S03.MP4

C1363S03.MP4

C1372S03.MP4

C1406S03.MP4

Adversarial distriburbance

C1346S03.MP4

lifting a corner of the blackboard

C1410S03.MP4

constantly perturbing the table height

C1373S03.MP4

constantly perturbing the tilt angle of the vase

Failure cases

C1371S03.MP4

C1320S03.MP4

C1392S03.MP4

C1358S03.MP4

Page updated

Google Sites

Report abuse