Abstract
Vision-Language-Action (VLA) models have advanced general-purpose robotic manipulation by leveraging pretrained visual and linguistic representations. However, they struggle with contact-rich tasks that require fine-grained control involving force, especially under visual occlusion or dynamic uncertainty. To address these limitations, we propose ForceVLA, a novel end-to-end manipulation framework that treats external force sensing as a first-class modality within VLA systems. ForceVLA introduces FVLMoE, a force-aware Mixture-of-Experts fusion module that dynamically integrates pretrained visual-language embeddings with real-time 6-axis force feedback during action decoding. This enables context-aware routing across modality-specific experts, enhancing the robot’s ability to adapt to subtle contact dynamics. We also introduce ForceVLA-Data, a new dataset comprising synchronized vision, proprioception, and force-torque signals across five contact-rich manipulation tasks. ForceVLA improves average task success by 23.2% over strong π0-based baselines, achieving up to 80% success in tasks such as plug insertion. Our approach highlights the importance of multimodal integration for dexterous manipulation and sets a new benchmark for physically intelligent robotic control.
Motivation
Figure 1: Comparison between ForceVLA and baselines without force input. Without force feedback, the policy fails to correct pose errors and completes insertion incorrectly. In contrast, ForceVLA leverages external force signals to adjust insertion strategies dynamically, leading to successful execution despite initial misalignment.
Robotic learning has rapidly advanced with Vision-Language-Action (VLA) models like OpenVLA and π0, leveraging large-scale pre-training on manipulation datasets and strong Vision-Language Model (VLM) backbones for semantic understanding and efficient fine-tuning. However, existing VLAs primarily focus on semantic grounding and spatial planning, largely overlooking crucial force sensing for contact-rich manipulation. This deficiency, coupled with the fact that force requirements dynamically evolve across task phases, limits their ability to handle tasks like insertion or tool use, especially under visual ambiguity.
To address these limitations, we introduce ForceVLA, a novel framework that augments VLAs with a force-aware Mixture-of-Experts (MoE) module called FVLMoE. Grounded in treating 6D external force as a first-class modality, ForceVLA integrates force into the action expert module. FVLMoE performs modality- and phase-aware fusion of VLM-derived visual-linguistic representations with real-time force feedback via a dynamic gating mechanism over specialized expert subnetworks. By adaptively activating experts based on task instructions and interaction feedback, ForceVLA captures subtle, phase-dependent physical variations to generate precise, force-aware actions.
Method
Figure 2: Robot manipulation tasks setting.
The robot's observation includes visual inputs from its base and hand, its proprioceptive state detailing tool pose and gripper width, and estimated external forces and torques at its tool. Conditioned on these multimodal inputs and a language instruction, an end-to-end policy outputs action chunks to successfully complete contact-rich tasks.
Figure 3: Overview of our ForceVLA model. Visual and language inputs are processed by a pretrained VLM to form contextual embeddings. External force signals are projected and fused with VLM outputs via the FVLMoE module. The resulting multimodal features guide a flow-based action head to generate contact-aware robot actions.
ForceVLA is an end-to-end multimodal robotic policy designed for contact-rich manipulation. Building upon the π0 framework, it integrates vision, language, proprioception, and 6-axis force feedback to generate actions through a conditional flow matching model. Visual inputs from multiple RGB cameras and task instructions are encoded by a SigLIP-based vision-language model (based on PaliGemma) into contextual embeddings. These embeddings, combined with proprioceptive and force cues, condition an iterative denoising process that predicts the action trajectory.
FVLMoE is the core module enabling effective force integration. Force readings are linearly projected into dedicated tokens and fused with vision-language embeddings via a Mixture-of-Experts (MoE) module. Inspired by MoE’s strength in multi-task and modality-specific learning, FVLMoE adaptively routes and processes multimodal inputs. Its output serves as a rich guidance signal for the flow model, allowing ForceVLA to handle subtle contact dynamics and visually ambiguous scenarios with greater precision and robustness.
Experiment
Experimental Setups
Figure 4: Overview of task setups used in evaluation. (a) Insert USB, (b) pump bottle, (c) insert plug, (d) peel cucumber, and (e) wipe board. These tasks span diverse contact dynamics and manipulation skills, from precise insertions to tool-mediated surface interactions.
To evaluate the effectiveness of ForceVLA, we conducted experiments on five diverse contact-rich manipulation tasks: Bottle Pumping, Plug Insertion, USB Drive Insertion, Whiteboard Wiping, and Cucumber Peeling, as in Figure 4. These tasks were chosen to assess fine-grained control, adaptability to varied initial conditions, and the utility of multimodal feedback, particularly force sensing. Each task introduces unique physical challenges: Bottle Pumping requires precise vertical pressing; Plug and USB Drive Insertions involve accurate alignment and force-controlled insertion; Whiteboard Wiping demands smooth trajectory control and surface contact; and Cucumber Peeling tests the ability to apply and maintain controlled force during continuous surface interaction.
Main Results
Figure 5: Main task success rates across different methods. ForceVLA significantly outperforms all baselines on five contact-rich tasks. Incorporating external force feedback improves performance for π0-base model, while our method achieves the highest average success rate, demonstrating robust performance under complex interaction dynamics. “Wipe Board-1” indicates the success rate of successfully performing the wiping motion, while “Wipe Board-2” refers to the success rate of completely erasing the markings.
ForceVLA achieves an average success rate of 60.5% across all five tasks, significantly outperforming all baseline configurations. Compared to the standard π0-base model without force feedback (π0-base w/ F), which achieved an average of 37.3%, ForceVLA shows an improvement of 23.2%. This highlights the substantial benefit of incorporating and effectively processing multimodal information, including force.
Table 1 further highlights ForceVLA's superior performance on the intricate cucumber peeling task. Our model excelled on both key metrics: it achieved the longest average peel length per stroke (14.12 cm), indicating better ability to execute high-fidelity surface manipulation through stable tool orientation, adaptive contouring, and sustained surface contact compared to both π0-base w/ F (13.17 cm) and π0-base w/o F (10.27 cm). Concurrently, ForceVLA demonstrated superior overall efficiency by requiring the minimum number of strokes (7) to achieve a substantially peeled cucumber, significantly fewer than the 10 and 14 strokes needed by π0-base w/ F and π0-base w/o F, respectively. These combined results underscore ForceVLA's proficiency in maintaining consistent, effective tool-surface interaction and executing efficient, goal-directed motions in tasks demanding continuous and precise force modulation.
Model Generalization
Figure 7: Variants of generalization settings used in our experiments. (a–b) Different object geometries; (c) variation in socket height; (d) partial visual occlusion; (e) unstable socket conditions. These scenarios evaluate robustness under diverse physical and perceptual perturbations.
To evaluate ForceVLA’s generalization, we designed five challenging settings involving object and height variations, visual occlusion, and environmental instability. ForceVLA consistently demonstrated superior generalization, achieving high success rates (e.g., 80.00% in object variation and 90.00% under visual occlusion) by effectively scaling interaction forces and avoiding torque limit violations seen in other models. This robust performance across varied conditions underscores how ForceVLA’s FVLMoE architecture intelligently integrates force feedback—not just for sensing contact, but for modulating actions in response to dynamic physical conditions—enabling versatile and adaptive robotic manipulation.
Ablation Studies
To validate ForceVLA’s force integration strategy, ablation studies compared different fusion approaches. Early fusion methods, injecting force prior to the Vision-Language Model (VLM), severely degraded performance—notably, an MoE-based early fusion failed entirely (0% success)—likely by disrupting pre-trained VLM representations. While simple late fusion (concatenating force at the decoding stage) improved success to 60% over a no-force baseline, our proposed ForceVLA, which introduces force features post-VLM and uses the FVLMoE module for adaptive fusion, achieved a markedly superior 80% success rate. These results confirm that both introducing force after VLM encoding and employing sophisticated fusion via FVLMoE are critical for effective contact-rich robotic behavior.
Visualization and Case Studies
Figure 7: Trajectory visualizations across tasks and conditions. (a) USB insertion, (b) bottle pumping, and (c) plug insertion under stable and unstable socket conditions. Each sequence illustrates how ForceVLA adapts its actions in response to contact dynamics, retrying or adjusting pose when failures occur, ultimately achieving successful task completion.
Figure 6: Open-loop evaluation of expert load across different task completion percentages for various tasks: (a) Pump Bottle, (b) Insert Plug, (c) Wipe Board, (d) Peel Cucumber, and (e) Insert USB. Each subplot represents the average expert load (vertical axis) as a function of the task completion percentage (horizontal axis) over the episodes in the test dataset.
To analyze MoE routing dynamics, we examined expert selection probabilities, normalized temporally across episodes. We observed distinct expert utilization patterns per task: some, like insert plug and peel cucumber, showed clear temporal specialization with experts dominating specific phases, while tasks such as wipe board favored one expert more consistently. Notably, one expert (Expert 0) activated frequently across multiple tasks, suggesting a general-purpose role for common multimodal fusion or control primitives, contrasting with other more phase-specific experts. This reveals learned dynamic computation allocation and functional specialization, reflecting both task semantics and potential training-induced architectural biases.
Videos
Seen bottle
Unseen bottle 1
Unseen bottle 2
Unseen bottle 3
Unseen bottle 4
Visual occlusion on seen bottle
Visual occlusion on unseen bottle
Visual occlusion on seen bottle
Height generalization 1
Height generalization 2
Height generalization 3
Small, square white plug (5x speed)
Flat, wide black plug (2x speed)
High, thin black plug (2x speed)
(3x speed)
(2x speed)