Vision-language-action (VLA) models demonstrate strong generalization in robotic manipulation but face challenges in complex, real-world tasks. While supervised fine-tuning with demonstrations is constrained by data quality, reinforcement learning (RL) offers a promising alternative. We propose a human-in-the-loop dual-actor fine-tuning framework grounded in RL. The framework integrates a primary actor for robust multi-task performance with a refinement actor for latent-space adaptation. Beyond standard physical interventions, we introduce a lightweight talk-and-tweak scheme that converts human corrections into semantically grounded language commands, thereby generating a new dataset for policy learning. In real-world multi-task experiments, our approach achieves 100% success across three tasks within 101 minutes of online fine-tuning. For long-horizon tasks, it sustains a 50% success rate over 12 consecutive operations. Furthermore, the framework scales effectively to multi-robot training, achieving up to a 2× improvement in efficiency when using dual robots.
The primary actor generates robust multi-task actions via diffusion, while the refinement actor operates in the latent noise space to provide fine-grained adjustments. Human interventions are integrated through the talk-and-tweak scheme, which translates physical corrections into semantically grounded refinement commands.
The refinement commands can be expressed in the following forms:
Experiments are conducted on a custom-built 7 Degree-of-Freedom (DoF) robotic arm developed in-house. The observation includes two RGB images and the robot’s proprioceptive state. The images are captured from a wrist-mounted camera and a head-mounted camera.
We evaluate the proposed fine-tuning method on multi-task learning against three baselines: HG-Dagger, which fine-tunes policies via supervised learning from human corrections; HIL-ConRFT, which applies human-in-the-loop RL with a flat optimization scheme; and DSRL, which refines policies by steering latent noise in diffusion models. Our method consistently achieves the highest success rates across all tasks, while also reducing episode length, indicating not only more reliable task completion but also more efficient execution.
The dual-actor framework strikes a favorable balance between sample efficiency and training stability, demonstrating that latent-space refinement provides an effective policy adaptation mechanism across multiple tasks.
Our method consistently achieves strong performance across different VLA backbones, demonstrating its general applicability to diverse pretrained models without backbone-specific modifications.
Our method reliably executes complex, multi-step tasks. While each bolt involves three actions, the framework maintains high performance on single-bolt assembly and generalizes reasonably well to longer sequences.
These results underscore the potential of our framework for large-scale real-world deployments, where parallel learning substantially reduces training time and improves sample efficiency. Moreover, the ability to sustain high performance across diverse robotic platforms further demonstrates the robustness of our approach.
one bolt
three bolts
two bolts
four bolts
place the bolt upright on the table
place the bolt upright on the table
assemble the bolt on the stud