Vision-Language-Action (VLA) models have shown that large-scale multimodal pretraining enables open-vocabulary perception and instruction following in robotics. Yet most current VLA systems rely almost exclusively on imitation learning (IL) from successful demonstrations, inheriting several fundamental limitations. IL policies suffer from compounding errors under covariate shift. Success-heavy demonstration datasets yield policies that lack recovery behavior when execution goes wrong. Moreover, supervised objectives optimize action likelihood rather than task success, efficiency, safety, or user preferences. Even at a massive scale, demonstration coverage cannot match open-world variation, leaving persistent generalization gaps.
Reinforcement learning (RL) offers a principled path beyond these limitations, enabling policies to learn from failures, adapt to deployment distributions, and directly optimize downstream objectives. The recent success of RL in improving reasoning and alignment in large language models suggests similar gains are possible for embodied agents, grounding decision-making in reward signals rather than demonstration mimicry.
However, RL for VLAs is non-trivial. Real-world sample efficiency and reset constraints make naive online RL impractical. Reward specification for language-conditioned manipulation is sparse and semantic. Credit assignment across perception, language grounding, and control is deeply entangled. RL fine-tuning of large VLA backbones introduces instability and catastrophic forgetting. Tokenized and chunked action representations clash with standard RL algorithms. Sim-to-real gaps sharpen under multimodal grounding. Safety constraints are not optional in embodied settings.
This workshop brings together researchers to tackle these challenges head-on. We will examine algorithmic foundations, when and how RL fine-tuning meaningfully improves over behavior cloning, and how to design rewards for language-conditioned tasks. We will explore long-horizon optimization, hierarchical methods, and lessons from RL for LLMs. We will address human feedback and preference alignment for embodied agents. And we will confront the practical realities of scaling, sim-to-real transfer, and evaluation. We welcome participants from robot learning, offline and online RL, vision-language models, embodied foundation models, and human-robot interaction.
Topics of Interest:
RL fine-tuning of multimodal foundation models for robotics
Offline and batch RL for language-conditioned policies
Reward modeling and preference-based RL for embodied agents
Human-in-the-loop reinforcement learning for VLA
Hierarchical and model-based RL for multimodal planning
Sim-to-real transfer in RL-based VLA systems
Safety and robustness in RL-trained language-conditioned agents
Benchmarks for evaluating RL in Vision-Language-Action models
Submission Open: May 12th, 2026
Submission Deadline: June 8th, 2026 (AOE)
Acceptance Notification: June 17th, 2026 (AOE)
Camera-ready submission deadline: June 30th, 2026 (AOE)
Workshop Day: Friday, July 17th, 2026 (Afternoon)
Associate Professor
UT Austin, NVIDIA Research
Assistant Professor
Nanyang Technological University
Assistant Professor
Tsinghua University
PhD Student
UC Berkeley
Professor
University of Würzburg
Professor
TU Darmstadt
Professor
TU Darmstadt, RIG, SAIROL
Professor
Karlsruhe Institute of Technology