July 17ᵗʰ, 2026

Reinforcement Learning for
Vision-Language-Action Models (RL4VLA)

Workshop @ RSS 2026

About

Vision-Language-Action (VLA) models have shown that large-scale multimodal pretraining enables open-vocabulary perception and instruction following in robotics. Yet most current VLA systems rely almost exclusively on imitation learning (IL) from successful demonstrations, inheriting several fundamental limitations. IL policies suffer from compounding errors under covariate shift. Success-heavy demonstration datasets yield policies that lack recovery behavior when execution goes wrong. Moreover, supervised objectives optimize action likelihood rather than task success, efficiency, safety, or user preferences. Even at a massive scale, demonstration coverage cannot match open-world variation, leaving persistent generalization gaps.

Reinforcement learning (RL) offers a principled path beyond these limitations, enabling policies to learn from failures, adapt to deployment distributions, and directly optimize downstream objectives. The recent success of RL in improving reasoning and alignment in large language models suggests similar gains are possible for embodied agents, grounding decision-making in reward signals rather than demonstration mimicry.

However, RL for VLAs is non-trivial. Real-world sample efficiency and reset constraints make naive online RL impractical. Reward specification for language-conditioned manipulation is sparse and semantic. Credit assignment across perception, language grounding, and control is deeply entangled. RL fine-tuning of large VLA backbones introduces instability and catastrophic forgetting. Tokenized and chunked action representations clash with standard RL algorithms. Sim-to-real gaps sharpen under multimodal grounding. Safety constraints are not optional in embodied settings.

This workshop brings together researchers to tackle these challenges head-on. We will examine algorithmic foundations, when and how RL fine-tuning meaningfully improves over behavior cloning, and how to design rewards for language-conditioned tasks. We will explore long-horizon optimization, hierarchical methods, and lessons from RL for LLMs. We will address human feedback and preference alignment for embodied agents. And we will confront the practical realities of scaling, sim-to-real transfer, and evaluation. We welcome participants from robot learning, offline and online RL, vision-language models, embodied foundation models, and human-robot interaction.

Topics of Interest:

RL fine-tuning of multimodal foundation models for robotics
Offline and batch RL for language-conditioned policies
Reward modeling and preference-based RL for embodied agents
Human-in-the-loop reinforcement learning for VLA
Hierarchical and model-based RL for multimodal planning
Sim-to-real transfer in RL-based VLA systems
Safety and robustness in RL-trained language-conditioned agents
Benchmarks for evaluating RL in Vision-Language-Action models

Important Dates

Submission Open: May 12th, 2026

Submission Deadline: June 8th, 2026 (AOE)

Acceptance Notification: June 17th, 2026 (AOE)

Camera-ready submission deadline: June 30th, 2026 (AOE)

Workshop Day: Friday, July 17th, 2026 (Afternoon)

July 17ᵗʰ, 2026

Reinforcement Learning for
Vision-Language-Action Models (RL4VLA)

Workshop @ RSS 2026

About

Important Dates

Speakers & Panelists

Yuke Zhu

Ziwei Wang

Chao Yu

William Chen

Organizers

Ahmed Hendawy

Alap Kshirsagar

Yufeng Jin

Advisory Board

Carlo D'Eramo

Georgia Chalvatzaki

Jan Peters

Rudolf Lioutikov

Sponsors

July 17ᵗʰ, 2026

Reinforcement Learning for Vision-Language-Action Models (RL4VLA)

Workshop @ RSS 2026

About

Important Dates

Speakers & Panelists

Organizers

Advisory Board

Sponsors

Reinforcement Learning for
Vision-Language-Action Models (RL4VLA)