What Can RL Bring to VLA Generalization?
An Empirical Study
Anonymous authors
Anonymous authors
Overview of our study
We conduct an empirical study to evaluate the generalization benefits of reinforcement learning (RL) fine-tuning versus supervised fine-tuning (SFT) for Vision-Language-Action (VLA) models.
In out-of-distribution tests, RL enhances VLA generalization substantially in Execution, improves moderately in Semantics, and performs on par with SFT for Vision.
We base our study on OpenVLA (Kim et al., 2024), an open-source model that achieves state-of-the-art performance on various robot tasks.
At each time step the policy receives a single RGB image and an instruction, i.e., the history length H=1, and outputs a sequence of discretized action tokens representing the predicted control commands.
RL algorithms: PPO, GRPO, DPO
We consider three representative RL algorithms: PPO, GRPO and DPO, fine-tuning the OpenVLA model with LoRA.
Our findings indicate that PPO consistently outperforms GRPO and DPO, likely due to non-stationary dynamics destabilizing GRPO, and sparse rewards together with distribution shifts limiting DPO.
Design factors of PPO
Shared actor-critic backbone: saves 45% VRAM and trains 53% faster in speed.
VLA warm-up: convereges with about 50% fewer environment steps.
Minimal PPO epoch: reduces wall-clock time with similar sample-efficiency.
Comparison between RL and SFT
Inspired by prior works (Fan et al., 2025; Stone et al., 2023) and the concept of Vision-Language-Action models, we define three dimensions of generalization:
Vision: We include both foreground and background changes, as well as image-level dynamic noise.
Semantics: We consider unseen variations in objects, receptacles, and instruction phrasings, as well as several new tasks.
Execution: We investigate changes in the initial positions of object and receptacle, as well as robot initial pose.
In the training setting, we randomise along three axes: 16 tables (Vision), 16 objects (Semantics), and perturbations of object and receptacle poses (Execution).
At test time we hold at least one of these factors out of distribution, introducing 9 novel objects, 16 unseen receptacles, 5 new table surroudings, and 16 distractor textures.
Comparison Results
Inspired by prior works (Fan et al., 2025; Stone et al., 2023) and the concept of Vision-Language-Action models, we define three dimensions of generalization:
Vision: SFT and RL perform comparably
Semantics: RL improves moderately
Execution: RL enhances substantially
Vision - Unseen Table
Case 1: both SFT and RL fail to grasp
Case 2: SFT grasps and sticks, RL succeeds
Vision - Dynamic Texture (weak)
Case 1: SFT fails to put on plate, RL fails to grasp
Case 2: SFT fails to grasp, RL succeeds
Vision - Dynamic Texture (strong)
Case 1: SFT sticks after grasping, RL fails to grasp
Case 2: SFT moves arm without holding the object, RL succeeds
Vision - Dynamic Noise (weak)
Case 1: both SFT and RL fail to grasp
Case 2: SFT moves arm without holding the object, RL succeeds
Vision - Dynamic Noise (strong)
Case 1: SFT fails to grasp, RL grasps and drops the object
Case 2: SFT fails to grasp, RL succeeds
Semantics - Unseen Objects
Case 1: SFT fails to grasp, RL grasps and drop the object out of the table
Case 2: SFT sticks, RL succeeds
Semantics - Unseen Receptacles
Case 1: SFT grasps and idles, RL doesn't grasp
Case 2: SFT fails to put the object, RL succeeds
Semantics - Unseen Instruction Phrasings
Instruct 1: pick up kitchen shovel and set it down on plate
Instruct 2: Put banana onto plate.
Case 1: both SFT and RL fail to grasp at the first time
Case 2: SFT sticks after grasping, RL succeeds
Semantics - Multi-Object (both seen)
Instruct 1: put watering can on plate
Instruct 2: put BBQ sauce on plate
Case 1: both SFT and RL fail to grasp
Case 2: SFT sticks after grasping, RL succeeds
Semantics - Multi-Object (both unseen)
Instruct 1: put champagne glass on plate
Instruct 2: put travel cup on plate
Case 1: SFT tries to grasp the wrong object, RL puts the wrong object on plate
Case 2: SFT sticks after grasping, RL succeeds
Semantics - Distractive Receptacles
Case 1: SFT puts the object on the wrong receptable, RL drops the object
Case 2: SFT hovers after grasping, RL succeeds
Semantics - Multi-Recep. (both unseen)
Instruct 1: put banana on sheet metal
Instruct 2: put plastic bottle on tomato slice
Case 1: SFT puts the object on the correct receptable, then moves it to the wrong one; RL directly put the object on the wrong receptacle
Case 2: SFT hovers after grasping, RL succeeds
Execution - Unseen Position (obj. & recep.)
Case 1: SFT fails to grasp, RL sticks
Case 2: SFT moves arm without holding the object, RL succeeds
Execution - Unseen Robot Init Pose
Case 1: both SFT and RL fail to grasp
Case 2: SFT fails to grasp, RL succeeds
Execution - Mid-Episode Obj. Reposition
Case 1: both SFT and RL fail to grasp
Case 2: SFT moves arm without holding the object, RL succeeds