What Can RL Bring to VLA Generalization?

An Empirical Study

Anonymous authors

Overview of our study

We conduct an empirical study to evaluate the generalization benefits of reinforcement learning (RL) fine-tuning versus supervised fine-tuning (SFT) for Vision-Language-Action (VLA) models.

In out-of-distribution tests, RL enhances VLA generalization substantially in Execution, improves moderately in Semantics, and performs on par with SFT for Vision.

1. Preliminary: Vision-Language-Action model

2. Effective RL fine-tuning of VLA models

3. Evaluating fine-tuning methods on VLA generalization

Appendix: More demonstration videos

Vision tasks

Semantics tasks

Execution tasks

1. Preliminary: Vision-Language-Action model

We base our study on OpenVLA (Kim et al., 2024), an open-source model that achieves state-of-the-art performance on various robot tasks.

At each time step the policy receives a single RGB image and an instruction, i.e., the history length H=1, and outputs a sequence of discretized action tokens representing the predicted control commands.

2. Effective RL fine-tuning of VLA models

RL algorithms: PPO, GRPO, DPO

We consider three representative RL algorithms: PPO, GRPO and DPO, fine-tuning the OpenVLA model with LoRA.

Our findings indicate that PPO consistently outperforms GRPO and DPO, likely due to non-stationary dynamics destabilizing GRPO, and sparse rewards together with distribution shifts limiting DPO.

Design factors of PPO

Shared actor-critic backbone: saves 45% VRAM and trains 53% faster in speed.

VLA warm-up: convereges with about 50% fewer environment steps.

Minimal PPO epoch: reduces wall-clock time with similar sample-efficiency.

3. Evaluating fine-tuning methods on VLA generalization

Comparison between RL and SFT

Inspired by prior works (Fan et al., 2025; Stone et al., 2023) and the concept of Vision-Language-Action models, we define three dimensions of generalization:

Vision: We include both foreground and background changes, as well as image-level dynamic noise.

Semantics: We consider unseen variations in objects, receptacles, and instruction phrasings, as well as several new tasks.

Execution: We investigate changes in the initial positions of object and receptacle, as well as robot initial pose.

In the training setting, we randomise along three axes: 16 tables (Vision), 16 objects (Semantics), and perturbations of object and receptacle poses (Execution).

At test time we hold at least one of these factors out of distribution, introducing 9 novel objects, 16 unseen receptacles, 5 new table surroudings, and 16 distractor textures.

Comparison Results

Inspired by prior works (Fan et al., 2025; Stone et al., 2023) and the concept of Vision-Language-Action models, we define three dimensions of generalization:

Vision: SFT and RL perform comparably

Semantics: RL improves moderately

Execution: RL enhances substantially

Appendix: More demonstration videos

Vision tasks

Vision - Unseen Table

Case 1: both SFT and RL fail to grasp

Case 2: SFT grasps and sticks, RL succeeds

Vision - Dynamic Texture (weak)

Case 1: SFT fails to put on plate, RL fails to grasp

Case 2: SFT fails to grasp, RL succeeds

Vision - Dynamic Texture (strong)

Case 1: SFT sticks after grasping, RL fails to grasp

Case 2: SFT moves arm without holding the object, RL succeeds

Vision - Dynamic Noise (weak)

Case 1: both SFT and RL fail to grasp

Case 2: SFT moves arm without holding the object, RL succeeds

Vision - Dynamic Noise (strong)

Case 1: SFT fails to grasp, RL grasps and drops the object

Case 2: SFT fails to grasp, RL succeeds

Semantics tasks

Semantics - Unseen Objects

Case 1: SFT fails to grasp, RL grasps and drop the object out of the table

Case 2: SFT sticks, RL succeeds

Semantics - Unseen Receptacles

Case 1: SFT grasps and idles, RL doesn't grasp

Case 2: SFT fails to put the object, RL succeeds

Semantics - Unseen Instruction Phrasings

Instruct 1: pick up kitchen shovel and set it down on plate

Instruct 2: Put banana onto plate.

Case 1: both SFT and RL fail to grasp at the first time

Case 2: SFT sticks after grasping, RL succeeds

Semantics - Multi-Object (both seen)

Instruct 1: put watering can on plate

Instruct 2: put BBQ sauce on plate

Case 1: both SFT and RL fail to grasp

Case 2: SFT sticks after grasping, RL succeeds

Semantics - Multi-Object (both unseen)

Instruct 1: put champagne glass on plate

Instruct 2: put travel cup on plate

Case 1: SFT tries to grasp the wrong object, RL puts the wrong object on plate

Case 2: SFT sticks after grasping, RL succeeds

Semantics - Distractive Receptacles

Case 1: SFT puts the object on the wrong receptable, RL drops the object

Case 2: SFT hovers after grasping, RL succeeds

Semantics - Multi-Recep. (both unseen)

Instruct 1: put banana on sheet metal

Instruct 2: put plastic bottle on tomato slice

Case 1: SFT puts the object on the correct receptable, then moves it to the wrong one; RL directly put the object on the wrong receptacle

Case 2: SFT hovers after grasping, RL succeeds

Execution tasks

Execution - Unseen Position (obj. & recep.)

Case 1: SFT fails to grasp, RL sticks

Case 2: SFT moves arm without holding the object, RL succeeds

Execution - Unseen Robot Init Pose

Case 1: both SFT and RL fail to grasp

Case 2: SFT fails to grasp, RL succeeds

Execution - Mid-Episode Obj. Reposition

Case 1: both SFT and RL fail to grasp

Case 2: SFT moves arm without holding the object, RL succeeds

Page updated

Google Sites

Report abuse