Lagrangian Perturbation Diffusion Steering: Latent Reinforcement Learning for Generative Policies

Hikmet Simsir, Ozgur S. Oguz
LiRA Lab, Department of Computer Engineering

Bilkent University, Ankara, Türkiye

Abstract

Behavior cloning with high-capacity generative policies achieves strong imitation performance, but is often limited by demonstration coverage and distribution shift. Direct reinforcement learning fine-tuning can improve performance, but updating large action decoders is frequently unstable and sample inefficient. We propose Lagrangian Perturbation Diffusion Steering (LP-DS), a lightweight adaptation method that improves a frozen generative policy by learning a compact noise-space perturbation before decoding. LP-DS optimizes this perturbation with a Lagrangian trust-region objective, improving downstream value while constraining deviation from the latent prior. Across RoboMimic manipulation, OpenAI Gym locomotion, and Adroit dexterous manipulation benchmarks, LP-DS improves sample efficiency, success, and return while maintaining higher action-space entropy than unconstrained noise-space steering, with return improvements of up to 25% over prior baselines. Additional evaluations with flow-matching backbones, a large vision-language-action model, and physical Franka deployment show that LP-DS is not limited to compact diffusion policies or simulated benchmarks.

Method

LP-DS improves a frozen generative policy by learning a small state-dependent perturbation in its latent noise space, rather than fine-tuning the full decoder. At each step, Gaussian noise is shifted by a learned residual before being decoded into an action, allowing the policy to steer toward higher-value behaviors while preserving the pretrained generative prior. A Lagrangian trust-region constraint adaptively limits the perturbation magnitude, preventing off-prior latent queries and reducing mode collapse. This makes LP-DS a lightweight and stable online adaptation method for diffusion, flow-matching, and other latent generative policies.

LP-DS Algorithm

Experiments

Baseline comparisons across domains. Top row: RoboMimic manipulation success rates. Second row: OpenAI Gym locomotion episodic returns. Third row: Adroit dexterous manipulation success rates. Bottom row: Adroit dexterous manipulation episodic returns.

π0 Experiment

LP-DS is run with the model π0 in “pick up the cream cheese and put it in the tray” task from Libero-90. The success rate graph shows that LP-DS can work with heavy generative backbones.

Success rate vs. environment steps on Libero-90 “pick up the cream cheese and put it in the tray” task.

10 Sample Trajectories Before LP-DS

10 sample trajectories with backbone model

10 Sample Trajectories After LP-DS (50k env. steps)

10 sample trajectories with LP-DS at step 50000

Real World Experiments

Pick and Place Task

We deployed our method on a physical Franka Panda robot for a real-world pick and place task. We collected 29 teleoperation trajectories, and explicitly trained a base Flow Matching policy using this teleoperation data. This base policy achieved only a 2/10 zero shot success rate due to real world execution errors. We then trained our perturbation module using an emulated RAI [4] framework with the following dense reward formulation: reward = -0.1 * ||obj_xy - target_xy|| - 0.5 * collision + 1.0 * success. Deploying this steered policy improved the physical success rate to 8/10, proving the method successfully overcomes sensor noise and simulation to reality gaps.

We simulate real-world teleoperation data in RAI [4]

Reward curve during LP-DS training in RAI simulation

Real-world trajectories sampled from flow matching bacbone trained with human data. This is before LP-DS training.

Real-world trajectories sampled with flow matching model steered with LP-DS. We observe an increase in success rate from 2/10 to 8/10.

Pick and Place Task with Vision

To evaluate our method under realistic sensor conditions, we upgraded the pick-and-place task to condition directly on image observations using an Intel RealSense Depth Camera 435i. By using this specific RGBD camera, we explicitly introduced real-world visual and sensor noise into the pipeline. The base Flow-Matching policy was trained on a fresh dataset of human teleoperation data collected with this setup.

To ensure a comprehensive and rigorous evaluation, we implemented a systematic spatial testing protocol:

Setup: We defined a 2x4 grid across the workspace for the cube's initial positions during inference
Evaluation: We conducted 5 independent trials at each of the 8 grid locations, resulting in 40 systematic trials.
Results: LP-DS achieved a 33/40 success rate. This demonstrates that our latent perturbation approach effectively compensates for both visual noise and spatial variations in real-world environments.

Simulation of Real-World Teleoperation Data in RAI

Real-world trajectories sampled from flow matching bacbone trained with human data. This is before LP-DS training.

Real-world trajectories sampled with flow matching model steered with LP-DS. We observe an increase in success rate from 18/40 to 33/40.

High-Precision Mug Hanging Task

To prove that our method generalizes beyond simple geometries, we deployed LP-DS on a highly distinct and challenging physical task: hanging a mug by its handle onto a stationary mug holder.

Challenge: This fine manipulation task requires high-precision alignment and narrow-tolerance insertion. Such requirements present a significant challenge for frozen generative policies when faced with real-world execution errors.
Results: Evaluated across 20 independent trials, LP-DS successfully steered the base policy to complete the insertion 17 out of 20 times (17/20 success rate).

Simulation of Real-World Teleoperation Data in RAI for Mug Hanging Task

Real-world trajectories sampled from flow matching bacbone trained with human data. This is before LP-DS training.

Real-world trajectories sampled with flow matching model steered with LP-DS. We observe an increase in success rate from 11/20 to 17/10.

Avoiding Environment

Avoiding environment [1] is used to show the effect of trust region parameter on the action diversity and success rate.

Eval Trajectories visualized at step 50000 for LP-DS and DSRL

Evaluation Metrics of LP-DS and DSRL on Avoding Environment

Ablation on Trust Region Target

We ablate the the trust region size, run LP-DS in the following 4 environments for different trust region size. We observe that trust region size is not a hyperparameter that requires careful tuning. We observe values > 0.1 leads to best performance independent of the environment. There is nothing special about choosing 0.35 vs. 0.5 vs. 0.66 as the trust region.

Reward/Success Rate graphs of LP-DS for different choice of trust region size on the environments Hopper, Square, Reloca, and Walker

Action vs. Noise Perturbation

We ablate our choice of finding perturbation in noise space by doing perturbation in action space.

Reward Curve of LP-DS vs. LP-DS-A in walker environment

Using Different Backbones on the Same Environment

LP-DSD can work with diffusion and flow matching backbones.

Reward Curve of two different generative model (matching in parameter count and initial reward) on hopper environment.

Toy Experiment

Toy multi-goal navigation with symmetric rewards. Four equally optimal Gaussian reward peaks (red markers) define four target modes. We visualize evaluation rollouts from the frozen backbone and after adaptation using LP-DS with different trust-region bounds δ ∈ {0.01, 0.05, 0.1}, alongside DSRL (Wagenmaker et al., 2025) and DPPO (Ren et al., 2024).

Toy multi-goal navigation with symmetric rewards Four equally optimal Gaussian reward peaks (red markers) de- fine four target modes. We visualize evaluation rollouts from the frozen backbone and after adaptation using LP-DS with different trust-region bounds δ ∈ {0.01, 0.05, 0.1}, alongside DSRL [2] and DPPO [3].

Trajectory-level mode coverage in the symmetric multi-goal toy task. Each panel corresponds to one goal (bottom-left, top-right, top-left, bottom-right) and shows, as training pro- ceeds, the fraction of evaluation trajectories (out of 1000 rollouts per evaluation) that reach that goal. Concentration of mass into a single panel indicates mode collapse, while sustained non-trivial mass across multiple panels indicates preserved multimodality

References
[1] Xiaogang Jia, Denis Blessing, Xinkai Jiang, Moritz Reuss, Atalay Donat, Rudolf Lioutikov, & Gerhard Neumann. (2024). Towards Diverse Behaviors: A Benchmark for Imitation Learning with Human Demonstrations.

[2] Wagenmaker, A., Nakamoto, M., Zhang, Y., Park, S., Yagoub, W., Nagabandi, A., Gupta, A., and Levine, S.Steering your diffusion policy with latent space reinforce-ment learning, 2025. URL https://arxiv.org/abs/2506.15799

[3] Ren, A. Z., Lidard, J., Ankile, L. L., Simeonov, A., Agrawal, P., Majumdar, A., Burchfiel, B., Dai, H., and Simchowitz, M. Diffusion policy policy optimization, 2024. URL https://arxiv.org/abs/2409.00588

[4] M. Toussaint, “Robotic: A Minimalistic C++ Framework for Robot Control.” 2026. [Online]. Available: https://github.com/MarcToussaint/robotic

Page updated

Google Sites

Report abuse