Behavior cloning with high-capacity generative policies achieves strong imitation performance, but performance is often constrained by limited demonstration coverage and sensitivity to distribution shift. While reinforcement learning can improve task performance, directly fine-tuning large action decoders is often unstable and sample inefficient. We propose Lagrangian Perturbation Diffusion Steering (LP-DS), a lightweight adaptation method that improves a frozen generative policy while preserving its multimodal structure. LP-DS learns a compact noise-space perturbation module that shifts Gaussian noise inputs before decoding, enabling policy improvement without modifying the action decoder. To prevent off-manifold latent queries and unstable denoising dynamics, we optimize this module with a Lagrangian trust-region objective that maximizes downstream value while constraining perturbation magnitude, yielding stable and sample-efficient learning. Across RoboMimic manipulation, OpenAI Gym locomotion, and Adroit dexterous manipulation benchmarks, LP-DS improves sample efficiency, success, and return while maintaining diverse behavior, as quantified by higher action-space entropy using the Kozachenko--Leonenko k-nearest neighbor estimator, with return improvements of up to 25% over prior baselines.
Baseline comparisons across domains. Top row: RoboMimic manipulation success rates. Second row: OpenAI Gym locomotion episodic returns. Third row: Adroit dexterous manipulation success rates. Bottom row: Adroit dexterous manipulation episodic returns.
LP-DS is run with the model π0 in “pick up the cream cheese and put it in the tray” task from Libero-90. The success rate graph shows that LP-DS can work with heavy generative backbones.
Success rate vs. environment steps on Libero-90 “pick up the cream cheese and put it in the tray” task.
10 sample trajectories with backbone model
10 sample trajectories with LP-DS at step 50000
We deployed our method on a physical Franka Panda robot for a real-world pick and place task. We collected 29 teleoperation trajectories, and explicitly trained a base Flow Matching policy using this teleoperation data. This base policy achieved only a 2/10 zero shot success rate due to real world execution errors. We then trained our perturbation module using an emulated RAI [4] framework with the following dense reward formulation: reward = -0.1 * ||obj_xy - target_xy|| - 0.5 * collision + 1.0 * success. Deploying this steered policy improved the physical success rate to 8/10, proving the method successfully overcomes sensor noise and simulation to reality gaps.
Real-world trajectories sampled from flow matching bacbone trained with human data. This is before LP-DS training.
Real-world trajectories sampled with flow matching model steered with LP-DS. We observe an increase in success rate from 2/10 to 8/10.
To evaluate our method under realistic sensor conditions, we upgraded the pick-and-place task to condition directly on image observations using an Intel RealSense Depth Camera 435i. By using this specific RGBD camera, we explicitly introduced real-world visual and sensor noise into the pipeline. The base Flow-Matching policy was trained on a fresh dataset of human teleoperation data collected with this setup.
To ensure a comprehensive and rigorous evaluation, we implemented a systematic spatial testing protocol:
Setup: We defined a 2x4 grid across the workspace for the cube's initial positions during inference
Evaluation: We conducted 5 independent trials at each of the 8 grid locations, resulting in 40 systematic trials.
Results: LP-DS achieved a 33/40 success rate. This demonstrates that our latent perturbation approach effectively compensates for both visual noise and spatial variations in real-world environments.
Simulation of Real-World Teleoperation Data in RAI
Real-world trajectories sampled from flow matching bacbone trained with human data. This is before LP-DS training.
Real-world trajectories sampled with flow matching model steered with LP-DS. We observe an increase in success rate from 18/40 to 33/40.
To prove that our method generalizes beyond simple geometries, we deployed LP-DS on a highly distinct and challenging physical task: hanging a mug by its handle onto a stationary mug holder.
Challenge: This fine manipulation task requires high-precision alignment and narrow-tolerance insertion. Such requirements present a significant challenge for frozen generative policies when faced with real-world execution errors.
Results: Evaluated across 20 independent trials, LP-DS successfully steered the base policy to complete the insertion 17 out of 20 times (17/20 success rate).
Simulation of Real-World Teleoperation Data in RAI for Mug Hanging Task
Real-world trajectories sampled from flow matching bacbone trained with human data. This is before LP-DS training.
Real-world trajectories sampled with flow matching model steered with LP-DS. We observe an increase in success rate from 11/20 to 17/10.
Avoiding environment [1] is used to show the effect of trust region parameter on the action diversity and success rate.
Eval Trajectories visualized at step 50000 for LP-DS and DSRL
Evaluation Metrics of LP-DS and DSRL on Avoding Environment
We ablate the the trust region size, run LP-DS in the following 4 environments for different trust region size. We observe that trust region size is not a hyperparameter that requires careful tuning. We observe values > 0.1 leads to best performance independent of the environment. There is nothing special about choosing 0.35 vs. 0.5 vs. 0.66 as the trust region.
Reward/Success Rate graphs of LP-DS for different choice of trust region size on the environments Hopper, Square, Reloca, and Walker
We ablate our choice of finding perturbation in noise space by doing perturbation in action space.
Reward Curve of LP-DS vs. LP-DS-A in walker environment
LP-DSD can work with diffusion and flow matching backbones.
Reward Curve of two different generative model (matching in parameter count and initial reward) on hopper environment.
Toy multi-goal navigation with symmetric rewards. Four equally optimal Gaussian reward peaks (red markers) define four target modes. We visualize evaluation rollouts from the frozen backbone and after adaptation using LP-DS with different trust-region bounds δ ∈ {0.01, 0.05, 0.1}, alongside DSRL (Wagenmaker et al., 2025) and DPPO (Ren et al., 2024).
Toy multi-goal navigation with symmetric rewards Four equally optimal Gaussian reward peaks (red markers) de- fine four target modes. We visualize evaluation rollouts from the frozen backbone and after adaptation using LP-DS with different trust-region bounds δ ∈ {0.01, 0.05, 0.1}, alongside DSRL [2] and DPPO [3].
Trajectory-level mode coverage in the symmetric multi-goal toy task. Each panel corresponds to one goal (bottom-left, top-right, top-left, bottom-right) and shows, as training pro- ceeds, the fraction of evaluation trajectories (out of 1000 rollouts per evaluation) that reach that goal. Concentration of mass into a single panel indicates mode collapse, while sustained non-trivial mass across multiple panels indicates preserved multimodality
References
[1] Xiaogang Jia, Denis Blessing, Xinkai Jiang, Moritz Reuss, Atalay Donat, Rudolf Lioutikov, & Gerhard Neumann. (2024). Towards Diverse Behaviors: A Benchmark for Imitation Learning with Human Demonstrations.
[2] Wagenmaker, A., Nakamoto, M., Zhang, Y., Park, S., Yagoub, W., Nagabandi, A., Gupta, A., and Levine, S.Steering your diffusion policy with latent space reinforce-ment learning, 2025. URL https://arxiv.org/abs/2506.15799
[3] Ren, A. Z., Lidard, J., Ankile, L. L., Simeonov, A., Agrawal, P., Majumdar, A., Burchfiel, B., Dai, H., and Simchowitz, M. Diffusion policy policy optimization, 2024. URL https://arxiv.org/abs/2409.00588
[4] M. Toussaint, “Robotic: A Minimalistic C++ Framework for Robot Control.” 2026. [Online]. Available: https://github.com/MarcToussaint/robotic