From Noise to Control: Parameterized Diffusion Policies

ICML 2026

Authors

Renhao Zhang¹ Haotian Fu² Mingxi Jia² George Konidaris² Yilun Du³ Bruno Castro Da Silva¹

¹UMass Amherst ²Brown University ³Harvard University

Correspondence to: renhaozhang@cs.umass.edu

Abstract

We propose Parameterized Diffusion Policy (PDP), a framework that learns a diffusion policy parameterized in a smooth continuous space. By structuring a latent manifold such that distances between latents' values reflect the semantic similarity of physical trajectories, we transform diffusion from a mechanism of stochastic diversity into a precise tool for behavior steering. Our approach also enables smooth interpolation between known strategies and efficient generalization to novel constraints without the need to update policy weights. We demonstrate that PDP significantly improves adaptation performance on complex multimodal benchmarks in both simulation and real-robot hardware compared to regular diffusion policy, particularly in scenarios requiring the discovery of novel behaviors.

[Paper] [Code]

Methods

Parameterized Diffusion Policy (PDP) turns diffusion policy from a purely noise-driven generator into a controllable behavior model. The key idea is to condition the diffusion policy on a continuous behavior latent zzz, where each latent represents a different way of completing the same task.

We first train a trajectory encoder that maps demonstrations into a low-dimensional latent space. This space is shaped so that trajectories that look physically similar are placed close together, while different execution strategies are separated. We use a soft-DTW-based geometry loss to align distances in latent space with similarities between end-effector trajectories.

The policy is then trained as a latent-conditioned diffusion policy. Given the current robot observation and a selected latent zzz, the denoiser generates an action trajectory corresponding to that behavior style. In practice, zzz acts as a control knob: changing zzz changes how the robot approaches and pulls the drawer.

At deployment time, when new obstacles block previously demonstrated strategies, PDP adapts by optimizing only the latent zzz while keeping the policy weights fixed. This allows the robot to quickly select a feasible demonstrated behavior or interpolate toward a new one from a single demonstration. In contrast, standard diffusion policy relies on random noise or high-dimensional noise-space optimization, which makes behavior steering less stable under constraint shifts.

Real Robot Experimental Results

1. Experimental Setup:

We evaluate Parameterized Diffusion Policies (PDP) on a real-world manipulation task using a Franka Emika Panda robot arm. The task is OpenDrawer, which requires the robot to reach a drawer handle, establish contact, and pull the drawer open to a target configuration. The physical setup mirrors the simulation task at a high level but introduces substantially greater execution variability due to sensing noise, unmodeled dynamics, and human teleoperation.

1.1 Multimodal Trajectory Collecting

To induce structured multimodality, we explicitly design six distinct execution modes for reaching and pulling the drawer handle (left image below). These modes differ in how the end-effector (EE) approaches the handle prior to contact. Specifically, the robot may approach from the left or right side of the handle, with two variants for each side corresponding to trajectories that remain closer to or farther from the straight start-end line. In addition, we include two vertical approach modes, where the end-effector reaches the handle from a higher or lower elevation relative to the handle center. Each mode therefore corresponds to a distinct geometric strategy for handle acquisition and pulling. For each mode, we collect 15 noisy intra-mode demonstration trajectories via human teleoperation using a SpaceMouse. Due to teleoperation imprecision and physical interaction effects, these demonstrations result in noticeably noisier end-effector trajectories even within the same mode (right image below).

1.2 Evaluation Methods

We evaluate PDP and DP under a set of controlled environment variants designed to isolate behavior-side adaptation under constraint-induced shifts. It is tested under four conditions: the Original Scene (first image below from left), which matches the training environment and serves as a baseline for multimodal imitation fidelity; Scene 1 and 2 (second and third image below from left), which introduce constraints that invalidate a subset of the demonstrated strategies while leaving exactly one training mode feasible; and Scene 3 (fourth image below from left), a zero-mode-feasible setting where all training modes fail and success requires discovering a new trajectory guided by a single new demonstration.

2. Evaluation Rollout Results

We evaluate policies through repeated closed-loop rollouts on the real robot, focusing on how behaviors are instantiated and adapted at execution time rather than on training-time differences. In the original scene, PDP is executed by conditioning the policy on a representative behavior latent z computed as the cluster center of demonstrations belonging to the same execution mode. This produces deterministic, mode-consistent rollouts once z is fixed. In contrast, standard DP generates rollouts by sampling from the diffusion noise space, relying on stochastic denoising to resolve multimodality at inference time.

In constraint-shifted scenes, where previously demonstrated strategies become infeasible, both policies are evaluated under test-time adaptation with frozen model parameters. PDP adapts by optimizing only the low-dimensional behavior latent z via gradient backpropagation through the denoising process, using a single new demonstration as guidance. DP instead performs gradient-based optimization in its high-dimensional noise space. The resulting rollouts reveal qualitative differences in stability and consistency, highlighting how explicit behavior parameterization enables reliable adaptation under real-world constraints.

2.1 Performance in Original Scene

PDP's Performance (5/5)

DP's Performance (5/5)

2.2 Adaptation to Constraint-Induced Shifts

Scene 1

PDP's Performance (5/5)

DP's Performance (3/5)

Scene 2

PDP's Performance (5/5)

DP's Performance (1/5)

Scene 3

PDP's Performance (5/5)

DP's Performance (2/5)

2.4 Generalization as Navigating in Behavior Space

Beyond adapting to specific constraint shifts, we evaluate whether the learned behavior representation supports generalization beyond the discrete strategies present in the training data. PDP enables this by operating in a continuous behavior latent space, where nearby latent values correspond to semantically similar physical trajectories. By interpolating between behavior latents derived from different demonstration clusters, PDP generates smooth, physically plausible rollouts that were never explicitly demonstrated.

On real hardware, these interpolated behaviors remain stable despite execution noise and contact uncertainty, indicating that the policy does not merely memorize discrete modes but instead learns a navigable behavior manifold. This allows PDP to synthesize new approaches that respect task constraints while remaining consistent with the underlying structure of the demonstrated behaviors.

Page updated

Google Sites

Report abuse