Hybrid Consistency Policy: Decoupling Multi-Modal Diversity and Real-Time Efficiency in Robotic Manipulation
Hybrid Consistency Policy: Decoupling Multi-Modal Diversity and Real-Time Efficiency in Robotic Manipulation
Anonymous
In visuomotor policy learning, diffusion-based imitation learning has become widely adopted for its ability to capture diverse behaviors. However, approaches built on ordinary and stochastic denoising processes struggle to jointly achieve fast sampling and strong multi-modality. To address these challenges, we propose the Hybrid Consistency Policy (HCP). HCP runs a short stochastic prefix up to an adaptive switch time, and then applies a one-step consistency jump to produce the final action. To align this one-jump generation, HCP performs time-varying consistency distillation that combines a trajectory-consistency objective to keep neighboring predictions coherent and a denoising-matching objective to improve local fidelity. In both simulation and on a real robot, HCP with 25 SDE steps plus one jump approaches the 80-step DDPM teacher in accuracy and mode coverage while significantly reducing latency. These results show that multi-modality does not require slow inference, and a switch time decouples mode retention from speed. It yields a practical accuracy–efficiency trade-off for robot policies.
Hybrid Consistency Policy.
SDE models capture multi-modal behaviors but sample slowly, while ODE models are fast yet prone to mode collapse. HCP runs a short stochastic SDE prefix in the high-noise region to form branches, then at an adaptive switch time performs a one-step consistency jump along the probability-flow ODE to the end.
Diffusion-based policies have recently emerged as a powerful paradigm for visuomotor control and sequential decision making, offering strong stability and an expressive way to model complex, multi-modal action distributions. While SDE methods are effective at preserving diverse behaviors, they incur substantial inference latency due to many iterative denoising steps. In contrast, ODE methods reduce sampling costs but can bias toward a dominant mode in multi-modal control settings, compromising diversity and robustness .This speed–diversity tension is particularly pronounced in real-world robotics, where ambiguous goals, stochastic environments, and human preferences routinely induce multi-modal solution manifolds.
To address these issues, we propose the Hybrid Consistency Policy (HCP), which enables robust preservation and promotion of multiple modes during distillation. Our method introduces a principled approach for selecting distillation time steps and incorporates explicit mechanisms to encourage mode bifurcation throughout the distillation process.
Overview of HCP.
We first define a hybrid score matching model that parameterizes both dynamics under a shared time and noise schedule, providing a continuous bridge from the SDE regime to the ODE regime.
We then carry out time varying consistency distillation, which maps an intermediate noisy state at switch time to the clean action.
Finally, we identify the switching time that marks the handoff from the stochastic prefix to the deterministic jump so that mode selection occurs before the jump.
A 7-DoF collaborative arm (Flexiv Rizon 4) operates in a fixed workspace with calibrated TCP pose.
A wrist camera (RealSense D415) and a third-person camera (RealSense D435) provide multi-view observations.
All methods take 2 consecutive observations as input and output an 8-step EE pose sequence.
Avoiding (Path-1)
Avoiding (Path-2)
Avoiding (Path-3)
Pick and Place (Left)
Pick and Place (Right)
Push-T (Left)
Push-T (Left)
Ablation on the switch time.
Switch timing is critical: jumping too early cuts inference to 6 steps (~0.04 s) but markedly hurts accuracy and triggers mode collapse.
With a near-optimal or later switch, HCP matches teacher-level success and preserves stable branching with obstacle-aware paths.