Flow-matching-based policies have recently emerged as a promising approach for learning-based robot manipulation, offering significant acceleration in action sampling compared to diffusion-based policies. However, conventional flow-matching methods struggle with multi-modality, often collapsing to averaged or ambiguous behaviors in complex manipulation tasks. To address this, we propose the Variational Flow-Matching Policy (VFP), which introduces a variational latent prior for mode-aware action generation and effectively captures both task-level and trajectory-level multi-modality. VFP further incorporates Kantorovich Optimal Transport (K-OT) for distribution-level alignment and utilizes a Mixture-of-Experts (MoE) decoder for mode specialization and efficient inference. We comprehensively evaluate VFP on 41 simulated tasks and 3 real-robot tasks, demonstrating its effectiveness and sampling efficiency in both simulated and real-world settings. Results show that VFP achieves a 49% relative improvement in task success rate over standard flow-based baselines in simulation, and further outperforms them on real-robot tasks, while still maintaining fast inference and a compact model size.
Overview of Variational Flow Policy (VFP).
(A) Model Pipeline: The model consists of a latent-conditioned Mixture-of-Experts (MoE) flow matching network. A prior network generates latent variables from input states, which guide the MoE decoder to predict actions efficiently. During training, a posterior encoder is used for variational learning.
(B) Latent-Instructed Mode Identification: Visualization of how VFP enables mode-specific behavior. Compared to the collapsed predictions of vanilla flow matching, our model captures distinct modes conditioned on latent variables.
(C) Latent Shaping and Mode Decoupling via OT: Optimal Transport regularization improves mode separation in the latent space, aligning latent modes with distinct action modes.
Variational Flow Policy outperforms prior state-of-the-art flow-based model on 41 tasks from 4 benchmarks with an average improvement of 49%.
Franka Kitchen¹
Avoiding²
Pen³
Assembly⁴
These are tasks from standard simulation benchmarks used in our work.
We are grateful to the authors of these projects for open-sourcing their simulation environments:
¹ Relay Policy Learning ²D3IL ³Adroit ⁴Meta-World
VFP in Avoiding
Our model is capable of capturing multi-modality in tasks. Here some videos of VFP in Avoiding task from D3IL.
Failure Case
Note: We observe some instability in this task, which causes the failure case. It is primarily due to the default action horizon of 1 in D3IL. In the Franka Kitchen environment shown below, we observe that increasing the action horizon results in much more stable behavior for VFP.
Vanilla FM in Avoiding
Contrast to VFP, vanilla flow-based models converge to a averaged path. Here, we use FlowPolicy¹ as our baseline.
FlowPolicy
In Franka Kitchen, we found that our model tackles the multi-modality well, resulting in very stable and smooth movements, compared to the vanilla flow-based model FlowPolicy¹ (FP).
VFP
FP
VFP
FP - Failure - 1
VFP
FP - Failure - 2
In the videos, we observe that FlowPolicy exhibits shaky movements. This instability probably arises from the model "wandering" among multiple modes/tasks—an effect of mode mixing—which leads to the following failure cases:
FP - Failure - 1: Indecisive switching among multiple targets leads to inaction .
FP - Failure - 2: Excessive shaking during kettle grasping results in dropping the object.
To show the robustness of our method in real-world environments, we conduct experiments on 3 real-world tasks, respectively Avoiding, Tubes Placement and Cups Nesting.
On all three tasks, out method substantially outperforms FlowPolicy and Diffusion Policy. This underscores the importance of possessing both multi-modal ability (accurate control) and fast inference in real-world environments.
In real world, VFP maintains its multi-modal ability.
Our method shows accurate control and smooth motion in these two tasks, which stems from its multi-modal ability and fast inference.