Flow-matching-based policies have recently emerged as a promising approach for learning-based robot manipulation, offering significant acceleration in action sampling compared to diffusion-based policies. However, conventional flow-matching methods struggle with multi-modality, often collapsing to averaged or ambiguous behaviors in complex manipulation tasks. To address this, we propose the Variational Flow-Matching Policy (VFP), which introduces a variational latent prior for mode-aware action generation and effectively captures both task-level and trajectory-level multi-modality. VFP further incorporates Kantorovich Optimal Transport (K-OT) for distribution-level alignment and utilizes a Mixture-of-Experts (MoE) decoder for mode specialization and efficient inference. We comprehensively evaluate VFP on 41 tasks across four benchmark environments, demonstrating its effectiveness and sampling efficiency in both task and path multi-modality settings. Results show that VFP achieves a 49% relative improvement in task success rate over standard flow-based baselines, while maintaining fast inference and compact model size.
Overview of Variational Flow Policy (VFP).
(A) Model Pipeline: The model consists of a latent-conditioned Mixture-of-Experts (MoE) flow matching network. A prior network generates latent variables from input states, which guide the MoE decoder to predict actions efficiently. During training, a posterior encoder is used for variational learning.
(B) Latent-Instructed Mode Identification: Visualization of how VFP enables mode-specific behavior. Compared to the collapsed predictions of vanilla flow matching, our model captures distinct modes conditioned on latent variables.
(C) Latent Shaping and Mode Decoupling via OT: Optimal Transport regularization improves mode separation in the latent space, aligning latent modes with distinct action modes.
Variational Flow Policy outperforms prior state-of-the-art flow-based model on 41 tasks from 4 benchmarks with an average improvement of 49%.
Franka Kitchen¹
Avoiding²
Pen³
Assembly⁴
These are tasks from standard simulation benchmarks used in our work.
We are grateful to the authors of these projects for open-sourcing their simulation environments:
¹ Relay Policy Learning ²D3IL ³Adroit ⁴Meta-World
VFP in Avoiding
Our model is capable of capturing multi-modality in tasks. Here some videos of VFP in Avoiding task from D3IL.
Failure Case
Note: We observe some instability in the model's behavior, which causes the failure case. It is primarily due to the default action horizon of 1 in D3IL, which causes frequent switching between randomly selected modes. In the Franka Kitchen environment shown below, we observe that increasing the action horizon results in much more stable behavior for VFP.
Vanilla FM in Avoiding
Contrast to VFP, vanilla flow-based models converge to a averaged path. Here, we use FlowPolicy¹ as our baseline.
FlowPolicy
In Franka Kitchen, we found that our model tackles the multi-modality well, resulting in very stable and smooth movements, compared to the vanilla flow-based model FlowPolicy¹.
VFP
FP
VFP
FP
VFP
FP - Failure - 1
VFP - Failure
FP - Failure - 2
In the videos, we observe that FlowPolicy exhibits shaky movements. This instability probably arises from the model "wandering" among multiple modes/tasks—an effect of mode mixing—which leads to the following failure cases:
FP - Failure - 1: Indecisive switching among multiple targets leads to inaction .
FP - Failure - 2: Excessive shaking during kettle grasping results in dropping the object.