VFP: Variational Flow-Matching Policy for

Multi-Modal Robot Manipulation

Abstract

Flow-matching-based policies have recently emerged as a promising approach for learning-based robot manipulation, offering significant acceleration in action sampling compared to diffusion-based policies. However, conventional flow-matching methods struggle with multi-modality, often collapsing to averaged or ambiguous behaviors in complex manipulation tasks. To address this, we propose the Variational Flow-Matching Policy (VFP), which introduces a variational latent prior for mode-aware action generation and effectively captures both task-level and trajectory-level multi-modality. VFP further incorporates Kantorovich Optimal Transport (K-OT) for distribution-level alignment and utilizes a Mixture-of-Experts (MoE) decoder for mode specialization and efficient inference. We comprehensively evaluate VFP on 41 simulated tasks and 3 real-robot tasks, demonstrating its effectiveness and sampling efficiency in both simulated and real-world settings. Results show that VFP achieves a 49% relative improvement in task success rate over standard flow-based baselines in simulation, and further outperforms them on real-robot tasks, while still maintaining fast inference and a compact model size.

Overview of Variational Flow Policy (VFP).

(A) Model Pipeline: The model consists of a latent-conditioned Mixture-of-Experts (MoE) flow matching network. A prior network generates latent variables from input states, which guide the MoE decoder to predict actions efficiently. During training, a posterior encoder is used for variational learning.
(B) Latent-Instructed Mode Identification: Visualization of how VFP enables mode-specific behavior. Compared to the collapsed predictions of vanilla flow matching, our model captures distinct modes conditioned on latent variables.
(C) Latent Shaping and Mode Decoupling via OT: Optimal Transport regularization improves mode separation in the latent space, aligning latent modes with distinct action modes.

Simulation Benchmarks

Variational Flow Policy outperforms prior state-of-the-art flow-based model on 41 tasks from 4 benchmarks with an average improvement of 49%.

Franka Kitchen¹

Avoiding²

Pen³

Assembly⁴

These are tasks from standard simulation benchmarks used in our work.

We are grateful to the authors of these projects for open-sourcing their simulation environments:

¹ Relay Policy Learning ²D3IL ³Adroit ⁴Meta-World

Path Multi-modality - Avoiding

VFP in Avoiding

Our model is capable of capturing multi-modality in tasks. Here some videos of VFP in Avoiding task from D3IL.

Failure Case

Note: We observe some instability in this task, which causes the failure case. It is primarily due to the default action horizon of 1 in D3IL. In the Franka Kitchen environment shown below, we observe that increasing the action horizon results in much more stable behavior for VFP.

Vanilla FM in Avoiding

Contrast to VFP, vanilla flow-based models converge to a averaged path. Here, we use FlowPolicy¹ as our baseline.

FlowPolicy

¹FlowPolicy: Enabling Fast and Robust 3D Flow-based Policy via Consistency Flow Matching for Robot Manipulation

Task Multi-modality - Franka Kitchen

In Franka Kitchen, we found that our model tackles the multi-modality well, resulting in very stable and smooth movements, compared to the vanilla flow-based model FlowPolicy¹ (FP).

VFP

FP - Failure - 1

VFP

FP - Failure - 2

In the videos, we observe that FlowPolicy exhibits shaky movements. This instability probably arises from the model "wandering" among multiple modes/tasks—an effect of mode mixing—which leads to the following failure cases:

FP - Failure - 1: Indecisive switching among multiple targets leads to inaction .
FP - Failure - 2: Excessive shaking during kettle grasping results in dropping the object.

¹FlowPolicy: Enabling Fast and Robust 3D Flow-based Policy via Consistency Flow Matching for Robot Manipulation

Real-World Experiments

To show the robustness of our method in real-world environments, we conduct experiments on 3 real-world tasks, respectively Avoiding, Tubes Placement and Cups Nesting.

On all three tasks, out method substantially outperforms FlowPolicy and Diffusion Policy. This underscores the importance of possessing both multi-modal ability (accurate control) and fast inference in real-world environments.

Real-World Avoiding

In real world, VFP maintains its multi-modal ability.

Tubes Placement & Cups Nesting

Tubes Placement

Cups Nesting

Our method shows accurate control and smooth motion in these two tasks, which stems from its multi-modal ability and fast inference.

Page updated

Google Sites

Report abuse