VLAW:
Iterative Co-Improvement of Vision-Language-Action Policy and World Model
VLAW:
Iterative Co-Improvement of Vision-Language-Action Policy and World Model
Existing world models lack the physical fidelity necessary for policy improvement: they are predominantly trained on demonstration datasets that lack coverage of many different physical interactions (particularly failure cases) and struggle to accurately model small yet critical physical details in contact-rich object manipulation.
We propose a simple iterative improvement algorithm that uses real-world roll-out data to improve the fidelity of the world model, which can then, in turn, be used to generate supplemental synthetic data for improving the VLA model.
Figure 5: Long-horizon Policy-in-the-loop Rollout inside World Model
(World model keep high fidelity in long-horizon rollout.)
0. Initial Image
1. Real world Rollout
2. World Model Rollout
Pi_0.5 in Scooping task with 20 seconds
Pi_0.5 in Wiping task with 20 seconds
0. Initial Image
1. Real world Rollout
2. World Model Rollout
Pi_0.5 in Drawing task with 20 seconds
Pi_0.5 in book-open task with 20 seconds
Figure 6: Action Replay inside world model
(Action replay align with the real world.)
1. Real world
2. Pre-trained WM
3. WM +expert data
4. WM+online rollout data
Modeling the failure case in stacking task
Modeling the failure case in scooping task
1. Real world
2. Pre-trained WM
3. WM +expert data
4. WM+online rollout data
Modeling the failure case in stacking task
Figure 7: Large-scale rollout inside world model
(We can get diverse rollouts inside world model and search for success trajectories.)
Green indicates success trajectories.
Red indicates failed trajectories.
GT is real world and 0-14 are world model rollouts.
GT is real world and 0-14 are world model rollouts.