Chenyang Gu*¹, Jiaming Liu*†¹, Hao Chen*†², Runzhong Huang*¹, Qingpo Wuwu¹, Zhuoyang Liu¹,
Xiaoqi Li¹, Ying Li¹, Renrui Zhang², Peng Jia³, Pheng-Ann Heng², Shanghang Zhang✉¹
1State Key Laboratory of Multimedia Information Processing, School of Computer Science,
Peking University 2The Chinese University of Hong Kong 3Simplexity Robotics
*Equal contribution †Project leader ✉Corresponding author
Selected Demo
🍿 Selected Demo
Selected Demo
Vision-Language-Action (VLA) models have recently emerged, demonstrating strong generalization in robotic scene understanding and manipulation. However, when confronted with long-horizon tasks that require defined goal states, such as LEGO assembly or object rearrangement, existing VLA models still face challenges in coordinating high-level planning with precise manipulation. Therefore, we aim to endow a VLA model with the capability to infer the "how" process from the "what" outcomes, transforming goal states into executable procedures. In this paper, we introduce Manual VLA, a unified VLA framework built upon a Mixture-of-Transformers (MoT) architecture, enabling coherent collaboration between multimodal manual generation and action execution. Unlike prior VLA models that directly map sensory inputs to actions, we first equip ManualVLA with a planning expert that generates intermediate manuals consisting of images, position prompts, and textual instructions. Building upon these multimodal manuals, we design a Manual Chain-of-Thought (ManualCoT) reasoning process that feeds them into the action expert, where each manual step provides explicit control conditions, while its latent representation offers implicit guidance for accurate manipulation. To alleviate the burden of data collection, we develop a high-fidelity digital-twin toolkit based on 3D Gaussian Splatting, which automatically generates manual data for planning expert training. ManualVLA demonstrates strong real-world performance, achieving an average success rate 32% higher than the previous hierarchical SOTA baseline on LEGO assembly and object rearrangement tasks.
Overview. (a) Long-horizon tasks with predefined goal states, such as LEGO assembly or object rearrangement, pose a significant challenge for intelligent robots, as they require not only imagining procedural manuals but also executing precise manipulations based on them. (b) We address such tasks by introducing ManualVLA, a unified VLA model built upon a MoT architecture, which enables coherent collaboration between multimodal manual and action generation via a designed Manual Chain-of-Thought.
⚒️ ManualVLA Architecture
To accomplish long-horizon tasks with defined goal states, we propose ManualVLA, a unified VLA model built upon a MoT architecture. The framework consists of two experts: a planning expert responsible for generating multimodal manuals, and an action expert responsible for predicting precise actions. The planning expert processes human instructions, the current image, and the final goal image to generate intermediate manuals that combine next-step image, positions, and sub-task instructions. We introduce an explicit CoT reasoning process, where each positional indicator serves as a visual prompt embedded into the observation of the action expert. Along with the cross-task shared attention mechanism and the designed attention mask, the generated manual tokens are also used as conditioning signals for action generation, enabling an implicit CoT reasoning process that effectively guides the action expert. For training, ManualVLA adopts a three-stage training strategy that aligns the planning and action experts for effective collaboration.
🔥 Real-World Experiments
Action Generation. We design three long-horizon tasks with defined goal states, challenging the model’s procedural reasoning and manipulation capabilities. (1) 2D LEGO Assembly: The task begins with several LEGO bricks of different colors placed on a planar board. Given the final 2D assembled structure as the goal, the model must infer a sequence of intermediate manipulation actions and execute them through coordinated bimanual control. (2) 3D LEGO Assembly: The task extends the 2D LEGO Assembly task to a more challenging 3D setting, where the final configuration transitions from a planar layout to a 3D structure. This upgraded task imposes greater demands on the model’s spatial reasoning abilities. (3) Object Rearrangement: The task begins with several objects of diverse shapes, sizes, and semantics scattered around a box. Given a goal state in which all objects are placed at their designated positions inside the box, the model must progressively generate manipulation actions, alternating control of the left and right arms to prevent collisions.
Compared with the strongest hierarchical baseline, ManualVLA improves the final task completion rate by 15%-30%. While baseline models often succeed in the early stages of a long-horizon pipeline, they typically fail to sustain this performance through the whole sequence. In contrast, ManualVLA mitigates this degradation by decomposing complex tasks into structured subgoal manuals and grounding them into precise actions through a combination of explicit and implicit reasoning, enabling consistent performance throughout the entire task.
For each task, we visualize three components: (1) manual ground truth (GT), (2) manual predictions (Pred.) generated by ManualVLA, and (3) the final goal image.
Manual Generation. We first evaluate the capability of the planning expert in ManualVLA to generate high-fidelity manuals on 300 unseen test samples. As shown in Table, our model produces satisfactory intermediate images across all three tasks, achieving high PSNR scores, indicating strong structural and pixellevel consistency with the ground truth. Furthermore, the low FID scores, particularly in the Object Rearrangement task, demonstrate that the generated image distribution closely matches that of real images, confirming their realism and fidelity.
💿 More Demonstrations
2D LEGO Assembly
1
1
1
🖼️ 2D LEGO Assembly
💯 Target State
📷 Front View
🎥 Third View
2D LEGO Assembly
1
1
1
3D Synchronous Assembly
1
1
1
🏔️ 3D Synchronous Assembly
💯 Target State
📷 Front View
🎥 Third View
3D Synchronous Assembly
1
1
1
3D Asynchronous Assembly
1
1
1
🌄 3D Asynchronous Assembly
💯 Target State
📷 Front View
🎥 Third View
3D Asynchronous Assembly
1
1
1
Object Rearrangement
1
1
1
🥣 Object Rearrangement
💯 Target State
📷 Front View
🎥 Third View
Object Rearrangement
1
1
1
🏅 Simulation Experiments
To systematically evaluate the ability of ManualVLA to manipulate on general tasks, retaining the basic capabilities of the VLA model, we conduct experiments on 10 tasks in the RLBench benchmark, which is based on the CoppeliaSim simulator. For each task, we collect 100 demonstration trajectories through the Open Motion Planning Library.
ManualVLA obtains an average success rate of 70% across 10 diverse tasks, surpassing the previous SOTA methods π0 and CoT-VLA by margins of 7% and 11%, respectively.