Grasp-Then-Plan with Failure Attribution: A Closed Two-Stage Framework for Precise and Generalizable Robotic Manipulation
Grasp-Then-Plan with Failure Attribution: A Closed Two-Stage Framework for Precise and Generalizable Robotic Manipulation
In robotic manipulation, the tight coupling between grasping and motion planning often obscures the true source of failure, leading to inefficient trial-and-error. To enable efficient long-horizon manipulation, we propose GTP-FA (Grasp-Then-Plan with Failure Attribution), a task-oriented two-stage ‘grasp-then-plan’ framework that generates grasp candidates and performs downstream motion planning conditioned on the selected grasp. Given a failed manipulation trajectory, we learn a failure attribution model that generalizes to unseen grasps and produces a stable distribution over failure modes for diagnosis-guided optimization. Based on these attribution results, we then optimize both modules in a diagnosis-driven manner: on the grasping side, we inject task-level priors and risk penalties into grasp candidate scoring and optimization to suppress unstable or task-incompatible grasps; on the planning side, we target high-risk initial states through data collection and fine-tuning to address genuine planning bottlenecks. We evaluate the proposed framework in both simulation and real-robot experiments, and show that GTP-FA improves the corresponding base learners across RL, IL, diffusion-policy, and VLA-based settings, achieving substantially higher overall task success rates.
Viewing note: Embedded playback may downsample the videos and miss some short stage captions. Please use the pop-out icon and download the video for complete high-resolution playback.
Stage 1: Task Goal; Task: place the orange into the pink tray.
Stage 2: Direct VLA Control; Original π0.5 directly predicts actions from visual observations and language.
Stage 3: Grasp Instability; Failure source: unstable or task-incompatible grasping.
Stage 4: Unreliable Post-Grasp; The downstream policy starts from an unreliable post-grasp condition.
Stage 5: Task Failure; Result: grasp or transport failure prevents task completion.
Stage 1: Task Goal; Task: place the grasped orange into the pink tray.
Stage 2: Task-Aware Grasp; GTP-FA first selects a task-compatible grasp.
Stage 3: Stable Post-Grasp; The selected grasp provides a stable post-grasp state.
Stage 4: Downstream Execution; The VLA policy executes from this reliable grasp-conditioned state.
Stage 5: Successful Completion; Result: the orange is stably placed into the tray.
Stage 1: Task Goal; Task: stack the blue cube on the orange cube.
Stage 2: Small-Object Localization; The small cubes occupy only a small region in the base-camera view.
Stage 3: Missing Grasp Grounding; Original π0.5 lacks explicit grasp-conditioned grounding.
Stage 4: Task Failure; Result: inaccurate approach, grasping, or placement causes failure.
Stage 1: Task Goal; Task: stack the blue cube on the orange cube.
Stage 2: Grasp Grounding; GTP-FA grounds execution on a selected grasp.
Stage 3: Stable Post-Grasp; The selected grasp provides a stable post-grasp state.
Stage 4: Precise Placement; The downstream VLA policy performs precise placement.
Stage 5: Post-Grasp Alignment; Accurate post-grasp alignment is critical for stacking.
Stage 6: Stable Stack; Result: the blue cube is stably stacked on the orange cube.
Stage 1: Task Goal; Task: pick up the red end of the stick and push the yellow cube into the red target area.
Stage 2: Direct VLA Control; Original π0.5 predicts the full action sequence directly.
Stage 3: Failed Grasping; The robot fails to grasp the stick.
Stage 4: Task Failure ; Result: the yellow cube does not reach the red target area.
Stage 1: Task Goal; Task: pick up the red end of the stick and push the yellow cube into the red target area.
Stage 2: Task-Aware Contact; GTP-FA selects an interaction pose suitable for stable pushing.
Stage 3: Stable Grasp; The selected grasp reduces slipping and off-direction motion.
Stage 4: Directional Execution; The VLA policy pushes the yellow cube toward the red target area.
Stage 5: Successful Completion; Result: the yellow cube reaches the red target area.
Stage 1: Task Goal; Task: pick up the red part of the hook and pull the yellow cube into the red target area.
Stage 2: Direct VLA Control; Original π0.5 directly predicts the full action sequence.
Stage 3: Stable Grasp; Original π0.5 can grasp the red part of the hook.
Stage 4: Pulling Attempt; The robot attempts to pull the cube but fails to move it into the target area.
Stage 5: Task Failure; Result: the yellow cube does not reach the red target area.
Stage 1: Task Goal; Task: pick up the red part of the hook and pull the yellow cube into the red target area.
Stage 2: Task-Aware Tool Grasp; GTP-FA selects a grasp that keeps the hook functional for pulling.
Stage 3: Tool Engagement; The tool is aligned to engage the cube from a suitable contact region.
Stage 4: Pulling Execution; The VLA policy pulls the cube toward the target while maintaining tool contact.
Stage 5: Successful Completion; Result: the yellow cube is pulled into the red target area.
Stage 1: Task Goal; Task: pick up the red handle of the gray cup and pour the contents into the blue-gray cup.
Stage 2: Failed Grasping; Original π0.5 fails to grasp the cup handle reliably.
Stage 3: Task Failure; Result: the task fails because the cup cannot be lifted for pouring.
Stage 1: Task Goal; Task: pick up the red handle of the gray cup and pour the contents into the blue-gray cup.
Stage 2: Task-Aware Grasp; GTP-FA selects a grasp that supports stable lifting and controlled pouring.
Stage 3: Stable Grasp; The selected grasp provides a reliable cup pose for downstream execution.
Stage 4: Controlled Pouring; The robot tilts the cup and pours the contents into the target cup.
Stage 5: Successful Completion; Result: the contents are successfully transferred into the blue-gray cup.
Success Conditions:
the red cube is on top of the green cube (to within half of the cube size)
the red cube is static
the red cube is not being grasped by the robot (robot must let go of the cube)
Success Conditions:
the absolute value of the peg’s y euler angle is within 0.08 of /2 and the z position of the peg is within 0.005 of its half-length (0.12).
Success Conditions:
the cube’s xy position is within goal_radius (default 0.05) of the target’s xy position by euclidean distance
the robot is static
Success Conditions:
The sphere is placed on the top of the bin. The robot remains static and the gripper is not closed at the end state.
Success Conditions:
the cube position is within goal_thresh (default 0.025m) euclidean distance of the goal position
the robot is static (q velocity < 0.2)
Success Conditions:
the cube’s xy position is within goal_radius (default 0.1) of the target’s xy position by euclidean distance.
Success Conditions:
the cube’s xy position is within goal_radius (default 0.1) of the target’s xy position by euclidean distance and the cube is still on the table.
Success Conditions
The cube’s xy position is within the goal region of the arm’s base (marked by reachability)
This table corresponds to Table 1 in the paper.
This figure summarizes final success_at_end across eight ManiSkill3 tasks and five downstream learners. GTP-FA achieves stronger and more consistent terminal success across PPO, SAC, BC, DP, and π0.5.
These curves show the full success_at_end training dynamics for PPO, SAC, BC, and DP across all eight tasks. GTP-FA generally improves terminal success, convergence behavior, or training stability.
These curves report whether the policy reaches a successful state at least once during an episode. Together with terminal-success curves, they show that GTP-FA improves both reaching success and maintaining success until termination.
These curves visualize the execute–diagnose–update process of GTP-FA. Later iterations or final models often improve or stabilize performance, showing the effect of failure-attribution-guided refinement.
These curves show the π0.5 fine-tuning loss under different ablation settings. GTP-FA often reaches lower or faster-converging training loss, while final conclusions are based on task-level success rather than loss alone.