We use VLMs to generate high-level hand-object plans for dexterous manipulation, and train residual RL policies to execute them.
Dexterous robotic hands are essential for performing complex manipulation tasks, yet remain difficult to train due to the challenges of demonstration collection and high-dimensional control. While reinforcement learning (RL) can alleviate the data bottleneck by generating experience in simulation, it typically relies on carefully designed, task-specific reward functions, which hinder scalability and generalization. Thus, contemporary works in dexterous manipulation have often bootstrapped from reference trajectories. These trajectories specify target hand poses that guide the exploration of RL policies and object poses that enable dense, task-agnostic rewards. However, sourcing suitable trajectories - particularly for dexterous hands - remains a significant challenge. Yet, the precise details in explicit reference trajectories are often unnecessary, as RL ultimately refines the motion. Our key insight is that modern vision-language models (VLMs) already encode the commonsense spatial and semantic knowledge needed to specify tasks and guide exploration effectively. Given a task description (e.g., "open the cabinet") and a visual scene, our method uses an off-the-shelf VLM to first identify task-relevant keypoints (e.g., handles, buttons) and then synthesize 3D trajectories for hand motion and object motion. Subsequently, we train a low-level residual RL policy in simulation to track these coarse trajectories or "scaffolds" with high fidelity. Across a number of simulated tasks involving articulated objects and semantic understanding, we demonstrate that our method is able to learn robust dexterous manipulation policies. Moreover, we showcase that our method transfers to real-world robotic hands without any human demonstrations or handcrafted rewards.
First, we use a VLM to generate a high-level plan for dexterous manipulation:
Capture an RGB-D image from a calibrated camera.
Query a VLM to detect useful 2D keypoints for solving a task.
Convert these to 3D keypoints using the depth image and camera calibration.
Query a VLM to generate a high-level plan consisting of keypoint trajectories AND a wrist trajectory.
Given the high-level plan from the VLM, we deploy a low-level, closed-loop policy to execute the task. We use a residual policy that outputs residual wrist poses and finger actions. This policy was trained entirely in simulation and deployed zero-shot in the real world.
We evaluate our method across eight diverse manipulation tasks in simulation that test semantic understanding, unstructured motion, articulated object control, and precise finger movements. All videos below show successful rollouts with novel object and hand configurations. Despite using a single unified method across all tasks, our system achieves strong generalization and robust performance.
Move the apple onto the cutting board
Move the water bottle to the right side of the sink
Pick up the hammer and make a hammering motion
Wipe the kitchen counter with a sponge
Open the top drawer of the cupboard
Open the refrigerator door
Close the pliers
Close the pair of scissors
Zero-shot performance is strong across all tasks, showing that VLMs provide effective plans for dexterous manipulation. Few-shot prompting with just a few examples significantly boosts success, especially on harder tasks. Our baseline using pre-recorded trajectories performs poorly. An oracle with perfect keypoints and trajectories instead of a VLM sets an upper bound, indicating room for improvement if VLM accuracy increases.
Our analysis reveals that most failures stem from incomplete tracking of planned trajectories by the low-level controller, followed by errors in keypoint detection from the VLM. Some trajectories are fully tracked but still fail, highlighting occasional planning limitations. Due to the complex interaction between high-level planning and low-level execution, further automatic decomposition of errors remains challenging.
Detected keypoints for hammering task
Generated trajectories for fridge opening task (zero-shot)