Scaffolding Dexterous Manipulation

with Vision-Language Models

TL;DR

We use VLMs to generate high-level hand-object plans for dexterous manipulation, and train residual RL policies to execute them.

Abstract

Dexterous robotic hands are essential for performing complex manipulation tasks, yet remain difficult to train due to the challenges of demonstration collection and high-dimensional control. While reinforcement learning (RL) can alleviate the data bottleneck by generating experience in simulation, it typically relies on carefully designed, task-specific reward functions, which hinder scalability and generalization. Thus, contemporary works in dexterous manipulation have often bootstrapped from reference trajectories. These trajectories specify target hand poses that guide the exploration of RL policies and object poses that enable dense, task-agnostic rewards. However, sourcing suitable trajectories - particularly for dexterous hands - remains a significant challenge. Yet, the precise details in explicit reference trajectories are often unnecessary, as RL ultimately refines the motion. Our key insight is that modern vision-language models (VLMs) already encode the commonsense spatial and semantic knowledge needed to specify tasks and guide exploration effectively. Given a task description (e.g., "open the cabinet") and a visual scene, our method uses an off-the-shelf VLM to first identify task-relevant keypoints (e.g., handles, buttons) and then synthesize 3D trajectories for hand motion and object motion. Subsequently, we train a low-level residual RL policy in simulation to track these coarse trajectories or "scaffolds" with high fidelity. Across a number of simulated tasks involving articulated objects and semantic understanding, we demonstrate that our method is able to learn robust dexterous manipulation policies. Moreover, we showcase that our method transfers to real-world robotic hands without any human demonstrations or handcrafted rewards.

Real-World: Keypoint Detection and Trajectory Generation

First, we use a VLM to generate a high-level plan for dexterous manipulation:

Capture an RGB-D image from a calibrated camera.
Query a VLM to detect useful 2D keypoints for solving a task.
Convert these to 3D keypoints using the depth image and camera calibration.
Query a VLM to generate a high-level plan consisting of keypoint trajectories AND a wrist trajectory.

Place Bottle onto Plate

Slide Box to Bottle

Hammer Three Times

Real-World: Videos (1x Speed)

Given the high-level plan from the VLM, we deploy a low-level, closed-loop policy to execute the task. We use a residual policy that outputs residual wrist poses and finger actions. This policy was trained entirely in simulation and deployed zero-shot in the real world.

Place Bottle onto Plate (Success 18/20)

Slide Box to Bottle (Success 17/20)

Hammer Three Times (Success 13/20)

Simulation: Videos

We evaluate our method across eight diverse manipulation tasks in simulation that test semantic understanding, unstructured motion, articulated object control, and precise finger movements. All videos below show successful rollouts with novel object and hand configurations. Despite using a single unified method across all tasks, our system achieves strong generalization and robust performance.

Semantic Understanding

Move the apple onto the cutting board

Move the water bottle to the right side of the sink

Unstructured Motion

Pick up the hammer and make a hammering motion

Wipe the kitchen counter with a sponge

Articulated Object Manipulation

Open the top drawer of the cupboard

Open the refrigerator door

Precise Manipulation

Close the pliers

Close the pair of scissors

Simulation: Results

Zero-shot performance is strong across all tasks, showing that VLMs provide effective plans for dexterous manipulation. Few-shot prompting with just a few examples significantly boosts success, especially on harder tasks. Our baseline using pre-recorded trajectories performs poorly. An oracle with perfect keypoints and trajectories instead of a VLM sets an upper bound, indicating room for improvement if VLM accuracy increases.

Simulation: Error Analysis

Our analysis reveals that most failures stem from incomplete tracking of planned trajectories by the low-level controller, followed by errors in keypoint detection from the VLM. Some trajectories are fully tracked but still fail, highlighting occasional planning limitations. Due to the complex interaction between high-level planning and low-level execution, further automatic decomposition of errors remains challenging.

Detected keypoints for hammering task

Generated trajectories for fridge opening task (zero-shot)

BibTeX

Page updated

Google Sites

Report abuse