KITE: Keypoints + Instructions To Execution
Keypoint-Conditioned Policies for Semantic Manipulation
Priya Sundaresan, Suneel Belkhale, Dorsa Sadigh, Jeannette Bohg
While natural language offers a convenient shared interface for humans and robots, enabling robots to interpret and follow language commands remains a longstanding challenge in manipulation. A crucial step to realizing a performant instruction-following robot is achieving semantic manipulation – where a robot interprets language at different specificities, from high-level instructions like "Pick up the stuffed animal" to more detailed inputs like "Grab the left ear of the elephant." To tackle this, we propose Keypoints + Instructions to Execution, a two-step framework for semantic manipulation which attends to both scene semantics (distinguishing between different objects in a visual scene) and object semantics (precisely localizing different parts within an object instance). KITE first grounds an input instruction in a visual scene through 2D image keypoints, providing a highly accurate object-centric bias for downstream action inference. Provided an RGB-D scene observation, KITE then executes a learned keypoint-conditioned skill to carry out the instruction. The combined precision of keypoints and parameterized skills enables fine-grained manipulation with generalization to scene and object variations. Empirically, we demonstrate KITE in 3 real-world environments: long-horizon 6-DoF tabletop manipulation, semantic grasping, and a high-precision coffee-making task. In these settings, KITE achieves a 75%, 70%, and 71% overall success rate for instruction-following, respectively. KITE outperforms frameworks that opt for pre-trained visual language models over keypoint-based grounding, or omit skills in favor of end-to-end visuomotor control, all while being trained from fewer or comparable amounts of demonstrations.
KITE Experimental Results
Tabletop Long-Horizon Instruction Following
"Open the top drawer"
"Put the lemon into the top drawer"
"Close the top drawer"
"Close the bottom drawer"
"Pick up the green bowl"
"Put the expo marker into the green bowl"
"Give the 3rd drawer a tug"
"Let's put the ketchup away"
"Plop the carrot into the green bowl"
pour_cup / refill_keurig
KITE Failure Modes
PerAct Waypoint Predictions
"Open the 2nd drawer"
❌ Imprecision of handle
"Open the top drawer"
"Open the bottom drawer"
❌ Wrong handle prediction