KITE: Keypoints + Instructions To Execution

Keypoint-Conditioned Policies for Semantic Manipulation
Priya Sundaresan, Suneel Belkhale, Dorsa Sadigh,       Jeannette Bohg


While natural language offers a convenient shared interface for humans and robots, enabling robots to interpret and follow language commands remains a longstanding challenge in manipulation. A crucial step to realizing a performant instruction-following robot is achieving semantic manipulation – where a robot interprets language at different specificities, from high-level instructions like "Pick up the stuffed animal" to more detailed inputs like "Grab the left ear of the elephant." To tackle this, we propose Keypoints + Instructions to Execution, a two-step framework for semantic manipulation which attends to both scene semantics (distinguishing between different objects in a visual scene) and object semantics (precisely localizing different parts within an object instance). KITE first grounds an input instruction in a visual scene through 2D image keypoints, providing a highly accurate object-centric bias for downstream action inference. Provided an RGB-D scene observation, KITE then executes a learned keypoint-conditioned skill to carry out the instruction. The combined precision of keypoints and parameterized skills enables fine-grained manipulation with generalization to scene and object variations. Empirically, we demonstrate KITE in 3 real-world environments: long-horizon 6-DoF tabletop manipulation, semantic grasping, and a high-precision coffee-making task. In these settings, KITE achieves a 75%, 70%, and 71% overall success rate for instruction-following, respectively. KITE outperforms frameworks that opt for pre-trained visual language models over keypoint-based grounding, or omit skills in favor of end-to-end visuomotor control, all while being trained from fewer or comparable amounts of demonstrations.

Overview Video


KITE Experimental Results

Tabletop Long-Horizon Instruction Following

Tier 1

Simple visual scene + simple instructions

"Open the top drawer"

"Put the lemon into the top drawer"

"Close the top drawer"

Tier 2

Cluttered visual scene + simple instructions

"Close the bottom drawer"

"Pick up the green bowl"

"Put the expo marker into the green bowl"

Tier 3

Cluttered visual scene + free-form instructions

"Give the 3rd drawer a tug"

"Let's put the ketchup away"

"Plop the carrot into the green bowl"

Semantic Grasping

Rigid Items


Articulated Items

Coffee Making


pour_cup / refill_keurig




KITE Failure Modes

failure modes - kite.mp4

We observe the following main failure modes with KITE:

Baseline Results

We benchmark against PerAct [1], an end-to-end voxels to actions approach to instruction-following, and RobotMoo [2], a framework which is similar in spirit to KITE but replaces keypoint-based grounding with VLMs (Grounded DINO [3] + Segment Anything [4]). On the left, we visualize the resulting behaviors of both on tabletop manipulation.


moo preds.mp4

RobotMoo is limited by the capacity of VLMs to accurately localize semantic features. Grounded DINO & Segment Anything prove to be a viable option for identifying scene semantic features like object instances, leading to decent pick-and-place behaviors. However, they often produce false negatives or innacuracies with object semantic features like drawer handles. This is most apparent with opening/closing in the above video.



While PerAct demonstrates some performant instruction following with maneuvers like closing, it struggles with high-precision tasks like grasping of objects and drawer handles. As this method is end-to-end, the action space consists of noisy predicted gripper open/close commands which further complicate contact-rich tasks.

PerAct Waypoint Predictions

Ground Truth Waypoint / Predicted Waypoint

"Open the 2nd drawer"

 ❌ Imprecision of handle

"Open the top drawer"

✔️Good alignment

"Open the bottom drawer"

❌ Wrong handle prediction

KITE Grounding Predictions (1-Step Lookahead)