KITE: Keypoints + Instructions To Execution
Keypoint-Conditioned Policies for Semantic Manipulation
Priya Sundaresan, Suneel Belkhale, Dorsa Sadigh, Jeannette Bohg
Abstract
While natural language offers a convenient shared interface for humans and robots, enabling robots to interpret and follow language commands remains a longstanding challenge in manipulation. A crucial step to realizing a performant instruction-following robot is achieving semantic manipulation – where a robot interprets language at different specificities, from high-level instructions like "Pick up the stuffed animal" to more detailed inputs like "Grab the left ear of the elephant." To tackle this, we propose Keypoints + Instructions to Execution, a two-step framework for semantic manipulation which attends to both scene semantics (distinguishing between different objects in a visual scene) and object semantics (precisely localizing different parts within an object instance). KITE first grounds an input instruction in a visual scene through 2D image keypoints, providing a highly accurate object-centric bias for downstream action inference. Provided an RGB-D scene observation, KITE then executes a learned keypoint-conditioned skill to carry out the instruction. The combined precision of keypoints and parameterized skills enables fine-grained manipulation with generalization to scene and object variations. Empirically, we demonstrate KITE in 3 real-world environments: long-horizon 6-DoF tabletop manipulation, semantic grasping, and a high-precision coffee-making task. In these settings, KITE achieves a 75%, 70%, and 71% overall success rate for instruction-following, respectively. KITE outperforms frameworks that opt for pre-trained visual language models over keypoint-based grounding, or omit skills in favor of end-to-end visuomotor control, all while being trained from fewer or comparable amounts of demonstrations.
Overview Video

KITE Experimental Results
Tabletop Long-Horizon Instruction Following
Tier 1
Simple visual scene + simple instructions
"Open the top drawer"
"Put the lemon into the top drawer"
"Close the top drawer"
Tier 2
Cluttered visual scene + simple instructions
"Close the bottom drawer"
"Pick up the green bowl"
"Put the expo marker into the green bowl"
Tier 3
Cluttered visual scene + free-form instructions
"Give the 3rd drawer a tug"
"Let's put the ketchup away"
"Plop the carrot into the green bowl"
Semantic Grasping
Rigid Items
Deformables
Articulated Items
Coffee Making

pour_cup / refill_keurig
reorient_mug

load_pod
KITE Failure Modes

We observe the following main failure modes with KITE:
Skill Imprecision: For each skill in its library, KITE learns a waypoint policy mapping an input scene pointcloud to a set of K waypoints with which to parameterize the skill. Any slight offsets in waypoint prediction can lead to manipulation errors. This failure mode is most apparent in fine-grained tasks such as pouring, and less of an issue for grasping of larger objects with higher tolerance for error.
Keypoint Mispredictions: As KITE learns keypoint-conditioned skills, any imprecision or errors with 2D keypoint prediction can compound. We mostly observe that the grounding model in KITE attends to the wrong object as the degree of visual clutter increases, and KITE occasionally has issues with resolving symmetry in semantic grounding (i.e. KITE attends to the "left handle" instead of "right handle" and vice versa).
Grasping Inaccuracies: KITE's grasping skill is trained on 50 demos per task; with this small number of demonstrations, KITE sometimes mispredicts the grasp orientation for a given object, leading to slippage , drop, or an unstable grasp.
Baseline Results
We benchmark against PerAct [1], an end-to-end voxels to actions approach to instruction-following, and RobotMoo [2], a framework which is similar in spirit to KITE but replaces keypoint-based grounding with VLMs (Grounded DINO [3] + Segment Anything [4]). On the left, we visualize the resulting behaviors of both on tabletop manipulation.
RobotMoo

RobotMoo is limited by the capacity of VLMs to accurately localize semantic features. Grounded DINO & Segment Anything prove to be a viable option for identifying scene semantic features like object instances, leading to decent pick-and-place behaviors. However, they often produce false negatives or innacuracies with object semantic features like drawer handles. This is most apparent with opening/closing in the above video.
PerAct

While PerAct demonstrates some performant instruction following with maneuvers like closing, it struggles with high-precision tasks like grasping of objects and drawer handles. As this method is end-to-end, the action space consists of noisy predicted gripper open/close commands which further complicate contact-rich tasks.
PerAct Waypoint Predictions
Ground Truth Waypoint / Predicted Waypoint
"Open the 2nd drawer"
❌ Imprecision of handle
"Open the top drawer"
✔️Good alignment
"Open the bottom drawer"
❌ Wrong handle prediction