Additional Results

Emergent Multimodal Heatmaps

Since KITE's keypoint grounding model is trained from (image, instruction, keypoint) pairs, any multimodality in the demonstration data will be reflected in the distribution of the learned heatmaps. In our demonstrations, instructions like “Open any drawer” are paired with images where the annotated keypoint lies on the top, middle, or bottom drawer.  Thus, an emergent property of the learned heatmap distributions is that they do attend to salient regions rather than individual points (middle column); we ultimately obtain a localized keypoint by argmaxing over this heatmap (right column).


Simulated Semantic Grasping Experiments

To further illustrate KITE's ability to generalize to diverse objects and object parts, we provide additional results on semantic grasping of synthetic objects in a custom PyBullet simulation environment. In particular, we consider 56 objects sampled from the YCB object dataset [2] (assets obtained from [1]). 

We train the grounding module on 30 objects and 20+ unique manually labeled object parts (ex: end/middle of banana, dispenser/side of cleanse, center/cap of marker, lip/handle of mug, body parts of animal, end/middle/non-button area/button area of remote, right/left side/handle of tools). 

"by the rim"

"by the end"

To implement the pick skill in simulation, we use ContactGraspNet [3], a pre-trained grasp candidate generator trained on millions of synthetic objects. ContactGraspNet takes a scene point cloud as input and generates candidate grasps and affordance scores (ex: on the right, we visualize a cleanser bottle from the YCB dataset, along with ContactGraspNet parallel-jaw grasps and their scores).

To implement a keypoint-conditioned grasp, we simply pick the highest-scoring grasp in close proximity to a deprojected predicted keypoint and execute this grasp.

Results

We evaluate on 26 unseen object instances. KITE achieves 16/20 (80%) successful semantic grasps on 20 trials of seen object/language pairings, and 14/20 (70%) success on unseen object instances. We provide visualizations of the grounding heatmaps and grasping results for various instructions below.

"Pick up the banana by the end."

"Pick up the cleanser by the dispenser."

"Pick up the black marker by the cap."

"Pick up the mug by the lip."

"Pick up the bowl by the handle."

"Pick up the teddy by the butt."

"Pick up the remote by the non-button area."

"Pick up the clamp by the right handle."

Towards Generalization

Visually, we also find that ContactGraspNet appears to generate compelling qualitative results for grasps on real point clouds. Because KITE is agnostic to the exact implementation of each skill (so long as it is keypoint-conditioned), we are excited in the future about integrating KITE with VLMS as they improve, and pre-trained grasping models like ContactGraspNet, for much richer generalization across objects (category, geometry, etc.).

Tabletop manipulation environment with grasps generated by ContactGraspNet [3].

References

[1] https://github.com/ChenEating716/pybullet-URDF-models/tree/main/urdf_models/models

[2] Calli, Berk, et al. "Yale-CMU-Berkeley dataset for robotic manipulation research." The International Journal of Robotics Research 36.3 (2017): 261-268.

[3] Sundermeyer, Martin, et al. "Contact-graspnet: Efficient 6-dof grasp generation in cluttered scenes." 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021.