Learning Object Relations with Graph Neural Networks for Target-Driven Grasping in Dense Clutter

Abstract

Robots in the real world frequently come across identical objects in dense clutter. When evaluating grasp poses in these scenarios, a target-driven grasping system requires knowledge of spatial relations between scene objects (e.g., proximity, adjacency, and occlusions). To efficiently complete this task, we propose a target-driven grasping system that simultaneously considers object relations and predicts 6-DoF grasp poses. A densely cluttered scene is first formulated as a grasp graph with nodes representing object geometries in the grasp coordinate frame and edges indicating spatial relations between the objects. We design a Grasp Graph Neural Network (G2N2) that evaluates the grasp graph and finds the most feasible 6-DoF grasp pose for a target object. Additionally, we develop a diverse shape completion-assisted grasp pose sampling method that improves sample quality and consequently training and grasping efficiency. We compare our method against several baselines in both simulated and real settings. In real-world experiments with novel objects, our approach achieves a 77.78% grasping accuracy in densely cluttered scenarios, surpassing the best-performing baseline by more than 15%.

Grasping one of the identical targets in dense clutter. Among the three meat cans, the one in the green bounding box is less surrounded by other objects. Grasping the more accessible target with a flexible pose is more likely to succeed. Reasoning about object relationships, the proposed target-driven grasping system predicts more feasible grasp poses in dense clutter.

Grasping pipeline. Our approach first segments the RGB-D image and localizes the target objects, and then samples diverse 6-DoF grasp candidates for each target object. The object point clouds are transformed into each candidate's grasp coordinate frame and voxelized. Next, we apply graph transformation to construct grasp graphs, where the voxel grids are encoded into latent features and used as node embeddings. For a set of N grasp candidates, the G2N2 evaluates the grasp graph for each candidate, and predicts an overall grasping score that simultaneously considers grasping stability and spatial relations between objects.

Experiments in the real world. Block Clutter includes training objects arranged with increasing densities. As scene grows more complex, knowledge of object relations significantly facilitates grasping activities. Novel scenes include unseen objects that distinct from the training ones, significantly increased the reasoning difficulties.