Joao Carvalho, An T. Le, Philipp Jahr, Qiao Sun, Julen Urain, Dorothea Koert, Jan Peters
Abstract
Grasping objects successfully from a single-view camera is crucial in many robot manipulation tasks. An approach to solve this problem is to leverage simulation to create large datasets of pairs of objects and grasp poses, and then learn a conditional generative model that can be prompted quickly during deployment. However, the grasp pose data is highly multimodal since there are several ways to grasp an object. Hence, in this work, we learn a grasp generative model with diffusion models to sample candidate grasp poses given a partial point cloud of an object. A novel aspect of our method is to consider diffusion in the manifold space of rotations and to propose a collision-avoidance cost guidance to improve the grasp success rate during inference. To accelerate grasp sampling we use recent techniques from the diffusion literature to achieve faster inference times. We show in simulation and real-world experiments that our approach can grasp several objects from raw depth images with 90% success rate and benchmark it against several baselines.
The input to GDN is a partial point cloud view of the object to grasp (blue dots), and the output is a distribution of gripper poses obtained by denoising N steps in the SO(3)×R3 manifold. The denoising network is a conditional ResNet that computes vectors for translation and rotation in the Lie algebra. These vectors are used to update the means of the posterior distribution and obtain new samples that are fed back to the network.
Real world grasping experiments
Book
GDN - DDIM
CVAE
SE(3)-DiffusionFields
Cup
GDN - DDIM
CVAE
SE(3)-DiffusionFields
Cap
GDN - DDIM
CVAE
SE(3)-DiffusionFields
Failures
GDN - DDIM
CVAE
SE(3)-DiffusionFields