RGBGrasp: Image-based Object Grasping by Capturing Multiple Views during Robot Arm Movement with Neural Radiance Fields

Chang Liu*, Kejian Shi*, Kaichen Zhou*, Haoxiao Wang, Jiyao Zhang and Hao Dong†

* equal contributions † corresponding author

[Paper] [Video(YouTube)] [Video(bilibili)]

Robotic research encounters a significant hurdle when it comes to the intricate task of grasping objects that come in various shapes, materials, and textures. Unlike many prior investigations that heavily leaned on specialized point-cloud cameras or abundant RGB visual data to gather 3D insights for object-grasping missions, this paper introduces a pioneering approach called RGBGrasp. This method depends on a limited set of RGB views to perceive the 3D surroundings containing transparent and specular objects and achieve accurate grasping. Our method utilizes pre-trained depth prediction models to establish geometry constraints, enabling precise 3D structure estimation, even under limited view conditions. Finally, we integrate hash encoding and a proposal sampler strategy to significantly accelerate the 3D reconstruction process. These innovations significantly enhance the adaptability and effectiveness of our algorithm in real-world scenarios. Through comprehensive experimental validations, we demonstrate that RGBGrasp achieves remarkable success across a wide spectrum of object-grasping scenarios, establishing it as a promising solution for real-world robotic manipulation tasks.

The demonstrations of our method and baselines can be found here .

Overview

Fig. 1. Overview of RGBGrasp. We introduce a novel approach capable of reconstructing the 3D geometric information of a target scene using views acquired during standard grasping procedures. Our method is not limited to fixed viewpoints and can flexibly work with partial observations in different trajectories based on the environmental requirements.

Fig. 2. RGBGrasp Pipeline. The robot employs an approaching trajectory to the objects when capturing multiple views to build a multi-scale hash table. Subsequently, a proposal sampler is trained to enhance the precision of sampling positions for a subsequent fine predictor. This predictor provides color and density data for individual points, where the density information contributes to the construction of the final point cloud. This resultant point cloud serves as input for a pre-trained grasping module to predict a 6-DoF grasp pose. Throughout the optimization procedure, we maintain a fixed state for the monocular depth network and the grasping module, rendering them non-trainable components. In contrast, the Hash Table, Proposal Sampler, and NeRF MLP are actively updated and subject to the learning process.

Results

Simulation

Fig. 3. Results of GraspNeRF and RGBGrasp with fixed view points and various view ranges. Our RGBGrasp shows little performance downgrades when the view range decreases. (SR: success rate; DR: declutter rate)

Fig. 4. Results of RGBGrasp and AnyGrasp with single-view/multi-view point cloud as input with viewpoints on approaching trajectory.

Fig. 5. Ablation Studies. We compare the performance of RGBGrasp and RGBGrasp without depth rank loss in the simulator. There are more artifacts in the reconstructed point cloud without depth supervision, especially on the boundary of the point cloud. However, most grasp modules rely on local geometry features to generate grasp poses, and these artifacts would cheat grasp modules to detect grasps around them. This phenomenon is more likely to occur in pile scenes, as objects in pile scenes have smaller dimensions and are more susceptible to artifacts.

Real World

Fig. 6. Results of RGBGrasp and two baselines (GraspNeRF and AnyGrasp with RealSense D415 as the input) in real world. We conduct 15 rounds of clutter removal grasping. Each scene is consisted of 5 randomly-picked daily objects, with diffuse, transparent and specular materials.