Dex-NeRF: Using a Neural Radiance field to
Grasp Transparent Objects

Jeffrey Ichnowski*, Yahav Avigal*, Justin Kerr, Ken Goldberg
* equal contribution

NeRF's learned density model provides high-quality depth maps that robots can use to perceive and grasp transparent objects.

Conference on Robot Learning (CoRL) 2021

Dex-NeRF was formerly known as NeRF-GTO.

Abstract

The ability to grasp and manipulate transparent objects is a major challenge for robots. Existing depth cameras have difficulty detecting, localizing, and inferring the geometry of such objects. We propose using neural radiance fields (NeRF) to detect, localize, and infer the geometry of transparent objects with sufficient accuracy to find and grasp them securely. We leverage NeRF's view-independent learned density, place lights to increase specular reflections, and perform a transparency-aware depth-rendering that we feed into the Dex-Net grasp planner. We show how additional lights create specular reflections that improve the quality of the depth map, and test a setup for a robot workcell equipped with an array of cameras to perform transparent object manipulation. We also create synthetic and real datasets of transparent objects in real-world settings, including singulated objects, cluttered tables, and the top rack of a dishwasher. In each setting we show that NeRF and Dex-Net are able to reliably compute robust grasps on transparent objects, achieving 90% and 100% grasp success rates in physical experiments on an ABB YuMi, on objects where baseline methods fail.

Comparison of Depth Map Recovery

Stereo and structured-light depth sensors struggle to recover depth information from scene containing transparent objects. On the other hand, using the proposed solution, we're able to recover higher quality depth maps, suitable for computing robot grasps.

Real-world Scene

Real-world scenes in labs, homes, workplaces and more, have transparent objects that existing depth sense have difficulty isolating.

RealSense D410 Depth Image

Here, the depth output from the Intel RealSense D410 camera has missing objects and pixels, and smudged and hard-to-distinguish outlines.

Depth Map (Ours)

Using NeRF, on the same scene, we're able to recover a depth map with all objects and pixels, crisp object outlines.

Dishwasher Real-world Scene

Dishwasher top racks are often loaded with transparent objects that are difficult to identify, localize, and infer geometry, but will be one task of many that household robots of the future will tackle.

RealSense D410 Depth Image

Here, the Intel RealSense D410 camera is unable to recover depth from a large portion of the scene.

Depth Map with Grasps (Ours)

Using NeRF and Dex-Net, the geometry of the objects can be computed with sufficient accuracy to plan three candidate grasps, colored by their expected grasp quality (green indicates a higher quality).

Grasping Results (Physical)

For each object here, Dex-NeRF trains a model from real images, then computes a grasp on each. See the physical dataset below for more detail on the objects used in this experiment.

Tape Dispenser

Wineglass

Flask

Safety Glasses

Bottle

Lion Figurine

Grasping Results (Simulation)

In physics-based wrench resistance simulation of predicted grasps, grasp robustness improves with additional training epochs, and plateaus at around 50k to 60k iterations. The objects in this graph are included in the simulated datasets below.

Multiple vs. Single Light Sources

Using multiple light sources, due to increased specular reflections, allows for better recovery of transparent objects.

Single Light Image

The scene was rendered using Blender with a single bright light source directly above the work surface.

Single Light Depth

We observe that the closer surfaces of the glasses are missing (e.g., the long water glass, the wineglass).


Multiple Lights Image

The scene was rendered using Blender with an array of 5x5 (25) lights above the work surface.


Multiple Light Depth

With the additional light sources, the glasses are nearly fully recovered.


Datasets

Datasets including transparent objects, both synthetic and real, are available on github, and linked below.

Singulated Objects Datasets (Physical)

Using Dex-NeRF, an ABB YuMi grasps and lifts transparent objects. On the left is the transparent object and the computed grasp being executed. In the middle is a visualization of the recovered depth map and computed grasp. On the right, we include a comparison of the point cloud generated by a PhoXi camera. Normally a PhoXi camera generates a high-quality depth map from which Dex-Net can reliably compute grasps. However, transparent objects do not show up in its point cloud, and at best create "shadows" in the scene beneath it.

Dex-NeRF + Grasp

PhoXi point cloud

Dex-NeRF + Grasp

PhoXi point cloud

Dex-NeRF + Grasp

PhoXi point cloud

Dex-NeRF + Grasp

PhoXi point cloud

Dex-NeRF + Grasp

PhoXi point cloud

Dex-NeRF + Grasp

PhoXi point cloud

Singulated Objects Datasets (Simulation)

Each singulated object shares the same camera and light positions. Along with each object is the depth map generated from Dex-NeRF from a top-down view and 3 grasps predictions color coded according to their quality (green = robust, red = not robust).

Overhead

Grasp 1

Grasp 2

Grasp 3

Overhead

Grasp 1

Grasp 2

Grasp 3

Overhead

Grasp 1

Grasp 2

Grasp 3

Overhead

Grasp 1

Grasp 2

Grasp 3

Overhead

Grasp 1

Grasp 2

Grasp 3

Overhead

Grasp 1

Grasp 2

Grasp 3

Citing

@inproceedings{IchnowskiAvigal2021DexNeRF,
title={{Dex-NeRF}: Using a Neural Radiance field to Grasp Transparent Objects},
author={Ichnowski*, Jeffrey and Avigal*, Yahav and Kerr, Justin and Goldberg, Ken},
booktitle={Conference on Robot Learning (CoRL)},
year={2020}
}