Learning Residual NeRFs for Transparent Object Manipulation
Bardienus P. Duisterhof, Yuemin Mao, Si Heng Teng, Jeffrey Ichnowski
ICRA Submission #3173
Abstract
Transparent objects are ubiquitous in industry, pharmaceuticals, and households. Grasping and manipulating these objects is a major challenge for robots. Existing methods have difficulty reconstructing complete depth maps for challenging transparent objects, leaving holes in the depth reconstruction. Recent work has shown neural radiance fields (NeRFs) work well for depth perception in scenes with transparent objects, and these depth maps can be used to grasp transparent objects with high accuracy. NeRF-based depth reconstruction struggles with especially challenging transparent objects and lighting conditions. In this work, we introduce Residual-NeRF, a NeRF algorithm that leverages images from the scene without transparent objects. It is a common use case for robots to return to the same area, such as a kitchen. By learning a background NeRF of the scene without the transparent object, we reduce the ambiguity faced by the residual NeRF. The residual NeRF learns to infer residual RGB values and densities, and the Mixnet learns how to combine background and residual NeRFs. We contribute synthetic and real experiments that suggest the background NeRF scene improves depth perception of transparent objects. We also show examples where the improved depth perception led to better grasp robustness.
Real-World Depth Mapping
We collect 3 real-world cluttered datasets with increasing difficulty. Scene A has four opaque objects in the background scene, with an added transparent coffee container with coffee in the evaluation scene. Scene B has the same four opaque objects in the background scene, with an added wine glass in the evaluation scene. Scene C has 6 transparent objects in the background scene, with 3 added transparent objects: a wine glass, kitchen wrap, and a glass bottle with a blue cap in the evaluation scene.
The results show that Residual-NeRF outputs depth maps with less noise and fewer gaps, improving the quality of depth reconstruction on cluttered real-world scenes.
Scene A
Dex-NeRF Depth
Residual-NeRF Depth
Residual-NeRF RGB
Scene B
Scene C
Synthetic Scene Depth Mapping
We use Blender to render singulated transparent objects, including tablewares and containers in homes and plastic-wrapped medical supplies in hospitals. For each fully transparent object, we render 2 scenes: lying flat on the table ("Flat") and sitting propped on a small block ("Up").
Here we show the results on 4 scenes as examples. As displayed in the images below, Residual-NeRF reduces holes and noise in the depth maps of singulated synthetic objects.
Bowl
GT RGB
NeRF
Dex-NeRF
Residual-NeRF
Drink Flat
Drink Up
Wine Flat
Quantitatively, Residual-NeRF reconstructs depth maps with lower root mean square error (RMSE) and mean absolute error (MAE), which demonstrates an improvement in depth perception.
Training Speed
In addition to depth map quality, the time cost of depth reconstruction is also critical to making NeRF applicable to real-world tasks. Therefore, we also compare the training speed of Residual-NeRF to the baselines.
The GIFs below capture the depth maps generated at the same time steps during training with NeRF, Dex-NeRF, and Residual-NeRF, which show that Residual-NeRF reconstructs a high-quality depth map much faster than NeRF and Dex-NeRF.
Bowl
NeRF
Dex-NeRF
Residual-NeRF
Drink Flat
Drink Up
Wine Flat
The quantitative results, RMSE over training time, also suggest that Residual-NeRF greatly improves training speed w.r.t. depth reconstruction.
Bowl
Drink Flat
Drink Up
Wine Flat
Grasping
We run physical experiments to demonstrate that Residual-NeRF can be used to grasp challenging objects. For each object here, Residual-NeRF reconstructs a depth map from real-world images. The depth map is then fed into Dex-Net 2.0 for the prediction of an optimal grasp.
Shot Glass
Dex-NeRF
Residual-NeRF
Residual-NeRF Grasp
Tape
Wine
Views Ablation
To evaluate the performance of Residual-NeRF with a decreasing number of views, we contribute an ablation that tracks mean absolute error (MAE) and root mean squared error (RMSE) of the generated depth maps. The results below suggest Residual-NeRF is more resilient to a smaller number of views as compared to the baselines.
Drink Up
Drink Flat
Bowl
Wine Flat