SimNet: Enabling Robust Unknown

Object Manipulation from Pure Synthetic Data via Stereo

Authors: Michael Laskey, Brijen Thananjeyan, Kevin Stone, Thomas Kollar, Mark Tjersland

Affiliation: Toyota Research Institute and University of California Berkeley

SimNet Architecture

Our key insight is that robust sim2real transfer can occur by learning to extract geometry, rather than relying on an active depth sensor. In SimNet, each stereo RGB image is fed into a feature extractor before being fed into approximate stereo matching. The output of a stereo cost volume network (SCVN) is a low-resolution disparity image fed in with features from the left image to a ResNet-FPN backbone and output prediction heads. The output heads predict high-level vision information such as room-level segmentation, OBBs, keypoints, and full-resolution disparity images.

By learning to focus on geometry, sim2real transfer can be performed using only very low-quality scenes. We crerated three domains: cars, small objects and t-shirts using a non-photorealistic simulator with domain-randomization. Dataset generation is parallelized across machines and can be generated in an hour for $60 (USD) cloud compute cost.

Qualitative Results

SimNet On Indoor Scenes

SimNet is capable of having robust generalization across a large diversity of scenes. Shown on the left is the OBB predictions and room level segmentation from SimNet. Since we are predicting OBBs and not absolute pose, the predicted box can rotate freely along principle axes of similar size.

Grasping Experiments Across Homes and Objects

The predictions from SimNet can be used to enable robust manipulation across our fleet of robots in diverse home scenarios. In our grasping experiments, across 4 homes and 40 objects, SimNet achieves 92.5% grasp success across all objects. By relying on exclusively stereo information for sim-to-real transfer, we can manipulate optically challenging objects such as glassware.

T-Shirt Folding From Predicted Keypoints

When trained on the synthetic t-shirt dataset, SimNet can predict task-relevant keypoints for t-shirt folding. Shown above are the keypoints being used for manipulation on a Toyota HSR. Despite being trained only on low-quality synthetic data the predictions can generalize across t-shirts and operate in real home environments. We report the prediction accuracy of the keypoints in the paper.

Car Detection with SimNet

To evaluate how well SimNet works on existing benchmarks, we also evaluated it on the KITTI 2D car detection task. Despite only being trained on synthetic data, SimNet is able to reliably detect cars in the real world. However, since it relies on stereo information, it struggles with cars at a distance. Thus, SimNet is best suited for applications with a limited range from the camera like robot manipulation.

SimNet for Home Tidying

We have since extended our approach to perform end-to-end tidying of unseen dining room tables.