Technical Summary Video
Deep object pose estimators are often unreliable and overconfident especially when the input image is outside the training domain, for instance, with sim2real transfer.
The following two images show the pose estimations of the Ketchup object from a SOTA pose estimator. Both results are very confident, but the right one is wrong!
For many robotics tasks, we need to efficiently and robustly quantify the uncertainty of pose estimations from deep learning-based object pose estimators.
Prior uncertainty quantification methods for the object pose estimation task require heavy modifications of the training process or the model inputs.
We develop a simple, efficient, and plug-and-play uncertainty quantification method for the 6-DoF object pose estimation task, using an ensemble of K pre-trained estimators with different architectures and/or training data sources.
We first train K deep object pose estimators with different architectures and training data source. For example, here we present three models with two different architectures and trained from two different synthetic data sources.
Then we input an image to get K pose predictions, to calculate their average disagreement based on a metric function f for uncertainty quantification.
We study four types of the disagreement metric f:
Average distance (ADD)
Learned (note that the learned metric requires labeled data on the target domain)
Two examples (bigger disagreement -> more uncertainty, smaller disagreement -> less uncertainty):
We first study the correlation between the proposed uncertainty quantification and the true pose estimation errors, using the Spearman's rank correlation coefficient. This analysis is on the real-world HOPE dataset.
We find our method yields much stronger correlations than baselines, and ADD is the best learning-free disagreement metric, which is only slightly worse than the learned metric.
(Application I) We apply the proposed uncertainty quantification method for a camera perspective selection task. The data is generated by a ray-tracing based renderer, ViSII. We find our method significantly reduces the pose estimation errors of the selected frames.
(Application II) We then examine the utility of our uncertainty quantification method for an uncertainty-guided robotic grasping task. We use our method to choose the optimal point of view for a real robotic arm. Our methods increase the grasping success rate from 35% to 90%. See two videos below for examples.