CoRL 23'
Ning Gao Ngo Anh Vien Hanna Ziesche Gerhard Neumann
To enable meaningful robotic manipulation of objects in the real world, 6D pose estimation is one of the critical aspects. Most existing approaches have difficulty extending predictions to scenarios where novel object instances are continuously introduced, especially with heavy occlusions. In this work, we propose a few-shot pose estimation (FSPE) approach called SA6D, which uses a self-adaptive segmentation module to identify the novel target object and construct a point cloud model of the target object using only a small number of cluttered reference images. Unlike existing methods, SA6D does not require object-centric reference images or any additional object information, making it a more generalizable and scalable solution across categories. We evaluate SA6D on real-world tabletop object datasets and demonstrate that SA6D outperforms existing FSPE methods, particularly in cluttered scenes with occlusions, while requiring fewer reference images.
SA6D includes three modules: i) The online self-adaptation module discovers and segments the target object (milk cow) from a cluttered scene giving a few posed RGB-D images as reference. Subsequently, the canonical object point cloud model from the reference images and the local model from the test image are constructed based on the segments. ii) The region proposal module outputs a robust region of interest (ROI) of the target object against occlusion by incorporating visual and geometric features. A coarse 6D pose is then estimated by comparing the cropped test and reference images using Gen6D, and iii) further fine-tuned using ICP by the refinement module.
A pretrained segmentor φ is first applied on reference images to predict segmentations. Meanwhile, an adaptive segmentor φ* is initialized from φ. With the ground-truth translation of the target object in the reference images Tref, the object center can be reprojected to the image. For each reference image, one segment is chosen as a positive sample if it includes the reprojected object center while the remaining segments are considered as negative samples. Subsequently, an object-level representation of each segment is computed by averaging the pixel-wise dense features from φ*. A contrastive loss is then applied over the positive and negative object representations and updates φ* iteratively. After adaptation, φ* generates the target object representation r* by averaging over all positive representations from reference images. Given a test image, we obtain the representation of each candidate segment in the same way and compute the cosine similarity between each candidate and r*, where the most similar candidate is chosen as the segment of the target object. Meanwhile, the canonical and local object models are computed based on the segments and depth images.
The green bounding box denotes the ground-truth pose and blue denotes the prediction. In SA6D, blue denotes prediction before refinement while red is the final prediction.
RePoNet
SA6D
If you want to cite our work, please use:
@inproceedings{
gao2023sad,
title={{SA}6D: Self-Adaptive Few-Shot 6D Pose Estimator for Novel and Occluded Objects},
author={Ning Gao and Vien Anh Ngo and Hanna Ziesche and Gerhard Neumann},
booktitle={7th Annual Conference on Robot Learning},
year={2023},
}