Abstract

Learning to estimate object pose often requires ground-truth (GT) labels, such as CAD model and absolute-scale object pose, which is expensive and laborious to obtain in the real world. To tackle this problem, we propose an unsupervised domain adaptation (UDA) for category-level object pose estimation, called UDA-COPE. Inspired by recent multi-modal UDA techniques, the proposed method exploits a teacher-student self-supervised learning scheme to train a pose estimation network without using target domain pose labels. We also introduce a bidirectional filtering method between the predicted normalized object coordinate space (NOCS) map and observed point cloud, to not only make our teacher network more robust to the target domain but also to provide more reliable pseudo labels for the student network training. Extensive experimental results demonstrate the effectiveness of our proposed method both quantitatively and qualitatively. Notably, without leveraging target-domain GT labels, our proposed method achieved comparable or sometimes superior performance to existing methods that depend on the GT labels.

Method

Overview of unsupervised domain adaptation for category-level object pose estimation (UDA-COPE). UDA-COPE utilizes pseudo label based teacher/student training scheme.  Our proposed bidirectional point filtering method removes the noisy pseudo labels and gives reliable guidance to the student network. At the same time, filtered depth points gives additional self-supervision to the teacher network so that it can be robust to the domain gap between the synthetic and real dataset.

Experiments

Quantitative comparison with state-of-the art methods on the REAL275 dataset.


Qualitative comparison with state-of-the art methods on the REAL275 dataset.


Example of noisy GT labels from real training data set.  Human-annotated GT pose labels on the real dataset (top row) are sometimes more inaccurate than our predicted pseudo labels (bottom row).

Ablation studies on UDA components.
Lower Bound: trained with labeled source only, Upper Bound: trained with both labeled source and target, PL: Pseudo Label, MU: Momentum Update, AM: All Modality loss, PL-F: Pseudo Label Filtering, TSL: Teacher Self-supervised Learning. Performance margins are calculated compared to the Lower Bound.

BibTeX

@inproceedings{lee2022uda,

  title={{UDA-COPE}: Unsupervised Domain Adaptation for Category-level Object Pose Estimation},

  author={Lee, Taeyeop and Lee, Byeong-Uk and Shin, Inkyu and Choe, Jaesung and Shin, Ukcheol and Kweon, In So and Yoon, Kuk-Jin},

  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},

  pages={14891--14900},

  year={2022}

}

Contact

If you have any questions, please feel free to contact Taeyeop Lee.