Single-Stage Keypoint-Based Category-Level Object Pose Estimation from an RGB Image

Yunzhi Lin Jonathan Tremblay Stephen Tyree Patricio A. Vela Stan Birchfield

NVIDIA Georgia Tech

Abstract: Prior work on 6-DoF object pose estimation has largely focused on instance-level processing, in which a textured CAD model is available for each object being detected. Category-level 6-DoF pose estimation represents an important step toward developing robotic vision systems that operate in unstructured, real-world scenarios. In this work, we propose a single-stage, keypoint-based approach for category-level object pose estimation that operates on unknown object instances within a known category using a single RGB image as input. The proposed network performs 2D object detection, detects 2D keypoints, estimates 6-DoF pose, and regresses relative bounding cuboid dimensions. These quantities are estimated in a sequential fashion, leveraging the recent idea of convGRU for propagating information from easier tasks to those that are more difficult. We favor simplicity in our design choices: generic cuboid vertex coordinates, single-stage network, and monocular RGB input. We conduct extensive experiments on the challenging Objectron benchmark, outperforming state-of-the-art methods on the 3D IoU metric (27.6% higher than the MobilePose single-stage approach and 7.1 % higher than the related two-stage approach).

Paper: arXiv (published at ICRA 2022)

Code: GitHub

Overview:

ICRA22_0658.mp4

Pipeline:

We focus on pose estimation (6-DoF translation and rotation, up to unknown scale) for category-level objects. Each category is equipped with a single network.

Given a monocular RGB image, the network represents each object as a center point, extracts different modalities, and computes 6-DoF pose by perspective-n-point (PnP).

Qualitative Comparison:

MobilePose on Objectron dataset

CenterPose on Objectron dataset

MobilePose on self-collected video

CenterPose on self-collected video

Robot Manipulation Demos:

object_arrangement.MP4

Object arrangement

object_grasping.mp4

Object grasping

More demos:

From the robot's view

With multiple objects (including transparency)

Citation:

@inproceedings{lin2022icra:centerpose,

title={Single-Stage Keypoint-based Category-level Object Pose Estimation from an {RGB} Image},

author={Lin, Yunzhi and Tremblay, Jonathan and Tyree, Stephen and Vela, Patricio A. and Birchfield, Stan},

booktitle={IEEE International Conference on Robotics and Automation (ICRA)},

month = May,

year=2022

}