SGTAPose: Robot Structure Prior Guided Temporal Attention for Camera-to-Robot Pose Estimation from Image Sequence

 Video 

 Code 

Abstract

In this work, we tackle the problem of online camera-torobot pose estimation from single-view successive frames of an image sequence, a crucial task for robots to interact with the world. The primary obstacles of this task are the robot’s self-occlusions and the ambiguity of single-view images. This work demonstrates, for the first time, the effectiveness of temporal information and the robot structure prior in addressing these challenges. Given the successive frames and the robot joint configuration, our method learns to accurately regress the 2D coordinates of the predefined robot’s keypoints (e.g. joints). With the camera intrinsic and robotic joints status known, we get the camerato-robot pose using a Perspective-n-point (PnP) solver. We further improve the camera-to-robot pose iteratively using the robot structure prior. To train the whole pipeline, we build a large-scale synthetic dataset generated with domain randomisation to bridge the sim-to-real gap. The extensive experiments on synthetic and real-world datasets and the downstream robotic grasping task demonstrate that our method achieves new state-of-the-art performances and outperforms traditional hand-eye calibration algorithms in real-time (36 FPS). 

Teaser

(a) Given a RGB image sequence and known robot structure priors, SGTAPose estimates the 2D keypoints locations and solves a refined camera-to-robot pose (left). (b) The real-time estimated camera-to-robot pose can serve the downstream grasping tasks with high success rates.

Method

Belief Map Generator: Given 2D/3D keypoints locations from the previous frame, an initial camera-to-robot pose is calculated and used to project current keypoints 3D positions to a current belief map. Feature Alignment: The shared encoder yields multi-scale features. For the first three features, we implement temporal cross attention with the guidance of structure priors, while for the last three features, we concatenate the corresponding features.  3D Refiner: We acquire the detected keypoints 2D locations from the output heads and solve an initial camera-to-robot pose. Then, we gain a refined camera-to-robot pose with structure priors via a LM solver.

Main Results

Baseline comparison. PCK for the 2D keypoints metric, ADD for the 3D reconstruction metric, and In Frame Found for the ratio of the amount of found keypoints to the amount of all keypoints in the frame. Our method is taking lead in all metrics upon all datasets

Qualitative Results

Visualisation of Predictions

Long-horizon Downstream Grasping Tasks

Video