CtRNet-X

CtRNet-X: Camera-to-Robot Pose Estimation in Real-world Conditions Using a Single Camera

Jingpei Lu*, Zekai Liang*, Tristin Xie, Florian Ritcher, Shan Lin, Sainan Liu, Michael C. Yip

ICRA 2025

Paper

Code

Practical robot pose estimation

CtRNet-X is a novel framework capable of estimating the robot pose with partially visible robot manipulators. Our approach leverages the Vision-Language Models for fine-grained robot components detection, and integrates it into a keypoint-based pose estimation network, which enables more robust performance in varied operational conditions.

Abstract

Camera-to-robot calibration is crucial for vision-based robot control and requires effort to make it accurate. Recent advancements in markerless pose estimation methods have eliminated the need for time-consuming physical setups for camera-to-robot calibration. While the existing markerless pose estimation methods have demonstrated impressive accuracy without the need for cumbersome setups, they rely on the assumption that all the robot joints are visible within the camera's field of view. However, in practice, robots usually move in and out of view, and some portion of the robot may stay out-of-frame during the whole manipulation task due to real-world constraints, leading to a lack of sufficient visual features and subsequent failure of these approaches. To address this challenge and enhance the applicability to vision-based robot control, we propose a novel framework capable of estimating the robot pose with partially visible robot manipulators. Our approach leverages the Vision-Language Models for fine-grained robot components detection, and integrates it into a keypoint-based pose estimation network, which enables more robust performance in varied operational conditions. The framework is evaluated on both public robot datasets and self-collected partial-view datasets to demonstrate our robustness and generalizability. As a result, this method is effective for robot pose estimation in a wider range of real-world manipulation scenarios.

Method

The integration of Vision-Language Model (VLM) with our self-supervised training makes common real-world robot pose estimation possible.

DEMO

Compatible with both RGB and RGBD input with differentiable depth rendering

depth_DROID_3.mp4

depth_video_3.mp4

A variation of CtRNet-X can integrate depth maps from an RGB-D camera during inference by comparing measured depth to rendered depth from the differentiable renderer.

BibTeX

@article{lu2024ctrnet,

title={CtRNet-X: Camera-to-Robot Pose Estimation in Real-world Conditions Using a Single Camera},

author={Lu, Jingpei and Liang, Zekai and Xie, Tristin and Ritcher, Florian and Lin, Shan and Liu, Sainan and Yip, Michael C},

journal={arXiv preprint arXiv:2409.10441},

year={2024}

}

Page updated

Report abuse