Any6D: Model-free 6D Pose Estimation of Novel Objects

CVPR 2025

Taeyeop Lee1 Bowen Wen2 Minjun Kang1 Gyuree Kang1 In So Kweon1 Kuk-Jin Yoon1

Abstract

We introduce Any6D, a model-free framework for 6D object pose estimation that requires only a single RGB-D anchor image to estimate both the 6D pose and size of unknown objects in novel scenes. Unlike existing methods that rely on textured 3D models or multiple viewpoints, Any6D leverages a joint object alignment process to enhance 2D-3D alignment and metric scale estimation for improved pose accuracy. Our approach integrates a render-and-compare strategy to generate and refine pose hypotheses, enabling robust performance in scenarios with occlusions, non-overlapping views, diverse lighting conditions, and large cross-environment variations. We evaluate our method on five challenging datasets: REAL275, Toyota-Light, HO3D, YCBINEOAT, and LM-O, demonstrating its effectiveness in significantly outperforming state-of-the-art methods for novel object pose estimation.

→

Method

Overview of the Any6D framework for model-free object pose estimation. First, we reconstruct the normalized object shape O_N from the image-to-3D model. Then, we estimate accurate object pose and size from anchor image I_A using the proposed object alignment. Next, we use the query image I_Q to estimate the pose with the reconstructed metric-scale object shape O_M.

Experiments

Experiments on real-world robotic manipulation scenarios.

Qualitative comparison of state-of-the-art methods on the YCBInEOAT Dataset. In this challenging scenario, the left anchor image shows only partially visible objects, while the query images are not visible due to occlusion or different viewing angles. This represents the most challenging case for matching. Gedi, being a depth-based method, shows ambiguity when dealing with RGB-based non-symmetric objects.

Qualitative comparison of state-of-the-art methods on the HO3D Dataset. In this challenging scenario, the left anchor image shows only partially visible objects, while the query images are not visible due to occlusion or different viewing angles. This represents the most challenging case for matching. Gedi, being a depth-based method, shows ambiguity when dealing with RGB-based non-symmetric objects.