IST-Net:Prior-free Category-level Pose Estimation
with Implicit Space Transformation
with Implicit Space Transformation
ICCV 2023 Paris
Jianhui Liu1 Yukang chen2 Xiaoqing Ye3 Xiaojuan Qi1
1The University of Hong Kong 2The Chinese University of Hong Kong 3Baidu
IST-Net is a clean, simple, and prior-free category-level 6D pose estimator.
Category-level 6D pose estimation aims to predict the poses and sizes of unseen objects from a specific category. Thanks to prior deformation, which explicitly adapts a category-specific 3D prior (i.e., a 3D template) to a given object instance, prior-based methods attained great success and have become a major research stream. However, obtaining category-specific priors requires collecting a large amount of 3D models, which is labor-consuming and often not accessible in practice. This motivates us to investigate whether priors are necessary to make prior-based methods effective. Our empirical study shows that the 3D prior itself is not the credit to the high performance. The keypoint actually is the explicit deformation process, which aligns camera and world coordinates supervised by world-space 3D models (also called canonical space). Inspired by these observations, we introduce a simple prior-free implicit space transformation network, namely IST-Net, to transform camera-space features to world-space counterparts and build correspondence between them in an implicit manner without relying on 3D priors. Besides, we design camera- and world-space enhancers to enrich the features with pose-sensitive information and geometrical constraints, respectively. Albeit simple, IST-Net achieves state-of-the-art performance based-on prior-free design, with top inference speed on the REAL275 benchmark.
Is Prior Really Necessary?
We experiment with the shape deformation module which is used to deform the given shape prior to the desired instance by replacing the shape priors with random noise and fixed shape prior from another category (see Fig. above). We observe that the deformation module can adapt any inputs (noise or fixed prior) into a target object (Fig. (b)). Besides, the model performance remains high regardless of the 3D priors (Fig. (a)). The above suggests: the shape prior itself is not necessary for the high performance of prior-based methods, but the deformation module that learns to synthesize world-space target objects and explicitly builds the correspondence between camera and world-space is the key as the performance degrades dramatically without prior deformation. This promotes us to investigate new ways to build camera-to-world correspondence without requiring 3D priors and models.
Implicit Space Transformation: Implicitly sets up feature correspondence between camera-space and world-space without requiring 3D priors or ground-truth 3D models of target objects.
World-space Enhancer: Distills standard world-space features to supervise the transformed features.
Camera-space Enhancer: Boosts the backbone network’s feature extraction capabilities.
@article{liu2023prior,
title={Prior-free Category-level Pose Estimation with Implicit Space Transformation},
author={Liu, Jianhui and Chen, Yukang and Ye, Xiaoqing and Qi, Xiaojuan},
journal={arXiv preprint arXiv:2303.13479},
year={2023}
}