Probing Effective and Efficient Category-Level Articulated Object Pose Perception

A Unified Framework for Pose Estimation and Pose Tracking in SE(3) Manifolds

Li Zhang, Xianhui Meng, Liu Liu✉️, Haonan Jiang, Jianan Wang,

Rujing Wang, Cewu Lu, Jun Liu, Hong Zhang

✉️ Liu Liu is the corresponding author: liuliu@hfut.edu.cn

code and dataset (github)

Category-level articulated object pose perception—encompassing both static 6D pose estimation and dynamic pose tracking—is critical for embodied AI systems interacting with complex environments. Due to the inherent complexity and diverse motion structures of articulated objects, existing methods often exhibit limitations in adequately modeling kinematic constraints, handling self-occlusions, and meeting optimization requirements. Building upon EfficientCAPER, this work introduces CAPER++, a unified framework addressing these limitations through three key innovations: First, a joint-centric hierarchical model decomposes objects into a free part and constrained parts linked by joints, explicitly embedding kinematic constraints for geometrically consistent pose recovery. Second, an SE(3) manifold formulation leverages Lie algebra se(3) in the tangent space for singularity-free rotation representation and stable optimization, replacing error-prone direct regression. Third, for tracking, a proxy-canonicalization strategy reformulates pose updates as SE(3) increment predictions relative to keyframes, enhanced by a dynamic keyframe mechanism to suppress drift. Extensive experiments on synthetic (ArtImage, PM-Videos), semi-synthetic (ReArtMix, ReArt-Videos), and real-world (RobotArm) benchmarks demonstrate state-of-the-art accuracy and robustness. CAPER++ achieves real-time inference (50 FPS) without post-processing, significantly advancing category-level articulated perception for real-world applications.

Novelty of Our Method

(a) Articulation Modeling. Traditional approaches often approximate articulated objects as a set of independent rigid parts, neglecting their inherent kinematic structure and thus frequently producing physically inconsistent predictions. In contrast, our method explicitly integrates kinematic constraints and employs a hierarchical modeling strategy to achieve more physically plausible articulated pose reasoning.

(b) Pose Estimation. Existing methods either rely on direct regression, which tends to result in unstable predictions, or adopt dense prediction schemes, which incur substantial post-processing overhead. To address this, we introduce a decoupled rotation prediction strategy within the SE(3) manifold, significantly improving both estimation accuracy and computational efficiency.

(c) Pose Tracking. A straightforward method for pose tracking is frame-by-frame independent prediction, failing to exploit temporal dependencies across sequential frames. In this work, we incorporate a pose-increment learning mechanism that enables online, real-time, and high-precision tracking of articulated objects.

Framework

The Pipeline of Our CAPER++ Framework. The proposed CAPER++ framework is designed to support both pose estimation and pose tracking for articulated objects. In the pose estimation task, the input is a single-frame point cloud, from which the network directly predicts the pose of the target articulated object. In the pose tracking task, it takes a sequence of point clouds as input. The framework iteratively updates the pose based on the previous frame's estimate, producing a continuous stream of poses over time to enable robust object tracking. Our method is built upon a two-stage framework: Stage I estimates the pose of the free part, followed by Stage II, which estimates the pose of the constrained parts. This articulated modeling approach is further extended to the pose tracking task.