Li Zhang, Mingyu Mei, Ailing Zeng, Xianhui Meng, Yan Zhong, Xinyuan Song, Liu Liu, Rujing Wang, Zaixin He, Cewu Lu
Articulated object pose estimation is a core task in embodied AI and computer vision. Existing methods typically regress poses in a continuous space, but often struggle with (1) navigating a large, complex search space and (2) failing to incorporate intrinsic kinematic constraints. In this paper, we introduce DICArt (Discrete Diffusion for Articulated Object Pose Estimation), a novel framework that formulates pose estimation as a conditional discrete diffusion process. Instead of operating in a continuous domain, DICArt progressively denoises a noisy pose representation through a learned reverse diffusion procedure to recover the ground-truth pose. To improve modeling fidelity, we propose a flexible flow decider that dynamically determines whether each token should be denoised or reset, effectively balancing the real and noise distributions during diffusion. Additionally, we incorporate a joint-oriented modeling strategy, estimating the pose of each rigid part hierarchically to respect the object's kinematic structure. We validate DICArt on both synthetic and real-world datasets with multi-hinged articulated objects. Experimental results demonstrate its superior performance and robustness over state-of-the-art baselines. By integrating discrete generative modeling with structural priors, DICArt offers a new paradigm for reliable category-level 6D pose estimation in complex environments.
Comparison of Different Denoising Processes. We denote the rotation-related Euler angles as l, m, n, and model them using discretized bin indices for prediction. (a) illustrates the vanilla denoising process of conventional discrete diffusion models, where inconsistent convergence rates across tokens often introduce uncertainty and ambiguity in pose prediction—this can be viewed as an aggressive denoising strategy. (b) presents the reformulated denoising process proposed in this work, which is centered around a customized Flowing Mechanism. This mechanism introduces adaptive directional guidance that determines appropriate update paths for each token. It is designed to enforce consistent convergence trajectories among semantically correlated token groups, thereby enabling a more stable and smoother gentle denoising process.
The Overall of Our Framework.
The proposed DICArt comprises three interrelated modules. Under the joint-oriented modeling strategy, the articulated object is decomposed into one Parent Part and multiple Child Parts. The overall pipeline proceeds as follows: first, the 6D pose $\mathbf{x}_0$ of the parent part is subjected to a Forward Corruption process, yielding a fully masked pose representation $\mathbf{x}_T$ . Then, during the Reformulated Denoising Process, geometric conditions are injected to guide the network in recovering the parent part’s 6D pose from noisy input. In the next Section, we further incorporate explicit \textit{\textbf{kinematic constraints}} based on joint motion, and derive the 6D poses of all child parts using the Rodrigues formulation. This results in a complete per-part 6D pose estimation of the articulated object.
Download our generated dataset from ArtImage at BaiduYun(code:o2ou) and OneDrive, and save in /data