Qiaojun Yu†, Xibin Yuan†, Yu Jiang†, Junting Chen,
Dongzhe Zheng, Ce Hao, Yang You, Yixing Chen, Yao Mu, Liu Liu, Cewu Lu*
* Corresponding author (email: lucewu@sjtu.edu.cn), † These authors contribute equally
Articulated object manipulation remains a critical challenge in robotics due to the complex kinematic constraints and the limited physical reasoning of existing methods. In this work, we introduce ArtGS, a novel framework that extends 3D Gaussian Splatting (3DGS) by integrating visual-physical modeling for articulated object understanding and interaction. ArtGS begins with multi-view RGB-D reconstruction, followed by reasoning with a vision-language model (VLM) to extract semantic and structural information, particularly the articulated bones. Through dynamic, differentiable 3DGS-based rendering, ArtGS optimizes the parameters of the articulated bones, ensuring physically consistent motion constraints and enhancing the manipulation policy. By leveraging dynamic Gaussian splatting, cross-embodiment adaptability, and closed-loop optimization, ArtGS establishes a new framework for efficient, scalable, and generalizable articulated object modeling and manipulation. Experiments conducted in both simulation and real-world environments demonstrate that ArtGS significantly outperforms previous methods in joint estimation accuracy and manipulation success rates across a variety of articulated objects.
Pipeline
Pipeline of ArtGS. Starting from multi-view RGB-D inputs and object masks, the framework performs Static Gaussian Reconstruction and synthesizes robot poses. It then uses a VLM-based bone initialization module to estimate joint parameters via visual-language reasoning. Finally, Bone’s Refinement module dynamically optimizes revolute and prismatic joint sequences, producing an articulated object model with precise kinematic parameters.
Experiment
Articulated Object Modeling
Compared to the single-frame point cloud-based methods for axis parameter prediction employed by ANCSH and GAMMA, Ditto demonstrates superior stability by utilizing a two-frame point cloud approach. Through effectively capturing the spatial motion information of hinge components, Ditto significantly enhances the accuracy of joint operation estimation, achieving notable performance improvements across seven categories of objects. However, due to Ditto's reliance solely on the spatial features of two-frame point clouds, coupled with the inherent sparsity and disorderliness of point cloud data, this method requires substantial spatial changes in hinge components for effective modeling. As evidenced by the visualized results in the figure, Ditto exhibits significant modeling errors when the hinge component changes are within 10 degrees or 10 centimeters. In contrast, when the changes reach 30 degrees or 30 centimeters, the modeling effectiveness shows marked improvement. This indicates that Ditto performs well in scenarios with substantial spatial changes, but its modeling capability remains limited in cases of subtle changes or small-range motions.
In contrast, the ArtGS method fully leverages dynamic temporal information during the interaction process, integrating a visual-physical model of the joint skeleton structure. By utilizing the differentiable rendering properties of 3DGS, ArtGS achieves temporal and spatial continuity and consistency, optimizing the physical model based on visual motion information. Experimental results demonstrate that ArtGS achieves the lowest errors in both axis direction and axis origin across multiple categories, showcasing relatively superior performance.
Cross Embodiment Experiment
Cross-Embodiment Experiment. This figure demonstrates the cross-embodiment capability of ArtGS. The first and second rows show qualitative results for the Franka and xArm7 robotic arms, respectively. The first column displays the robotic arm reconstruction results from Dr. Robot, the second column presents our higher-quality digital assets, and the third column illustrates the manipulation results of ArtGS in a simulated environment across different robotic arms.
Real-world Manipulation Experiment
Real-world Experiment. We implement ArtGS and only fine-tuned VLM in the real-world experiments. Manipulation tasks include opening the door of cabinet (2 parts), drawer, storage, and microwave