Qiaojun Yu*, Siyuan Huang*, Xibin Yuan, Zhengkai Jiang, Ce Hao,
Xin Li, Haonan Chang, Junbo Wang, Liu Liu, Hongsheng Li, Peng Gao✉️, Cewu Lu✉️
*indicates the equal contribution
✉️ Peng Gao and Cewu Lu are the equal corresponding authors, gaopeng@pjlab.org.cn, lucewu@sjtu.edu.cn
Previous studies on robotic manipulation are based on a limited understanding of the underlying 3D motion constraints and affordances. To address these challenges, we propose a comprehensive paradigm, termed UniAff, that integrates 3D object-centric manipulation and task understanding in a unified formulation. Specifically, we constructed a dataset labeled with manipulation-related key attributes, comprising 900 articulated objects from 19 categories and 600 tools from 12 categories. Furthermore, we leverage MLLMs to infer object-centric representations for manipulation tasks, including affordance recognition and reasoning about 3D motion constraints. Comprehensive experiments in both simulation and real-world settings indicate that UniAff significantly improves the generalization of robotic manipulation for tools and articulated objects. We hope that UniAff will serve as a general baseline for unified robotic manipulation tasks in the future.
UniAff demonstrates its ability to unify tool usage and articulation understanding in a VQA format, predicting part bounding boxes, 6D poses, grasp affordances, functional affordances, and manipulation types, etc for effective robotic manipulation tasks.
Video
FrameWork
The architecture of UniAff. The image features are first extracted using a Mixed Visual Encoder, such as DINOv2, CLIP, or Q-Former, followed by an MLP projector. Next, language instructions are used to extract features with the Llama Tokenizer. Finally, the output of the structured manipulation tasks, such as Part BBOX, Affordance, and Revolute Parts, is used to execute robotic instructions.
Qualitative results
Tools’ affordance evaluation results on real-world HANDAL dataset
From the evaluation results presented in the above figure and table for the real-world HANDAL dataset, it is evident that UniAff demonstrates competitive performance even in a zero-shot setting associated with SAM. Trained solely on the simulation dataset, UniAff outperformed LISA by a significant margin of 11.5% and achieved performance comparable to ManipVQA, with only a 2.2% difference in IOU.
For more details about the HANDAL, LISA, and ManipVQA , please visit the corresponding websites.
HANDAL: https://nvlabs.github.io/HANDAL/
Tools’ affordance evaluation results on our synthetic dataset
From the evaluation results presented in the above figure and table for our synthetic dataset, UniAff excels at detecting both grasp affordances, demonstrating a 32.5% improvement on grasp affordances, with a 56.9% improvement on functional affordances in IOU compared to ManipVQA, which struggles with functional affordance reasoning.
Simulation visual question answering (VQA) results for each object category
Nine seen categories of tools (Unseen instances)
Brush
Razor
Screwdriver
Hair dryer
Hammer
Knife
Spoon
Spatula
Power drill
Three unseen categories of tools
Flower shovel
Fork
Ladel
For tools, the dataset comprises a total of 12 categories of tools, with 9 used for training and 3 as unseen categories. The training categories specifically include brush, razor, screwdriver, hair dryer, hammer, knife, spoon, spatula, and power drill, while the unseen categories are flower shovel, fork, and ladle. For the tools, yellow boxes represent 2D object bounding boxes, blue boxes indicate 2D grasp affordance bounding boxes, red boxes highlight 2D functional affordance bounding boxes, and the coordinate axes represents the tool's 6D poses. We evaluate UniAff's performance on both unseen instances and unseen categories and visualize the results. The visualized results demonstrate that our model performs well on unseen instances and shows notable generalization capabilities to unseen categories.
Thirteen seen categories of articulated objects (Unseen instances)
Bottle
Box
Bucket
Dispenser
Door
Folding chair
Kitchen pot
Laptop
Microwave
Refrigerator
Safe
Storage furniture
Trash can
Six unseen categories of articulated objects
Faucet
Oven
Table
Toilet
Kettle
Washing machine
For articulated objects, the dataset includes 19 categories in total, with 13 used for training and 6 as unseen categories. The training categories specifically include bottle, box, bucket, dispenser, door, folding chair, kitchen pot, laptop, microwave, refrigerator, safe, storage furniture, and trash can, while the unseen categories include faucet, oven, table, toilet, kettle, and washing machine. We implement UniAff to analyze our synthetic dataset in a Visual Question Answering (VQA) format and visualize the results on both unseen instances and unseen categories. For the articulated objects, yellow boxes represent the 2D part bounding boxes, the blue boxes indicate the 2D grasp affordance bounding boxes, and the red arrows highlight the 3D spatial positions of the joint axis. The visualized results demonstrate that our model performs well on unseen instances and shows strong generalization capabilities to unseen categories.
Bottle
Kitchen pot
Microwave
Storage furniture
Faucet
Toilet
Utilizing the 2D grasp affordances, we project the depth map into a 3D point cloud using methods such as GraspNet to extract the corresponding 6D grasp poses. Based on these grasp poses, we integrate either 6D poses or 3D spatial joint axis, along with the relevant manipulation types, enabling the manipulation of tools or articulated objects while respecting the associated 3D motion constraints.
For more details about GraspNet, please visit the homepage: https://github.com/graspnet/anygrasp_sdk
Simulation ariculated objects manipulation experiments
Open bottle cap
Open pot lid
Open revolute part
Open prismatic part
We present the qualitative manipulation performance of UniAff in four tasks: open bottle cap, pot lid, revolute part and prismatic part. In all of these tasks, UniAff achieved a 7.07% improvement in success rates for unseen instances and a 9.60% improvement for unseen categories compared to A3VLM.
For more details about the Where2Act, UMPNet, and A3VLM, please visit the corresponding websites.
Where2Act: https://github.com/daerduoCarey/where2act
Real-world experiments
Hammer
Power drill
Screwdriver
Spatula
Spoon
Flower shovel
We apply UniAff to understand the usage of six previously unseen real-world tools in a Visual Question Answering (VQA) format and visualize the results. UniAff predicts 2D object bounding boxes, 2D grasp affordance bounding boxes, 2D functional affordance bounding boxes and manipulation types (freedom object). Yellow boxes represent 2D object bounding boxes, blue boxes indicate 2D grasp affordance bounding boxes, and red boxes highlight 2D functional affordance bounding boxes. The results demonstrate UniAff's effectiveness in understanding tool usage.
Storage furniture
Refrigerator
Microwave
Kitchen pot
Bucket
Laptop
We utilize UniAff to understand the articulation of six previously unseen real-world articulated objects in a Visual Question Answering (VQA) format and visualize the results. UniAff predicts 2D part bounding boxes, 2D grasp affordance bounding boxes, the 3D spatial positions of the joint axis and their corresponding manipulation types. The yellow boxes represent the 2D part bounding boxes, the blue boxes indicate the 2D grasp affordance bounding boxes, and the red arrows highlight the 3D spatial positions of the joint axis. The results demonstrate that UniAff effectively understands articulation.
Hammer
Power drill
Flower shovel
Storage furniture
Microwave
Kitchen pot
We visualize the 6D grasp poses within the identified grasp affordance regions. In real-world settings, despite the limitations of RGB-D cameras in capturing accurate depth maps, such as for transparent materials like the glass door of a microwave or a glass lid, and noise from the depth sensor, UniAff effectively identifies the grasp affordance regions. Leveraging these 2D affordances, we project the depth map into a 3D point cloud and utilize GraspNet to extract the corresponding 6D grasp poses. Building on these grasp poses, we incorporate either 6D poses or 3D spatial joint axis, along with designed manipulation policies, enabling the effective manipulation of tools or articulated objects.
Real-world manipulation experiments
Strike target
Open drawer
Open refrigerator
Open microwave
Open pot lid
Lift bucket handle
We present the qualitative manipulation performance of six unseen real-world objects. The video is accelerated six times. The experiments encompassed six tasks: striking a target with a hammer, opening a drawer, refrigerator, microwave, and pot lid, as well as lifting a bucket handle. These results collectively highlight UniAff's capability in real-world manipulation tasks.