UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models（ICRA 2025）

Qiaojun Yu*, Siyuan Huang*, Xibin Yuan, Zhengkai Jiang, Ce Hao,

Xin Li, Haonan Chang, Junbo Wang, Liu Liu, Hongsheng Li, Peng Gao✉️, Cewu Lu✉️

*indicates the equal contribution

✉️ Peng Gao and Cewu Lu are the equal corresponding authors, gaopeng@pjlab.org.cn, lucewu@sjtu.edu.cn

Video (YouTube)

Paper (arxiv)

Code and Dataset (github)

Previous studies on robotic manipulation are based on a limited understanding of the underlying 3D motion constraints and affordances. To address these challenges, we propose a comprehensive paradigm, termed UniAff, that integrates 3D object-centric manipulation and task understanding in a unified formulation. Specifically, we constructed a dataset labeled with manipulation-related key attributes, comprising 900 articulated objects from 19 categories and 600 tools from 12 categories. Furthermore, we leverage MLLMs to infer object-centric representations for manipulation tasks, including affordance recognition and reasoning about 3D motion constraints. Comprehensive experiments in both simulation and real-world settings indicate that UniAff significantly improves the generalization of robotic manipulation for tools and articulated objects. We hope that UniAff will serve as a general baseline for unified robotic manipulation tasks in the future.

UniAff demonstrates its ability to unify tool usage and articulation understanding in a VQA format, predicting part bounding boxes, 6D poses, grasp affordances, functional affordances, and manipulation types, etc for effective robotic manipulation tasks.

Video

FrameWork

The architecture of UniAff. The image features are first extracted using a Mixed Visual Encoder, such as DINOv2, CLIP, or Q-Former, followed by an MLP projector. Next, language instructions are used to extract features with the Llama Tokenizer. Finally, the output of the structured manipulation tasks, such as Part BBOX, Affordance, and Revolute Parts, is used to execute robotic instructions.

Qualitative results

Tools’ affordance evaluation results on real-world HANDAL dataset

From the evaluation results presented in the above figure and table for the real-world HANDAL dataset, it is evident that UniAff demonstrates competitive performance even in a zero-shot setting associated with SAM. Trained solely on the simulation dataset, UniAff outperformed LISA by a significant margin of 11.5% and achieved performance comparable to ManipVQA, with only a 2.2% difference in IOU.

For more details about the HANDAL, LISA, and ManipVQA , please visit the corresponding websites.

HANDAL: https://nvlabs.github.io/HANDAL/

LISA：https://github.com/dvlab-research/LISA

ManipVQA: https://github.com/SiyuanHuang95/ManipVQA

Tools’ affordance evaluation results on our synthetic dataset

From the evaluation results presented in the above figure and table for our synthetic dataset, UniAff excels at detecting both grasp affordances, demonstrating a 32.5% improvement on grasp affordances, with a 56.9% improvement on functional affordances in IOU compared to ManipVQA, which struggles with functional affordance reasoning.

Simulation visual question answering (VQA) results for each object category

Nine seen categories of tools (Unseen instances)

Brush

Razor

Screwdriver

Hair dryer

Hammer

Knife

Spoon

Spatula

Power drill

Three unseen categories of tools

Flower shovel

Fork

Ladel

For tools, the dataset comprises a total of 12 categories of tools, with 9 used for training and 3 as unseen categories. The training categories specifically include brush, razor, screwdriver, hair dryer, hammer, knife, spoon, spatula, and power drill, while the unseen categories are flower shovel, fork, and ladle. For the tools, yellow boxes represent 2D object bounding boxes, blue boxes indicate 2D grasp affordance bounding boxes, red boxes highlight 2D functional affordance bounding boxes, and the coordinate axes represents the tool's 6D poses. We evaluate UniAff's performance on both unseen instances and unseen categories and visualize the results. The visualized results demonstrate that our model performs well on unseen instances and shows notable generalization capabilities to unseen categories.

Thirteen seen categories of articulated objects (Unseen instances)

Bottle

Box

Bucket

Dispenser

Door

Folding chair

Kitchen pot

Laptop

Microwave

Refrigerator

Safe

Storage furniture

Trash can

Six unseen categories of articulated objects

Faucet

Oven

Table

Toilet

Kettle

Washing machine

For articulated objects, the dataset includes 19 categories in total, with 13 used for training and 6 as unseen categories. The training categories specifically include bottle, box, bucket, dispenser, door, folding chair, kitchen pot, laptop, microwave, refrigerator, safe, storage furniture, and trash can, while the unseen categories include faucet, oven, table, toilet, kettle, and washing machine. We implement UniAff to analyze our synthetic dataset in a Visual Question Answering (VQA) format and visualize the results on both unseen instances and unseen categories. For the articulated objects, yellow boxes represent the 2D part bounding boxes, the blue boxes indicate the 2D grasp affordance bounding boxes, and the red arrows highlight the 3D spatial positions of the joint axis. The visualized results demonstrate that our model performs well on unseen instances and shows strong generalization capabilities to unseen categories.

Bottle

Kitchen pot

Microwave

Storage furniture

Faucet

Toilet

Utilizing the 2D grasp affordances, we project the depth map into a 3D point cloud using methods such as GraspNet to extract the corresponding 6D grasp poses. Based on these grasp poses, we integrate either 6D poses or 3D spatial joint axis, along with the relevant manipulation types, enabling the manipulation of tools or articulated objects while respecting the associated 3D motion constraints.

For more details about GraspNet, please visit the homepage: https://github.com/graspnet/anygrasp_sdk

Simulation ariculated objects manipulation experiments

Open bottle cap

Open pot lid

Open revolute part

Open prismatic part

We present the qualitative manipulation performance of UniAff in four tasks: open bottle cap, pot lid, revolute part and prismatic part. In all of these tasks, UniAff achieved a 7.07% improvement in success rates for unseen instances and a 9.60% improvement for unseen categories compared to A3VLM.

For more details about the Where2Act, UMPNet, and A3VLM, please visit the corresponding websites.

Where2Act: https://github.com/daerduoCarey/where2act

UMPNet：https://ump-net.cs.columbia.edu/

A3VLM: https://github.com/changhaonan/A3VLM

Real-world experiments

Hammer

Power drill

Screwdriver

Spatula

Spoon

Flower shovel

We apply UniAff to understand the usage of six previously unseen real-world tools in a Visual Question Answering (VQA) format and visualize the results. UniAff predicts 2D object bounding boxes, 2D grasp affordance bounding boxes, 2D functional affordance bounding boxes and manipulation types (freedom object). Yellow boxes represent 2D object bounding boxes, blue boxes indicate 2D grasp affordance bounding boxes, and red boxes highlight 2D functional affordance bounding boxes. The results demonstrate UniAff's effectiveness in understanding tool usage.

Storage furniture

Refrigerator

Microwave

Kitchen pot

Bucket

Laptop

We utilize UniAff to understand the articulation of six previously unseen real-world articulated objects in a Visual Question Answering (VQA) format and visualize the results. UniAff predicts 2D part bounding boxes, 2D grasp affordance bounding boxes, the 3D spatial positions of the joint axis and their corresponding manipulation types. The yellow boxes represent the 2D part bounding boxes, the blue boxes indicate the 2D grasp affordance bounding boxes, and the red arrows highlight the 3D spatial positions of the joint axis. The results demonstrate that UniAff effectively understands articulation.

Hammer

Power drill

Flower shovel

Storage furniture

Microwave

Kitchen pot

We visualize the 6D grasp poses within the identified grasp affordance regions. In real-world settings, despite the limitations of RGB-D cameras in capturing accurate depth maps, such as for transparent materials like the glass door of a microwave or a glass lid, and noise from the depth sensor, UniAff effectively identifies the grasp affordance regions. Leveraging these 2D affordances, we project the depth map into a 3D point cloud and utilize GraspNet to extract the corresponding 6D grasp poses. Building on these grasp poses, we incorporate either 6D poses or 3D spatial joint axis, along with designed manipulation policies, enabling the effective manipulation of tools or articulated objects.

Real-world manipulation experiments

Strike target

Open drawer

Open refrigerator

Open microwave

Open pot lid

Lift bucket handle

We present the qualitative manipulation performance of six unseen real-world objects. The video is accelerated six times. The experiments encompassed six tasks: striking a target with a hammer, opening a drawer, refrigerator, microwave, and pot lid, as well as lifting a bucket handle. These results collectively highlight UniAff's capability in real-world manipulation tasks.

Page updated

Google Sites

Report abuse