Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions
CVPR 2025
CVPR 2025
Grounding 3D object affordance is a task that locates objects in 3D space where they can be manipulated, which links perception and action for embodied intelligence. For example, for an intelligent robot, it is necessary to accurately ground the affordance of an object and grasp it according to human instructions. In this paper, we introduce a novel task that grounds 3D object affordance based on language instructions, visual observations and interactions. And we collect an Affordance Grounding dataset with Point, Image and Language instructions (AGPIL) to support the proposed task. In the 3D physical world, due to observation orientation, object rotation, or spatial occlusion, we can only get a partial observation of the object. So this dataset includes affordance estimations of objects in full-view, partial-view, and partial-view with rotation. To solve this problem, we propose LMAffordance3D, the first multi-model, language-guided 3D affordance grounding network, which applies vision language model to fuse 2D and 3D spatial features with semantic features. Comprehensive experiments on AGPIL demonstrate the effectiveness and superiority of our method on this task, even in unseen experimental settings.
Figure 1. Illustration for the affordance grounding task. Inspired by cognitive science, when humans encounter a new object, they learn its use through language instructions, human-machine interactions, and visual observations, thus obtaining its affordance. So we introduce a new task which is grounding 3D object affordance with language instructions, visual observations, and interactions.
Table 1. Comparion of affordance grounding dataset. Here we mainly compare the input, output, view, and dataset scale for different datasets. F, P, and R respectively refer to full-view, partial-view, and rotation-view.
(a) Example of image data.
(b) Example of point cloud and affordance data.
(c) Distribution of instruction data.
(d) Distribution of image data.
(e) Distribution of point cloud data.
(f) Distribution of affordance data.
Figure 2. Examples and statistics of the AGPIL dataset. Figures (a) and (b) are examples of images, point clouds, and a certain affordance that we randomly selected from the dataset. Figure (c) shows a word cloud generated according to the frequency of each word appearing in the language instructions. Figures (d) and (e) respectively show the distribution of affordances corresponding to different objects in image and point cloud data. The horizontal axis represents the types of objects, and the vertical axis represents the quantity. Different colors indicate different affordances. Figure (f) illustrates the distribution of image and point cloud data corresponding to each affordance. It indicates that images and point clouds are not a one-to-one match, as a single image may correspond to multiple objects.
Figure 3. Method. The structure of the proposed LMAffordance3D model, which consists of four major components: 1) a vision encoder that processes multi-model data, including images and point clouds, to encode and fuse the 2D and 3D features; 2) a vision language model and its associated component (tokenizer and adapter) that takes in the instruction token, 2D and 3D vision token to predict the affordance feature; 3) a decoder that uses 2D and 3D spatial features as query, instructional features as key and semantic feature as value for fusion; 4) a head for segmenting and grounding 3D object affordance.
Figure 4. Visualization. We selected several examples from the test set under different views and experimental settings, showcasing the model’s inputs and outputs, and comparing them with the ground truth (GT).
Figure 5. Affordance multiplicity. We keep the image and point cloud inputs unchanged while modifying the language instructions to demonstrate the model’s instruction-following capability. In the figure, we have omitted the instructions, retaining only the object and different affordance names.
Figure 6. Different shapes. The objects in the instructions, images and point clouds belong to the same category but have different geometries. We choose two examples and compare them with the ground truth (GT).
Figure 7. Different categories. The objects in the instructions, images and point clouds have different categories and geometries. (Row 1:) The category of object in the instruction and image is “refrigerator”, while the object of point cloud is "storage furniture". (Row 2:) The category of object in the instruction and image is "bag", while the object of point cloud is “hat”.
Figure 8. Failure Cases. In the rotation-view and unseen setting, (Row 1) the model fails to ground the dishwasher door handle and only ground the door itself; (Row 2) the model is not very effective at grounding in small affordance regions, such as the buttons on the microwave.
Table 2. Description of tensors. We present the dimensions and meanings of the input and output tensors in each component of the model. Here we do not specify the batch size.
Table 3. Different backbones. Here we present the results of models using different backbones under various views and settings. Among them, “Baseline” refers to using ResNet18 and PointNet++ as the backbone, “ViT” refers to using CLIP-ViT and PointNet++ as the backbone, and “PCF” refers to using ResNet18 and PointConvFormer as the backbone.
Table 4. Different pairings. We show the results of models when the number of pairings varies under different views and settings in detail. One image could be paired with multiple point clouds during training. The number of pairings has an influence on the model performance.
Please refer to the paper for more details.