2HandedAfforder: Learning Precise Actionable Bimanual Affordances from Human Videos
2HandedAfforder: Learning Precise Actionable Bimanual Affordances from Human Videos
Marvin Heidinger*, Snehal Jauhri*, Vignesh Prasad, and Georgia Chalvatzaki
PEARL Lab, TU Darmstadt, Germany
* Equal contribution
International Conference on Computer Vision (ICCV) 2025
Summary:
When interacting with objects, humans effectively reason about which regions of objects are viable for an intended action, i.e., the affordance regions of the object. They can also account for subtle differences in object regions based on the task to be performed and whether one or two hands need to be used. However, current vision-based affordance prediction methods often reduce the problem to naive object part segmentation. In this work, we propose a framework for extracting affordance data from human activity video datasets. Our extracted 2HANDS dataset contains precise object affordance region segmentations and affordance class-labels as narrations of the activity performed. The data also accounts for bimanual actions, i.e., two hands co-ordinating and interacting with one or more objects. We present a VLM-based affordance prediction model, 2HandedAfforder, trained on the dataset and demonstrate superior performance over baselines in affordance region segmentation for various activities. Finally, we show that our predicted affordance regions are actionable, i.e., can be used by an agent performing a task, through demonstration in robotic manipulation scenarios.
Examples of different manipulation tasks executed on a bimanual Tiago++ robot. Red and green masks denote left and right hand affordance mask predictions, respectively. We use our method, 2HandedAfforder, to segment the task-specific object affordance regions, propose grasps for these regions, and execute manipulation tasks.
Affordance extraction pipeline: Using a video mask-propagation network, we first obtain dense, full-sequence object and hand masks with a human activity video sequence and a single-frame object and hand masks. We then inpaint out the hands in the RGB images using a video-based hand inpainting model. This gives us an image with the objects reconstructed and un-occluded by the hands. With the inpainted image and the original object masks, we use SAM2 to ``complete" the object masks by again propagating the object masks to the inpainted image. Finally, we can extract the affordance region masks for the given task as the intersection between the completed masks and the hand masks. We also label the affordance class using the narration of the task.
Affordance prediction network: Given an input image and task, we use a question asking where the objects should be interacted with for the desired task as a text prompt to a Vision-Language model (VLM). The VLM produces language tokens and a [SEG] token which is passed to the affordance decoders. We also use a SAM vision-backbone to encode the image and pass it to the affordance decoders. The decoders predict the left hand and right hand affordance region masks as well as a taxonomy classification indicating whether the interaction is supposed to be performed with the left hand, right hand, or both hands. The vision encoder is frozen, while the VLM predictions are fine-tuned using LORA.
Video demonstration:
BibTeX:
@misc{heidinger20252handedafforderlearningpreciseactionable,
title={2HandedAfforder: Learning Precise Actionable Bimanual Affordances from Human Videos},
author={Marvin Heidinger and Snehal Jauhri and Vignesh Prasad and Georgia Chalvatzaki},
year={2025},
eprint={2503.09320},
archivePrefix={arXiv},
primaryClass={cs.CV},
}