BiPreManip:
Learning Affordance-Based Bimanual Preparatory Manipulation through Anticipatory Collaboration
Yan Shen, Feng Jiang, Zichen He, Xiaoqi Li, Yuchen Liu, Zhiyu Li, Ruihai Wu, Hao Dong
CVPR 2026
[Paper] [Code] [Dataset]
CVPR 2026
[Paper] [Code] [Dataset]
Video Presentation
Abstract
Many everyday objects are difficult to directly grasp (e.g., a flat iPad) or manipulate functionally (e.g., opening the cap of a pen lying on a desk). Such tasks require sequential, asymmetric coordination between two arms, where one arm performs preparatory manipulation that enables the other’s goal-directed action—for instance, pushing the iPad to the table’s edge before picking it up, or lifting the pen body to allow the other hand to remove its cap. In this work, we introduce Collaborative Preparatory Manipulation, a class of bimanual manipulation tasks that demand understanding object semantics and geometry, anticipating spatial relationships, and planning long-horizon coordinated actions between the two arms. To tackle this challenge, we propose a visual affordance-based framework that first envisions the final goal-directed action and then guides one arm to perform a sequence of preparatory manipulations that facilitate the other arm’s subsequent operation. This affordance-centric representation enables anticipatory inter-arm reasoning and coordination, generalizing effectively across various objects spanning diverse categories. Extensive experiments in both simulation and the real world demonstrate that our approach substantially improves task success rates and generalization compared to competitive baselines.
Overview
Figure 1. Illustration of Collaborative Preparatory Manipulation tasks. (1) The top row shows objects (e.g., a capped bottle or an inverted bowl) that cannot be directly grasped or operated on by a single arm, highlighting the necessity of bimanual coordination and preparatory manipulation. (2) In the bottom rows, one arm first performs preparatory actions—such as lifting, reorienting, or repositioning an object—to enable the other arm’s subsequent goal-directed manipulation.
Figure 2. Overview of the BiPreManip framework for bimanual preparatory manipulation tasks. (a) The system predicts an anticipatory affordance map to infer the goal-directed interaction of the primary arm. (b) Guided by this prediction, the assistant arm performs preparatory actions, establishing favorable conditions for manipulation. (c) The primary arm then executes the goal-directed manipulation. This anticipatory reasoning enables effective and coordinated dual-arm manipulation.
Pipeline
Figure 3. The BiPreManip pipeline. Given the point cloud observation and a language instruction, the Goal Affordance Network predicts an anticipatory affordance for the primary arm. Conditioned on this, the Pre-Affordance Network infers how the assistant arm should act to establish favorable object conditions. The Anticipatory Object Pose Predictor and Reorient Actor estimate and execute the object reconfiguration required for collision-free access. Finally, the Goal Affordance Network is re-invoked on the updated scene to execute the goal-directed manipulation. This framework design enables anticipatory, collaborative, and geometrically consistent bimanual reasoning.
Results
Edge-Pushing
Articulated Manipulation
Handover
Plate-Lifting
( See the top video for predicted affordances. )