BiGraspFormer
BiGraspFormer
Official Code will be released soon
Abstract
Bimanual grasping is essential for robots to handle large and complex objects. However, existing methods either focus solely on single-arm grasping or employ separate grasp generation and bimanual evaluation stages, leading to coordination problems including collision risks and unbalanced force distribution. To address these limitations, we propose BiGraspFormer, a unified end-to-end transformer framework that directly generates coordinated bimanual grasps from object point clouds. Our key idea is the Single-Guided Bimanual (SGB) strategy, which first generates diverse single grasp candidates using a transformer decoder, then leverages their learned features through specialized attention mechanisms to jointly predict bimanual poses and quality scores. This conditioning strategy reduces the complexity of the 12-DoF search space while ensuring coordinated bimanual manipulation. Comprehensive simulation experiments and real-world validation demonstrate that BiGraspFormer consistently outperforms existing methods while maintaining efficient inference speed (<0.05s), confirming the effectiveness of our framework.
Method
Our goal is to predict bimanual grasps from an object point cloud. This task is challenging as it involves a 12-DoF action space, doubling the complexity of single-arm grasping. Each grasp needs to achieve force-closure stability while both arms coordinate to avoid collisions, maintain torque balance, and ensure overall stability. To tackle this challenge, we introduce the Single-Guided Bimanual (SGB) grasp generation scheme, which decom- poses the prediction of B into three structured stages. First, generate diverse single grasp candidates under basic stability constraints. Second, select feasible grasp pairs by discarding collisions and ensuring balanced force distribution. Finally, refine these pairs into stable bimanual grasps using learned features from both the object and single grasps. This formulation explicitly enforces both individual grasp quality and dual-arm coordination, decomposing the complex 12-DoF search space into a sequence of more tractable subproblems.
Real-world Results
Blue Chair
Toy Box
White Shelf
Yellow Stair
Wood Stool
White Frame
Green Bin
Yellow Chair