Selective Object Rearrangement in Clutter

University of Southern California

Abstract: we propose an image-based, learned method for selective tabletop object rearrangement in clutter using a parallel jaw gripper. Our method consists of three stages: graph-based object sequencing (which object to move), feature-based action selection (whether to push or grasp, and at what position and orientation) and a visual correspondence-based placement policy (where to place a grasped object). Experiments show that this decomposition works well in challenging settings requiring the robot to begin with an initially cluttered scene, selecting only the objects that need to be rearranged while discarding others, and dealing with cases where the goal location for an object is already occupied – making it the first system to address all these concurrently in a purely image-based setting. We also achieve an ~8% improvement in task success rate over the previously best reported result that handles both translation and orientation in less restrictive (un-cluttered, non-selective) settings. We demonstrate zero-shot transfer of our system solely trained in simulation to a real robot selectively rearranging up to everyday objects, many unseen during learning, on a crowded tabletop.

System overview

We decompose the rearrangement problem into three parts: object sequencing (which object to relocate next), action selection how to manipulate it), and object placement (where to place a grasped object). We rely on three primitives: pushing objects (push), picking them up (grasp}), and placing them at the target locations (place). Push and grasp can be initiated by the robot at any time, however place can only be performed if the robot is already holding an object. This suggests a natural decomposition into our three part strategy. When the robot is not holding an object, it must make a decision on which object to manipulate next (object sequencing). After choosing an object, it must decide whether (and how) to push the selected object or whether (and how) to pick it up (action selection). When holding an object, it must decide where to place it (object placement).

We model object sequencing as a supervised learning problem on graph transformations, action selection as a Partially Observable Markov Decision Process (POMDP), and object placement as a supervised learning problem. Our system uses RGB-D images as input and builds a scene graph based on the object segmentation given by UOIS-Net-3D. Graph-based object sequencing (SimGNN) selects the optimal object for next rearrangement and we mask the Q-value map for grasp with its segmentation mask. Then the system picks the highest Q-value action candidate from push and grasp Q-value maps and executes the action. If grasp is chosen and successfully executed, the system locates the placement of the grasped object.

Example: 15-object selective rearrangement from a cluttered initial state.

Video: robot demonstration

robot_demo.mp4