We use our full version of the decentralized SA-AC policy in the development. To avoid unsafe actions, we manually standardize each pick and place primitive by linear motion planning. Each primitive is composed of pre-grasping, grasping and post-grasping stages, which guarantee collision-free pick and place. To overcome the heavy occlusion issue in our single-camera experimental setting, we run our policy after each interaction is over, i.e., at the end of the pick-and-place primitives.Â
To distill the real-world policy from simulation, we first collect 1e6 steps of data from the simulation. Then we relabeled the data using the labels of the next object to be manipulated. Lastly, we running a supervised learning process to get the new policy. The acquired policy takes the object state, goal position and robot end effector position as input and yields the next object to be manipulated. Our controller will then execute the action and run the higher-level neural planner again.
We used two Franka Pandas as our manipulators. The observation of the system is provided by a local localization system based on Aruco tag detection. The relative position of the arm and object is shown in figure 11. In each planning step, the system will average the calculated position over 1s time to avoid failure in object detection. In each execution step, the robot will run a predefined pick and place primitive to the objects given by the neural planer.
Real world setting: we mounted two Franka panda arms on two tables. They are separated by a wood wall, so agents must hand over objects in the air.