We tune the hyperparameters to improve the sample efficiency and final performance of SAC specifically with massively paralleled data collection, which is an unusual setting for off-policy RL algorithms. The hyperparameters in the experiments are summarized in the table below.
We highlight the following critical implementations:
Varying number of updates: Due to the instability of the off-policy algorithm, excessive updates at the beginning of training harm subsequent training, so we set the number of updates proportional to the number of data in the replay buffer.
Data augmentation: We perform goal relabel to augment every batch of data sampled from the replay buffer. Specifically, 80% of the sampled transitions are relabeled with the goal from future transitions.
High data reuse: “Data reuse” is the number of data usage during network updates to the number of steps collected from the environment. In our experiments, a large amount of data is created offline via hindsight relabel, object-centric relabel and symmetry-aware augmentation. Therefore, we need to perform more network updates to fit the diverse training data.
Object-centric architecture: The details of the self-attention structure used in the experiments are as follows. The input to the policy network is m object-centric tokens
while for the critic network, we additionally concatenate the robot actions to the input as
Both networks process their inputs into feature vectors x_i, 1 ≤ i ≤ m with a stack of 4 attention blocks with 4 heads. We aggregate the processed features in the critic network with mean pooling, then feed it into a two-layer MLP to get Q prediction. Here we adopt mean pooling to keep the output Q in the same range for different number of input tokens. For the actor network, we take the maximum of x_i to fuse the features. Intuitively, the actor network relies more on local information from a small part of objects to make decisions. The pooled embedding of actor is then fed into a two-layer MLP and split into two heads to predict the statistics of two robots' action distributions. Since both action distributions are modeled as Gaussian, each policy head predicts the mean and the logarithm of standard deviation for robot i.
Curriculum learning:
Curriculum on table gap: Initially, the gap between the tables is 10cm wide (half of the cuboid length), so the agent can discover cooperative handover more easily. We increase the table gap by 5cm when the success rate of the tasks that require handover reaches 0.7 until the table gap reaches 30cm.
Curriculum on the probability of sampling goals on the opposite side: The probability of sampling goals on the other table is set to 0.2 in the beginning. After the success rate of local rearrangement tasks reaches 0.7, we increase the probability by 0.2 in each iteration until it reaches 0.8.
Curriculum on object number: The agent starts learning from single-object tasks. Each time the average success rate over all tasks reaches 0.7, we increase the ratio of tasks with two objects by 10% and gradually switch to a scenario with two-object tasks only. Afterwards, we expand the total number of objects in the scene in the same way.
In our experiment setup, two robots are mounted on two tables with a gap of 0.3m and a height of 0.4m. We establish a cartesian coordinate system originating at the center of the gap, with the x-axis perpendicular to the gap, the y-axis parallel to the gap, and the z-axis pointing upward. The bases of the two robots are randomly sampled from regions with distances [0.5m, 0.7m] to the gap center and angle to the world x-axis randomly picked from [-π/3, π/3]. We also initialize the robot base orientation randomly from [-π, π]. The bijection M mirrors task instances between the original coordinate system defined above and another coordinate system that rotates 180 degrees around the z-axis of the former one.
State: Each robot state contains the position and velocity of the robot’s end effector and the gripper width. Each object state contains the position and orientation of objects.
Initial distribution ρ: We randomly sample objects and goals from two rectangle regions on the tables symmetric to the gap. For objects and goals located on the right side of the gap, we randomly sample in the range of 0.15m < x < 0.55m and −0.15m < y < 0.15m, for a total space of 0.4m × 0.2m. Symmetrically, for the object located to the left of the gap, we randomly sample in the range of −0.55m < x < −0.15m and −0.15m < y < 0.15m. We first sample goals uniformly the union of the two regions, then assign objects to two workspaces according to the current other-side ratio. Finally, the initial object positions are sampled from their assigned workspace.
Horizon: Each episode lasts 50 × m steps.
We use inverse kinematics to compute the target joint angles from the desired end effector pose, then compute the torques with a PD controller. The end effector control runs at 10Hz. We tweak the simulated environment to prohibit the agent from learning agile behaviors that are difficult to transfer to real robots. The major changes include reducing the rotational inertia of the cuboid blocks and reducing the speed of the fingers. To reduce the negative impact of useless experiences on training, we allow the environment to terminate early in the following cases: (1) the robot arm collides with the table, (2) the object falls from the table, (3) all objects do not move within 15 steps. Through the above pruning operation, we can get more various experiences during the exploration and improve the quality of data in the buffer.