Redundancy Resolution as Action Bias in Policy Search for Robotic Manipulation

presented at the Conference on Robot Learning (CoRL), London UK, 2021

Paper

Poster

Abstract

We propose a novel approach that biases actions during policy search by lifting the concept of redundancy resolution from multi-DoF robot kinematics to the level of the reward in deep reinforcement learning and evolution strategies. The key idea is to bias the distribution of executed actions in the sense that the immediate reward remains unchanged. The resulting biased actions favor secondary objectives yielding policies that are safer to apply on the real robot. We demonstrate the feasibility of our method, considered as policy search with redundant action bias (PSRAB), in a reaching and a pick-and-lift task with a 7-DoF Franka robot arm trained in RLBench - a recently introduced benchmark for robotic manipulation - using state-of-the-art TD3 deep reinforcement learning and OpenAI's evolutionary strategy. We show that it is a flexible approach without the need of significant fine-tuning and interference with the main objective even across different policy search methods and tasks of different complexity. We evaluate our approach in simulation and on the real robot.

Supplementary Video

This video briefly introduces our method and provides some simulated and real-world results. More results are present below.

Results

This section presents the results of our work. It is meant to complement the results section presented in our paper by presenting more result plots, images and videos. So if you haven't taken a look at our paper so far, we recommend to do it first. Nonetheless, it is started with the result plots that were already shown in the paper. Figure 1 presents the validation results of TD3 and OpenAI-ES agents trained on the reaching and the pick-and-lift task with and without redundancy resolution using a reference configuration. Note that the reward is shown per step for TD3 and per episode for OpenAI-ES, as the latter demands fixed episode lengths.

Figure 1: Validation results of TD3 and OpenAI-ES with and without redundancy resolution on different tasks during training (abscissa shows training episodes). Every 2500 training episodes, the training was paused and the agents were evaluated on 1000 random validation episodes (i.e., no exploration). Then, the means were taken to plot a single point of the reward and the loss.

For each agent and task, the results show the performance with respect to the main objective, which is the reward, as well as the performance with respect to the secondary objective - here, the minimization of the distance to the reference configuration. As can be seen, our approach significantly reduces the loss of the secondary objective without significantly influencing the main objective. A more detailed discussion of each plot can be found in the paper. The following sections present images and videos with exemplary episodes of the final performance of some of the agents presented above as well as further agents.

The Reaching Task

Figure 2 and Figure 3 present some exemplary episodes of a TD3 and an OpenAI-ES agent on a reaching task after training for around 36000, respectively 115000 episodes. Here, redundancy resolution is used to keep the robot configuration close to a reference configuration (c.f., Figure 1). As can be seen, our approach significantly reduces self-motion for both agents. Moreover, the vanilla OpenAI-ES agent has learned a backflip motion, which was not observed on the OpenAI-ES agent with redundancy resolution. This is seen as the major reason for the increased reward of the OpenAI-ES agent with redundancy resolution in the upper right corner of Figure 1.

Figure 2: Some exemplary episodes of a TD3 agent with and without redundancy resolution on a reach-target task after training for around 36000 episodes.

Figure 3: Some exemplary episodes of a OpenAI-ES agent with and without redundancy resolution on a reach-target task after training for around 115000 episodes.

The vanilla TD3 agent has significantly less self-motion. While an episode is reset once successfully reached the target in simulation, there is no auto-reset in the real world. This is why, on the real robot, a lot of self-motion can be observed in the vicinity of the target, as shown in Figure 4.

Moreover, when starting in a new initial position, self-motion drastically increases, as shown in Figure 5 and Figure 6. Interestingly, the vanilla agents solves the task significantly slower as the agent agent with redundancy resolution. It is hypothesized that this is due to the fact that the redundancy bias keeps the current configuration close to the reference configuration and, thus, the state close the states visited during training.

Figure 4: One exemplary episode of a TD3 agent on a reach-target task on the real robot. It is started in the same initial position as during training. Here, the final performance after training is shown.

Figure 5: One exemplary episode of a TD3 agent on a reach-target task on the real robot. It is started in a different initial position (from right). Here, the final performance after training is shown.

Figure 6: One exemplary episode of a TD3 agent on a reach-target task on the real robot. It is started in a different initial position (from left). Here, the final performance after training is shown.

The Pick-and-Lift Task

The necessity of secondary objectives becomes clearer when considering manipulation tasks. For instance, there exist different ways to grasp an object, from which some solution are not desired or even infeasible on the real robot. Figure 6 gives an example of what might be learned when not considering secondary objectives (lower row). As can be seen, the vanilla agent has learned, in some cases, to put the gripper on the table while grasping. Obviously, this is an undesired behavior on the real robot. Our approach biases the robot configuration towards a more upright configuration to avoid contact with the table allowing to transfer the learned policy directly to the real robot. Note that in Figure 6 (and all following real robot picking examples) a foam plate was added under the object to compensate the height of the iron plate under the real robot (this was not available in simulation) and to allow the application of the vanilla agent on the real robot (without damaging the gripper).

Figure 6: Impact of redundancy resolution on a TD3 agent in a simulated and a real pick-and-lift task when using a reference position. Here, the final performance after training is shown.

Figure 7 shows the full episode shown on the right side of Figure 6. Figure 8 shows more examples from simulation. Figure 11 presents further examples of the real robot on a picking task.

Figure 7: Impact of redundancy resolution on a TD3 agent in a real pick-and-lift task when using a reference position (c.f., right side in Figure 6). Here, the final performance after training is shown.

Figure 8: Impact of redundancy resolution on a TD3 agent in a simulated pick-and-lift task when using a reference position. Here, the final performance after training is shown.

As a matter of fact, any kind of reference configuration can be used in our approach. Figure 9 provides an example with an inclined reference configuration resulting in an agent that learns to grasp from the right. Figure 10 presents the validation results during training of an agent trained with redundancy resolution and an inclined reference configuration compared to a vanilla TD3 agent. As shown in the main video (c.f., top section of this page), the agent has learned to grasp with a straight gripper position for some object position. This is due to the fact the main and secondary objectives are handled in the same action-space (joint velocities) allowing the agent to learn to overcome the added bias if needed. This stands in contrast to other approaches, which do policy search in task-space yet use redundancy resolution in a different space (e.g., joint-velocity-space).

Figure 9: Providing an inclined reference configuration (instead of an upright one). Here, the final performance after training is shown.

Figure 10: Validation results of TD3 on a pick-and-lift task with an inclined reference position during redundancy resolution in order to learn grasping from the right (c.f., Figure 9 and 11). The abscissa shows training episodes. Every 2500 training episodes, the training was paused and the agents were evaluated on 1000 random validation episodes (i.e., no exploration). Then, the means were taken to plot a single point of the reward and the loss.

Figure 11 compares the TD3 agent using redundancy resolution with an inclined and a straight reference configuration on a real robot.

Figure 11: Comparison of a TD3 agent with redundancy resolution using an inclined and a straight reference configuration on the real robot. Here, the final performance after training is shown.

The Reaching Task with Collision-Avoidance

Up until now, all agents with redundancy resolution considered the minimization of the distance to a reference configuration as the secondary objective. However, different secondary objectives can be considered as well. As a matter of fact, any secondary objective can be considered as long as it can be expressed as a function of the current robot configuration (c.f., paper). That is, a different secondary objective could be the maximization of the distance between the robot links and an obstacle. This secondary objective results in a bias that pushes the robot links away from the obstacle. Figure 12 provides an exemplary episode of a TD3 agent with redundancy resolution for collision avoidance in a reaching task.

Figure 12: One exemplary episode of a TD3 agent with redundancy resolution for collision avoidance on a simulated reaching task. The violet ellipsoid constitutes the obstacle. Here, the final performance after training is shown.

Figure 13 presents the validation results during training of a TD3 agent with and without redundancy resolution for collision avoidance. As can be seen, our approach has little impact on the main objective while allowing the consideration of a secondary objectives.

Figure 13: Validation results of TD3 on a reaching task with and without redundancy resolution for collision avoidance. The abscissa shows training episodes. Every 2500 training episodes, the training was paused and the agents were evaluated on 1000 random validation episodes (i.e., no exploration). Then, the means were taken to plot a single point of the reward and the loss.

Figure 14 shows a TD3 agent with redundancy resolution for collision avoidance on a reading task with a real robot. The blue cube has a similar size as the violet ellipsoid, which was used as an obstacle during simulation. Figure 15 presents the result of a TD3 agent without redundancy resolution for collision avoidance.

Figure 14: One exemplary episode of a TD3 agent with redundancy resolution for collision avoidance on a real reaching task. The cube constitutes the obstacle. Here, the final performance after training is shown.

Figure 15: One exemplary episode of a TD3 agent without redundancy resolution for collision avoidance on a real reaching task. The cube constitutes the obstacle. Here, the final performance after training is shown. The video was sped up by a factor of 2.

Simulation Setup

RLBench is based on PyRep and CoppeliaSim, and provides the reaching and pick-and-lift tasks. However, some adaptions had to be made to the tasks and RLBench itself. At the time of writing, RLBench uses different time horizons for sole arm actions and arm actions with gripper usage. While a sole arm action is executed in a single time step, once a gripper action is triggered, the simulation runs in a loop until the gripper has changed its state while continuously repeating the arm action. This idiosyncrasy was learned by the agent and led to inferior performance. Thus we separated arm and gripper actions: once a gripper action is triggered, arm actions are stalled. This is also beneficial for real-world application, as this relationship between arm and gripper actions allows scaling the arm actions almost independently of the gripper actuation.

Distribution Strategy

While OpenAI-ES offers straightforward parallelization to a huge number of workers by default, TD3 was introduced as a serial approach. Nonetheless, as TD3 is an off-policy approach, it can be trained on data from parallel workers as well. Barth-Maron et al. introduced a distributed version of DDPG, which runs multiple actors in parallel to generate experience that is added to a joint replay buffer. Then a single critic can be trained on data from the joint replay buffer. We also distribute training in TD3. However, unlike Barth-Maron et al., we only distribute the simulation environment to different workers and run it with different seeds while using a single actor instead. This distribution strategy allows more efficient training on consumer PCs as it reduces communication bandwidth - only states and actions need to be communicated instead of actor parameters - and runs the single actor on the GPU to predict the actions of all workers at once. In contrast, we implemented OpenAI-ES similarly to the original authors, as it relies on the usage of different actors.