High Acceleration Reinforcement Learning for Real-World Juggling with Binary Rewards

Abstract:

Robots that can learn in the physical world will be important to enable robots escape their stiff and pre-programmed movements. For dynamic high-acceleration tasks, such as juggling, learning in the real-world is particularly challenging as one must push the limits of the robot and its actuation without harming the system. Therefore, learning these tasks on the physical robot amplifies the necessity of sample efficiency and safety for robot learning algorithms, making a high-speed task an ideal benchmark to highlight robot learning systems. To achieve learning on the physical system, we propose a learning system that directly incorporates the safety and sample efficiency requirements in the design of the policy representation, initialization and optimization. This approach is in contrast to prior work which mainly focuses on the learning algorithm details, but neglect the engineering details. We demonstrate that this system enables the high-speed Barrett WAM to learn juggling of two balls from 56 minutes of experience. The robot learns to juggle consistently solely based on a binary reward signal. The optimal policy is able to juggle for up to 33 minutes or about 4500 repeated catches.

Supplementary Material

Learning of the Juggling Task

For the learning on the physical Barrett WAM 20 episodes were performed. During each episode 25 randomly sampled parameters were executed and the episodic reward evaluated. If the robot successfully juggles for 10s, the roll-out is stopped. Roll-outs that were corrupted due to obvious environment errors were repeated using the same parameters. Minor variations caused by the environment initialization were not repeated. After collecting the samples, the policy was updated using eREPS with a KL constraint of 2. The video shows **all** trials executed on the physical system to learn the optimal policy.

Repeatability of the Learned Juggling

To test the repeatability and stability of the learned policy, the deterministic policy mean after 20 episodes of training is executed for 30 repeated roll-outs with a maximum duration of 2 minutes. To the left you can see 11 different roll-outs of the learned deterministic policy.

Stability Evaluation of the Learned Juggling

To test the stability of the learned policy, the juggling was repeatedly executed and the maximum juggling duration recorded. The learned policy achieves juggling for more 33.13 minutes, which corresponds to roughly 4500 repeated catches. The complete video of this trial is shown to the right.