GPU-Accelerated Robotics Simulation for Distributed Reinforcement Learning
Jacky Liang*, Viktor Makoviychuk*, Ankur Handa*, Nuttapong Chentanez, Miles Macklin, Dieter Fox
* equal contributionAccepted to Conference on Robot Learning (CoRL), 2018
Jacky Liang*, Viktor Makoviychuk*, Ankur Handa*, Nuttapong Chentanez, Miles Macklin, Dieter Fox
* equal contribution[arXiv]
We benchmarked our system on the following 4 tasks:
We used Proximal Policy Optimization (PPO) to train all of our policies. See videos of trained policies below.
We use the 28-DoF Humanoid in all of our benchmarked experiments, which has 4 additional ankle joints than the 24-DoF Humanoid used in MuJoCo.
The more complex humanoid provides more natural and balanced running behaviors, especially for the Flagrun tasks.
However, we also can train the 24-DoF MuJoCo Humanoid, and see the left video below for a trained policy.
With our default rewards for Ant they learn to run with only two legs. We can also modify the reward to enforce 4 leg running, as seen in the right video below.
The Humanoid running task can be learned in 16 minutes (5000 rew) with 1024 agents on single GPU.
The Humanoid Flagrun Harder task can be learned in 2hrs with 1024 agents on single GPU.
The Humanoid Hard Flagrun on Complex Terrain task is significantly harder than the other tasks, and we show how our system can scale to simulate and train with multiple GPUs (512 agents per GPU) with distributed PPO.
The Humanoid Flagrun Harder on Complex Terrain task can be learned in 2.5hrs with 16K agents on 32 GPUs.
Left: collision with dynamic obstacles. Right: varying gravity setting from -50% to +50% of trained value.
Note that the humanoids in the videos are able to collide and interact with each other.