GPU-Accelerated Robotics Simulation for Distributed Reinforcement Learning

Jacky Liang*, Viktor Makoviychuk*, Ankur Handa*, Nuttapong Chentanez, Miles Macklin, Dieter Fox

* equal contribution


Most Deep Reinforcement Learning (Deep RL) algorithms require a prohibitively large number of training samples for learning complex tasks. Many recent works on speeding up Deep RL have focused on distributed training and simulation. While distributed training is often done on the GPU, simulation is not. In this work, we propose using GPU-accelerated RL simulations as an alternative to CPU ones. Using NVIDIA Flex, a GPU-based physics engine, we show promising speed-ups of learning various continuous-control, locomotion tasks. With one GPU and CPU core, we are able to train the Humanoid running task in less than 20 minutes, using 10-1000x fewer CPU cores than previous works. We also demonstrate the scalability of our simulator to multi-GPU settings to train for more challenging locomotion tasks.


- We can simulate hundreds to thousands of robots at the same time on a single GPU.

- Unlike simulating individual robots on each CPU cores, we load all simulated agents onto the same scene on 1 GPU, so they can interact and collide with each other.

- The peak GPU simulation frame time per agent for the humanoid environment is less than 0.02ms.


We benchmarked our system on the following 4 tasks:

  • Ant
  • Humanoid running
  • Humanoid Flagrun Harder (in addition to running, agent must learn to change directions and recover from falls)
  • Humanoid Flagrun Harder on Complex Terrain (terrain is not flat and has static obstacles)

We used Proximal Policy Optimization (PPO) to train all of our policies. See videos of trained policies below.

We use the 28-DoF Humanoid in all of our benchmarked experiments, which has 4 additional ankle joints than the 24-DoF Humanoid used in MuJoCo.

The more complex humanoid provides more natural and balanced running behaviors, especially for the Flagrun tasks.

However, we also can train the 24-DoF MuJoCo Humanoid, and see the left video below for a trained policy.

With our default rewards for Ant they learn to run with only two legs. We can also modify the reward to enforce 4 leg running, as seen in the right video below.

Reward Curves

The Humanoid running task can be learned in 16 minutes (5000 rew) with 1024 agents on single GPU.

The Humanoid Flagrun Harder task can be learned in 2hrs with 1024 agents on single GPU.

The Humanoid Hard Flagrun on Complex Terrain task is significantly harder than the other tasks, and we show how our system can scale to simulate and train with multiple GPUs (512 agents per GPU) with distributed PPO.

The Humanoid Flagrun Harder on Complex Terrain task can be learned in 2.5hrs with 16K agents on 32 GPUs.

Humanoid Flagrun Harder policies are robust to external perturbations unseen during training.

Left: collision with dynamic obstacles. Right: varying gravity setting from -50% to +50% of trained value.

Note that the humanoids in the videos are able to collide and interact with each other.

We augment the state space w/ a height map for the Humanoid Flagrun Harder on Complex Terrain task.

Our agents trained on complex terrains are able to adapt to environments unseen during training, including climbing up and down stairs and recovering from falls from heights.


This is an ongoing project at NVIDIA. For questions and other inquiries, please contact corresponding author Viktor Makoviychuck (vmakoviychuk [at]