How to Make Deep RL Work in Practice

NeurIPS 2020, Workshop on Deep RL

Nirnai Rao*, Elie Aljalbout*, Axel Sauer* and Sami Haddadin

* Shared First Authorship

Abstract: In recent years, challenging control problems became solvable with deep rein- forcement learning (RL). To be able to use RL for large-scale real-world applica- tions, a certain degree of reliability in their performance is necessary. Reported results of state-of-the-art algorithms are often difficult to reproduce. One reason for this is that certain implementation details influence the performance signifi- cantly. Commonly, these details are not highlighted as important techniques to achieve state-of-the-art performance. Additionally, techniques from supervised learning are often used by default but influence the algorithms in a reinforcement learning setting in different and not well-understood ways. In this paper, we in- vestigate the influence of certain initialization, input normalization, and adaptive learning techniques on the performance of state-of-the-art RL algorithms. We make suggestions which of those techniques to use by default and highlight areas that could benefit from a solution specifically tailored to RL.

Investigated Implementation Details

After navigating popular repositories implementing state-of-the-art deep RL algorithms, we found the following implementation details to be the most popular and influential on RL performance:

  • Initialization

  • Input Normalization

  • Learning Rate Schedules

  • Advantage Normalization

  • Gradient Clipping

  • KL-Stopping

  • KL-Cutoff

Effect of Initialization:

In an RL context, initialization techniques effect the initial action distribution, this is illustrated in the figure below:

Subsequently, different initialization techniques lead to different performances even for the same algorithm:

TD3

SAC

TRPO

Effect of Normalization:

Similarly, input normalization plays an important role on the performance of RL algorithms:

TD3

TRPO

Effect of Adaptive Learning Techniques:

Learning rate schedules, Advantage normalization, Gradient clipping, KL-Stopping and KL Cutoff effect the optimization and gradient computation of RL Algortihms. We show the effect of those techniques on PPO: