How to Make Deep RL Work in Practice
NeurIPS 2020, Workshop on Deep RL
Nirnai Rao*, Elie Aljalbout*, Axel Sauer* and Sami Haddadin
* Shared First Authorship
Abstract: In recent years, challenging control problems became solvable with deep rein- forcement learning (RL). To be able to use RL for large-scale real-world applica- tions, a certain degree of reliability in their performance is necessary. Reported results of state-of-the-art algorithms are often difficult to reproduce. One reason for this is that certain implementation details influence the performance signifi- cantly. Commonly, these details are not highlighted as important techniques to achieve state-of-the-art performance. Additionally, techniques from supervised learning are often used by default but influence the algorithms in a reinforcement learning setting in different and not well-understood ways. In this paper, we in- vestigate the influence of certain initialization, input normalization, and adaptive learning techniques on the performance of state-of-the-art RL algorithms. We make suggestions which of those techniques to use by default and highlight areas that could benefit from a solution specifically tailored to RL.
Investigated Implementation Details
After navigating popular repositories implementing state-of-the-art deep RL algorithms, we found the following implementation details to be the most popular and influential on RL performance:
Initialization
Input Normalization
Learning Rate Schedules
Advantage Normalization
Gradient Clipping
KL-Stopping
KL-Cutoff
Effect of Initialization:
In an RL context, initialization techniques effect the initial action distribution, this is illustrated in the figure below:
Subsequently, different initialization techniques lead to different performances even for the same algorithm:
TD3
SAC
TRPO
Effect of Normalization:
Similarly, input normalization plays an important role on the performance of RL algorithms:
TD3
TRPO
Effect of Adaptive Learning Techniques:
Learning rate schedules, Advantage normalization, Gradient clipping, KL-Stopping and KL Cutoff effect the optimization and gradient computation of RL Algortihms. We show the effect of those techniques on PPO: