On Pathologies in KL-Regularized Reinforcement Learning from Expert Demonstrations
Tim G. J. Rudner*, Cong Lu*, Michael A. Osborne, Yarin Gal, Yee Whye Teh
in 34th Advances of Neural Information Processing Systems (NeurIPS), 2021.
Abstract
KL-regularized reinforcement learning from expert demonstrations has proved successful in improving the sample efficiency of deep reinforcement learning algorithms, allowing them to be applied to challenging physical real-world tasks. However, we show that KL-regularized reinforcement learning with behavioral policies derived from expert demonstrations suffers from hitherto unrecognized pathological behavior that can lead to slow, unstable, and suboptimal online training. We show empirically that the pathology occurs for commonly chosen behavioral policy classes and demonstrate its impact on sample efficiency and online policy performance. Finally, we show that the pathology can be remedied by specifying non-parametric behavioral policies and that doing so allows KL-regularized RL to significantly outperform state-of-the-art approaches on a variety of challenging locomotion and dexterous hand manipulation tasks.
Behaviors Learned with N-PPAC
door-binary-v0
We show that the predictive variance of the non-parametric behavioral prior is crucial to the success of KL-regularized RL by performing an ablation study where the behavioral prior's predictive mean is fixed to a GP mean and the predictive variance is set to that of a parametric behavioral policy (a neural network trained via MLE) and that of a non-parametric behavioral policy (a GP posterior predictive), respectively.
PArametric (NN)
Non-Parametric (GP)
pen-binary-v0
N-PPAC @ 0 TIMESTEPS
Pre-training from the offline prior gives 70% success rate which is then fine-tuned by N-PPAC.
N-PPAC @ 100k TIMESTEPS
Online fine-tuning improves the success rate rapidly after 100k timesteps.
For the following three MuJoCo environments, we show the online performance of KL-regularized RL with a non-parametric behavioral policy after a small number of time steps, showing rapid adaptation from and improvement to the expert demonstrations.
HalfCheetah-v2
N-PPAC @ 5k steps
N-PPAC @ 50k steps
N-PPAC @ 200k steps
Walker2d-v2
N-PPAC @ 5k steps
N-PPAC @ 50k steps
N-PPAC @ 200k steps
Ant-v2
N-PPAC @ 5k steps
N-PPAC @ 50k steps
N-PPAC @ 200k steps
BibTex
@InProceedings{rudner2021pathologies,
title = {On Pathologies in {KL}
Regularized Reinforcement Learning
from Expert Demonstrations},
author = {Tim G. J. Rudner and Cong Lu
and Michael A. Osborne and Yarin Gal
and Yee Whye Teh},
journal = {Advances in Neural Information
Processing Systems 34},
year = {2021},
}