Recurrent Model-free RL
Can Be a Strong Baseline
for Many POMDPs
International Conference on Machine Learning (ICML), 2022
Tianwei Ni (Université de Montréal & Mila - Quebec AI Institute) Benjamin Eysenbach (Carnegie Mellon University) Ruslan Salakhutdinov (Carnegie Mellon University)
Paper: arXiv Code: GitHub Numerical Results: Google Drive
Poster, Slides, Talk
Abstract
Many problems in RL, such as meta-RL, robust RL, generalization in RL, and temporal credit assignment, can be cast as POMDPs. In theory, simply augmenting model-free RL with memory-based architectures, such as recurrent neural networks, provides a general approach to solving all types of POMDPs. However, prior work has found that such recurrent model-free RL methods tend to perform worse than more specialized algorithms that are designed for specific types of POMDPs. This paper revisits this claim. We find that careful architecture and hyperparameter decisions can often yield a recurrent model-free implementation that performs on par with (and occasionally substantially better than) more sophisticated recent techniques. We compare to 21 environments from 6 prior specialized methods and find that our implementation achieves greater sample efficiency and asymptotic performance than these methods on 18/21 environments. We also release a simple and efficient implementation of recurrent model-free RL for future work to use as a baseline for POMDPs.
Learning Curves in Our Paper
Methods for POMDPs
Ours: our implementation of recurrent model-free RL using same variant in each subarea
With some hyperparameter tuning on the decision factors, it can perform even better (see the ablation study in our paper)
Lower bounds:
Random: random policy
Markovian: Markovian policy
Upper bound:
Oracle: Markovian policy with access to the hidden states of POMDPs
Other recurrent model-free RL:
Compared specialized / model-based methods:
"Standard" POMDPs
Observations only include positions and angles (-P) or their velocities (-V).
Ant-P
Cheetah-P
Hopper-P
Walker-P
Ant-V
Cheetah-V
Hopper-V
Walker-V
Meta-RL
Hidden state is the task variable that normally appear only in reward function. We compare both off-policy and on-policy variBAD on their respective environments.
Semi-Circle
Wind
Cheetah-Vel
Cheetah-Dir
Ant-Dir
Humanoid-Dir
Robust RL
Objective is maximizing the worst return (right figures) over all the tasks, instead of average returns (left figures).
Cheetah-Robust
Hopper-Robust
Walker-Robust
Generalization in RL
Objective is maximizing the return over the testing tasks, which may be within training task distribution (interpolation; left figures) or out of distribution (extrapolation; right figures).
Cheetah-Generalize
Hopper-Generalize
Temporal Credit Assignment
Rewards are often delayed and reward function can be history-dependent.
Delayed-Catch
Key-to-Door
Acknowledgement
We thank Pierre-Luc Bacon, Murtaza Dalal, Paul Pu Liang, Sergey Levine, Evgenii Nikishin, Hao Sun, and Maxime Wabartha for their constructive feedback on the draft of this paper. TN thanks Pierre-Luc Bacon for suggesting experiments on temporal credit assignment and Michel Ma and Pierluca D'Oro for their help on the environments. We thank Luisa Zintgraf for sharing the learning curves of on-policy variBAD.
TN thanks CMU cluster and Mila cluster for compute resources.
This work is supported by the Facebook CIFAR AI Chair, the Fannie and John Hertz Foundation and NSF GRFP (DGE1745016).