Recurrent Model-free RL

Can Be a Strong Baseline

for Many POMDPs

International Conference on Machine Learning (ICML), 2022

Tianwei Ni (Université de Montréal & Mila - Quebec AI Institute) Benjamin Eysenbach (Carnegie Mellon University) Ruslan Salakhutdinov (Carnegie Mellon University)

Paper: arXiv Code: GitHub Numerical Results: Google Drive

Poster, Slides, Talk

Recurrent Model-Free RL and POMDPs (ICML 2022)

Slides for ICML paper, Jun 27, 2022

Abstract

Many problems in RL, such as meta-RL, robust RL, generalization in RL, and temporal credit assignment, can be cast as POMDPs. In theory, simply augmenting model-free RL with memory-based architectures, such as recurrent neural networks, provides a general approach to solving all types of POMDPs. However, prior work has found that such recurrent model-free RL methods tend to perform worse than more specialized algorithms that are designed for specific types of POMDPs. This paper revisits this claim. We find that careful architecture and hyperparameter decisions can often yield a recurrent model-free implementation that performs on par with (and occasionally substantially better than) more sophisticated recent techniques. We compare to 21 environments from 6 prior specialized methods and find that our implementation achieves greater sample efficiency and asymptotic performance than these methods on 18/21 environments. We also release a simple and efficient implementation of recurrent model-free RL for future work to use as a baseline for POMDPs.

Learning Curves in Our Paper

Methods for POMDPs

  • Ours: our implementation of recurrent model-free RL using same variant in each subarea

    • With some hyperparameter tuning on the decision factors, it can perform even better (see the ablation study in our paper)

  • Lower bounds:

    • Random: random policy

    • Markovian: Markovian policy

  • Upper bound:

    • Oracle: Markovian policy with access to the hidden states of POMDPs

  • Other recurrent model-free RL:

  • Compared specialized / model-based methods:

"Standard" POMDPs

Observations only include positions and angles (-P) or their velocities (-V).

Ant-P

Cheetah-P

Hopper-P

Walker-P

Ant-V

Cheetah-V

Hopper-V

Walker-V

Meta-RL

Hidden state is the task variable that normally appear only in reward function. We compare both off-policy and on-policy variBAD on their respective environments.

Semi-Circle

Wind

Cheetah-Vel

Cheetah-Dir

Ant-Dir

Humanoid-Dir

Robust RL

Objective is maximizing the worst return (right figures) over all the tasks, instead of average returns (left figures).

Cheetah-Robust

Hopper-Robust

Walker-Robust

Generalization in RL

Objective is maximizing the return over the testing tasks, which may be within training task distribution (interpolation; left figures) or out of distribution (extrapolation; right figures).

Cheetah-Generalize

Hopper-Generalize

Temporal Credit Assignment

Rewards are often delayed and reward function can be history-dependent.

Delayed-Catch

Key-to-Door

Acknowledgement

We thank Pierre-Luc Bacon, Murtaza Dalal, Paul Pu Liang, Sergey Levine, Evgenii Nikishin, Hao Sun, and Maxime Wabartha for their constructive feedback on the draft of this paper. TN thanks Pierre-Luc Bacon for suggesting experiments on temporal credit assignment and Michel Ma and Pierluca D'Oro for their help on the environments. We thank Luisa Zintgraf for sharing the learning curves of on-policy variBAD.

TN thanks CMU cluster and Mila cluster for compute resources.

This work is supported by the Facebook CIFAR AI Chair, the Fannie and John Hertz Foundation and NSF GRFP (DGE1745016).