Eric Chen *4, Zhang-Wei Hong * 1 4, Joni Pajarinen 3, Pulkit Agrawal 1 2 4
1 NSF AI Institute for Artificial Intelligence and Fundamental Interactions 3 Aalto University
2 MIT-IBM Watson AI Lab 4 MIT Improbable AI Lab
Abstract: State-of-the-art reinforcement learning (RL) algorithms typically use random sampling (e.g., epsilon-greedy) for exploration, but this method fails in hard exploration tasks like Montezuma's Revenge. To address the challenge of exploration, prior works incentivize the agent to visit novel states using an exploration bonus (also called an intrinsic reward or curiosity). Such methods can lead to excellent results on hard exploration tasks but can suffer from intrinsic reward bias and underperform when compared to an agent trained using only task rewards. This performance decrease occurs when an agent seeks out intrinsic rewards and performs unnecessary exploration even when sufficient task reward is available. This inconsistency in performance across tasks prevents the widespread use of intrinsic rewards with RL algorithms. We propose a principled constrained policy optimization procedure that automatically tunes the importance of the intrinsic reward: it suppresses the intrinsic reward when exploration is unnecessary and increases it when exploration is required. This results in superior exploration that does not require manual tuning to balance the intrinsic reward against the task reward. Consistent performance gains across sixty-one ATARI games validate our claim.
Reinforcement learning algorithms work well on tasks where dense extrinsic rewards provide frequent feedback. This is the case for ATARI games like James Bond. Other games like Montezuma's Revenge have sparse extrinsic rewards that are difficult to find using random exploration schemes like epsilon-greedy, which prevents standard RL algorithms from learning a successful policy. Supplementing a sparse extrinsic reward with a dense intrinsic reward that incentivizes the agent to explore novel states can greatly improve performance. In any game where extrinsic rewards are not completely sparse however, intrinsic rewards might distract the agent from accomplishing the task at hand. The unpredictable effect of intrinsic rewards on performance make them difficult to apply broadly in practice. Our method Extrinsic-Intrinsic Policy Optimization (EIPO) dynamically tunes the balance between extrinsic and intrinsic rewards during training, and closes the performance gap between intrinsic and extrinsic reward optimization across environments.
EIPO eliminates the need to manually tune the balance between intrinsic and extrinsic rewards. To demonstrate this, we manually tune the intrinsic-extrinsic scaling coefficient lambda in several games where RND substantially underperforms the extrinsic only (EO) baseline. We see that lambda can have a big effect on performance. EIPO matches the best baseline in all environments, suggesting that our method can automatically tune the balance between intrinsic and extrinsic rewards as training progresses. In some environments (e.g. Star Gunner, Yars Revenge), EIPO significantly outperforms all baselines. These results show that dynamic tuning not only . We hope EIPO will encourage practitioners to use intrinsic rewards to improve exploration on new benchmark tasks, without worrying about hurting performance or manually tuning sensitive hyperparameters.
Media Coverage
Bibtex
Acknowledgments
We thank members of the Improbable AI Lab for the helpful discussion and feedback. We are grateful to MIT Supercloud and the Lincoln Laboratory Supercomputing Center for providing HPC resources. Google cloud credits provided as part of Google-MIT support were used in this work. The research in this paper was supported in part by the MIT-IBM Watson AI Lab, an AWS MLRA research grant and compute resources, DARPA Machine Common Sense Program, the Army Research Office MURI under Grant Number W911NF-21-1-0328, ONR MURI under Grant Number N00014-22-1-2740, and by the United States Air Force Research Laboratory and the United States Air Force Artificial Intelligence Accelerator under Cooperative Agreement Number FA8750-19-2-1000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the United States Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.