Evolving Reinforcement Learning Algorithms
John D. Co-Reyes, Yingjie Miao, Daiyi Peng, Esteban Real, Sergey Levine, Quoc V. Le, Honglak Lee, Aleksandra Faust
Oral @ ICLR 2021
arXiv | ICLR Oral Talk | github | algorithms | openreview | blog post
Abstract: We propose a method for meta-learning reinforcement learning algorithms by searching over the space of computational graphs which compute the loss function for a value-based model-free RL agent to optimize. The learned algorithms are domain-agnostic and can generalize to new environments not seen during training. Our method can both learn from scratch and bootstrap off known existing algorithms, like DQN, enabling interpretable modifications which improve performance. Learning from scratch on simple classical control and gridworld tasks, our method rediscovers the temporal-difference (TD) algorithm. Bootstrapped from DQN, we highlight two learned algorithms which obtain good generalization performance over other classical control tasks, gridworld type tasks, and Atari games. The analysis of the learned algorithm behavior shows resemblance to recently proposed RL algorithms that address overestimation in value-based methods.
Overview
Insight: Represent RL algorithm as a computational graph
Method: Evolve population of graphs by mutating, training, and evaluating RL agents
Result: Learn new algorithms which generalize to unseen environments
RL Algorithm as a Computational Graph
Expressive
Generalizable
Interpretable
We represent a RL algorithm as a computational graph which will compute the loss function for an agent to optimize over its experience. Nodes in the graph have various types and include neural network modules as basic primitives. This representation is expressive, generalizable, and interpretable.
Evolving RL Algorithms
Scale with compute
We use an evolutionary based approach where we first initialize a population of training agents each with a randomized computation graph. Each agent is trained in parallel and its performance is evaluated and used to update the population where more promising algorithms are further mutated. At the end of training the best algorithm is selected and evaluated over a set of unseen test environments.
Learn Generalizable Algorithms
Learn Interpretable Algorithms
Learned Algorithms Generalize to Unseen Environments
DDQN
DQNReg (Learned)
Atari Performance
Even though training was on non-image based environments, our learned algorithms have improved performance on more complex image-based environments.