Evolving Reinforcement Learning Algorithms

John D. Co-Reyes, Yingjie Miao, Daiyi Peng, Esteban Real, Sergey Levine, Quoc V. Le, Honglak Lee, Aleksandra Faust

Oral @ ICLR 2021

Abstract: We propose a method for meta-learning reinforcement learning algorithms by searching over the space of computational graphs which compute the loss function for a value-based model-free RL agent to optimize. The learned algorithms are domain-agnostic and can generalize to new environments not seen during training. Our method can both learn from scratch and bootstrap off known existing algorithms, like DQN, enabling interpretable modifications which improve performance. Learning from scratch on simple classical control and gridworld tasks, our method rediscovers the temporal-difference (TD) algorithm. Bootstrapped from DQN, we highlight two learned algorithms which obtain good generalization performance over other classical control tasks, gridworld type tasks, and Atari games. The analysis of the learned algorithm behavior shows resemblance to recently proposed RL algorithms that address overestimation in value-based methods.

Overview

Insight: Represent RL algorithm as a computational graph
Method: Evolve population of graphs by mutating, training, and evaluating RL agents
Result: Learn new algorithms which generalize to unseen environments

RL Algorithm as a Computational Graph

Expressive

Generalizable

Interpretable

We represent a RL algorithm as a computational graph which will compute the loss function for an agent to optimize over its experience. Nodes in the graph have various types and include neural network modules as basic primitives. This representation is expressive, generalizable, and interpretable.

Evolving RL Algorithms

Scale with compute

We use an evolutionary based approach where we first initialize a population of training agents each with a randomized computation graph. Each agent is trained in parallel and its performance is evaluated and used to update the population where more promising algorithms are further mutated. At the end of training the best algorithm is selected and evaluated over a set of unseen test environments.