RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning

Authors: Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, Pieter Abbeel

Abstract:

Deep reinforcement learning (deep RL) has been successful in learning sophisticated behaviors automatically; however, the learning process requires a huge number of trials. In contrast, animals can learn new tasks in just a few trials, benefiting from their prior knowledge about the world. This paper seeks to bridge this gap. Rather than designing a "fast" reinforcement learning algorithm, we propose to represent it as a recurrent neural network (RNN) and learn it from data. In our proposed method, RL2, the algorithm is encoded in the weights of the RNN, which are learned slowly through a general-purpose ("slow") RL algorithm. The RNN receives all information a typical RL algorithm would receive, including observations, actions, rewards, and termination flags; and it retains its state across episodes in a given Markov Decision Process (MDP). The activations of the RNN store the state of the "fast" RL algorithm on the current (previously unseen) MDP. We evaluate RL2 experimentally on both small-scale and large-scale problems. On the small-scale side, we train it to solve randomly generated multi-arm bandit problems and finite MDPs. After RL2 is trained, its performance on new MDPs is close to human-designed algorithms with optimality guarantees. On the large-scale side, we test RL2 on a vision-based navigation task and show that it scales up to high-dimensional problems.

Paper on ArXiv

Visual navigation - good behavior

Overall performance

We found that the learned RL^2 agent exhibits a very efficient navigation strategy: during the first episode, it applies consistent motions to move forward or make turns. It navigates around the maze to find the target. During the second episode, it already knows where the goal is, and can directly head towards it without requiring further exploration.

Although we only trained on small mazes, we also tested on large mazes and the agent is also able to perform well.

Visual navigation - (very occasional) bad behavior

Shows a learned RL^2 agent navigating a maze for 2 episodes where the behavior is not as intended:

- It had a glance at the goal position but did not pick up the signal initially;

- It wanders back and forth for a couple of times.

Visual navigation - before training (random policy)

Shows an agent's behavior before learning (essentially taking random actions).

Visual navigation - good behavior (large maze)

Shows a learned RL^2 agent navigating a maze for 2 episodes. In the first episode it moves around and explore, and in the second episode it remembers where the goal was and directly navigates to the goal. This is a large maze that is not experienced during training (we only train the agent on smaller mazes). Hence the agent has learned to generalize both the exploratory behavior, and the ability to improve upon its earlier experience (by remembering where the goal was and taking the shortcut) to larger mazes.

Comparison with MAML

We have compared our approach to MAML, a recently published meta-learning algorithm. We found RL^2 to significantly outperform MAML on all RL tasks considered in their paper.