A Simple Neural Attentive Learner

Authors: Anonymous Authors

Abstract: Deep neural networks excel in regimes with large amounts of data, but tend to struggle when data is scarce or when they need to adapt quickly to changes in the task. Recent work in meta-learning seeks to overcome this shortcoming by training a meta-learner on a distribution of similar tasks, in the hopes of generalization to novel but related tasks by learning a high-level strategy that captures the essence of the problem it is asked to solve. However, many recent approaches to meta-learning are extensively hand-designed, either using architectures specialized to a particular application, or hard-coding algorithmic components that constrain how the meta-learner solves the task. We propose a class of simple and generic meta-learner architectures that use a novel combination of temporal convolutions and soft attention; the former to aggregate information from past experience and latter to pinpoint specific pieces of information. We validate the resulting Simple Neural AttentIve Learner (or SNAIL) by conducting the most extensive experimental evaluation against heavily benchmarked meta-learning tasks to date --- on all of which we attained state of the art by a significant margin.

Few Shot Classification

In the few-shot classification setting, we wish to classify data points into N classes, when we only have a small number (K) of labeled examples per class. A meta-learner is readily applicable, because it learns how to compare input points, rather than memorize a specific mapping from points to classes.

Omniglot Results

Mini-ImageNet Results

Meta-Reinforcement Learning

Multi-Armed Bandits Results

Each of K arms gives rewards according to a Bernoulli distribution whose parameter p∈[0,1] is chosen randomly at the start of each episode of length N. At each timestep, the meta-learner takes as input the reward received at the previous timestep along with a one-hot encoding of the corresponding arm selected. It outputs a discrete probability distribution over the K arms; the arm to select is determined by sampling from this distribution.

Tabular Markov Decision Processes (MDPs)

Each MDP had 10 states and 5 actions (both discrete); the reward for each (state, action)-pair followed a normal distribution with unit variance where the mean was sampled from N(1,1), and the transitions are sampled from a flat Dirichlet distribution. We allowed each meta-learner to interact with an MDP for N episodes of length 10. As input, they received one-hot encodings of the current state and previous action, the previous reward received, and a binary flag indicating termination of the current episode.

Below are learning curves for LSTM and SNAIL.

Continous Control

Two simulated robots (a planar cheetah and a 3D-quadruped ant) have to run in particular direction or at a specified velocity. In the goal direction experiments, the reward is the magnitude of the robot’s velocity in either the forward or backward direction, and in the goal velocity experiments, the reward is the negative absolute value between its current forward velocity and the goal. The observations are the robot’s joint angles and velocities, and the actions are its joint torques. In each of these four tasks ({ant, cheetah}×{goal velocity, goal direction}). As an oracle, we sampled tasks from each distribution, and trained a separate policy for each task. We plot the average performance of the oracle policies for each task distribution as an upper bound on a meta-learner’s performance.

Ant Movie.mp4
Cheetah Movie.mp4

Visual Navigation

Each episode involves a randomly generated maze and target position. The observations the agent receives are 30×40 first-person images, and the actions it can take are {step forward, turn slightly left, turn slightly right}. We constructed a training dataset of 1000 mazes, and two test datasets (one with different mazes of the same size, and one with larger mazes) of the same size. The agents were allowed to interact with each maze for 2 episodes, with episode length 250 (1000 in the larger mazes).The starting and goal locations were chosen randomly for each trial but remained fixed within each pair of episodes. The agents received rewards of +1 for reaching the target (which resulted in the episode terminating), -0.01 at each timestep, to encourage it to reach the goal faster, and -0.001 for hitting the wall. The results below indicate time to finish trial.

Test Maze Sample Runs:

maze_small_02.mp4
maze_small_03.mp4

Generalization to Larger Mazes:

maze_big_09.mp4
maze_big_08.mp4