EX2: Exploration with Exemplar Models for Deep Reinforcement Learning

Abstract: Efficient exploration in high-dimensional environments remains a key challenge in reinforcement learning (RL). Deep reinforcement learning methods have demonstrated the ability to learn with highly general policy classes for complex tasks with high-dimensional inputs, such as raw images. However, many of the most effective exploration techniques rely on tabular representations, or on the ability to construct a generative model over states and actions. Both are exceptionally difficult when these inputs are complex and high dimensional. On the other hand, it is comparatively easy to build discriminative models on top of complex states such as images using standard deep neural networks. This paper introduces a novel approach, EX2, which approximates state visitation densities by training an ensemble of discriminators, and assigns reward bonuses to rarely visited states. We demonstrate that EX2 achieves comparable performance to the state-of-the-art methods on low-dimensional tasks, and its effectiveness scales into high-dimensional state spaces such as visual domains without hand-designing features or density models.

Coming Soon: Github - Arxiv

A brief overview of our algorithm:

  1. Collect Experience: The agent collects a batch of experience and appends them to the replay buffer.
  2. Train Exemplar Model: We train discriminators to classify the newly observed states against previously experience states in the replay buffer.
  3. Compute Reward Bonuses: States that were easily classifiable are assigned a large reward bonus, and the augmented reward is used to update the policy.
  4. Repeat steps 1-3.

Learned Policies

This is a video showing training progression on the DoomMyWayHome task. The maze map is shown on a right. The spawn location is fixed to the blue dot and the green dot is the goal.

The following videos demonstrate training progression of our method (TRPO+EX2) against TRPO with naive gaussian exploration on the 2D maze task. The task is to navigate the blue ball to the green goal location.

Episode

1

200

1000

1500

2000

TRPO

TRPO + EX2

Density Modeling with Exemplar Models

Here, we show a neural network exemplar model fitting two toy distributions (a bimodal gaussian and a trimodal piecewise uniform)

Here we train a linear exemplar model and a neural network-based exemplar model on the same tri-modal dataset. The linear model is unable to fit the data and clusters density around the mean.

Linear Exemplar

Shared Neural Network

We can also visualize densities estimated on the 2D maze task. A neural network exemplar is shown on the left, and a slightly cleaner density can be obtained using an RBF feature projection with a linear exemplar. Within each image, the left sub-image is the estimated density and the right-sub-image is the empirical target distribution sampled from the replay buffer.

Shared Neural Network

RBF+Linear