Justin Fu*, J.D. Co-Reyes*, Sergey Levine
Abstract: Efficient exploration in high-dimensional environments remains a key challenge in reinforcement learning (RL). Deep reinforcement learning methods have demonstrated the ability to learn with highly general policy classes for complex tasks with high-dimensional inputs, such as raw images. However, many of the most effective exploration techniques rely on tabular representations, or on the ability to construct a generative model over states and actions. Both are exceptionally difficult when these inputs are complex and high dimensional. On the other hand, it is comparatively easy to build discriminative models on top of complex states such as images using standard deep neural networks. This paper introduces a novel approach, EX2, which approximates state visitation densities by training an ensemble of discriminators, and assigns reward bonuses to rarely visited states. We demonstrate that EX2 achieves comparable performance to the state-of-the-art methods on low-dimensional tasks, and its effectiveness scales into high-dimensional state spaces such as visual domains without hand-designing features or density models.
Github (Density Estimation) - Github (Reinforcement Learning) - Arxiv
A brief overview of our algorithm:
This is a video showing training progression on the DoomMyWayHome task. The maze map is shown on a right. The spawn location is fixed to the blue dot and the green dot is the goal.
The following videos demonstrate training progression of our method (TRPO+EX2) against TRPO with naive gaussian exploration on the 2D maze task. The task is to navigate the blue ball to the green goal location.
Episode
1
200
1000
1500
2000
TRPO
TRPO + EX2
Here, we show a neural network exemplar model fitting two toy distributions (a bimodal gaussian and a trimodal piecewise uniform)
Here we train a linear exemplar model and a neural network-based exemplar model on the same tri-modal dataset. The linear model is unable to fit the data and clusters density around the mean.
Linear Exemplar
Shared Neural Network
We can also visualize densities estimated on the 2D maze task. A neural network exemplar is shown on the left, and a slightly cleaner density can be obtained using an RBF feature projection with a linear exemplar. Within each image, the left sub-image is the estimated density and the right-sub-image is the empirical target distribution sampled from the replay buffer.
Shared Neural Network
RBF+Linear