Randomized Prior Functions for Deep Reinforcement Learning

Randomized Prior Functions

for Deep Reinforcement Learning

Overview

This is a site to collate useful accompanying material for the NeurIPS 2018 spotlight paper.

Authors: Ian Osband, John Aslanides, Albin Cassirer
Paper: https://arxiv.org/abs/1806.03335
Poster: http://iosband.github.io/docs/rpf_nips_poster.pdf
Short link: http://bit.ly/rpf_nips

Abstract

Dealing with uncertainty is essential for efficient reinforcement learning. There is a growing literature on uncertainty estimation for deep learning from fixed datasets, but many of the most popular approaches are poorly-suited to sequential decision problems. Other methods, such as bootstrap sampling, have no mechanism for uncertainty that does not come from the observed data. We highlight why this can be a crucial shortcoming and propose a simple remedy through addition of a randomized untrainable `prior' network to each ensemble member. We prove that this approach is efficient with linear representations, provide simple illustrations of its efficacy with nonlinear representations and show that this approach scales to large-scale problems far better than previous attempts.

Demo code / blog post

Releasing the exact code for all of our experiments is not entirely practical right now... there are just too many parts tightly tied up with the Google infrastructure!

However, to help build intuition for our method and to give simple examples for how/why this approach works we're releasing a colab notebook.

This notebook aims to:

Build intuition for how/why prior networks have an effect on generalization
Demonstrate importance of prior + bootstrap in 1D regression
Simple + readable code that helps to make the algorithms intuitive.
Play "Deep Sea" exploration challenge yourself interactively!
Implementation of DQN + epsilon-greedy AND BootDQN + prior
Reproduce "chain experiment" results in the colab!

Videos

Chain

Visualizing how/why the algorithm is able to solve the difficult exploration problem "deep sea" (chain MDP)

Cartpole

Contrasting performance on cartpole swingup.

Montezuma

Watch a trained agent get a score of 3000 in Montezuma's revenge.

Google Sites

Report abuse