Autonomous Reinforcement Learning via Subgoal Curricula

Archit Sharma, Abhishek Gupta, Sergey Levine, Karol Hausman, Chelsea Finn

Neural Information Processing Systems (NeurIPS), 2021

arXiv

Abstract

Reinforcement learning (RL) promises to enable autonomous acquisition of complex behaviors for diverse agents. However, the success of current reinforcement learning algorithms is predicated on an often under-emphasised requirement -- each trial needs to start from a fixed initial state distribution. Unfortunately, resetting the environment to its initial state after each trial requires substantial amount of human supervision and extensive instrumentation of the environment which defeats the purpose of autonomous reinforcement learning. In this work, we propose Value-accelerated Persistent Reinforcement Learning (VaPRL), which generates a curriculum of initial states such that the agent can bootstrap on the success of easier tasks to efficiently learn harder tasks. The agent also learns to reach the initial states proposed by the curriculum, minimizing the reliance on human interventions into the learning. We observe that VaPRL reduces the interventions required by three orders of magnitude compared to episodic RL while outperforming prior state-of-the art methods for reset-free RL both in terms of sample efficiency and asymptotic performance on a variety of simulated robotics problems.

Learned Behaviors

We present some of the final behaviors learned in the persistent RL setting through VaPRL. For all the tasks, the agent was provided an environment reset after a few 100k steps (corresponding to several hours of training), reducing the human effort by ~1000x compared to the conventional episodic RL setting. VaPRL generates a curriculum to repeatedly practice the task while improving the evaluation performance efficiently. More details pertaining to the problem setting, algorithm and task setup can be found in the paper.

Tabletop Rearrangement

The agent is tasked with repositioning the mug to one of 4 different goal locations (specified at the beginning) from its initial state. The learned behaviors are shown below (current goal is indicated by a red sphere):

Door Closing

The agent is tasked with closing an open door. The learned behaviors are shown below (current goal is indicated by a green sphere):

Dexterous Hand Object Pickup

A high-dimensional dexterous hand attached to a sawyer robot arm is tasked with picking up a three-pronged valve from arbitrary positions on the table. The learned behaviors are shown below: