DisCo RL: Distribution-Conditioned Reinforcement Learning for
General Purpose Policies
Soroush Nasiriany*, Vitchyr H. Pong*, Ashvin Nair*, Alexander Khazatsky, Glen Berseth, Sergey Levine*Equal Contribution
University of California, Berkeley
2021 International Conference on Robotics and Automation (ICRA 2021)
Can we use reinforcement learning to learn general-purpose policies that can perform a wide range of different tasks, resulting in flexible and reusable skills? Contextual policies provide this capability in principle, but the representation of the context determines the degree of generalization and expressivity. Categorical contexts preclude generalization to entirely new tasks. Goal-conditioned policies may enable some generalization, but cannot capture all tasks that might be desired. In this paper, we propose goal distributions as a general and broadly applicable task representation suitable for contextual policies. Goal distributions are general in the sense that they can represent any state-based reward function when equipped with an appropriate distribution class, while the particular choice of distribution class allows us to trade off expressivity and learnability. We develop an off-policy algorithm called distribution-conditioned reinforcement learning (DisCo RL) to efficiently learn these policies. We evaluate DisCo RL on a variety of robot manipulation tasks and find that it significantly outperforms prior methods on tasks that require generalization to new goal distributions.
Can we create a framework that (1) infers reward functions for a broad set of tasks, and (2) trains a general-purpose policy to solve all of these tasks?
This has been studied in the context of goal-conditioned reinforcement learning. However, many tasks cannot be expressed as a specific goal state. Some tasks have many possible successful states, and more generally the reward function for a task may be some arbitrary function of the state.
Our Method: Distribution-Conditioned Reinforcement Learning (DisCo RL)
Rather than conditioning a policy on a single goal state, we propose to condition a policy on an entire goal distribution. We show that any reward function can be represented with the parameters of a goal distribution and that this goal distribution can be learned from examples of successful states.
Specify example sets. Take as input a set of example states that describe the desired task.
Infer goal distribution. Infer the parameters of the goal distribution from the example states using maximum likelihood estimation.
Run DisCo RL. Train a policy on the parameters of the inferred goal distributions. The policy is rewarded for going to states that have high likelihood under the distribution.
Conditional DisCo RL
To remove the need to provide an example set for every new task, we also introduce a variant called Conditional DisCo RL. Conditional DisCo RL automatically generates new goal distributions using a conditional model.
The conditional model is conditioned on task-specific context, such as the final location of a specific block. We train this conditional model by collecting pairs of contexts and example states.
We compare DisCo RL along with prior work that also uses successful states for computing rewards. We consider a variety of robotic manipulation domains:
The agent must use the blue cursor to move objects to different locations.
The agent must control a Sawyer robot to move cubes into and out of a box.
The agent must attach shelves to a pole using a cursor.
We visualize our learned policy against other prior methods. We find that DisCo RL is able to represent and solve tasks from example states, while prior methods cannot.
Each column displays a trajectory and each row displays a different method.
Left: The agent must place the red block into a sliding tray, irrespective of the tray position and the positions of the other blocks.
Right: The agent must assemble a bookshelf by reasoning about each individual shelf as a sub-task. Conditional DisCo RL generates the distribution for each sub-task, by conditioning on the final configuration of the task.