Policy Architectures for Compositional Generalization in Control

Policy Architectures for
Compositional Generalization in Control

Allan Zhou, Vikash Kumar, Chelsea Finn, Aravind Rajeswaran

[Paper] [Code]

Abstract

Many tasks in control, robotics, and planning can be specified using desired goal configurations for various entities in the environment. Learning goal-conditioned policies is a natural paradigm to solve such tasks. However, current approaches struggle to learn and generalize as task complexity increases, such as variations in number of environment entities or compositions of goals. In this work, we introduce a framework for modeling entity-based compositional structure in tasks, and create suitable policy designs that can leverage this structure. Our policies, which utilize architectures like Deep Sets and Self Attention, are flexible and can be trained end-to-end without requiring any action primitives. When trained using standard reinforcement and imitation learning methods on a suite of simulated robot manipulation tasks, we find that these architectures achieve significantly higher success rates with less data. We also find these architectures enable broader and compositional generalization, producing policies that extrapolate to different numbers of entities than seen in training, and stitch together (i.e. compose) learned skills in novel ways.

Summary

Consider the task of arranging pieces on a chess board using a robot arm. A naive specification would provide goal locations for all 32 pieces simultaneously. However, we can immediately recognize that the task is a composition of 32 sub-goals involving the rearrangement of individual pieces. This understanding of compositional structure can allow us to focus on one object at a time, dramatically reducing the effective size of state and goal spaces. Moreover, such a compositional understanding would help an agent easily generalize to other re-arrangement tasks, involving fewer or more pieces.

To formalize these intuitions and to develop practical algorithms, we develop and study the Entity Factored Markov Decision Process (EFMDP) as a framework for modeling tasks that can be decomposed in terms of entities and their corresponding subgoals. Many real-world tasks can be modeled as EFMDPs, including most robotic manipulation tasks involving multiple objects. A guiding principle of our work is that the optimal policy and value function in an EFMDP are always invariant to the ordering of entity-subgoal pairs. We use this principle to design permutation invariant policy and critic architectures for reinforcement and imitation learning. These invariant architectures learn more efficiently and enable zero-shot generalization to more complex tasks involving more entities or different subgoals.

In our framework, agents solve complex tasks by interacting with entities that have corresponding subgoals. In this Push-and-Stack example, the agent must move the green cube to the green sphere, then stack the yellow cube on top of the green cube.

Architectures

We develop two policy and critic architecture types that are invariant to the order of the entity-subgoal pairs:

The Deep Sets architecture treats each entity-subgoal pair as an element in a set, which has no inherent ordering.
The Self Attention architecture treats each entity-subgoal pair as an element in a sequence, and uses Transformer style attention to model relations between these elements.

Results

We show the results of our Deep Set and Transformer policies under two different generalization settings. In extrapolation, a policy must handle test tasks with more or fewer entities than observed in training. In stitching, the test tasks require the policy to combine skills learned in training in novel ways. In either case, the policies are only trained in the settings labeled "Training task" and tested on the "Test task" zero-shot.

N-Push (extrapolation)

In this family of tasks the robot must re-arrange N cubes into the positions indicated by the spherical targets.

3-Push (Training task)

6-Push (Test task)

Deep Set

Self Attention

N-Switch (extrapolation)

In this family of tasks the robot must flip each switch into a specified goal setting (either left or right, depending on the goal).

3-Switch (Training task)

6-Switch (Test task)

Deep Set

Self Attention

Push-and-stack

In this setting 50% of training episodes involving pushing cubes to targets, and 50% of training episodes involving stacking one cube on top of another. The test task combines the two training tasks: the robot must push the bottom cube into position and then stack the other cube on top.

Push (Training task)

Stack (Training task)

Push-and-Stack (Test task)

Deep Set

Self Attention

Page updated

Google Sites

Report abuse