Policy Architectures for
Compositional Generalization in Control
Policy Architectures for
Compositional Generalization in Control
Many tasks in control, robotics, and planning can be specified using desired goal configurations for various entities in the environment. Learning goal-conditioned policies is a natural paradigm to solve such tasks. However, current approaches struggle to learn and generalize as task complexity increases, such as variations in number of environment entities or compositions of goals. In this work, we introduce a framework for modeling entity-based compositional structure in tasks, and create suitable policy designs that can leverage this structure. Our policies, which utilize architectures like Deep Sets and Self Attention, are flexible and can be trained end-to-end without requiring any action primitives. When trained using standard reinforcement and imitation learning methods on a suite of simulated robot manipulation tasks, we find that these architectures achieve significantly higher success rates with less data. We also find these architectures enable broader and compositional generalization, producing policies that extrapolate to different numbers of entities than seen in training, and stitch together (i.e. compose) learned skills in novel ways.
Consider the task of arranging pieces on a chess board using a robot arm. A naive specification would provide goal locations for all 32 pieces simultaneously. However, we can immediately recognize that the task is a composition of 32 sub-goals involving the rearrangement of individual pieces. This understanding of compositional structure can allow us to focus on one object at a time, dramatically reducing the effective size of state and goal spaces. Moreover, such a compositional understanding would help an agent easily generalize to other re-arrangement tasks, involving fewer or more pieces.
To formalize these intuitions and to develop practical algorithms, we develop and study the Entity Factored Markov Decision Process (EFMDP) as a framework for modeling tasks that can be decomposed in terms of entities and their corresponding subgoals. Many real-world tasks can be modeled as EFMDPs, including most robotic manipulation tasks involving multiple objects. A guiding principle of our work is that the optimal policy and value function in an EFMDP are always invariant to the ordering of entity-subgoal pairs. We use this principle to design permutation invariant policy and critic architectures for reinforcement and imitation learning. These invariant architectures learn more efficiently and enable zero-shot generalization to more complex tasks involving more entities or different subgoals.
In our framework, agents solve complex tasks by interacting with entities that have corresponding subgoals. In this Push-and-Stack example, the agent must move the green cube to the green sphere, then stack the yellow cube on top of the green cube.
We develop two policy and critic architecture types that are invariant to the order of the entity-subgoal pairs:
The Deep Sets architecture treats each entity-subgoal pair as an element in a set, which has no inherent ordering.
The Self Attention architecture treats each entity-subgoal pair as an element in a sequence, and uses Transformer style attention to model relations between these elements.
We show the results of our Deep Set and Transformer policies under two different generalization settings. In extrapolation, a policy must handle test tasks with more or fewer entities than observed in training. In stitching, the test tasks require the policy to combine skills learned in training in novel ways. In either case, the policies are only trained in the settings labeled "Training task" and tested on the "Test task" zero-shot.
In this family of tasks the robot must re-arrange N cubes into the positions indicated by the spherical targets.
Deep Set
Self Attention
In this family of tasks the robot must flip each switch into a specified goal setting (either left or right, depending on the goal).
Deep Set
Self Attention
In this setting 50% of training episodes involving pushing cubes to targets, and 50% of training episodes involving stacking one cube on top of another. The test task combines the two training tasks: the robot must push the bottom cube into position and then stack the other cube on top.
Deep Set
Deep Set
Deep Set
Self Attention
Self Attention
Self Attention