Bisimulation Makes Analogies in Goal-Conditioned Reinforcement Learning

Philippe Hansen-Estruch, Amy Zhang, Ashvin Nair, Patrick Yin, Sergey Levine

ICML 2022. Paper Code 

Building generalizable goal-conditioned agents from rich observations is a key to reinforcement learning (RL) solving real world problems. Traditionally in goal-conditioned RL, an agent is provided with the exact goal they intend to reach. However, it is often not realistic to know the configuration of the goal before performing a task. A more scalable framework would allow us to provide the agent with an example of an analogous task, and have the agent then infer what the goal should be for its current state. We propose a new form of state abstraction called goal-conditioned bisimulation that captures functional equivariance, allowing for the reuse of skills to achieve new goals. We learn this representation using a metric form of this abstraction, and show its ability to generalize to new goals in simulation manipulation tasks. Further, we prove that this learned representation is sufficient not only for goal-conditioned tasks, but is amenable to any downstream task described by a state-only reward function. 

Motivation


Analogous tasks of dicing carrots and radishes. Although the target objects are different, the skill required and functional difference between the initial state and goal images are similar. Given an agent that has learned to cut carrots, rather than needing to present the agent with an image of chopped up radishes, we can instead say something along the lines of, “Do the same thing to the radish that you did to the carrot.”


An example of abstractions with a compositional form. Our goal is to define a policy-dependent distance metric form of our goal-conditioned bisimulation (GCB) and use it to define a paired-state embedding space. We can then construct an objective for learning a state representation that can compose this task abstraction with states to depict new goals, capable of “filling in the blank.”

GCB: Constructing Embedding Spaces



GCB learns two representations: a state-goal representation, which generalizes across state-goal pairs, and a single state representation, which is shaped in latent space by the state-goal encoder but can generalize across single states. A flow diagram of the representation learning component of GCB can be found on the left. The dashed line represents stopped gradients and the state encoder is a Siamese network using shared weights.

Objective

State-Goal Objective

The paired encoder's loss uses a bisimulation objective comparing different state-goal transitions. To optimize this loss, an encoder must collapse state-goal pairs that have similar tasks and therefore dynamics.

Single State Objective

The single state encoder uses the paired encoder to build its arithmetic space which relates change in the phi space to change in the psi space. 

Manipulation Experiments

Two examples of analogy arithmetic. The rightmost image is the nearest neighbor in a test set of the composed representation in our learned state space. Top: RHS has the finger pushing the button, just like in the goal argument to the state-goal encoder, but in a different position and orientation, indicating that the representation is equivariant to pose but captures the functional relevance of pushing the button.  Bottom: RHS has the drawer being closed, again matching function with the goal argument to the state-goal encoder while exhibiting invariance to color and equivariance to pose.

Visualization of nearest neighbors in state-goal space. We sample state-goal pairs and find their nearest neighbors. Features irrelevant to the task have changed, such as drawer color and distractor objects, but task-specific components are invariant, such as the relative positioning of robot to target object and semantics of the task.

GCB Makes Analogies: An Evaluation

5 experiments showing analogous state-goal pairs on top provided as task description to the policy and the actions the policy takes in a new environment shown on bottom for the Button and Drawer environment. We see that whatever task the agent chooses to perform in the analogous state-goal pair is also what is attempted by the policy in the new environment.

Evaluating the Standard Goal-Conditioned Setting 

Videos of policy evaluation for the standard goal-conditioned setting for 5 different episodes. Goals shown on top and video of evaluation episodes on bottom for the Drawer environment.

Videos of policy evaluation for the standard goal-conditioned setting for 5 different episodes. Goals shown on top and video of evaluation episodes on bottom for the Button and Drawer environment.

Additional Nearest Neighbor Visualizations

More examples of analogy arithmetic. In each box, the leftmost image is a sampled start state. The next two images are a sampled analogy start and goal. The rightmost image is the nearest neighbor in a test set of the composed representation in psi space. Each row is a new sample, the bottom row, left column is an example of a failure.