Goal-conditioned Imitation Learning

Please find our accompanying code here.

Cite as: Yiming Ding*, Carlos Florensa *, Mariano Phielipp, Pieter Abbeel. Goal-Conditioned Imiation Learning. Advances in Neural Information Processing Systems, NeurIPS 2019.

@article{article,

  author    = {Yiming Ding and

               Carlos Florensa and

               Mariano Phielipp and

               Pieter Abbeel},

  title     = {Goal-conditioned Imitation Learning},

  journal   = {Advances in Neural Information Processing Systems},

  year      = {2019},

  url       = {http://arxiv.org/abs/1906.05838},

Abstract

Designing rewards for Reinforcement Learning (RL) is challenging because it needs to convey the desired task, be efficient to optimize, and be easy to compute. The latter is particularly problematic when applying RL to robotics, where detecting whether the desired configuration is reached might require considerable supervision and instrumentation. Furthermore, we are often interested in being able to reach a wide range of configurations, hence setting up a different reward every time might be unpractical. Methods like Hindsight Experience Replay (HER) have recently shown promise to learn policies able to reach many goals, without the need of a reward. Unfortunately, without tricks like resetting to points along the trajectory, HER might take a very long time to discover how to reach certain areas of the state-space. In this work we investigate different approaches to incorporate demonstrations to drastically speed up the convergence to a policy able to reach any goal, also surpassing the performance of an agent trained with other Imitation Learning algorithms. Furthermore, our method can be used when only trajectories without expert actions are available, which can leverage kinestetic or third person demonstration.

Experimental Result

Experiments are conducted in four environments: Four rooms environment, Fetch Pick & Place, Point-mass block pusher and Fetch Stack2.

Four rooms env

Fetch robot pick and place

Point-mass block pusher

Fetch Stack2

Effect of Expert Relabeling

Here we show that the Expert Relabeling technique we proposed is beneficial when using demonstrations in the goal-conditioned imitation learning framework. As shown in the following figures, our expert relabeling technique brings considerable performance boosts for both Behavioral Cloning methods and goal-GAIL in all four environments.

Four rooms environment

Fetch pick and place

Point-mass block pusher

Fetch Stack2

Visualization of Expert Relabeling effect

We also perform a further analysis of the benefit of the expert relabeling in the four-rooms environment because it is easy to visualize in 2D the goals the agent can reach. As shown in the figure on the right, without the expert relabeling (top row), the agent fails to learn how to reach many intermediate states visited in the middle of a demonstration. While with expert relabeling (bottom row), the agent learns to reach all states visited by the experts (left column), and the coverage on uniform goals is also boosted (right column).

Goal-GAIL

Here we demonstrate how the proposed goal-GAIL algorithm (Goal-conditioned GAIL with Hindsight) improves upon pure HER and pure GAIL.

In both environments, as shown in the two figures below, we observe that running GAIL with relabeling (GAIL+HER) considerably outperforms running each of them in isolation. HER alone has a very slow convergence, although as expected it ends up reaching the same final performance if run long enough. On the other hand GAIL by itself learns fast at the beginning, but its final performance is capped. This is because despite collecting more samples on the environment, those come with no reward of any kind indicating what is the task to perform (reach the given goals). Therefore, once it has extracted all the information it can from the demonstrations it cannot keep learning and generalize to goals further from the demonstrations. This is not an issue anymore when combined with HER, as our results show.

Four rooms environment

Fetch pick and place

Point-mass block pusher

Fetch Stack2

Use of State-Only Demonstrations

Behavioral Cloning and standard GAIL rely on the state-action (s, a) tuples coming from the expert. Nevertheless there are many cases in robotics where we have access to demonstrations of a task, but without the actions. All the results obtained with our goal-GAIL method and reported in the above figures do not require any access to the action that the expert took.

We run the experiment of training GAIL only conditioned on the current state, and not the action and we observe that the discriminator learns a very well shaped reward that encourages the agent to go towards the goal, as pictured on the left.

Robustness to Sub-optimal Expert

In the above sections we were assuming access to perfectly optimal experts. Nevertheless, in practical applications the experts might have an erratic behavior. In this section, we study how the different methods perform when a sub-optimal expert is used.

As shown in the below figures, we observe that approaches that directly try to copy the action of the expert, like Behavioral Cloning, greatly suffer under a sub-optimal expert, to the point that it barely provides any improvement over performing plain Hindsight Experience Replay. On the other hand, methods based on training a discriminator between expert and current agent behavior are able to leverage much noisier experts. A possible explanation of this phenomenon is that a discriminator approach can give a positive signal as long as the transition is "in the right direction", without trying to exactly enforce a single action. Under this lens, having some noise in the expert might actually improve the performance of these adversarial approaches, as it has been observed in many generative models literature.

Four rooms environment

Fetch pick and place

Point-mass block pusher

Fetch Stack2