[Mohit Sharma *] [Arjun Sharma *] [Nick Rhinehart] [Kris M. Kitani]

Robotics Institute, CMU


The use of imitation learning to learn a single policy for a complex task that has multiple modes or hierarchical structure can be challenging. In fact, previous work has shown that when the modes are known, learning separate policies for each mode or sub-task can greatly improve the performance of imitation learning. In this work, we discover the interaction between sub-tasks from their resulting state-action trajectory sequences using a directed graphical model. We propose new algorithm based on the generative adversarial imitation learning framework which automatically learns sub-task policies from unsegmented demonstrations. Our approach maximizes the directed information flow in the graphical model between sub-task latent variables and their generated trajectories. We also show how our approach connects with existing "Options" framework commonly used to learn hierarchical policies


title={Directed-Info {GAIL}: Learning Hierarchical Policies from Unsegmented Demonstrations using Directed Information},
author={Mohit Sharma and Arjun Sharma and Nicholas Rhinehart and Kris M. Kitani},
booktitle={International Conference on Learning Representations},



We use the 2D Grid-World environment to show that DirectedInfo-GAIL is able to infer sub-tasks which correspond to meaningful navigation strategies and combine them to plan paths to different goal states. In the following figures we can see that each sub-task policy is being used to achieve a different navigation strategy.

Sub-Task Policy 1

Sub-Task Policy 2

Sub-Task Policy 3

Sub-Task Policy 4


We use the Circle-World environment to show the multi-modal nature of our experiments, since the expert trajectories contain different actions (for clockwise and anti-clockwise direction) for every state (x, y) in the trajectory. In the following figures we also show how our proposed approach can be used to compose new behavior that was not observed in the training set. The figure on the left is the generated trajectory by using the learned macro-policy i.e. the initial circle is drawn clockwise and the latter circle is anti-clockwise. The figure on the right swaps the learned macro-policy with a different desired macro-policy which generates the first circle in the anti-clockwise direction the second circle in the clockwise direction, thus composing new unobserved behavior.

Trajectory generated by Directed-Info GAIL using the learned sub-task and learned macro policy.

Trajectory generated by Directed-Info GAIL using the learned sub-task and desired macro policy.


We use the mujoco simulator to show results on high-dimensional continuous control tasks 1) Hopper and 2) Walker2D. Below are the video results for both of these environments. The video on the left shows the result for the Hopper environment, while the video on the right shows the result on the Walker environment. We annotate each frame of all the videos with the sub-task latent variable predicted by the posterior network.

Fetch Robot Pick and Place Segmentation Results

We use the Fetch Robotics environment provided in OpenAI gym to show how our proposed method is able to correctly segment an expert pick and place trajectory into two different sub-tasks.

Walker Context

For the Walker environment in OpenAI gym we show that our proposed approach is able to learn different behaviors corresponding to different sub-task latent variables. For example in the videos below we provided the walker agent with just one latent variable i.e., in video 1 we always provided context 2 which corresponded to the control of pink leg, while in video 2 we always provided context 1 which corresponds to control in the brown leg.

Fetch Robot - Pick and Place Imitation Learning Results

We also show results for our approach on a more complex pick and place task using the Fetch robot. We use the FetchPickAndPlace-v1 environment for our experiments. As shown in the above results, we use the VAE pre-training step to infer the sub-task for the expert trajectory. In the experiment results below along with the adversarial loss to the policy network we also add a behavior cloning loss which minimizes the L2 distance between the policy action and the expert action on states in the expert demonstrations.

As seen in the results below, our method is able to learn both picking and placing sub-tasks from unsegmented demonstrations. As the video on the left shows, the GAIL baseline only learns to reach the object and fails to grasp the object by either not closing the gripper or closing the gripper prematurely. In contrast, as can be seen in the middle video, our proposed approach is able to successfully learn to reach and grasp the object and then move it to the target location. We believe that our approach helps the agent alleviate the grasp failures by providing it with the sub-task code, helping it disambiguate between the very similar states the agent observes just before and just after grasping. However, as seen in the video on the right, we observed some failure cases in our proposed approach as well.