Meta-Inverse Reinforcement Learning with

Probabilistic Context Variables

Lantao Yu*, Tianhe Yu*, Chelsea Finn, Stefano Ermon

Stanford University

(* denotes equal contribution)

Abstract: Reinforcement learning demands a reward function, which is often difficult to provide or design in real world applications. While inverse reinforcement learning (IRL) holds promise for automatically learning reward functions from demonstrations, several major challenges remain. First, existing IRL methods learn reward functions from scratch, requiring large numbers of demonstrations to correctly infer the reward for each task the agent may need to perform. Second, and more subtly, existing methods typically assume demonstrations for one, isolated behavior or task, while in practice, it is significantly more natural and scalable to provide datasets of heterogeneous behaviors. To this end, we propose a deep latent variable model that is capable of learning rewards from unstructured, multi-task demonstration data, and critically, use this experience to infer robust rewards for new, structurally-similar tasks from a single demonstration. Our experiments on multiple continuous control tasks demonstrate the effectiveness of our approach compared to state-of-the-art imitation and inverse reinforcement learning methods.

Results of Reward Adaptation to Challenging Situations

In this setting, after providing the demonstration of an unseen task to the agent, we change the underlying environment dynamics but keep the same task goal. We present video results of our method as well as the baselines.

Point-Maze Navigation with a Shifted Barrier

We can visualize learned reward functions for point-maze navigation as shown below. The red star represents the target position and the white circle represents the initial position of the agent (both are different across different iterations). The black horizontal line represents the barrier that cannot be crossed. To show the generalization ability, the expert demonstration used to infer the target position are sampled from new target positions that have not been seen in the meta-training set.

In this domain, after showing a single demonstration of the pointmass moving to an unseen target with the wall located on the left, we change the environment dynamics by moving the wall to the right. Results of PEMIRL reward adaptation and our best baseline are shown below.

Demonstration

Meta-IL Policy Generalization (Best Baseline)

PEMIRL Reward Adaptation

The reward learned by PEMIRL is able to infer the reward of reaching the new goal and the learned policy with such a reward is able to navigate the pointmass to the goal avoiding the right wall, while Meta-IL policy is stuck at the wall position.

Disabled Ant Walking

In this domain, after showing a single demonstration of the ant moving forward or backward, we disable and shorten two front legs of the ant such that it cannot walk without changing its gait to a large extent. We present our disabled ant walking video results below.

Demonstration (Moving Backward)

Meta-IL Policy Generalization (Best Baseline)

PEMIRL Reward Adaptation

Demonstration (Moving Forward)

Meta-IL Policy Generalization (Best Baseline)

PEMIRL Reward Adaptation

As seen from the result above, meta-imitation learning policies, which is the best baseline, fail to maneuver the disabled ant to the right direction. Reward functions learned by PEMIRL encourage the RL policy to orient the ant towards the demonstrated direction and move along that direction using two healthy legs. This suggests that PEMIRL efficiently infers the actual goal of the task, which is the moving direction of the ant.