f-IRL: Inverse Reinforcement Learning

via State Marginal Matching

Tianwei Ni*, Harshit Sikchi*, Yufei Wang*, Tejus Gupta*, Lisa Lee°, Ben Eysenbach°

(*equal contribution, order determined by dice rolling; °equal advising)

Appearing in Conference on Robot Learning (CoRL) 2020

Paper Video and Slides

fIRL_CoRL_final.mov
fIRL_CoRL_final.pptx

Motivation

Traditionally, IL/IRL methods assume access to expert demonstrations and minimize some divergence between policy and expert's trajectory distribution. However, in many cases, it may be easier to directly specify the state distribution explicitly or via samples of the desired behavior rather than to provide fully-specified demonstrations (with actions) of the desired behavior. For example, in a safety-critical application, it may be easier to specify that the expert never visits some unsafe states rather than tweaking reward to penalize safety violations. Similarly, we can specify a uniform density over the whole state space for exploration tasks, or a Gaussian centered at the goal for goal-reaching tasks.

Contribution

In this paper, we present a new method, f-IRL, based on an analytic gradient of arbitrary f-divergence between the agent and expert state distribution w.r.t reward parameters. f-IRL recovers a policy that matched the expert state distribution as well as the reward. The resulting algorithm is more sample efficient in the number of environment interactions and expert trajectories, as we demonstrate on the MuJoCo imitation learning benchmarks. This makes f-IRL desirable in the limited expert data regime. We also demonstrate the utility of the recovered rewards by f-IRL, in hard-to-explore tasks with sparse rewards and for transferring behaviors across changes in dynamics.


Experiments

An Illustration of how f-IRL works on Reacher-v2

Example below shows the training progress for f-IRL on Reacher-v2 task on the MuJoCo simulator. This illustrates the gradients obtained in f-IRL and how the reward function evolves over training iteration.

Target Density (want to achieve)

Imitation Learning Benchmark - Qualitative Results

We compare f-IRL to previous state of the art IRL methods on the MuJoCo Imitation learning benchmarks. We obtain a policy which is closer to the expert using f-IRL than the baselines as shown on four high dimensional continuous environments: Hopper-v2, Walker-v2, HalfCheetah-v2, Ant. Only 1 expert trajectory is used for training.

HalfCheetah-v2

Expert Policy

f-MAX (Ghasemipour et al. 2019)

f-IRL (learned policy)

f-IRL (policy trained from scratch on learned reward)

Hopper-v2

Expert Policy

f-MAX (Ghasemipour et al. 2019)

f-IRL (learned policy)

f-IRL (policy trained from scratch on learned reward)

Walker2d-v2

Expert Policy

f-MAX (Ghasemipour et al. 2019)

f-IRL (learned policy)

f-IRL (policy trained from scratch on learned reward)

Ant-v2

Expert Policy

f-MAX (Ghasemipour et al. 2019)

f-IRL (learned policy)

f-IRL (policy trained from scratch on learned reward)

Transferring the learned reward across changes in dynamics

f-IRL learns a reward function in addition to a policy for imitating the expert. This allows us to transfer the reward function across changes in dynamics where policy transfer will simply fail. We provide an example to this effect where we have a expert Ant agent trying to walk as fast as possible. f-IRL is used to extract a reward and train a SAC agent from scratch on a modified Ant agent whose 2 legs out of 4 are disabled. The modified Ant agent has to learn a different policy with a different gait where it uses the disabled legs as support and uses the other 2 legs to crawl forward.

Healthy Ant -> Disabled Ant Transfer

Healthy Ant Expert

Disabled Ant Agent (policy trained from scratch on learned reward obtained via f-IRL)

Conclusion

In summary, we have presented f-IRL, a practical IRL algorithm that distills an expert's state distribution into a stationary reward function. Our f-IRL algorithm can learn from either (a) provided expert samples (as in traditional IRL), or (b) a specified expert density , which opens the door to supervising IRL with other useful types of data. These types of supervision can assist agents in solving tasks faster, encode preferences for how tasks are performed, and indicate which states are unsafe and should be avoided. Our experiments demonstrate that f-IRL is more sample efficient in the number of expert trajectories and environment timesteps as demonstrated on MuJoCo benchmarks.