Learning Robust Rewards with Adversarial Inverse Reinforcement Learning
Justin Fu, Katie Luo, Sergey Levine
Abstract: Reinforcement learning provides a powerful and general framework for decision making and control, but its application in practice is often hindered by the need for extensive feature and reward engineering. Deep reinforcement learning methods can remove the need for explicit engineering of policy or value features, but still require a manually specified reward function. Inverse reinforcement learning holds the promise of automatic reward acquisition, but has proven exceptionally difficult to apply to large, high-dimensional problems with unknown dynamics. In this work, we propose AIRL, a practical and scalable inverse reinforcement learning algorithm based on an adversarial reward learning formulation. We demonstrate that AIRL is able to recover reward functions that are robust to changes in dynamics, enabling us to learn policies even under significant variation in the environment seen during training. Our experiments show that AIRL greatly out-performs prior methods in these transfer settings.
Dynamics Shift in Reinforcement Learning
Traditional inverse reinforcement learning is evaluated via performance of a an agent that optimizes the learned reward on your training environment. However, in a real-world scenario we may wish to deploy the learned reward in an environment different from training (i.e. from a laboratory to a home). The traditional evaluation is akin to "testing on your training set". Unfortunately, IRL algorithms are ill-equipped to handle this issue because they cannot distinguish between shaped rewards of the form:
Shaped rewards are highly dynamics-sensitive and are not guaranteed to preserve policy optimality when the dynamics are changed. We combat this issue by learning rewards that are only a function of state (see the paper for justification). When combined with a new inverse IRL algorithm which we call adversarial inverse reinforcement learning (AIRL), we can obtain superior results to direct imitation learning in transfer learning scenarios.
Transfer Learning under Dynamics Shift
We consider transfer learning in scenarios where the reward function remains the same, but the dynamics are changed. In the "Shifting Wall" task, a 2D pointmass (blue) must navigate to the goal (green) around a wall, whose position is changed during test time. In the "Disabled Ant" task, a quarapedal robot must walk right after its two front legs (highlighted in red) are disabled and shortened.
Shifting Wall Task
Disabled Ant Task
On the 2D pointmass task, we can plot the learned (state-only) reward function. The agent starts at the white circle, and the goal is represented by the green star. Note that there is little reward shaping, enabling the reward function to work no matter where the wall is placed.
We can also visualize the policies learned on the ant task via direct policy transfer (middle) and re-optimizing a state-only reward function (right). The original policy is shown on the left. Note that re-optimizing the state-only reward acquires a significantly different gait that requires the ant to turn around and crawl backwards.
Direct Policy Transfer
AIRL (re-optimize state-only reward)