Basis for Intentions: Efficient Inverse Reinforcement Learning

using Past Experience


Effective IRL is challenging, as there are many reward functions can be compatible with an observed behavior. This paper focuses on how prior reinforcement learning (RL) experience can be leveraged to make learning expert preferences faster and more efficient. We address several key problems in IRL:

  • Underspecification: Learning from high-dimensional observations is extremely challenging, because there are many possible reward functions consistent with a set of demonstrations.

  • Requiring many demonstrations: When learning rewards from scratch, modern deep IRL algorithms often require a large number of demonstrations and trials.

We propose the IRL algorithm BASIS (Behavior Acquisition through Successor Feature Intention inference from Samples), which leverages multi-task RL pre-training and successor features to allow an agent to build a strong basis for intentions that spans the space of possible goals in a given domain. When exposed to just a few expert demonstrations optimizing a novel goal, the agent uses its basis to quickly and effectively infer the reward function. Our experiments reveal that our method is highly effective at inferring and optimizing demonstrated reward functions, accurately inferring reward functions from less than 100 trajectories.


BASIS uses multi-task RL pre-training to learn a basis for intentions. It encodes information about both the environment dynamics, and, through modeling the rewards for multiple pre-training tasks, the space of possible goals that can be pursued in the environment. It captures this information in cumulants, successor representation, and preference vectors of all previous tasks. The agent then leverages knowledge from these parameters to rapidly infer the demonstrator's goal shown through expert demonstrations, updating the parameters as needed.


We evaluate BASIS in three multi-task environment domains: a fruit-picking environment, a highway driving scenario, and a roundabout driving scenario. On these tasks, our approach is up to 10x more accurate at recovering the demonstrator's reward than state-of-the-art IRL methods involving pre-training with IRL, and achieves up to 15x more ground-truth reward than state-of-the-art imitation learning methods. In summary, the contributions of this paper are to show the effectiveness of multi-task RL pre-training for IRL, to propose a new technique for using successor features to learn a basis for behaviour that can be used to infer rewards, and empirical results demonstrating the effectiveness of our method over prior work.


These videos show BASIS compared with our ablations when demonstrating expert preferences, including maintaining a wide following distance, high speed, and preference for the left lane.


Multi-task IRL pretraining



Further Ablations

Through a series of ablations, we found that both multi-task RL pre-training and successor features contribute to the success of the approach.