Zaynah Javed*, Daniel S. Brown*, Satvik Sharma, Jerry Zhu, Ashwin Balakrishna,
Marek Petrik, Anca D. Dragan, Ken Goldberg
International Conference on Machine Learning (ICML) 2021
TLDR: There are often many different reward functions that explain human demonstrations, leaving imitation learning agents with uncertainty over the demonstrator's true intent. To address this uncertainty, we derive a novel policy gradient-style robust imitation learning algorithm, PG-BROIL, that easily scales to continuous MDPs. PG-BROIL outperforms existing state-of-the-art imitation learning methods by hedging against uncertainty, rather than seeking to uniquely identify the demonstrator's reward function.