Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences
Daniel S. Brown, Russell Coleman, Ravi Srinivasan, Scott Niekum
University of Texas at Austin
In Proceedings of the Thirty-seventh International Conference on Machine Learning (ICML) 2020.
Motivation
Previous approaches to imitation learning typically fall into one of two categories
Behavioral cloning, which learns a mapping from states to actions in a supervised fashion
Rarely outperforms the demonstrator and suffers from compounding error when going out-of-domain
Inverse reinforcement learning (IRL), which seeks to explain the demonstrator by learning a reward function
However, most approaches give only a single point estimate of this function, and it can be computationally intractable to get many samples
Standard reinforcement learning (RL) techniques can then be used to extract a policy
We propose Bayesian Reward Extrapolation (Bayesian REX), a novel inverse reinforcement learning approach which generates a distribution over reward functions in an efficient way
This Bayesian approach allows for:
Safety guarantees via high confidence performance bounds
Increasingly important area necessary for real-world deployment, e.g. self-driving cars, consumer-facing robotics, or risk-sensitive professional applications
Reasoning about uncertainty
Improved policy evaluation
Detection of undesirable behavior or reward gaming
Method
Bayesian REX consists of two distinct phases:
Pre-training
Sampling
Pre-training
Converts raw pixel data to lower-dimensional embedding
Trained with 5 self-supervised losses, illustrated in the diagram to the right
Variational autoencoder (VAE): Reconstruct original frames from lower-dimensional embedding
Temporal difference: Estimate how much time passed between frames
Inverse dynamics: Predict action taken by agent between two frames
Forward dynamics: Predict future frames given some actions
T-REX: Predict which of two demonstrations is preferred with respect to the provided ranking
Sampling
Uses learned embedding to sample reward function
Reward function is assumed to be a linear combination of embedding features
Metropolis–Hastings MCMC sampling
Every step, we sample a reward function, which is represented as a single floating-point vector
The return of a trajectory is the dot product of this vector with the encoded features of a given demonstration
A reward function is probabilistically accepted or rejected relative to the provided ranking
This likelihood function is treated as a classification problem for every pair of demonstrations, predicting which of the two has higher return
This process is highly efficient and easily hardware-accelerated, allowing for hundreds of thousands of samples
High-Confidence Performance Bounds
Having a distribution over reward functions allows for more in-depth policy evaluation
Development of safety guarantees possible
Policies with a larger left tail may be more risky or contain undesirable behavior
The example to the right shows evaluation metrics for four standard Beamrider policies as well as a no-op policy
In this particular reward distribution, it was observed that long-living policies were strongly favored
A no-op policy was added to test this, since it can live for tens of thousands of frames. This is a quirk of Beamrider, which waits until movement to send enemy ships after the player gets shot. However, this policy is not capable of scoring ground-truth points
While this policy was assigned a high mean due to its long lifespan, Bayesian REX is also able to recognize it as high risk
The 0.05-VaR bound is deeply negative, indicating that this policy is off distribution and may be undesirable
It is very straightforward to then re-run the sampling process, with the no-op behavior as a poorly-ranked additional demonstration. This yields a new distribution that does not have this flaw
Other adversarial demonstration experiments have followed a similar pattern
High-risk is a good indicator of reward-gaming, out-of-distribution behavior or undesirable behavior
Performance of Bayesian REX
We measured the performance of Bayesian REX across five Atari games, compared with T-REX [1] and GAIL [2], two state-of-the-art imitation learning approaches
We tested optimizing a policy based on the mean and MAP reward function from the learned posterior
Performance is measured by ground-truth score
The dotted line indicates the performance of the best demonstration
Performance is normalized based on best demonstration. So a score of two indicates doubling demonstrator performance, etc.
B-REX is highly competitive for best overall performance, achieving state-of-the-art on three out of five games
Note that the T-REX approach is very similar to that of B-REX, and differs only in that it provides a single reward function estimate. As such, its performance should be expected to be the closest; it wins on the other two out of five games
Since B-REX is essentially an adaptation of T-REX into a Bayesian setting, this is ideal. It demonstrates that adding Bayesian analysis atop T-REX is possible without negatively impacting game performance; on the contrary the B-REX MAP reward often outperforms T-REX
It is also important to note that the other approaches do not offer an entire reward distribution like B-REX. This comparison is strictly ground-truth game performance; only B-REX is capable of doing the risk-based analysis described previously
[1] Brown, Daniel, et al. "Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations." ICML. 2019.
[2] Ho, Jonathan, and Stefano Ermon. "Generative adversarial imitation learning." NeurIPS 2016.