Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences

Daniel S. Brown, Russell Coleman, Ravi Srinivasan, Scott Niekum

University of Texas at Austin

In Proceedings of the Thirty-seventh International Conference on Machine Learning (ICML) 2020.

Paper / Code

Motivation

Previous approaches to imitation learning typically fall into one of two categories
- Behavioral cloning, which learns a mapping from states to actions in a supervised fashion
  - Rarely outperforms the demonstrator and suffers from compounding error when going out-of-domain
- Inverse reinforcement learning (IRL), which seeks to explain the demonstrator by learning a reward function
  - However, most approaches give only a single point estimate of this function, and it can be computationally intractable to get many samples
  - Standard reinforcement learning (RL) techniques can then be used to extract a policy
We propose Bayesian Reward Extrapolation (Bayesian REX), a novel inverse reinforcement learning approach which generates a distribution over reward functions in an efficient way
This Bayesian approach allows for:
- Safety guarantees via high confidence performance bounds
  - Increasingly important area necessary for real-world deployment, e.g. self-driving cars, consumer-facing robotics, or risk-sensitive professional applications
- Reasoning about uncertainty
- Improved policy evaluation
- Detection of undesirable behavior or reward gaming

Method

Bayesian REX consists of two distinct phases:

Pre-training
Sampling

Pre-training

Converts raw pixel data to lower-dimensional embedding
Trained with 5 self-supervised losses, illustrated in the diagram to the right
- Variational autoencoder (VAE): Reconstruct original frames from lower-dimensional embedding
- Temporal difference: Estimate how much time passed between frames
- Inverse dynamics: Predict action taken by agent between two frames
- Forward dynamics: Predict future frames given some actions
- T-REX: Predict which of two demonstrations is preferred with respect to the provided ranking

Sampling

Uses learned embedding to sample reward function
- Reward function is assumed to be a linear combination of embedding features
Metropolis–Hastings MCMC sampling
- Every step, we sample a reward function, which is represented as a single floating-point vector
- The return of a trajectory is the dot product of this vector with the encoded features of a given demonstration
- A reward function is probabilistically accepted or rejected relative to the provided ranking
  - This likelihood function is treated as a classification problem for every pair of demonstrations, predicting which of the two has higher return
This process is highly efficient and easily hardware-accelerated, allowing for hundreds of thousands of samples

High-Confidence Performance Bounds

Having a distribution over reward functions allows for more in-depth policy evaluation
- Development of safety guarantees possible
Policies with a larger left tail may be more risky or contain undesirable behavior
- The example to the right shows evaluation metrics for four standard Beamrider policies as well as a no-op policy
  - In this particular reward distribution, it was observed that long-living policies were strongly favored
  - A no-op policy was added to test this, since it can live for tens of thousands of frames. This is a quirk of Beamrider, which waits until movement to send enemy ships after the player gets shot. However, this policy is not capable of scoring ground-truth points
  - While this policy was assigned a high mean due to its long lifespan, Bayesian REX is also able to recognize it as high risk
    - The 0.05-VaR bound is deeply negative, indicating that this policy is off distribution and may be undesirable
  - It is very straightforward to then re-run the sampling process, with the no-op behavior as a poorly-ranked additional demonstration. This yields a new distribution that does not have this flaw
- Other adversarial demonstration experiments have followed a similar pattern
  - High-risk is a good indicator of reward-gaming, out-of-distribution behavior or undesirable behavior

Performance of Bayesian REX

We measured the performance of Bayesian REX across five Atari games, compared with T-REX [1] and GAIL [2], two state-of-the-art imitation learning approaches
We tested optimizing a policy based on the mean and MAP reward function from the learned posterior
Performance is measured by ground-truth score
The dotted line indicates the performance of the best demonstration
Performance is normalized based on best demonstration. So a score of two indicates doubling demonstrator performance, etc.
B-REX is highly competitive for best overall performance, achieving state-of-the-art on three out of five games
- Note that the T-REX approach is very similar to that of B-REX, and differs only in that it provides a single reward function estimate. As such, its performance should be expected to be the closest; it wins on the other two out of five games
- Since B-REX is essentially an adaptation of T-REX into a Bayesian setting, this is ideal. It demonstrates that adding Bayesian analysis atop T-REX is possible without negatively impacting game performance; on the contrary the B-REX MAP reward often outperforms T-REX
It is also important to note that the other approaches do not offer an entire reward distribution like B-REX. This comparison is strictly ground-truth game performance; only B-REX is capable of doing the risk-based analysis described previously

[1] Brown, Daniel, et al. "Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations." ICML. 2019.
[2] Ho, Jonathan, and Stefano Ermon. "Generative adversarial imitation learning." NeurIPS 2016.

Page updated

Google Sites

Report abuse