Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences

Daniel S. Brown, Russell Coleman, Ravi Srinivasan, Scott Niekum

University of Texas at Austin

In Proceedings of the Thirty-seventh International Conference on Machine Learning (ICML) 2020.


  • Previous approaches to imitation learning typically fall into one of two categories

    • Behavioral cloning, which learns a mapping from states to actions in a supervised fashion

      • Rarely outperforms the demonstrator and suffers from compounding error when going out-of-domain

    • Inverse reinforcement learning (IRL), which seeks to explain the demonstrator by learning a reward function

      • However, most approaches give only a single point estimate of this function, and it can be computationally intractable to get many samples

      • Standard reinforcement learning (RL) techniques can then be used to extract a policy

  • We propose Bayesian Reward Extrapolation (Bayesian REX), a novel inverse reinforcement learning approach which generates a distribution over reward functions in an efficient way

  • This Bayesian approach allows for:

    • Safety guarantees via high confidence performance bounds

      • Increasingly important area necessary for real-world deployment, e.g. self-driving cars, consumer-facing robotics, or risk-sensitive professional applications

    • Reasoning about uncertainty

    • Improved policy evaluation

    • Detection of undesirable behavior or reward gaming


  • Bayesian REX consists of two distinct phases:

  1. Pre-training

  2. Sampling


  • Converts raw pixel data to lower-dimensional embedding

  • Trained with 5 self-supervised losses, illustrated in the diagram to the right

    • Variational autoencoder (VAE): Reconstruct original frames from lower-dimensional embedding

    • Temporal difference: Estimate how much time passed between frames

    • Inverse dynamics: Predict action taken by agent between two frames

    • Forward dynamics: Predict future frames given some actions

    • T-REX: Predict which of two demonstrations is preferred with respect to the provided ranking


  • Uses learned embedding to sample reward function

    • Reward function is assumed to be a linear combination of embedding features

  • Metropolis–Hastings MCMC sampling

    • Every step, we sample a reward function, which is represented as a single floating-point vector

    • The return of a trajectory is the dot product of this vector with the encoded features of a given demonstration

    • A reward function is probabilistically accepted or rejected relative to the provided ranking

      • This likelihood function is treated as a classification problem for every pair of demonstrations, predicting which of the two has higher return

  • This process is highly efficient and easily hardware-accelerated, allowing for hundreds of thousands of samples

High-Confidence Performance Bounds

  • Having a distribution over reward functions allows for more in-depth policy evaluation

    • Development of safety guarantees possible

  • Policies with a larger left tail may be more risky or contain undesirable behavior

    • The example to the right shows evaluation metrics for four standard Beamrider policies as well as a no-op policy

      • In this particular reward distribution, it was observed that long-living policies were strongly favored

      • A no-op policy was added to test this, since it can live for tens of thousands of frames. This is a quirk of Beamrider, which waits until movement to send enemy ships after the player gets shot. However, this policy is not capable of scoring ground-truth points

      • While this policy was assigned a high mean due to its long lifespan, Bayesian REX is also able to recognize it as high risk

        • The 0.05-VaR bound is deeply negative, indicating that this policy is off distribution and may be undesirable

      • It is very straightforward to then re-run the sampling process, with the no-op behavior as a poorly-ranked additional demonstration. This yields a new distribution that does not have this flaw

    • Other adversarial demonstration experiments have followed a similar pattern

      • High-risk is a good indicator of reward-gaming, out-of-distribution behavior or undesirable behavior

Performance of Bayesian REX

  • We measured the performance of Bayesian REX across five Atari games, compared with T-REX [1] and GAIL [2], two state-of-the-art imitation learning approaches

  • We tested optimizing a policy based on the mean and MAP reward function from the learned posterior

  • Performance is measured by ground-truth score

  • The dotted line indicates the performance of the best demonstration

  • Performance is normalized based on best demonstration. So a score of two indicates doubling demonstrator performance, etc.

  • B-REX is highly competitive for best overall performance, achieving state-of-the-art on three out of five games

    • Note that the T-REX approach is very similar to that of B-REX, and differs only in that it provides a single reward function estimate. As such, its performance should be expected to be the closest; it wins on the other two out of five games

    • Since B-REX is essentially an adaptation of T-REX into a Bayesian setting, this is ideal. It demonstrates that adding Bayesian analysis atop T-REX is possible without negatively impacting game performance; on the contrary the B-REX MAP reward often outperforms T-REX

  • It is also important to note that the other approaches do not offer an entire reward distribution like B-REX. This comparison is strictly ground-truth game performance; only B-REX is capable of doing the risk-based analysis described previously

[1] Brown, Daniel, et al. "Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations." ICML. 2019.
[2] Ho, Jonathan, and Stefano Ermon. "Generative adversarial imitation learning." NeurIPS 2016.