# Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences

**Daniel S. Brown****, ****Russell Coleman****, ****Ravi Srinivasan****, ****Scott Niekum**

## University of Texas at Austin

**In Proceedings of the ****Thirty-seventh International Conference on Machine Learning (ICML) 2020.**

## Motivation

Previous approaches to imitation learning typically fall into one of two categories

Behavioral cloning, which learns a mapping from states to actions in a supervised fashion

Rarely outperforms the demonstrator and suffers from compounding error when going out-of-domain

Inverse reinforcement learning (IRL), which seeks to

**explain**the demonstrator by learning a reward functionHowever, most approaches give only a single

**point estimate**of this function, and it can be computationally intractable to get many samplesStandard reinforcement learning (RL) techniques can then be used to extract a policy

We propose Bayesian Reward Extrapolation (Bayesian REX), a novel inverse reinforcement learning approach which generates a

**distribution over reward functions**in an efficient wayThis Bayesian approach allows for:

Safety guarantees via

**high confidence performance bounds**Increasingly important area necessary for real-world deployment, e.g. self-driving cars, consumer-facing robotics, or risk-sensitive professional applications

Reasoning about uncertainty

Improved policy evaluation

Detection of undesirable behavior or reward gaming

## Method

Bayesian REX consists of two distinct phases:

Pre-training

Sampling

### Pre-training

Converts raw pixel data to lower-dimensional embedding

Trained with 5 self-supervised losses, illustrated in the diagram to the right

**Variational autoencoder (VAE):**Reconstruct original frames from lower-dimensional embedding**Temporal difference:**Estimate how much time passed between frames**Inverse dynamics:**Predict action taken by agent between two frames**Forward dynamics:**Predict future frames given some actions**T-REX:**Predict which of two demonstrations is preferred with respect to the provided ranking

### Sampling

Uses learned embedding to sample reward function

Reward function is assumed to be a

**linear combination**of embedding features

Metropolisâ€“Hastings MCMC sampling

Every step, we sample a reward function, which is represented as a single floating-point vector

The return of a trajectory is the dot product of this vector with the encoded features of a given demonstration

A reward function is probabilistically accepted or rejected relative to the provided ranking

This likelihood function is treated as a classification problem for every pair of demonstrations, predicting which of the two has higher return

This process is

**highly efficient**and easily hardware-accelerated, allowing for hundreds of thousands of samples

## High-Confidence Performance Bounds

Having a distribution over reward functions allows for more in-depth policy evaluation

Development of

**safety guarantees**possible

Policies with a larger left tail may be more risky or contain undesirable behavior

The example to the right shows evaluation metrics for four standard Beamrider policies as well as a no-op policy

In this particular reward distribution, it was observed that long-living policies were strongly favored

A no-op policy was added to test this, since it can live for tens of thousands of frames. This is a quirk of Beamrider, which waits until movement to send enemy ships after the player gets shot. However, this policy is not capable of scoring ground-truth points

While this policy was assigned a high mean due to its long lifespan, Bayesian REX is also able to recognize it as high risk

The 0.05-VaR bound is deeply negative, indicating that this policy is off distribution and may be undesirable

It is very straightforward to then re-run the sampling process, with the no-op behavior as a poorly-ranked additional demonstration. This yields a new distribution that does not have this flaw

Other adversarial demonstration experiments have followed a similar pattern

High-risk is a good indicator of reward-gaming, out-of-distribution behavior or undesirable behavior

**Performance of Bayesian REX**

**Performance of Bayesian REX**

We measured the performance of Bayesian REX across five Atari games, compared with T-REX [1] and GAIL [2], two state-of-the-art imitation learning approaches

We tested optimizing a policy based on the mean and MAP reward function from the learned posterior

Performance is measured by ground-truth score

The dotted line indicates the performance of the best demonstration

Performance is normalized based on best demonstration. So a score of two indicates doubling demonstrator performance, etc.

B-REX is highly competitive for best overall performance, achieving state-of-the-art on three out of five games

Note that the T-REX approach is very similar to that of B-REX, and differs only in that it provides a single reward function estimate. As such, its performance should be expected to be the closest; it wins on the other two out of five games

Since B-REX is essentially an adaptation of T-REX into a Bayesian setting, this is ideal. It demonstrates that adding Bayesian analysis atop T-REX is possible without negatively impacting game performance; on the contrary the B-REX MAP reward often outperforms T-REX

It is also important to note that the other approaches do not offer an entire reward distribution like B-REX. This comparison is strictly ground-truth game performance; only B-REX is capable of doing the risk-based analysis described previously

[1] Brown, Daniel, et al. "Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations."

*I*

*CML*. 2019.

[2] Ho, Jonathan, and Stefano Ermon. "Generative adversarial imitation learning."

*NeurIPS*2016.