Learning Human Objectives by Evaluating Hypothetical Behavior

Abstract: We seek to align agent behavior with a user's objectives in a reinforcement learning setting with unknown dynamics, an unknown reward function, and unknown unsafe states. The user knows the rewards and unsafe states, but querying the user is expensive. To address this challenge, we propose an algorithm that safely and interactively learns a model of the user's reward function. We start with a generative model of initial states and a forward dynamics model trained on off-policy data. Our method uses these models to synthesize hypothetical behaviors, asks the user to label the behaviors with rewards, and trains a neural network to predict the rewards. The key idea is to actively synthesize the hypothetical behaviors from scratch by maximizing tractable proxies for the value of information, without interacting with the environment. We call this method reward query synthesis via trajectory optimization (ReQueST). We evaluate ReQueST with simulated users on a state-based 2D navigation task and the image-based Car Racing video game. The results show that ReQueST significantly outperforms prior methods in learning reward models that transfer to new environments with different initial state distributions. Moreover, ReQueST safely trains the reward model to detect unsafe states, and corrects reward hacking before deploying the agent.

Our method for safely aligning agent behavior with a user's objectives: interactively learning a reward model from user feedback on hypothetical behaviors, then deploying a model-based RL agent that optimizes the learned rewards.

The Car Racing environment

Unlike the video above, the videos on the right show examples of hypothetical behaviors synthesized from scratch using a VAE image decoder and an LSTM dynamics model.


Realistic hypotheticals:





Informative hypotheticals:

Maximize reward

Minimize reward

Maximize uncertainty

Maximize novelty

ReQueST enables trading off between synthesizing realistic vs. informative hypothetical behaviors.

When ReQueST is tuned to produce realistic queries, the reward-maximizing query shows the car driving down the road and making a turn. The reward-minimizing query shows the car going off-road as quickly as possible. The uncertainty-maximizing query shows the car driving to the edge of the road and slowing down. The novelty-maximizing query shows the car staying still.

When ReQueST is tuned to produce informative queries, most of the behaviors are qualitatively similar to their realistic counterparts, but are less realistic and more aggressively optimize the acquisition function. Only the novelty-maximizing query is qualitatively different, in that it seeks the boundaries of the map (the white void) instead of staying still.