Efficient Preference-Based RL Using Learned Dynamics Models

Yi Liu, Gaurav Datta, Ellen Novoseller, Daniel S. Brown

Abstract

Preference-based reinforcement learning (PbRL) can enable robots to learn to perform tasks based on an individual's preferences without requiring a hand-crafted reward function. However, existing approaches either assume access to a high-fidelity simulator or analytic model or take a model-free approach that requires extensive, possibly unsafe online environment interactions. In this paper, we study the benefits and challenges of using a learned dynamics model when performing PbRL. In particular, we provide evidence that a learned dynamics model offers the following benefits when performing PbRL: (1) preference elicitation and policy optimization require significantly fewer environment interactions than model-free PbRL, (2) diverse preference queries can be synthesized safely and efficiently as a byproduct of standard model-based RL, and (3) reward pretraining based on suboptimal demonstrations can be performed without any environmental interaction. Our paper provides empirical evidence that learned dynamics models enable robots to learn customized policies based on user preferences in ways that are safer and more sample efficient than prior preference learning approaches.

Safe and Sample Efficient Learning

Combining model-based RL with preference-based learning allows robots to use a learned dynamics model to learn customizable behaviors

Diverse Trajectories for Preference Learning

A learned dynamics model can also be used to generate informative, diverse preference queries without environment interactions

Reward Pre-training from Suboptimal Demonstrations

Learned dynamics models additionally enable reward pre-training without any need to collect samples from the environment

Experiments

Maze

AssistiveGym Itch Scratching

Hopper Backflip

Results

We hypothesize that MoP-RL will require fewer environmental interactions in comparison to model-free RL, while achieving similar or better performance. We compare MoP-RL and PEBBLE [ 31] in the Maze-LowDim, Maze-Image, and Assistive-Gym Itch Scratching tasks.

Maze-LowDim. We see that PEBBLE requires a large number of rollouts to learn the reward function accurately enough to successfully navigate the maze. By contrast, MoP-RL is much more sample efficient. We also observe that pre-training the reward network using the demonstrations significantly speeds up online preference learning, requiring fewer trajectories to successfully complete the task.

Maze-Image. Table 3 shows results for the Maze-Image environment. We see that MoP-RL outperforms PEBBLE, even when PEBBLE is given significantly more unsupervised access to the environment. Prior work [ 31 ] only evaluates PEBBLE on low-dimensional reward learning tasks. We adapted the author’s implementation of PEBBLE to allow an image-based reward function and use the same reward function architecture for both PEBBLE and MoP-RL. However, our results demonstrate that PEBBLE struggles in high-dimensional visual domains. Furthermore, we see that adding pre-training with a small number of demonstrations, namely ten demonstrations, can significantly improve reward learning, leading to improved task success by first learning a rough estimate of the reward function and then fine-tuning that estimate via model-based preference queries.

Assistive Gym. Table 2 shows the results for the itching task from Assistive Gym. MoP-RL is able to reach a higher reward in this environment in performing this task. The above experiments demonstrate that MoP-RL requires significantly fewer environment interaction steps than PEBBLE to learn a reward function. MoP-RL also enables dynamics model pre-training to be performed separately from preference learning; in particular, unlike a learned reward function, learned dynamics can be re-used to safely and efficiently learn the preferences of multiple users.

Hopper Backflip. Inspired by previous model-free PbRL results [ 13 ], we demonstrate that MoP-RL can train the OpenAI Gym Hopper to perform a backflip via preference queries over a learned dynamics model. An example learned backflip is displayed in Figure 4, suggesting that MoP-RL can learn to perform novel behaviors for which designing a hand-crafted reward function is difficult

Table 1: Performance on Maze-LowDim Environment. We report the numbers of unsupervised and training rollouts and the success rate (over 16 rollouts) for each method when trained with 60 preference queries.

Table 2: Performance on Assistive-Gym Itch Scratching Environment. We report the numbers of unsupervised and training rollouts and the average reward (mean ± standard error over 4 rollouts) for each method compared. Both methods were trained with 80 preference queries.

Table 3: Performance on Maze-Image Environment. We report the numbers of unsupervised and training rollouts and the success rate (over 16 rollouts) for each method compared. Both methods were trained with 1000 preference queries.

Page updated

Report abuse