Active Reward Learning from Online Preferences

Vivek Myers, Erdem Bıyık, Dorsa Sadigh

ICRA 2023

online-active-video.mp4

Abstract

Robot policies need to adapt to human preferences and/or new environments. Human experts often have the domain knowledge required to help robots achieve this adaptation. However, existing works often require costly offline re-training on human feedback, and those feedback usually need to be frequent and too complex for the humans to reliably provide. To avoid placing undue burden on human experts and allow quick adaptation in critical real-world situations, we propose designing and sparingly presenting easy-to-answer pairwise action preference queries in an online fashion. Our approach designs queries and determines when to present them to maximize the expected value derived from the queries' information. We demonstrate our approach with experiments in simulation, human user studies, and real robot experiments. In these settings, our approach outperforms baseline techniques while presenting fewer queries to human experts.

Approach

We model the robot's uncertainty about it's environment's reward function. This uncertainty allows the robot to evaluate when to ask the most informative pairwise comparisons from the human. Specifically, we take a multitask learning perspective to this problem and pretrain a library of robot reward functions on a number of tasks. In test time, we maintain a posterior over different tasks by directly modeling the effect of presenting queries to a human on the robot's belief state. We can thus compute the expected value of information (EVOI) of any potential pairwise query corresponding to the expected value the robot gains by asking that question. Using the EVOI metric, we propose an approach for selecting when to to query a human expert and what queries to make.

Our approach compares the EVOI of different queries against each other and a threshold to determine when to ask pairwise comparison queries over the robot's actions as well as which queries to ask. Leveraging the EVOI metric allows us to ask the most informative questions that can be easily answered at the critical time steps and thus update our policy in an online fashion.


GridWorld


Our initial results test our algorithm in a GridWorld environment. This environment involves an agent navigating a grid to reach a goal destination. In this setting, our task distribution is a uniform distribution over possible valid goal locations. Our method outperforms baselines across differernt numbers of queries.

Driving

We test our approach on a driving environment. Users may have different preferences over the desired lane and lane changes, speed range, acceleration, and follow distance. Our approach outperforms baselines while using fewer queries in simulation, and when tested with real human users is ranked as better than baselines.

Block Pushing

We apply our approach to the task of using a robot to push a block to a goal location which is unknown to the robot but known to a simulated human expert. Our method outperforms baselines in simulation and transfers to a real Fetch robot.