The EMPATHIC Framework for Task Learning from Implicit Human Feedback

Yuchen Cui*, Qiping Zhang*, Alessandro Allievi, Peter Stone, Scott Niekum, W. Bradley Knox

Paper presented at Conference on Robot Learning (CoRL) 2020

[Paper] [Code: EMPATHIC Robotaxi] [Dataset]

Abstract

Reactions such as gestures, facial expressions, and vocalizations are an abundant, naturally occurring channel of information that humans provide during interactions. A robot or other agent could leverage an understanding of such implicit human feedback to improve its task performance at no cost to the human. This approach contrasts with common agent teaching methods based on demonstrations, critiques, or other guidance that need to be attentively and intentionally provided. 

In this paper, we first define the general problem of learning from implicit human feedback and then propose to address this problem through a novel data-driven framework, EMPATHIC. This two-stage method consists of (1) mapping implicit human feedback to relevant task statistics such as rewards, optimality, and advantage; and (2) using such a mapping to learn a task. We instantiate the first stage and three second-stage evaluations of the learned mapping. To do so, we collect a dataset of human facial reactions while participants observe an agent execute a sub-optimal policy for a prescribed training task. We train a deep neural network on this data and demonstrate its ability to (1) infer relative reward ranking of events in the training task from prerecorded human facial reactions; (2) improve the policy of an agent in the training task using live human facial reactions; and (3) transfer to a novel domain in which it evaluates robot manipulation trajectories.

Short Summary Video (5 min)

In-depth Talk (47 min)

Motivation

People often react when observing an agent—whether human or artificial—if they are interested in the outcome of the agent’s behavior.  We have scowled at robot vacuums, raised eyebrows at cruise control, and rebuked automatic doors.  Such reactions are often not intended to communicate to the agent and yet nonetheless contain information about the perceived quality of the agent’s performance.  A robot or other software agent that can sense and correctly interpret these reactions could use the information they contain to improve its learning of the task. Importantly, learning from such implicit human feedback does not burden the human, who naturally provides such reactions even when learning does not occur.

Overview of the EMPATHIC Framework


firstview_demo.mp4

First Stage Instantiation: Robotaxi

Participants are recruited to watch an agent perform in the Robotaxi task and their payout for participating in the study is determined by the agent's performance. 

An example trajectory in the Robotaxi task environment is displayed on the left.

Human Proxy Test

To better understand the problem, the authors served as proxies for an algorithm by trying to predict the corresponding reward associated with each colored block by viewing the human observer's reactions. 

An example video clip is displayed on the left.

Try it yourself! Infer which color corresponds to gaining $6, losing $1, and losing $5. (Answer at bottom of page.)

Facial Gesture Annotations

The authors also annotated the recorded video data with common facial gestures observed. From annotating and analyzing the dataset, we learned what features are important for detecting facial gestures and how long facial gestures last on average.

The annotation tool UI is shown on the left.

Model Architecture for Learning the Mapping from Reaction Videos to Reward Categories


Evaluating Reward Ranking in Robotaxi

The learned mappings are evaluated on heldout data for predicting the correct reward ranking after processing a full episode of data.

The corresponding per-subject ranking similarity with ground truth is displayed in the figure on the left.

S6Zghgggo4_red_bottle_then_can_detections.avi

Stage 2 Instantiation: Robotic sorting task

The learned mapping is also evaluated on ranking trajectories of a robotic sorting task, demonstrating the capability of generalizing the learned mapping to a different task domain. 

Online Learning

The learned mapping is also used to learn reward mappings from unseen subjects in an online setting, where the agent's behavior changes as it learns more about the true reward from the reactions of an observer. The agent is able to converge to the correct reward ranking after integrating individual predictions over time. Video here.

(Human proxy test answer: Yellow +$6, Red -$1, Blue -$5)