Few-shot Preference Learning for Human-in-the-Loop RL

The above graphic shows the general procedure for our method. First, we collect an offline dataset of experience from prior tasks. We use said prior data in order to train a reward model using the MAML Algorithm (Finn et. al 2017). We then adapt the reward model using newly collected preference data and use it to train a policy on a new, unseen task.

Abstract

While reinforcement learning (RL) has become a more popular approach for robotics, designing sufficiently informative reward functions for complex tasks has proven to be extremely difficult due their inability to capture human intent and policy exploitation. Preference based RL algorithms seek to overcome these challenges by directly learning reward functions from human feedback. Unfortunately, prior work either requires an unreasonable number of queries implausible for any human to answer or overly restricts the class of reward functions to guarantee the elicitation of the most informative queries, resulting in models that are insufficiently expressive for realistic robotics tasks. Contrary to most works that focus on query selection to minimize the amount of data required for learning reward functions, we take an opposite approach: expanding the pool of available data by viewing human-in-the-loop RL through the more flexible lens of multi-task learning. Motivated by the success of metalearning, we pre-train preference models on prior task data and quickly adapt them for new tasks using only a handful of queries. Empirically, we reduce the amount of online feedback needed to train manipulation policies in MetaWorld by 20Ă—, and demonstrate the effectiveness of our method on a real Franka Panda Robot. Moreover, this reduction in query-complexity allows us to train robot policies from actual human users.

Policy Visualizations

Note that visualizations are speed up 4x for MetaWorld and 8x for DM Control.

Window Close - 200 Artificial Feedback

Few-Shot (Ours)

PEBBLE

Door Close - 200 Artificial Feedback

Few-Shot (Ours)

PEBBLE

Door Unlock - 500 Artificial Feedback

Few-Shot (Ours)

PEBBLE

Button Press - 500 Artificial Feedback

Few-Shot (Ours)

PEBBLE

Drawer Open - 1000 Artificial Feedback

Few-Shot (Ours)

PEBBLE

Sweep Into - 2500 Artificial Feedback

Few-Shot (Ours)

PEBBLE

Point Mass - 36 Human Feedback

Few-Shot (Ours)

PEBBLE

Reacher - 48 Human Feedback

Few-Shot (Ours)

PEBBLE

Window Close - 64 Human Feedback

Few-Shot (Ours)

PEBBLE

Door Close - 100 Human Feedback

Notice that in the door close environment in particular, very different strategies are learned. Our method (few-shot) learns to move to the door handle and then close the door from it. On the other hand, PEBBLE tends to learn to slam into the corner of the door, and cause the door to close. Our method recovers a policy more similar to what is desired by the human user as seen below.

Few-Shot (Ours)

PEBBLE

Panda Reach - 200 Artificial Feedback

reach_goal_1_ours.MOV

Few-Shot (Ours)
Goal 1

reach_goal_1_pebble.MOV

PEBBLE
Goal 1

reach_goal_2_ours.MOV

Few-Shot (Ours)
Goal
2

reach_goal_2_pebble.MOV

PEBBLE
Goal
2

Panda Block Push - 2000 Artificial Feedback

push_goal_1_ours.MOV

Few-Shot (Ours)
Goal 1

push_goal_1_pebble.MOV

PEBBLE
Goal 1

push_goal_2_ours.MOV

Few-Shot (Ours)
Goal 2

push_goal_2_pebble.MOV

PEBBLE
Goal 2

User Queries

Here we show selected queries for both our method and PEBBLE on each of the environments used for human training. The top behavior was preferred by the human user. In the downloadable content we include the full images shown to users during policy training.

A depiction of the 28th query asked to users when training the Point Mass Agent from human feedback. The winning query was chosen based on proximity of the agent (yellow) to the goal position (red). At this point in training, our Few-Shot method sampled queries closer to the goal position than PEBBLE.

A depiction of the 35th query asked to users when training the reacher from human feedback. Our method's query (top) was easier to answer because the top trajectories' arm was clearly closer to the target position.

This shows one of the last queries asked for the Window Close environment. Here we see that our method's query asks the user to choose between a closed and unclosed window (top), while PEBBLE asked the user to choose between two different, hard to distinguish, arm positions.

This shows a query towards the middle of training for the Door Close environment. At this point, the few-shot method is asking the user to compare a completely closed door (better) versus an open one, while PEBBLE's query only includes a partially closed door.