Learning a Universal Human Prior for Dexterous Manipulation from Human Preference

Introduction

Reinforcement Learning from Human Feedback (RLHF) is demonstrated to be a power approach for achieving human preferred behaviors with learning agents. We build this website to collect data from humans, and your contributions matter!

Thousands of videos are collected with Proximal Policy Optimization algorithm on 20 simulated bi-dexterous hand manipulation environments.

We use the human preference data for training a Reward Model to fine-tune the RL polices for mimicking human-like behaviors, with the following procedure:

Step ① is to generate diverse policies across 20 dexterous hand manipulation tasks.

Step ② is to let human labelers provide the preference over trajectories collected from the generated policies.

Step ③ is to train the task-agnostic reward model for human-like behavior using the labeled samples.

The polices are fine-tuned in Step ① of the next iteration with the reward model. Iterate Steps ①-③.

Human Preference Collection

Anyone can provide the preference data on the collected dexterous-hands videos. Click it here:

Provide Your Preference

Method and Results

Iterative RLHF and diverse policy fine-tuning with human-preference reward model.

Comparison of original policies and fine-tuned policies with the trained reward model.

Across dozens of tasks, including seen and unseen, simulation and reality.

demo_presentation.mp4

Real Robot Experiments

The setup of real-robot experiments include a Shadow Hand mounted at the end of the UR10e robotic arm, which are simultaneously controlled at a frequency of 10Hz. Comparison of real-robot trajectories for original policies and fine-tuned policies with the trained reward model are shown.

Original Policy RM Finetune

Failure Cases

Here we demonstrate some failure cases for policies with/without RM fine-tuning and provide some analysis.

ShadowHandGraspAndPlace

without RM

with RM

ShadowHandTwoCatchUnderarm

without RM

with RM

Discussion: In above two tasks ShadowHandGraspAndPlace and ShadowHandTwoCatchUnderarm we find that some tasks themselves can be quite challenging. For example, ShadowHandTwoCatchUnderarm requires two hands each to throw a ball and catch the ball thrown by the other hand. The success conditions for both tasks are quite strict thus leading to a low success rate with RL directly applied, even with the RM fine-tuning applied. In our method, the RM provides an additional regularization on hand behaviors, this can lead to more human-like behaviors but not necessarily help with the task completion. The relationship between human likeness and task completion can be complicated given imperfectly designed task rewards.