Learning a Universal Human Prior for Dexterous Manipulation from Human Preference
Introduction
Reinforcement Learning from Human Feedback (RLHF) is demonstrated to be a power approach for achieving human preferred behaviors with learning agents. We build this website to collect data from humans, and your contributions matter!
Thousands of videos are collected with Proximal Policy Optimization algorithm on 20 simulated bi-dexterous hand manipulation environments.
We use the human preference data for training a Reward Model to fine-tune the RL polices for mimicking human-like behaviors, with the following procedure:
Step â‘ is to generate diverse policies across 20 dexterous hand manipulation tasks.Â
Step ② is to let human labelers provide the preference over trajectories collected from the generated policies.Â
Step ③ is to train the task-agnostic reward model for human-like behavior using the labeled samples.Â
The polices are fine-tuned in Step ①of the next iteration with the reward model. Iterate Steps ①-③.
Human Preference Collection
Anyone can provide the preference data on the collected dexterous-hands videos. Click it here:
Method and Results
Iterative RLHF and diverse policy fine-tuning with human-preference reward model.
Comparison of original policies and fine-tuned policies with the trained reward model.
Across dozens of tasks, including seen and unseen, simulation and reality.
Real Robot Experiments
The setup of real-robot experiments include a Shadow Hand mounted at the end of the UR10e robotic arm, which are simultaneously controlled at a frequency of 10Hz. Comparison of real-robot trajectories for original policies and fine-tuned policies with the trained reward model are shown.
                                           Original Policy                                         RM Finetune
Failure Cases
Here we demonstrate some failure cases for policies with/without RM fine-tuning and provide some analysis.
ShadowHandGraspAndPlace
without RM
with RM
ShadowHandTwoCatchUnderarm
without RM
with RM
Discussion: In above two tasks ShadowHandGraspAndPlace and ShadowHandTwoCatchUnderarm we find that some tasks themselves can be quite challenging. For example, ShadowHandTwoCatchUnderarm requires two hands each to throw a ball and catch the ball thrown by the other hand. The success conditions for both tasks are quite strict thus leading to a low success rate with RL directly applied, even with the RM fine-tuning applied. In our method, the RM provides an additional regularization on hand behaviors, this can lead to more human-like behaviors but not necessarily help with the task completion. The relationship between human likeness and task completion can be complicated given imperfectly designed task rewards.
Bi-dexhands Tasks
ShadowHand
ShadowHandBlockStack
ShadowHandBottleCap
ShadowHandCatchAbreast
ShadowHandCatchOver2Underarm
ShadowHandCatchUnderarm
ShadowHandDoorCloseInward
ShadowHandDoorCloseOutward
ShadowHandDoorOpenInward
ShadowHandDoorOpenOutward
ShadowHandGraspAndPlace
ShadowHandKettle
ShadowHandLiftUnderarm
ShadowHandOver
ShadowHandPen
ShadowHandPushBlock
ShadowHandScissors
ShadowHandSwingCup
ShadowHandSwitch
ShadowHandTwoCatchUnderarm