Skill Preferences: Learning to Extract and Execute Skills from Human Feedback

[Paper] [Code]


A promising approach to solving challenging long-horizon tasks has been to extract behavior priors (skills) by fitting generative models to large offline datasets of demonstrations. However, such generative models inherit the biases of the underlying data and result in poor and unusable skills when trained on imperfect demonstration data. To better align skill extraction with human intent we present Skill Preferences (SkiP), an algorithm that learns a model over human preferences and uses it to extract human-aligned skills from offline data. After extracting human-preferred skills, SkiP also utilizes human feedback to solve downstream tasks with RL. We show that SkiP enables a simulated kitchen robot to solve complex multi-step manipulation tasks and substantially outperforms prior leading RL algorithms with human preferences as well as leading skill extraction algorithms without human preferences.

Method

Skill Extraction with Human Feedback

In this offline phase, SkiP learns a VAE that encodes action sequences into skill latent conditioned on the starting state. To incorporate human feedback, we weight the ELBO objective by how likely this trajectory is useful. This likelihood comes from a classifier P that is learned from human labels.

Skill Execution with Human Preference

In the second phase, there are two loops. The orange loop is the RL learning loop over skill latents, and the pink loop is the "human loop". In the pink loop, it learns a reward model from human preference and relabel the entire replay buffer with learned reward before training the RL agent.

Enviroments

We evaluate in the robot kitchen environment from D4RL, which requires a 7-DOF robotic arm to operate a kitchen. Within this environment, we consider a variety of manipulation tasks of varying difficulty. The simplest tasks involve one subtask - opening a microwave or moving the kettle, while more challenging tasks require the agent to compose multiple subtasks. Overall, wec onsider 6 evaluation tasks that require chaining one, two, or three subtasks

Long-Horizon Skill Execution from Human Labels

skip_mkb.mp4

SkiP (our method)

flat_prior_mkb.mp4

Flat Prior

SkiP is able to match the oracle baseline asymptotically and outperforms other baselines.

It is also label-efficient. It uses 120 human labels during skill extraction, and 300-1k human labels during skill execution depending on the task’s complexity


Video Summary

zoom_0.mp4