A promising approach to solving challenging long-horizon tasks has been to extract behavior priors (skills) by fitting generative models to large offline datasets of demonstrations. However, such generative models inherit the biases of the underlying data and result in poor and unusable skills when trained on imperfect demonstration data. To better align skill extraction with human intent we present Skill Preferences (SkiP), an algorithm that learns a model over human preferences and uses it to extract human-aligned skills from offline data. After extracting human-preferred skills, SkiP also utilizes human feedback to solve downstream tasks with RL. We show that SkiP enables a simulated kitchen robot to solve complex multi-step manipulation tasks and substantially outperforms prior leading RL algorithms with human preferences as well as leading skill extraction algorithms without human preferences.
Method
Skill Extraction with Human Feedback
In this offline phase, SkiP learns a VAE that encodes action sequences into skill latent conditioned on the starting state. To incorporate human feedback, we weight the ELBO objective by how likely this trajectory is useful. This likelihood comes from a classifier P that is learned from human labels.
Skill Execution with Human Preference
In the second phase, there are two loops. The orange loop is the RL learning loop over skill latents, and the pink loop is the "human loop". In the pink loop, it learns a reward model from human preference and relabel the entire replay buffer with learned reward before training the RL agent.
Enviroments
We evaluate in the robot kitchen environment from D4RL, which requires a 7-DOF robotic arm to operate a kitchen. Within this environment, we consider a variety of manipulation tasks of varying difficulty. The simplest tasks involve one subtask - opening a microwave or moving the kettle, while more challenging tasks require the agent to compose multiple subtasks. Overall, wec onsider 6 evaluation tasks that require chaining one, two, or three subtasks
Long-Horizon Skill Execution from Human Labels
SkiP (our method)
Flat Prior
SkiP is able to match the oracle baseline asymptotically and outperforms other baselines.
It is also label-efficient. It uses 120 human labels during skill extraction, and 300-1k human labels during skill execution depending on the task’s complexity
Video Summary