PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training

Kimin Lee*, Laura Smith*, Pieter Abbeel

UC Berkeley

*Equal contribution

[Code] [Paper]

Abstract

Conveying complex objectives to reinforcement learning (RL) agents can often be difficult, involving meticulous design of reward functions that are sufficiently informative yet easy enough to provide. Human-in-the-loop RL methods allow practitioners to instead interactively teach agents through tailored feedback; however, such approaches have been challenging to scale since human feedback is very expensive. In this work, we aim to make this process more sample- and feedback-efficient. We present an off-policy, interactive RL algorithm that capitalizes on the strengths of both feedback and off-policy learning. Specifically, we learn a reward model by actively querying a teacher's preferences between two clips of behavior and use it to train an agent. To enable off-policy learning, we relabel all the agent's past experience when its reward model changes. We additionally show that pre-training our agents with unsupervised exploration substantially increases the mileage of its queries. We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by human-in-the-loop methods, including a variety of locomotion and robotic manipulation skills. We also show that our method is able to utilize real-time human feedback to effectively prevent reward exploitation and learn new behaviors that are difficult to specify with standard reward functions.

Illustration of PEBBLE

First, the agent engages in unsupervised pre-training during which it is encouraged to visit a diverse set of states so its queries can provide more meaningful signal than on randomly collected experience (left). Then, a teacher provides preferences between two clips of behavior, and we learn a reward model based on them. The agent is updated to maximize the expected return under the model. We also relabel all its past experiences with this model to maximize their utilization to update the policy (right).

Unsupervised pre-training via state entropy maximization

We pre-train the policy only using an intrinsic motivation to explore and collect diverse experiences.

Reward learning from preferences

We learn a reward function that can lead to the desired behavior by getting feedback from a teacher.

Off-policy RL with non-stationary reward

We update the agent using an off-policy RL algorithm with relabeling to mitigate the effects of a non-stationary reward function.

Experiments with a Human Teacher

Novel Behaviors

We demonstrate that PEBBLE can learn behaviors for which a typical reward function is difficult to engineer very efficiently.

Clock-wise windmill

Counter clock-wise windmill

Quadruped waving its left front leg

Quadruped waving its right front leg

Hopper backflip

Mitigating Reward Exploitation

We also show that PEBBLE can avoid reward exploitation, leading to more desirable behaviors compared to an agent trained with respect to an engineered reward function.

Agent trained with human preference


We can train the Walker to walk in a more natural, human-like manner (using both legs)

Agent trained with hand-engineered reward


Even though it achieves the maximum scores, the Walker agent learns to walk using only one leg

Experiments with a Scripted Teacher

Comparison to Prior Work

We compare our approach to prior methods in learning a variety complex continuous control tasks without being able to directly observe the ground truth reward function. Specifically, we consider learning locomotion skills as well as robotic manipulation.

Learning curves on locomotion tasks as measured on the ground truth reward. The solid line and shaded regions represent the mean and standard deviation, respectively, across ten runs. Asymptotic performance of PPO and Preference PPO is indicated by dotted lines of the corresponding color.

Learning curves on robotic manipulation tasks as measured on the success rate. The solid line and shaded regions represent the mean and standard deviation, respectively, across ten runs. Asymptotic performance of PPO and Preference PPO is indicated by dotted lines of the corresponding color.

Ablations

In order to evaluate the individual effects of each technique in PEBBLE, we do an ablation study on Quadruped-walk.

Effects of relabeling and pre-training

Contribution of relabeling the replay buffer (relabel) and unsupervised pre-training (pre-train)

Sampling schemes

Effects of sampling schemes to select queries

Length of the segment

PEBBLE with varying the length of the segment

Bibtex

@inproceedings{2021pebble,

title={PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training},

booktitle={Lee, Kimin and Smith, Laura and Abbeel, Pieter},

journal={International Conference on Machine Learning},

year={2021}

}