DIP-RL: Demonstration-Inferred Preference Learning in Minecraft

Ellen Novoseller, Vinicius G. Goecks, David Watkins, Josh Miller, and Nicholas Waytowich

Abstract:

In machine learning for sequential decision-making, an algorithmic agent learns to interact with an environment while receiving feedback in the form of a reward signal. However, in many unstructured real-world settings, such a reward signal is unknown and humans cannot reliably craft a reward signal that correctly captures desired behavior. To solve tasks in such unstructured and open-ended environments, we present Demonstration-Inferred Preference Reinforcement Learning (DIP-RL), an algorithm that leverages human demonstrations in three distinct ways, including training an autoencoder, seeding reinforcement learning (RL) training batches with demonstration data, and inferring preferences over behaviors to learn a reward function to guide RL. We evaluate DIP-RL in a tree-chopping task in Minecraft. Results suggest that the method can guide an RL agent to learn a reward function that reflects human preferences and that DIP-RL performs competitively relative to baselines. DIP-RL is inspired by our previous work on combining demonstrations and pairwise preferences in Minecraft, which was awarded a research prize at the 2022 NeurIPS MineRL BASALT competition, Learning from Human Feedback in Minecraft.

System Diagram:

System diagram of the Demonstration-Inferred Preference Reinforcement Learning (DIP-RL) algorithm. DIP-RL leverages two modalities of human feedback: demonstrations and pairwise preferences. The demonstrations are used to 1) train an autoencoder to learn a compact state representation (the autoencoder training data can include nontask-specific demonstrations as well as task-specific trajectories), 2) provide trajectory segments for pairwise preference queries, and 3) provide experience to seed the RL replay buffer. Pairwise preferences are used to learn a reward function to inform a reinforcement learning algorithm.

Example Trajectory Rollouts:

Demo_0_2022-12-15_12-28-09.mp4

Human Demonstration

best_rl.mp4

Reinforcement Learning with Soft Actor-Critic (SAC) Algorithm [1]

best_sqil.mp4

Soft-Q Imitation Learning (SQIL) [2]

decent_synthetic.mp4

Demonstration-Inferred Preference Reinforcement Learning (DIP-RL)

References

[1] Haarnoja, Zhou, Abbeel, and Levine, "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor", ICML 2018.

[2] Reddy, Dragan, and Levine, "SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards", ICLR 2020.

Questions?

Contact ellen.novoseller.civ@army.mil to get more information on the project

Page updated

Google Sites

Report abuse