Currently, deep reinforcement learning (RL) shows impressive results in complex gaming and robotic environments. Often these results are achieved at the expense of huge computational costs and require an incredible number of episodes of interaction between the agent and the environment. There are two main approaches to improving the sample efficiency of reinforcement learning methods – using hierarchical methods and expert demonstrations. In this paper, we propose a combination of these approaches that allow the agent to use low-quality (after processing) demonstrations in complex vision-based environments with multiple related goals. Our forgetful experience replay (ForgER) algorithm effectively handles errors in expert data and reduces quality losses when adapting the action space and states representation to the agent's capabilities. Our proposed goal-oriented structuring of replay buffer allows the agent to automatically highlight sub-goals for solving complex hierarchical tasks in demonstrations. Our method is universal and can be integrated into various off-policy methods. It surpasses all known existing state-of-the-art RL methods using expert demonstrations on various model environments. The solution based on our algorithm beats all the solutions for the famous MineRL competition and allows the agent to mine a diamond in the Minecraft environment.
The idea of hierarchical augmentation is to use data from other subtasks as extra data on the imitating phase for each policy. Both supervised loss function and pseudo rewards is turned off for the extra data. Using this type of augmentation of ForgER we can solve two problems. Margin loss function causes the agent to learn how to act as an expert at the cost of generalization. Additional data prevents overfitting. The division into subtasks leads to the fact that only part of the data is used to learn each option policy. The use of additional data and TD losses allows us to learn the agent on additional information from other subtasks. For example, in Minecraft, such behavior as avoiding obstacles or floating out of the water can be reused in different subtasks.
The forgetting approach is part of our architecture designed for hierarchical tasks, but it can be used separately for learning from the demonstrations tasks, where it showed better results than the standard approach. In this paper, we address several problems of expert demonstrations, which could be solved using ForgER.
Forgetting is the process of dynamically changing the sampling rate of experts and agent data.