Fine-tuning Reinforcement Learning Models is Secretly a Forgetting Mitigation Problem

Maciej Wołczyk*¹ Bartłomiej Cupiał*¹² Mateusz Ostaszewski³ Michał Bortkiewicz³ Michał Zając⁴
Razvan Pascanu⁵ Łukasz Kuciński¹²⁶ Piotr Miłoś¹²⁶⁷

*Equal contribution ¹IDEAS NCBR ²University of Warsaw ³Warsaw University of Technology ⁴Jagiellonian University
⁵Google DeepMind ⁶Institute of Mathematics, Polish Academy of Sciences ⁷deepsense.ai

Contact: {maciej.wolczyk, bartlomiej.cupial}@gmail.com

[Paper] [Poster] [Code]

Online fine-tuning with knowledge retention unlocks state-of-the-art results
in the challenging NetHack Learning Environment.

News

2024.07: Check out our guest lecture at UCL Dark here!
2024.05: We will be at ICML 2024! Come by our spotlight poster #1410 on Tue 23 Jul 11:30 a.m. CEST - 1 p.m. CEST!

Abstract

Fine-tuning is a widespread technique that allows practitioners to transfer pre-trained capabilities, as recently showcased by the successful applications of foundation models. However, fine-tuning reinforcement learning (RL) models remains a challenge. This work conceptualizes one specific cause of poor transfer, accentuated in the RL setting by the interplay between actions and observations: forgetting of pre-trained capabilities. Namely, a model deteriorates on the state subspace of the downstream task not visited in the initial phase of fine-tuning, on which the model behaved well due to pre-training. This way, we lose the anticipated transfer benefits. We identify conditions when this problem occurs, showing that it is common and, in many cases, catastrophic. Through a detailed empirical analysis of the challenging NetHack and Montezuma’s Revenge environments, we show that standard knowledge retention techniques mitigate the problem and thus allow us to take full advantage of the pre-trained capabilities. In particular, in NetHack, we achieve a new state-of-the-art for neural models, improving the previous best score from 5K to over 10K points in the Human Monk scenario.

Motivation

While fine-tuning has been key to unlocking the potential of LLMs and other foundation models, we haven't seen it achieve similar success in reinforcement learning. This is especially important, as offline pre-trained models in decision-making scenarios are often limited by the expertise of the data-generating expert and lack of understanding of action-environment interactions. Online fine-tuning should, in theory, allow us to go beyond the performance of the data-generating expert and adapt existing models to new situations. In this paper, we study and propose a fix for a crucial issue blocking the progress of fine-tuning methods in RL.

Consider a model proficient in picking up objects and a task where the target object is hidden inside a drawer. If we fine-tune on such a task, the model will learn how to open the drawer to access the object, but during the learning process, it will lose its original object-manipulation skills. This is a classic example of catastrophic forgetting studied in the continual learning literature.

Why do we see forgetting?

State Coverage Gap

Let's consider the environment states in two distinct sets:

Close states: These are immediately accessible to the agent (e.g., the first level of a game, opening the drawer).
Far states: These require effort to reach (e.g., the second level of a game, picking up objects).

During the initial stages of fine-tuning, an agent who does not know how to behave on Close will not see Far states at all. This limited exposure leads to the agent forgetting how to behave in Far states. This phenomenon occurs due to catastrophic forgetting, a well-documented tendency of neural networks to rapidly lose previously obtained information once it is not present in the optimization process.

Our solution to this problem is to protect the knowledge we have on the Far states during fine-tuning. Fortunately, we can use knowledge retention from continual learning literature to achieve this.

Imperfect Cloning Gap & NetHack

We show that even in scenarios without an apparent environment shift, forgetting remains a significant challenge. In particular, we consider the problem of fine-tuning behaviorally-cloned models in the challenging NetHack domain.

What is NetHack?

NetHack is a classic roguelike game known for its incredibly deep and complex gameplay featuring permadeath, procedurally generated levels, and an astounding number of item and monster interactions. The recently proposed NetHack Learning Environment based on the game serves as a perfect environment for RL agents.

Imitation Gap

We use a model trained on vast amounts of NetHack data using behavioral cloning (BC); see Jens et al. for details. Despite the extensive training, imitation learning often does not perfectly replicate the expert's behavior. This imperfection leads to a significant distribution shift in the trajectories collected by the pre-trained model compared to those of the expert. The density plot below shows that our expert (the AutoAscend bot) reaches deeper dungeon levels of the game than the cloned policy. Even though our pre-trained has seen data on how to behave on level 3 and further, these levels will not appear in its unrolled trajectories.

As the model makes decisions that deviate slightly from the expert's choices, it fails to reach Far states. The reduced exposure to previously trained states leads to a rapid decline in the model's performance in these areas, which is another instance of catastrophic forgetting. We can again use continual learning methods to fix it.

Results

Our experiments reveal a stark contrast between vanilla fine-tuning and approaches incorporating knowledge retention techniques. Surprisingly, vanilla fine-tuning fails to outperform training from scratch, underscoring the severity of the forgetting problem in online RL fine-tuning. However, when we apply knowledge retention methods, we observe a significant improvement in performance across all tested environments. In fact, we achieve a new state-of-the-art for neural models, more than doubling the previous best score from 5,000 to 10,500 points.

Conclusion

We demonstrate that forgetting is a crucial problem in online RL fine-tuning, but we can fix it by applying knowledge retention methods. Thus, we can fully harness the power of pre-trained models in RL.

Page updated

Google Sites

Report abuse