Forget Forgetting: Continual Learning in a World of Abundant Memory
Dongkyu Cho, Taesup Moon, Rumi Chunara, Kyunghyun Cho, Sungmin Cha
Dongkyu Cho, Taesup Moon, Rumi Chunara, Kyunghyun Cho, Sungmin Cha
Abstract
Continual learning (CL) has traditionally focused on minimizing exemplar memory, a constraint often misaligned with modern systems where GPU time, not storage, is the primary bottleneck. This paper challenges this paradigm by investigating a more realistic regime: one where memory is abundant enough to mitigate forgetting, but full retraining from scratch remains prohibitively expensive. In this practical "middle ground", we find that the core challenge shifts from stability to plasticity, as models become biased toward prior tasks and struggle to learn new ones. Conversely, improved stability allows simple replay baselines to outperform the state-of-the-art methods at a fraction of the GPU cost. To address this newly surfaced trade-off, we propose Weight Space Consolidation, a lightweight method that combines (1) rank-based parameter resets to restore plasticity with (2) weight averaging to enhance stability. Validated on both class-incremental learning with image classifiers and continual instruction tuning with large language models, our approach outperforms strong baselines while matching the low computational cost of replay, offering a scalable alternative to expensive full-retraining. These findings challenge long-standing CL assumptions and establish a new, cost-efficient baseline for real-world CL systems where exemplar memory is no longer the limiting factor.
Background
Continual learning studies how a model can learn tasks sequentially without catastrophically forgetting earlier ones. The core issue is the stability–plasticity tradeoff: models must stay stable enough to retain past knowledge, while remaining plastic enough to adapt to new tasks. Many prior methods address this with replay or regularization, often under very small memory budgets.
Motivation
Continual Learning assumes memory is the bottleneck, so it focuses on making tiny replay buffers work. But in modern settings, storage is relatively cheap while GPU retraining cost is expensive, and once memory is large enough, forgetting is already much less severe—so the real problem becomes recovering plasticity, or the ability to learn new tasks efficiently. That is why the paper proposes a low-cost weight-space method to improve plasticity without sacrificing the stability that abundant memory already provides.
Proposed Method
Weight Space Consolidation (WSC), combines two simple weight-space operations. First, after a short warmup, it identifies low-importance or dormant parameters using a gradient-based ranking score and softly resets them toward the previous stable weights to recover plasticity. Then it performs running weight averaging during training so the model converges to a flatter, more stable solution, improving retention with little extra compute.
Main Results
We show that when replay memory is sufficiently large, simple replay already becomes much stronger, and the key bottleneck shifts from forgetting to plasticity—the ability to adapt to new tasks.
In that setting, Weight Space Consolidation (WSC) consistently matches or beats strong baselines on both class-incremental vision benchmarks and continual instruction tuning for LLMs, while keeping training cost close to naive replay and delivering about 3–4× lower cost than more expensive state-of-the-art methods.
The ablations also show that both parts of WSC matter: reset alone is not enough, averaging alone is not enough, and the combination gives the best results.
Ablations and Takeaways
Both ingredients matter, and they work best together. Across all memory sizes on CIFAR-100, the full method consistently outperforms Replay and both single-component variants, showing that parameter reset and weight averaging are complementary rather than interchangeable. Reset alone gives only modest gains, while averaging alone helps more but still falls short of the combined method.
Our reset criterion is both effective and cheap. The proposed moment-based importance score performs on par with much more expensive Hessian-based alternatives, while clearly outperforming using only the first or second moment alone. This suggests that the benefit comes not just from resetting, but from resetting the right parameters.
How and when we reset also matters. Our soft reset strategy outperforms random reset, hard reversion, Shrink-and-Perturb, and Continual Backprop, with the advantage becoming larger at higher memory sizes. In addition, a single reset after warmup works best in most settings, while overly frequent resets can hurt performance.
Only a small fraction of parameters appear necessary for adapting to new tasks. In the retain-rate analysis, resetting up to 80% of parameters causes only minor degradation, suggesting that continual adaptation relies on a relatively small active subset of weights. Weight averaging further improves robustness under very aggressive resets.