Introduction
Recent advances in large language models (LLMs) have significantly extended their ability to handle long input contexts and generate lengthy reasoning, with some models supporting context windows of up to 100K tokens. However, as context length and model size grow, the memory required to store the key-value (KV) cache increases dramatically, posing major challenges for efficient inference.
One promising direction is the hybrid model, where some standard transformer layers are replaced with more memory-efficient mechanisms (e.g., RNN, Mamba, sliding window). This strategy often yields substantial memory savings but requires large-scale training from scratch. To address this limitation, we propose LIGHTTRANSFER—a method to directly transform pretrained transformers (e.g., LLaMA, Mistral) into hybrid models with minimal overhead.
By observing that certain “lazy” layers in long-context LLMs pay attention mostly to a few unimportant tokens, we selectively convert these layers to streaming attention while retaining full attention in the remaining layers. For tasks where the input is sufficiently long (i.e., long-context understanding), we leverage on-the-fly lazy layer identification at the prefilling stage, LIGHTTRANSFER-Test. In addition, for o1-like long reasoning generation tasks, even though the questions can be relatively short (only a few dozen tokens) yet demand higher model capacity, we surprisingly find that minimal training still enables robust performance LIGHTTRANSFER-Train. In practice, this transition requires only around 5K samples (originally utilized for long-reasoning ability distillation, underscoring the lightweight nature of our approach.
Experiment Results
Results on Long Reasoning Generation
Figure 1. Lazy ratio scores across layers in QwQ-32B-STILL.
Results on Long-Context Understanding (LongBench)
Methodology
LightTransfer-Test
Figure 2. The framework of our LightTransfer-Test. A priority queue is maintained during the prefilling stage to store the lazy ratio and corresponding layer index after processing each layer. Once the queue reaches its capacity, the layer with the highest lazy ratio is identified as a lazy layer, and its KV cache is reduced, freeing memory for storing the KV cache of the current layer.
LightTransfer-Train