EAGLE: Extrapolation Algorithm for Greater Language-model Efficiency

by Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang, December 8, 2023

Vector Institute, University of Waterloo, Peking University, Microsoft Research

[Code with Apache-2.0] [EAGLE-1 Paper] [EAGLE-2 Paper]

Figure 1: Medusa was tested on the Vicuna's benchmark and Lookahead was tested on the LLaMA2-chat's benchmark by the original authors. In order to make a fair comparison, we run EAGLE on both benchmarks. Medusa's and Lookahead's numbers are copied from their original technical reports.

TL;DR: We introduce EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), a new baseline for fast decoding of Large Language Models (LLMs) with provable performance maintenance. This approach involves extrapolating the second-top-layer contextual feature vectors of LLMs, enabling a significant boost in generation efficiency.

In summary, EAGLE is:

3x faster than vanilla decoding (13B).
2x faster than Lookahead (13B).
1.6x faster than Medusa (13B).
provably maintaining the consistency with vanilla decoding in the distribution of generated texts.
trainable (within 1-2 days) and testable on 8x RTX 3090 GPUs. So even the GPU poor can afford it.
combinable with other parallelled techniques such as vLLM, DeepSpeed, Mamba, FlashAttention, quantization, and hardware optimization.

Figure 2: Generation speed of Vicuna 33B using different methods, with inference conducted on RTX 3090 GPUs at fp16 precision. For an enhanced viewing experience, the animation has been sped up fourfold.

Introduction

Large Language Models (LLMs) like ChatGPT demonstrate remarkable capabilities and are increasingly applied in various domains. However, their text generation process is costly and slow. This inefficiency is attributed to the nature of auto-regressive decoding: each token generation necessitates a forward pass, requiring access to the entire parameter set of the LLM, which can amount to several tens or even hundreds of billions of parameters. This results in a memory-bound limitation for auto-regressive decoding.

One approach to accelerate auto-regressive decoding is speculative decoding. This technique employs a smaller, draft model to guess the next γ tokens through standard auto-regressive generation. Subsequently, the Original LLM validates these guessed tokens, necessitating only a single forward pass for verification. If the draft model accurately predicts α tokens, a single forward pass of the Original LLM can generate α+1 tokens.

In speculative decoding, both the draft model and the Original LLM share the same task: predicting the next token based on the current sequence of tokens. Accomplishing this task with a model significantly smaller in parameters is extremely challenging and often yields sub-optimal results. Furthermore, the draft model's independent predictions do not leverage the rich semantic information extracted by the Original LLM, leading to potential inefficiencies.

This limitation inspired the development of EAGLE. Our approach utilizes the contextual features extracted by the Original LLM (i.e., those gathered during the prediction of the next token as the second-top layer output, without requiring additional computation). EAGLE is building upon the following First Principle:

The sequence of features is compressible, making the prediction of subsequent feature vectors from previous ones easy.

We train a lightweight plugin, called Auto-regression Head, in conjunction with the Original LLM's frozen embedding layer, to predict the next feature based on the current feature sequence from the second-top-layer of the Original model. The token is then derived using the frozen classification head of the Original LLM, which maps features to tokens. Features, being more abstract and clearer than token sequences, make the task of regressing features considerably simpler than regressing tokens. In summary, EAGLE extrapolates at the feature level using a small auto-regressive head and then employs the frozen classification head to generate the predicted sequence of tokens.

Consistent with similar works like speculative sampling, Medusa, and Lookahead, our focus is on the latency per prompt inference rather than the overall system throughput.

EAGLE – Enhancing LLM Generation Efficiency

Figure 2: A comparison of how to "guess" the fourth and fifth tokens t4 and t5 in different methods under the guess-verify framework, where t1t2 is the prompt. 't' (blue blocks) represents tokens, and 'f' (orange blocks) signifies the features from the second-top layer, with subscripts indicating their positions in the sequence. For simplicity, the "n" in the n-gram for Lookahead shown in the figure has been set to 2.

The figure below illustrates the workflow of EAGLE. During the forward process of the Original LLM, we collect the features from the second-top-layer. Beginning with these features and a token generated in this forward pass by the Original LLM, the FeatExtrapolator commences its "guessing" process. The FeatExtrapolator integrates embeddings and generates the next feature in an auto-regressive manner. Subsequently, the distribution of tokens is determined using the frozen LM head, enabling us to sample from this distribution. By repeating this sampling multiple times, we conduct a tree-like generation process, as depicted on the right side of the figure. In this example, three forward passes of the FeatExtrapolator "guess" a tree of 10 tokens.

Figure 3: Schematic Diagram of EAGLE. The left side illustrates the computational process, while the right side displays the corresponding generation results for each step. Green blocks represent token embeddings, orange blocks signify the features from the second-top-layer layer of the LLM, red boxes indicate features predicted by the Auto-regression Head, and blue modules with snowflake icons represent the use of original LLM parameters, which are not subject to training.

We employ the lightweight FeatExtrapolator to predict the features of the Original LLM, which may not always be precise (as indicated by the features within red boxes in the figure). To ensure the consistency of the generated text distribution, we subsequently verify the predicted tree. Owing to the properties of causal LMs, this verification process can be completed in a single forward pass, which also simultaneously generates a token. Using this generated token and the collected features, the FeatExtrapolator can make further "guesses". Through this cycle of prediction and verification, the LLM is enabled to generate tokens rapidly.

Training the Auto-regression Head is straightforward. For all models, we utilize the ShareGPT dataset, which comprises fewer than 70,000 conversational rounds, for training purposes. The FeatExtrapolator is also characterized by a minimal number of trainable parameters. As indicated by the blue sections in the above figure, the majority of components are frozen. The only requirement is "One Auto-regression Head," which is a single-layer transformer decoder and has 0.24B-0.99B parameters. Even with GPU-poor setups, training the Auto-regression Head is feasible. For example, the Auto-regression Head for Vicuna 33B can be trained on RTX 3090 GPUs within 24 hours.

Why We Use Token Embedding？

Unlike Medusa, which predicts tokens at various offsets (e.g., the next token, the one following it, etc.) solely using the second-top-layer's features, FeatExtrapolator additionally incorporates the embedding of the next token into its input for prediction. This additional information helps FeatExtrapolator handle the randomness inherent in the sampling process.

Consider the example in the figure below where the query is "I". The LLM outputs probabilities for "I" being followed by "am" or "always". Medusa, not factoring in whether "am" or "always" is sampled, directly predicts the probability of the token following the next one. Therefore, Medusa's target is the probability of the token following "I am" or "I always". Due to the randomness of the sampling process, Medusa's identical inputs can have different targets, leading to a lack of a consistent mapping between inputs and outputs.

In contrast, EAGLE's input includes the embedding of the sampled result, ensuring a consistent mapping between input and output. This distinction allows FeatExtrapolator to more accurately predict subsequent tokens, accounting for the specific context established by the sampling process.

Figure 4: Demonstrating the Importance of Using Token Embedding. The figure depicts the LLM's generation process using "I" as the query, where the LLM predicts the next token to be either "am" or "always." The left side assumes the sampling of "am," and the right side assumes "always." For both EAGLE and Medusa, the target is to predict the probability of the token that follows the next token, which means predicting the token after "I am" (on the left side) or after "I always" (on the right side), depending on the random sampling outcome. Medusa’s input is the feature of "I," which does not account for the sampled result, leading to different targets for the same input. In contrast, EAGLE's input includes the embedding of the sampled result - either "am" (left side) or "always" (right side), ensuring a unique target for each input.

Tree Generation Caption

Differing from other guess-verify frameworks like speculative sampling, Lookahead, and Medusa, our approach, by employing tree-like generation in the "guessing" phase, achieves greater efficiency. As illustrated in the diagram, the generation processes in speculative sampling and Lookahead are linear or chain-like. Medusa's method, hindered by its inability to construct context during the guessing phase, generates a tree through the Cartesian product, leading to a fully connected tree across adjacent layers. This approach frequently results in nonsensical combinations, such as "I am begin." On the other hand, EAGLE creates a sparser tree structure. The sparse tree structure is more selective and context-aware, preventing the formation of nonsensical sequences and focusing computational resources on more plausible token combinations.

Figure 5: Schematic diagram illustrating the structure of text generation under different methods within the guess-verify framework.

Multi-Round Speculative Sampling

Speculative sampling, traditionally used for chain-like guessing processes, maintains distribution consistency. To adapt it for tree-like guessing scenarios, we have extended this approach into a multi-round recursive form. The pseudocode of Multi-Round Speculative Sampling is presented below.

During Tree Generation, we record the probability corresponding to each sampled token. Through Multi-Round Speculative Sampling, we ensure that the distribution of every token generated ultimately aligns with that of the Original LLM.

Experimental Result

The following figure illustrates the acceleration effects of EAGLE using Vicuna 33B across different types of tasks. The "coding" task, which involves a substantial amount of fixed templates, shows the best acceleration performance.

Acknowledgement

This project has been influenced by many excellent projects in the LLM community, such as Medusa, Lookahead, and others. The logo is designed by GPT-4. We also appreciate many valuable discussions with Tianle Cai, Hao Zhang, Ziteng Sun, and others.

Page updated

Google Sites

Report abuse