Accelerated LLM Decoding via Monte Carlo Tree Search and Self-evolving Speculative Decoding
News and Updates:
🚀 Dec 2024: Our paper Adaptive Draft-Verification for Efficient Large Language Model Decoding is accepted at the AAAI 2025. See you in Philadelphia!🌎
Large language model (LLM) decoding involves generating a sequence of tokens based on a given context, where each token is predicted one at a time using the model's learned probabilities.
The typical autoregressive decoding method requires a separate forward pass through the model for each token generated, which is computationally inefficient and poses challenges for deploying LLMs in latency-sensitive scenarios.
The main limitations of current decoding methods stem from their inefficiencies and resource demands. Existing approaches either necessitate fine-tuning smaller models, which is resource-intensive, or relying on fixed retrieval schemes to construct drafts for the next tokens, which lack adaptability and fail to generalize across different models and contexts.
To address these issues, we introduce a novel methodology called Adaptix, which is a kind of Retrieval Based Speculative Decoding method accelerates LLM decoding without requiring fine-tuning. Our approach involves an adaptive draft-verification process that evolves over time to improve efficiency. We utilize a tri-gram matrix-based LLM representation to dynamically approximate the output distribution of the LLM, allowing the model to adjust to changing token probabilities during the decoding process. Additionally, we implement a draft construction mechanism that effectively balances exploration and exploitation, ensuring that the drafts generated are both diverse and close to the true output distribution of the LLM.
The importance of this design lies in its ability to optimize the draft distribution adaptively, leading to faster and more accurate decoding. Through extensive experiments on various benchmark datasets and LLM architectures, we demonstrate that Adaptix accelerates the decoding process while maintaining high accuracy, making it suitable for deployment in a wide range of practical applications.
Speed Up on Vicuna-7B
Achieves up to 2.5X speed improvements in large language model decoding without the need for additional model training.
Efficiently accelerates decoding with a minimal memory footprint, requiring only 253MB of corpus data for a 1.9X speedup.
Operates without the need for GPU resources, making it highly suitable for deployment on devices with limited hardware capabilities.
Features an adaptive mechanism that improves prediction accuracy over time, ensuring better performance the more it is used.
This figure illustrates the data processing workflow of Adaptix. Initially, the input tokens undergo preprocessing to calculate their tri-grams, which serve to update the tri-gram matrix. Subsequently, the updated matrix, in conjunction with the last two tokens of the input, is used to retrieve potential token sequences. These sequences are ranked, and the top-k sequences are selected, and then appended to the original input. Finally, these extended sequences are inputted into the Large Language Model for prediction.
In the experiments, we compare the efficacy of different baselines applied to various models, utilizing three datasets: MT-Bench, Human-Eval, and Alpaca. We focus on metrics of Accept Length, Latency, and Speedup Ratio.
On the three datasets. Adaptix consistently demonstrates lower latency,
For instance, on MT-Bench, Adaptix achieves a latency of 12.95 ms for vicuna-7B, which is lower than REST (16.31 ms), REST Single Thread (17.36 ms), and Lookahead (18.93 ms).
Notably, the memory required for Adaptix (574MB) is only 5.6% of that required for REST (12GB). Even when Adaptix uses a smaller corpus (253MB, which is just 2.5% of REST's requirements), it still achieves lower latency than REST. This trend is also observed on Alpaca, where Adaptix achieves a latency of 12.81 ms for vicuna-7B, compared to 14.24 ms for REST, 14.58 ms for REST Single Thread, and 18.73 ms for Lookahead.
The accept length results in this table indicate the quality of the generated outputs, with longer accept lengths suggesting more coherent and contextually relevant text. Our method, Adaptix, outperforms other methods across different models on all three datasets, showcasing its superior language generation capabilities.
Throughput and Speedup on MT-Bench
Throughput and Speedup on Alpaca
Throughput and Speedup on Human-Eval
Speedup ratio are used to evaluate the efficiency.
Adaptix consistently shows a significant improvement in speed up across all datasets.
This efficiency is noticeable on the MT-Bench, Alpaca and Human-Eval datasets, where Adaptix not only reduces latency but also enhances the overall processing speed.
For instance, Adaptix achieves a speedup of 1.92x on MT-Bench with the vicuna-13B model, outperforming REST, REST Single Thread, and Lookahead. On the HumanEval dataset, the vicuna-33B model, for example, demonstrates a speedup of nearly 2.5x when using Adaptix .
We test Adaptix across different categories of tasks in MT-Bench, including writing, roleplay, reasoning, math, coding, extraction, STEM, and humanities. The experimental results indicate that Adaptix maintains consistent performance across categories. The average accept length remains stable, demonstrating that Adaptix can effectively handle a diverse range of tasks without significant variations in performance.
This figure illustrates the performance of our adaptive strategy on Vicuna-7B. The adaptive strategy maintains a higher average accept length over the entire range of inputs compared to the non-adaptive strategy.
The adaptive strategy's success is attributed to its dynamic adjustment of the model's probability distributions based on the tri-gram frequencies from prior outputs. This allows the model to better manage longer contexts and maintain relevance, enhancing stability and coherence in longer interactions.
Vicuna-7B on MT-Bench
Vicuna-13B on MT-Bench
The result of Vicuna-7B and Vicuna-13B models on the MT-Bench dataset shows the impact of different MCTS search counts on performance. We find that for both models, increasing the number of searches improves performance, while the optimal number varies by model size. The average exact length and latency are plotted against the number of searches, illustrating the trade-off between performance and computational cost.
As shown in the figure, there is a notable improvement in the average exact length as the number of searches increases, along with an increase in latency, indicating a balance between search depth and generation time.
We further compare greedy search with MCTS(full traversal is not feasible due to the huge search space) while keeping the number of 150 search iterations constant.
The results show that the average accept length of the greedy search is only 1.493, significantly lower than that obtained by MCTS, demonstrating the superiority of MCTS in efficiently managing the vast decision spaces.