Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding

Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui

The Hong Kong Polytechnic University, Peking University, Microsoft Research Asia, Alibaba Group

[Leaderboard] [Paper] [Code] [Twitter]

Speedup comparison of various Speculative Decoding methods on Spec-Bench with greedy settings (T=0). Evaluations were conducted on Vicuna-7B-v1.3 with a batch size of 1. We present the mean speedup over 3 different runs.

To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. Recent advancements in Speculative Decoding like Speculative Sampling, Medusa, and EAGLE have demonstrated impressive acceleration performance. However, existing methods are evaluated using disparate benchmarks, devices, and testing environments, making fair comparisons impractical. 

To bridge this gap, we introduce Spec-Bench📝 – a comprehensive benchmark designed for assessing Speculative Decoding methods across diverse application scenarios. Based on Spec-Bench, we aim to establish and maintain a unified evaluation platform for open-source Speculative Decoding approaches. This platform facilitates the systematic assessment🧐 of existing methods under uniform device and testing conditions, thereby ensuring fair comparisons. 

Through Spec-Bench, we hope to provide the research community with more realistic speedup expectations of leading Speculative Decoding methods, fostering further advancements in this promising area.

Introducing Spec-Bench

To assess Speculative Decoding methods across various application scenarios, we developed Spec-Bench, a comprehensive evaluation benchmark encompassing six distinct subtasks.

Spec-Bench integrates MT-bench, a multiturn conversation benchmark previously adopted in research, to provide a basis for comparison with earlier studies. Additionally, it includes two input-guided tasks: summarization and retrieval-augmented generation (RAG), both of which exhibit a significant overlap between the input prompts and the target outputs. We selected CNN/Daily Mail and Natural Questions as the datasets for these two tasks, respectively. Specifically, in the RAG subtask, the top-5 documents retrieved from DPR were concatenated with each question to construct the input prompt.

Moreover, Spec-Bench incorporates three further subtasks – translation, question answering, and mathematical reasoning – to provide a thorough evaluation of Speculative Decoding’s speedup capabilities in diverse contexts. We utilized WMT14 DE-EN, Natural Questions, and GSM8K as the primary datasets for these tasks, respectively. 

We randomly selected 80 instances from each subtask’s test set for evaluation.

Detailed Composition of Spec-Bench.

Experimental Details

We have integrated implementations of six representative Speculative Decoding methods into Spec-Bench. These methods are open-source and free of bugs. Specifically, Speculative Sampling (SpS) stands as the pioneering work in this field, utilizing a smaller LM from the same model series as the drafter to accelerate LLM inference. Medusa and EAGLE incorporate additional lightweight heads into the target LLM to facilitate efficient drafting. Lookahead Decoding introduces multiple special tokens to the end of the input prompt for parallel drafting and transforms the drafts into n-gram candidates. PLD is the code implementation of LLMA, which selects text spans from the input as drafts. REST retrieves relevant drafts from text corpora based on the input prompt.

Currently, Spec-Bench supports evaluations on the Vicuna-v1.3 model series. Below, we present our comparative analysis on Spec-Bench using a single consumer NVIDIA GeForce RTX 3090 GPU and a more powerful NVIDIA A100 GPU.

Comparative Analysis on 3090

Greedy Decoding

We first present our comparative evaluation results on a single NVIDIA 3090 GPU at fp16 precision with greedy settings (T=0). Evaluations were conducted on Vicuna-7B-v1.3 with a batch size of 1.

Under greedy settings, EAGLE achieves the largest speedup ratio (1.8x~2.4x) over autoregressive decoding across most subtasks, especially the mathematical reasoning subtask (with ~2.4x speedup). PLD excels in subtasks with high similarities between input and output, such as summarization (with ~2.4x speedup). However, its acceleration performance diminishes in other subtasks like translation and question answering, with speedup ratios falling between 1.1x~1.3x. 

Nucleus Sampling

EAGLE consistently outperforms other methods across different sampling temperatures, achieving a speedup ratio ranging from 1.7x to 2.1x. Besides, it is observed that the acceleration effect of all methods decreases with an increase in sampling temperature. This is attributed to the increased computational complexity of the speculative sampling criterion at higher temperatures, as revealed in prior research.

Speedup comparison of various methods on Spec-Bench at different temperatures.

Comparative Analysis on A100

Model Scale

We present the speedup comparison of Speculative Decoding methods on Spec-Bench across various model scales. Among all the evaluated methods, EAGLE maintains a high speedup ratio over autoregressive decoding across all model scales, achieving a speedup ratio ranging from 2.4x to 2.5x. Particularly, it achieves a maximum speedup of 3.0x on the mathematical reasoning subtask with Vicuna-33B. While Medusa demonstrates superior acceleration performance on Vicuna-7B, its speedup ratio degrades from 2.4x to 2.0x as the model scale increases.

Computational Devices

The speedup of most Speculative Decoding methods is notably enhanced when employed on high-performance GPUs, such as NVIDIA A100s. This enhancement is primarily due to the increased availability of idle computational resources on more advanced devices, which Speculative Decoding can leverage to accelerate the inference process. This finding underscores that Speculative Decoding methods will benefit more from evolving computational hardware, such as H100 GPUs.

Speedup comparison of various methods on Spec-Bench with different computational devices.

Acknowledgments

Spec-Bench is built upon the excellent work of many open-source projects in the LLM community, including Medusa, Lookahead Decoding, EAGLE, and more. We express our sincere gratitude for their groundbreaking efforts in advancing research in this field.