Xuefei Ning*, Zinan Lin*, Zixuan Zhou*, Zifu Wang, Huazhong Yang, Yu Wang

Let us accelerate the end-to-end generation of LLMs by 2x without any change to the model, system, or hardware!

[ArXiv] [BibTex] [Code] [Open Review] [Poster]

Abstract

This work aims at decreasing the end-to-end generation latency of large language models (LLMs). One of the major causes of the high generation latency is the sequential decoding approach adopted by almost all state-of-the-art LLMs. In this work, motivated by the thinking and writing process of humans, we propose Skeleton-of-Thought (SoT), which first guides LLMs to generate the skeleton of the answer, and then conducts parallel API calls or batched decoding to complete the contents of each skeleton point in parallel. Not only does SoT provide considerable speed-ups across 12 LLMs, but it can also potentially improve the answer quality on several question categories. SoT is an initial attempt at data-centric optimization for inference efficiency, and further underscores the potential of pushing LLMs to think more like a human for answer quality.

Background

The generation process of large language models (LLMs) is slow. For example, it takes 22 seconds for Claude (accessed through Slack, middle July, 2023) and 43 seconds for Vicuna-33B V1.3 (a 33B LLaMA-based model, running locally on one NVIDIA A100 GPU) to answer the question "What are the most effective strategies for conflict resolution in the workplace".

There are three major causes of LLM's slow inference problem: (1) the large memory, memory access, and computation amounts induced by the large model size; (2) the quadratic memory and computation complexity of attention in the prevailing Transformer architecture; and (3) the sequential decoding approach that generates token one by one. A bunch of literature has been addressing the first two axes by compressing/redesigning the model or redesigning the serving system and hardware.

Can we accelerate off-the-shelf LLMs without any changes to their model, system, or hardware? In this work, we show the feasibility of parallel decoding of off-the-shelf LLMs!

Method

The idea stems from reflecting on how humans answer questions. Humans do not always think about questions and write answers in a sequential fashion. In contrast, for many types of questions, we first derive the skeleton according to some protocols and strategies, and then add evidence and details to refine and explicate each point. This is especially the case on formal occasions like offering consultancy, taking tests, writing papers, and so on. Can we make LLMs think in the same way? To this end, we propose Skeleton-of-Thought (SoT). Specifically, as shown below, we guide the LLM to derive a skeleton first by itself. Based on the skeleton, the LLMs can complete each point in parallel so that we get a speed-up. Note that SoT can be utilized to accelerate both open-source models with batched decoding and closed-source models with parallel API calls.

(1) Skeleton stage

SoT first assembles a skeleton request using the skeleton prompt template with the original question. The skeleton prompt template is written to guide the LLM to output a concise skeleton of the answer. Then, we extract the B points from the skeleton response of the LLM.

Skeleton prompt template: In order to make the output skeleton short and in a consistent format for the good of efficiency and ease of point extraction, the skeleton prompt template (1) describes the task precisely, (2) uses two simple demonstrations, and (3) provides a partial answer “1.” for the LLM to continue writing. We find that, in most cases, the skeleton responses are in the desired format. Therefore, we can simply use a regular expression to extract point indexes and point skeletons from the skeleton response.

(2) Point-expanding stage

Based on the skeleton, we assemble B point-expanding requests using the point-expanding prompt template, and let the LLM expand on each point in parallel. For proprietary models with only API access, we can issue multiple parallel API calls. For open-source models, we let the model process the point-expanding requests as a batch (paddings are added to the left of the point-expanding requests). Finally, after completing all points, we concatenate the point-expanding responses to get the final answer.

Point-expanding prompt template: The point-expanding prompt template describes the point-expanding task and provides a partial answer. We also provide instructions “Write it **very shortly** in 1∼2 sentence” so that the LLMs keep the answers concise. Unlike the skeleton prompt template, we find that demonstrations are not necessary to get reasonable results.

Please refer to our technical report for the prompt template we used.

SoT with Router (SoT-R): Adaptively Triggering SoT

SoT conducts an independent and parallel expansion of points. Consequently, it is not suitable for questions that (1) require step-by-step reasoning, (2) only require a very short answer, and (3) cannot be independently decoded for coherence. Therefore, towards pushing the practical adoption of SoT, we explore the possibility of adaptively triggering SoT only when it is suitable. To achieve that, we propose a router module that decides if SoT should be applied for the user request, and then call either SoT or normal decoding accordingly. To implement the router, we explore two options: LLM prompting as the router (no model training is needed), and trained RoBERTa as the router (trained on LIMA, tested on Vicuna-80 and WizardLM). We name the overall solution SoT with Router (SoT-R).

Please refer to our technical report for the details of router prompting or router annotation and training.

Results

Datasets. We use (1) the Vicuna-80 dataset, which contains 80 questions spanning nine categories, such as coding, math, writing, roleplay, and so on, and (2) the WizardLM dataset, which contains 218 questions spanning more categories and diverse difficulties.

Models. We test SoT and SoT-R on 12 recently released models, including 9 open-source models and 3 API-based models as shown below. We obtain the weights of all the open-source models from HuggingFace.

Efficiency evaluation. We record the latency of API calls with time.time and that of local model runs with torch.cuda.Events. All local models are run with the FastChat - HuggingFace transformer - PyTorch library stack and in FP16 precision.

Answer quality evaluation. We follow the recent practice of letting an LLM judge compare answers. We use FastChat's and LLMZoo's evaluation prompts and ask GPT-4 for its preference of answers.

The left figure below shows the answer quality and speed-ups achieved by our overall solution -- SoT-R. SoT-R can accelerate the generation of both the API-based and open-source models. In addition, SoT-R can also improve the answer quality for many models. This is because the skeleton stage in SoT encourages LLMs to think from multiple perspectives centering around the question, improving the diversity and relevance of the answers.

For some details on why SoT-R helps, the right figure below shows that SoT-R successfully triggers SoT for the suitable question categories and falls back to the normal generation for others, thus maintaining the answer quality for these categories.

More detailed evaluation and analyses on the efficiency and answer quality can be found in our technical report.

The net win rates and speed-ups of SoT with router (SoT-R) compared to normal generation on Vicuna-80. The net win rate is the difference between the fraction of questions that SoT-R has better and worse answers than the normal generation. The speed-up is the ratio between the latency of normal and SoT-R generation. (1.0, 0.0) represents normal generation. Higher is better on both axes. For most models, SoT-R not only accelerates the generation but also improves the quality of the answers.

(Upper) The net win rates of SoT and SoT-R on different question categories on Vicuna-80. For the question categories not suitable for SoT, SoT-R learns to fall back to the normal generation mode. Consequently, SoT-R can maintain good answer quality for all question categories.

(Lower) The speed-ups of SoT and SoT-R on different models on Vicuna-80. SoT-R can keep >1 speed-up for most models.

Simple Demos

The SoT-R demo code is released in our repository, made with Gradio and based on the FastChat team's web demo implementation.

Questions suitable for SoT

Question: How can I improve my time management techniques?

Question: What are some of the most impactful projects that Lionel Messi's charity has undertaken?

SoT-R triggers the SoT generation mode for these two questions.

Questions unsuitable for SoT

Question: Hi, what's your name?

Question: Solve for x in the equation 3x + 10 = 13

SoT-R falls back to the normal generation mode for these two questions: (1) "Hi, what's your name?", which requires only a short answer instead of a long multi-point answer; (2) "Solve for x in the equation 3x + 10 = 13", which is a math question that requires step-by-step reasoning.

Limitations and Outlooks

Eliciting or improving LLMs’ ability

SoT has showcased some potential in enhancing the answer quality. This is part of a broader trend in recent research, exemplified by work including CoT, ToT, and ReAct, which collectively affirm the notion that explicitly articulating the thought process in language can elicit high-quality answers. These findings resemble human thinking: rather than relying solely on the first intuition or purely sequential thinking, we often document step-by-step reasoning or thought organization to attain high-quality answers. This intriguing parallel prompts us to explore further how we can draw from the human thinking process to facilitate more effective and efficient AI.

For instance, SoT currently ignores the dependencies between points. A conceptually better way is to organize the points as Graph-of-Thoughts, where the edges represent the dependencies, and each point is decoded conditioned on the contents of its ancestor points. In addition, instead of complying with a static graph, we expect the need of having dynamic Graph-of-Thoughts, where the high-level thought structure is adjusted dynamically by LLMs themselves. This could potentially combine the efficiency and global thinking advantages of SoT with the logical reasoning and impromptu thinking strengths of methods like CoT.

Furthermore, there exist self-improving training pipelines that use rationales generated by CoT to fine-tune LLMs, thereby enhancing their reasoning abilities. Likewise, it is interesting to investigate how the more structured answers from SoT can be used to fine-tune LLMs to enhance their ability to generate well-organized and comprehensive answers.

Efficiency and overhead of SoT in different scenarios

Serving systems commonly adopt batch processing to handle concurrent queries. This raises a concern of whether SoT may hurt serving throughput due to parallel requests. (1) When there is an unsaturated number of concurrent queries, SoT can effectively reduce latency and enhance GPU utilization. Example scenarios include (a) Edge-side applications with a single user; (b) Centralized services during periods with unsaturated user requests and underutilized computing capacity. It is interesting to study the appropriate SoT triggering conditions based on system workloads. (2) When there is a saturated number of concurrent queries, SoT is still useful for improving answer quality. However, in this case, it is important to consider the computation overhead from SoT.

For API-based models, a notable concern arises regarding the increased number of prefilling tokens. Given that many APIs charge token usage, SoT may lead to higher costs. To address this, one can tune the number of parallel API requests (by expanding multiple points in a single API call), or use prompt tuning to design shorter SoT prompts.

See Appendix H in the technical report for more information on these concerns.

Data-centric efficiency optimization

While data-centric engineering for improving answer quality is gaining popularity, its potential for inference efficiency is not explored yet. SoT is the first attempt. As LLM capabilities and the amount of LLM-generated data are growing rapidly, data-centric techniques could become more useful in the future. We look forward to more explorations to unlock the full potential of data-centric efficiency optimization.

Acknowledgements

We thank Sergey Yekhanin (Microsoft Research), and Tianji Wu (Infinigence AI) for their support and suggestions on the work. We thank Tianyu Fu for many initial discussions on the idea. We thank Ke Hong and Genghan Zhang for their discussions about profiling. We thank Yue Wu for the help on the Claude scripts. We thank Da Yu, Chulin Xie, and Saiqian Zhang for their suggestions on revising the first version of the paper. We thank Rui Hu, Cheng Cheng, Jack Jin, Zhoutong Ye, Mingze Sun, Jun Yan, Zhi Zhang, Yuxuan Tong, and Nianhui Guo for their suggestions on revising the second version of the paper.

BibTex

@misc{ning2023sot,

title={Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding},

author={Xuefei Ning and Zinan Lin and Zixuan Zhou and Zifu Wang and Huazhong Yang and Yu Wang},

year={2023},

eprint={2307.15337},

archivePrefix={arXiv},

primaryClass={cs.CL}

}

Page updated

Google Sites

Report abuse