Eliciting or improving LLMs’ ability
SoT has showcased some potential in enhancing the answer quality. This is part of a broader trend in recent research, exemplified by work including CoT, ToT, and ReAct, which collectively affirm the notion that explicitly articulating the thought process in language can elicit high-quality answers. These findings resemble human thinking: rather than relying solely on the first intuition or purely sequential thinking, we often document step-by-step reasoning or thought organization to attain high-quality answers. This intriguing parallel prompts us to explore further how we can draw from the human thinking process to facilitate more effective and efficient AI.
For instance, SoT currently ignores the dependencies between points. A conceptually better way is to organize the points as Graph-of-Thoughts, where the edges represent the dependencies, and each point is decoded conditioned on the contents of its ancestor points. In addition, instead of complying with a static graph, we expect the need of having dynamic Graph-of-Thoughts, where the high-level thought structure is adjusted dynamically by LLMs themselves. This could potentially combine the efficiency and global thinking advantages of SoT with the logical reasoning and impromptu thinking strengths of methods like CoT.
Furthermore, there exist self-improving training pipelines that use rationales generated by CoT to fine-tune LLMs, thereby enhancing their reasoning abilities. Likewise, it is interesting to investigate how the more structured answers from SoT can be used to fine-tune LLMs to enhance their ability to generate well-organized and comprehensive answers.
Efficiency and overhead of SoT in different scenarios
Serving systems commonly adopt batch processing to handle concurrent queries. This raises a concern of whether SoT may hurt serving throughput due to parallel requests. (1) When there is an unsaturated number of concurrent queries, SoT can effectively reduce latency and enhance GPU utilization. Example scenarios include (a) Edge-side applications with a single user; (b) Centralized services during periods with unsaturated user requests and underutilized computing capacity. It is interesting to study the appropriate SoT triggering conditions based on system workloads. (2) When there is a saturated number of concurrent queries, SoT is still useful for improving answer quality. However, in this case, it is important to consider the computation overhead from SoT.
For API-based models, a notable concern arises regarding the increased number of prefilling tokens. Given that many APIs charge token usage, SoT may lead to higher costs. To address this, one can tune the number of parallel API requests (by expanding multiple points in a single API call), or use prompt tuning to design shorter SoT prompts.
See Appendix H in the technical report for more information on these concerns.
Data-centric efficiency optimization
While data-centric engineering for improving answer quality is gaining popularity, its potential for inference efficiency is not explored yet. SoT is the first attempt. As LLM capabilities and the amount of LLM-generated data are growing rapidly, data-centric techniques could become more useful in the future. We look forward to more explorations to unlock the full potential of data-centric efficiency optimization.