We first construct the Bigtable dataset. Subsequently, different types of LLMs, including general-purpose and coding-specific models, are employed to evaluate various subtasks.These subtasks include end-to-end Text-to-SQL, SQL-to-Text tasks and Schema Linking, which aligns entity references in the question with the corresponding schema tables or columns. For the questions that the LLMs fail to process correctly, a SQL Debugging task is designed with the aim of having the LLMs identify and correct the errors to generate the correct results. A SQL Optimization task is designed for the parts that the LLMs are able to handle correctly, with the goal of improving the execution efficiency of the SQL statements.
Table 1: EX(%) of different LLMs on open source datasets.
Table 2: The datadistribution of “BigTable” includes the average number of ground truth (GT) tables and columns involved in the instances.
Large language models (LLMs) exhibit varying performance across different open-source datasets. This can be attributed to the fact that some LLMs, particularly those specialized in coding tasks, a're trained or fine-tuned on these open-source datasets. To address the potential overfitting of LLMs, particularly those specialized in coding tasks, and to ensure a reliable and accurate assessment of their capabilities, we construct a novel dataset, termed ``BigTable''. This dataset is an extension and augmentation of the BIRD dataset.
we formally evaluate the different sub-tasks within the Text-to-SQL process to determine the performance differences among various LLMs and provide recommendations for addressing specific task requirements.
Proposed prompt template "SimpleDDL-MD-Chat"
Table 3: EX (%) of Text-to-SQL with various zero-shot prompt templates on Spider dev set.
Core Conclusion 1. The prompt template “SimpleDDL-MD-Chat” achieves optimal performance in the Text-to-SQL task.
Table 4: EX (%) of different LLMs on “BigTable”. Prompt template is “SimpleDDL-MD-Chat".
Core Conclusion 2. As the number of tables and columns involved in user queries increases, the Text-to-SQL challenge for LLMs significantly escalates.
Word cloud representation of error information for incorrect SQL queries generated by LLMs. Top: System Error and Result Error. Bottom: Detailed classification of Result Error.
What is the most powerful information organization of self debug?
Core Conclusion 3. Detailed error information and corresponding annotations greatly enhance the capabilities of LLMs, enabling them to effectively correct errors.
Can LLMsbenefit from multi-round self debug?
Core Conclusion 4. Multi-round self-debugging aids in error correction for LLMs, but there exists a performance boundary,with 1-2 rounds of debugging being the optimal choice.
Can an LLM debug the error incurred by other LLMs (general debugging)?
Core Conclusion 5. The performance of cross-LLM SQL debugging is inferior to the direct regeneration. A multi-agent approach that integrates outputs from different models shows great potential.
Do LLMs have the capability for SQL self-optimization?
Core Conclusion 6. In-context learning methods present challenges in achieving effective SQL optimization with LLMs.
Core Conclusion 6. Utilizing a general-purpose model for semantic description of SQL statements is a better choice.
Core Conclusion 7. Foreign key information is capable of advance the performance of schema linking. PreSQL yields the highest performance on coding-specific models, and integrating the results from Few Shot can further enhance performance on general-purpose models.