/10

Poster4 位审稿人

最低1最高4标准差1.1

ICML 2025

Structure-Guided Large Language Models for Text-to-SQL Generation

Qinggang Zhang,Hao Chen,Junnan Dong,Shengyuan Chen,Feiran Huang,Xiao Huang

OpenReview PDF

提交: 2025-01-24更新: 2025-07-24

摘要

关键词

Text-to-SQLlarge language modelstructure learning

评审与讨论

审稿意见

评分: 32025-03-14

This paper introduces SGU-SQL, a structure-guided framework for text-to-SQL generation using large language models (LLMs). By leveraging syntax trees and database schema graphs, SGU-SQL recursively decomposes queries into subtasks guided by SQL syntax, enabling incremental and accurate SQL generation. Experiments on Spider and BIRD benchmarks demonstrate its superiority over state-of-the-art baselines, particularly in handling complex queries. The framework addresses key challenges such as schema linking, syntax errors, and structural ambiguity through graph-based representations and syntax-aware decomposition.

给作者的问题

The framework uses GPT-4 as the backbone LLM. Would performance degrade significantly with smaller, open-source models (e.g., CodeLlama-7B)? Are there strategies to mitigate this?

论据与证据

The paper claims that SGU-SQL significantly reduces errors in complex queries (such as Schema links and Join statements), and demonstrates performance improvements on Spider and BIRD datasets through experimental data (such as Table 1). These data support the main argument, but there are some problems: Insufficient error classification details: Although it is mentioned that errors are reduced by 33.5% (Appendix case analysis), the distribution and quantitative standards of specific error types (such as syntax errors, logical errors) are not clearly stated, which may lead to doubts about the credibility of the conclusion.

方法与评估标准

Graph structure construction and dual graph encoding: It is reasonable to use RGAT to handle the graph structure of queries and databases, but it does not elaborate on how to solve the ambiguity problem in graph alignment (for example, how to select the optimal solution when multiple candidate nodes match). Syntax tree decomposition strategy: It is innovative to decompose SQL generation into subtasks based on syntax trees, but the specific processing mechanism for nested subqueries or complex aggregate functions is not discussed. Evaluation Metrics: The selection of EM Acc, Exec Acc, and VES is comprehensive, but the calculation method of VES is not clearly defined (for example, how to balance efficiency and accuracy)

理论论述

The paper does not provide a rigorous theoretical proof and mainly relies on experimental verification. For example, does structural decomposition necessarily improve the generation effect? Are there any theoretical boundaries (e.g., too fine a decomposition granularity may lead to context loss)? These questions are not explored at the theoretical level.

实验设计与分析

Limited Baseline Comparisons: While SGU-SQL outperforms listed baselines. If possible, please compare with newer LLM-based text-to-SQL methods (e.g., GPT-4 variant 4o, DeepSeek-R1).

补充材料

Syntax tree example: The syntax tree in Figure 5 only shows part of the structure and does not fully present the decomposition process of complex queries (such as nested subqueries).

与现有文献的关系

The paper makes a good connection between traditional methods (such as RAT-SQL), PLM-based methods (such as T5), and LLM paradigms (such as GPT-4)

遗漏的重要参考文献

No.

其他优缺点

Strengths: 1.Innovative Methodology: The integration of syntax trees and schema graphs to guide LLM-based SQL generation is novel and addresses critical limitations of existing methods, such as schema linking errors and structural ambiguity. 2.Comprehensive Evaluation: Extensive experiments on two benchmark datasets (Spider and BIRD) validate SGU-SQL’s effectiveness, with significant improvements in execution accuracy, especially for complex queries. 3.Practical Insights: The ablation studies and error analysis provide valuable insights into the contributions of each component and highlight the framework’s robustness across query difficulty levels. Weakness: 1.Personalization Gaps: The framework does not explore personalized decomposition strategies for different user intents or database structures, which could further enhance performance in real-world scenarios. 2.Efficiency Trade-offs: The graph construction and syntax decomposition steps introduce computational overhead. While efficiency analysis is included, the trade-off between accuracy and latency in large-scale applications is underexplored.

其他意见或建议

None.

作者回复

2025-04-01

Dear Reviewer jn4B, Thank you for your recognition of our work and for providing such thorough and insightful feedback. Your comments and suggestions are invaluable in helping us improve the quality and clarity of our work.

Insufficient details for error analysis. Thank you for your thorough review. In this paper, we perform an error analysis to evaluate our model’s performance. Specifically, we classify errors into two primary categories:

Schema-linking errors: Incorrect matching of tables or columns in the database schema.
Syntactic errors: Invalid SQL syntax, including misuse or omission of key SQL clauses (e.g., JOIN, GROUP BY, nested queries and others).

As illustrated in Figure 2 of our submission, we provided a detailed analysis of the error distribution. Compared to the baseline model, our approach significantly improves schema-linking accuracy and syntactic correctness.

The definition of VES metric is not clear. The definition of VES metric is not clear. Sorry for the confusion caused. Valid Efficiency Score (VES) is defined to measure the efficiency of valid SQL queries, which was first defined in BIRD bechmark.

A valid SQL query is a predicted SQL whose executed results exactly match the ground truth results. Specifically, VES evaluates both the efficiency and accuracy of predicted SQL queries. For a text dataset with $N$ examples, VES is computed by: $\text{VES} = \frac{1}{N}\sum_{n=1}^{N}**I**(V_n, \hat{V}_n) \cdot **R**(Y_n, \hat{Y}_n),$ where $\hat{Y}_n$ and $\hat{V}_n$ are the predicted SQL query and its executed results and $Y_n$ and $V_n$ are the ground truth SQL query and its corresponding executed results, respectively. $\text{I}(V_n, \hat{V}_n)$ is an indicator function, where: $**I**(V_n, \hat{V}_n) = 1 if V_n = \hat{V}_n.$ Then, $**R**(Y_n, \hat{Y}_n) = \sqrt{E(Y_n)/E(\hat{Y}_n)}$ denotes the relative execution efficiency of the predicted SQL query in comparison to ground-truth query, where $E(\cdot)$ is the execution time of each SQL in the database.

The examples of the syntax tree are not clear. Thanks a lot for your careful review. Following your suggestion, we have updated Figure 5 to include more complex examples, particularly those involving nested queries. We will add the revised figure in our new manuscript since we are not allowed to include external links in this version.

Personalization gaps. We sincerely appreciate your insightful suggestion regarding personalized decomposition strategies tailored to different user intents and database structures. This is indeed a promising research direction that could significantly enhance real-world applicability. Moving forward, we plan to extend our framework by incorporating adaptive decomposition mechanisms to further improve system performance.

Efficiency trade-offs. Thank you for making this valuable suggestion. To assess our approach thoroughly, we conducted the efficiency analysis on the BIRD dataset (33.4 GB total). Given that the queries in this dataset are categorized into 3 difficulty levels: simple, moderate, and challenging, we specifically tested our model on the challenging set of the BIRD dataset and compared its performance with DIN-SQL and MAC-SQL.

Table 1: Efficiency analysis on the ''Challenging'' set of BIRD.

Model	Training Time	Inference Time	Performance
DIN-SQL	4.69 h	0.39 h	36.7%
MAC-SQL	4.98 h	0.36 h	39.3%
SGU-SQL	3.47 h	0.22 h	42.1%

As shown in Table 4, our model demonstrates superior performance while maintaining competitive computational efficiency. This superior efficiency can be attributed to our graph-based architecture. While baseline methods avoid the overhead of graph construction, they heavily rely on prompt-based modules that require multiple calls to LLMs like GPT-4. These API calls introduce substantial latency that accumulates during both the training and inference phases. In contrast, our graph-based approach, despite its initial graph construction overhead, achieves faster end-to-end processing by minimizing dependence on time-consuming API calls.

The performance of lightweight LLMs. Thanks for the valuable comments. Following your suggestion, we add the QwenCoder series model as the backbone LLM.

Table 2: Performance on BIRD with Qwen2.5-Coder as the backbone LLM.

Model	+Qwen2.5-Coder-7B	+Qwen2.5-Coder-14B	+Qwen2.5-Coder-32B
XiYan-SQL(DDL)	56.58	60.37	63.04
XiYan-SQL(M-Schema)	59.78	63.10	67.01
SGU-SQL	60.24	64.75	68.12

XiYanSQL-QwenCoder series model is the SOTA method that uses lightweight Qwen2.5-Coder as the backbone. As shown in Table 2, our SGU-SQL outperforms this competitor across all model sizes, suggesting the effectiveness and robustness of our framework.

审稿意见

评分: 42025-03-14

This paper addresses the challenge of generating precise SQL queries from natural language, particularly when handling ambiguous user intents, complex database schemas, and SQL’s rigid syntax. The authors propose SGU-SQL, a framework that enhances Text-to-SQL generation by modeling structural relationships between entities in user questions and database tables. Key innovations include a graph-based representation to align ambiguous natural language entities with database components and a syntax-guided decomposition strategy that breaks complex questions into sub-questions to guide LLMs in incrementally constructing target SQLs. Experiments on two benchmarks verify that SGU-SQL outperforms state-of-the-art baselines, including 11 fine-tuning models, 7 structure learning models, and 14 in-context learning models.

给作者的问题

How does the framework perform across different backbone LLMs, and are there specific LLMs for which it is particularly well-suited?
Can the authors provide a more detailed justification for choosing RGAT over other GNN variants?
Could critical findings from the appendix, such as ablation study and error analysis, be integrated into the main text to enhance clarity?

论据与证据

The paper makes two key claims: (1) graph-based schema linking improves SQL accuracy by resolving ambiguities, and (2) syntax-guided prompting outperforms traditional methods like Few-Shot and Chain-of-Thought through syntax-aware decomposition. These claims are supported by rigorous benchmark comparisons and ablation studies.

方法与评估标准

The proposed method is innovative and well-designed, combining graph-based schema linking and syntax-aware decomposition to handle ambiguous user queries, complex database schemas, and SQL’s rigid syntax. Evaluation is thorough, using Spider and BIRD benchmarks with metrics like Execution Accuracy (EX), Exact Match Accuracy (EM), and Valid Efficiency Score (VES). Comparisons against 32 baselines across fine-tuning, structure-aware, and in-context learning paradigms highlight the framework’s robustness.

理论论述

The decomposition strategy and structure linking are motivated and verified empirically with few theoretical claims introduced in this paper.

实验设计与分析

The experiments are comprehensive, comparing SGU-SQL with 32 state-of-the-art models and including ablation studies and error analysis. While, critical findings, such as error-type distributions, are relegated to the appendix, which slightly weakens the narrative flow. Integrating these results into the main text would enhance clarity.

补充材料

The appendix includes detailed ablation studies, error analysis, case studies, grammar rules, syntax tree examples, and source code. These materials effectively supplement the main claims and improve reproducibility.

与现有文献的关系

SGU-SQL builds on prior LLM-based Text-to-SQL methods (e.g., DIN-SQL) by introducing graph-based schema linking and syntax-guided decomposition. This combination addresses limitations in structural alignment and complex query handling.

遗漏的重要参考文献

n/a

其他优缺点

Strengths:

The paper identifies the critical challenges in leveraging LLMs for SQL generation. It highlights that LLMs often face significant difficulties in comprehending complex database schemas, particularly when dealing with intricate relationships between tables, columns, and constraints. Additionally, the paper emphasizes that LLMs frequently struggle to accurately interpret user queries, especially when the queries involve nuanced semantics or require precise SQL syntax.
The overall idea is clear and novel. The graph-based schema linking enhances SQL accuracy by effectively resolving ambiguities, and syntax-guided prompting surpasses traditional prompting strategies by leveraging syntax-aware query decomposition. The combination of these two techniques demonstrates a novel and well-thought-out solution that ensures the generated SQL queries are not only semantically correct but also syntactically precise.
The methodology section is well-organized. The authors provide clear mathematical formulations for key components of their approach, such as the graph-based schema linking mechanism and the syntax-guided prompting strategy.
Extensive empirical results on widely recognized benchmarks like Spider and BIRD verifies SGU-SQL outperforms state-of-the-art baselines, including 11 finetuning models, 7 structure learning models, and 14 in-context learning models.
The evaluation is comprehensive, encompassing ablation study, error analysis and experiments on both open-source and proprietary LLMs, ensuring an objective assessment of the method's effectiveness and robustness across different backbone LLMs.

Weaknesses:

While the authors evaluate their framework across multiple backbone LLMs, a more systematic and detailed comparison with baseline methods using different backbone LLMs would further clarify the framework’s generalizability.
The paper utilizes RGAT as the backbone model for graph-based structure linking. A more thorough discussion comparing RGAT with other graph neural network architectures would make the method more clear and easier to follow.
Key findings, such as ablation and error analysis, are placed in the appendix. Prioritizing these results in the main body would enhance clarity and strengthen the narrative coherence.

其他意见或建议

See above.

作者回复

2025-04-01

Dear Reviewer wUD6,

We are deeply grateful for your recognition of our work and also appreciate your time and effort in providing insightful suggestions that can help further polish our paper. Below are detailed responses to your comments and suggestions:

The effect of the base LLMs. Our model, like most text-to-SQL methods, is model-agnostic, meaning that it can be integrated with any LLM as the backbone model. To verify the effect of the base LLMs, we added additional experiments using GPT-4, GPT-4o, and Gemini-1.5 Pro as backbones.

Table 1: Performance comparison on BIRD dev with different LLMs as backbones.

Execution Accuracy	MAC-SQL	PURPLE	E-SQL	CHESS	Distillery	CHASE-SQL	SGU-SQL (Ours)
GPT-4	59.59	60.71	58.95	61.37	-	-	61.80
GPT-4o	65.05	68.12	65.58	68.31	67.21	-	69.28
Gemini-1.5 Pro	-	-	-	-	-	73.14	72.93

Note that PURPLE, Distillery, and CHASE-SQL are closed-source models. We will update their results on GPT-4 and GPT-4o once their implementations become publicly available.

As shown in Table 1, our SGU-SQL achieves competitive performance across different LLM backbones. Specifically, we have the following observations:

Using GPT-4 as the backbone, SGU-SQL achieves the best performance compared to other models using the same backbone.
With GPT-4o, SGU-SQL achieves 69.28% in terms of execution accuracy, outperforming several strong baselines: PURPLE (68.12%), CHESS (68.31%), E-SQL (65.58%) and Distillery (67.21%).
The only model showing higher performance is CHASE-SQL, which uses Gemini 1.5 Pro as its backbone. Notably, CHASE-SQL incorporates a query fixer module that leverages database execution feedback to guide LLMs to iteratively refine generated queries. In contrast, our model generates SQL queries in a single pass without utilizing any execution feedback.

Justification on the backbone GNN model. Thanks for your insightful comments. To verify the effectiveness of the backbone model, i.e., RGAT, we replace it with other alternatives, including RGCN [1] and CompGCN [2].

Table 2: Alation study on backbone GNN models.

Execution Accuracy(EX)	SPIDER	BIRD
Full Model (RAGT)	87.95	61.80
w/o structure-aware linking	82.62	55.31
with RGCN	86.37	60.92
with CompGCN	86.09	60.25

As shown in Table 3, our RGAT-based approach outperforms alternative architectures across all evaluations. Besides that, removing structure-aware linking causes a dramatic performance drop - accuracy decreases by 5.33% on SPIDER-dev and 6.49% on BIRD-dev. These substantial reductions highlight the critical role of our structure-aware linking strategy.

[1] Modeling Relational Data with Graph Convolutional Networks.

[2] Composition-based Multi-Relational Graph Convolutional Networks.

Paper structure. Thanks a lot for your valuable suggestion. We will reorganize the paper and move the key experiments into the main content of the paper to enhance clarity.

审稿意见

评分: 22025-03-14

The author has proposed Structure Guided text-to-SQL framework. At a high level, it i) represent user query as a graph, vertex is key word and edge is relationship, ii) use schema graph to represent database schema, iii) linking with dual graph encoding (with Relational Graph Attention Network), and iv) apply syntax tree based guidance to decompose the generation task.

给作者的问题

Please see W1-W3.

论据与证据

Some of the claims are not well supported.

For example, please add more discussion about [1], which achieve high execution accuracy for BIRD and argues that schema linking is not important " if the schema fits within the context length".

Is the proposed decomposition strategy better than CHASE-SQL [2]? I didn't find results of CHASE-SQL in table 1.

[1] Maamari, Karime, et al. "The death of schema linking? text-to-sql in the age of well-reasoned language models." arXiv preprint arXiv:2408.07702 (2024).

[2] Pourreza, Mohammadreza, et al. "Chase-sql: Multi-path reasoning and preference optimized candidate selection in text-to-sql." arXiv preprint arXiv:2410.01943 (2024).

方法与评估标准

Spider and Bird are important benchmarks for text-to-SQL solutions.

理论论述

N/A

实验设计与分析

The experimental design is sound.

补充材料

No.

与现有文献的关系

Text-to-SQL is an important problem with significant practical importance.

遗漏的重要参考文献

No.

其他优缺点

S1. The proposed method for schema linking is sound.

S2. The evaluation section compared with many different models and techniques.

W1. The performance of the model on Bird benchmark is 61.8% (Table 2), which falls in the range of 25-32 on Bird leaderbaord.

W2. Need more discussion and support on why is schema linking is important, given Distillery-SQL argues otherwise. Distillery-SQL ranks at 7th place on Bird benchmark.

W3. Need better demonstration on why is the proposed task decomposition is better than that of Chase-SQL.

其他意见或建议

Replace the number of models in Table 1. Maybe only include the best model in fine-tuned and structure learning category.

作者回复

2025-04-01

Dear Reviewer 5QmZ,

Thank you for your expertise and insightful comments. Below are detailed responses to your comments and suggestions:

Performance on BIRD. Thanks for your insightful comments. For a thorough evaluation of SGU-SQL's performance, we added top-performing models from the BIRD leaderboard as baselines. From the top 10 methods in the BIRD leaderboard, we include CHASE-SQL (4th), OpenSearch-SQL (6th), Distillery (7th), CHESS (8th), and PURPLE (10th) in our comparisons. We exclude the remaining methods (AskData, Contextual-SQL, ExSL, Insights AI) since they are all industrial solutions without any released instructions (papers and technical reports) or accessible code.

Table 1: Performance comparison on BIRD dev with different LLMs as backbones.

Backbone LLM	MAC-SQL	PURPLE	E-SQL	CHESS	Distillery	CHASE-SQL	SGU-SQL (Ours)
+GPT-4	59.59	60.71	58.95	61.37	-	-	61.80
+GPT-4o	65.05	68.12	65.58	68.31	67.21	-	69.28
+Gemini-1.5 Pro	-	-	-	-	-	73.14	72.93
+Claude 3.5 Sonnet	-	-	-	-	-	69.53	70.36

Due to time and API budget limits, we have currently only evaluated our model's performance with Gemimi 1.5 Pro and Claude 3.5. We plan to conduct more comprehensive experiments with other baselines using these advanced LLMs in future work.

As shown in Table 1, we have the following observations:

SGU-SQL+GPT-4 achieves the best performance compared to the other baselines using GPT-4 as the backbone.
SGU-SQL+GPT-4o achieves $\textcolor{maroon}{69.28\\%}$ , outperforming the strong baselines: E-SQL+GPT-4o (65.58%), Distillery+GPT-4o (67.21%), PURPLE+GPT-4o (68.12%) and CHESS+GPT-4o (68.31%).
When using Gemini 1.5 Pro as the backbone, SGU-SQL achieves highly competitive results ( $\textcolor{maroon}{72.76\\%}$ , with gemini-1.5-pro and $\textcolor{maroon}{72.93\\%}$ with gemini-1.5-pro-exp-0827) compared to CHASE-SQL (73.01%).
With Claude 3.5 Sonnet as the backbone, SGU-SQL ( $\textcolor{maroon}{70.36\\%}$ ) slightly outperforms CHASE-SQL (69.53%).

To summarize, our approach demonstrates robust and competitive performance across different base LLMs.

The importance of schema linking. We thank the reviewer for raising this important point. While we commend Distillery-SQL’s novel schema-free paradigm leveraging iterative refinement and execution feedback, we respectfully contend that explicit schema linking remains indispensable for real-world text-to-SQL systems, particularly for three reasons:

While Distillery-SQL achieves strong results without dedicated schema linking, its reliance on iterative query refinement (via augmentation/selection/correction) introduces substantial computational overhead. For instance, their pipeline requires multiple LLM calls with database execution feedback, which incurs significant latency and infrastructure costs. In contrast, schema linking modules enable single-pass query generation while maintaining desirable performance.
Inaccurate schema linking degrades LLM-based SQL generation while accurate schema linking still improves the model performance. Current top-performing models (XiYAN-SQL, CAHSE-SQL, etc.) universally incorporate schema linking, achieving superior performance on benchmarks like BIRD and Spider (+5-8% over schema-free baselines). This aligns with our findings: explicit schema linking improves robustness, particularly for long-tail schemas and compositional queries.

Compared to CHASE-SQL. Thanks for your insightful comments. Following your suggestion, we compare our model with CHASE-SQL by integrating different backbone LLMs.

Model	+GPT-4o	+Gemini-1.5 Pro	+Claude 3.5 Sonnet
CHASE-SQL	-	73.14	69.53
SGU-SQL	69.28	72.93	70.36

Notably, CHASE-SQL incorporates a query fixer module that leverages database execution feedback to guide LLMs to refine generated queries iteratively. In contrast, our model generates SQL queries in a single pass without utilizing any execution feedback. As shown in the table, our model shows more desirable performance than CHASE-SQL. It is because that traditional methods attempt to generate entire SQL queries in one step or rely on simple decomposition strategies, SGU-SQL breaks down the complex generation task in a syntax-aware manner. This ensures that the generated queries maintain both semantic accuracy (correctly capturing user intentions) and syntactic correctness (following proper SQL structure).

The structure of Table 1. Thanks a lot for your insightful comments. Following your suggestion, we will make Table 1 more concise by removing some less important baselines.

审稿意见

评分: 12025-03-17

This paper proposes a novel methodology to enhance the schema linking and complex SQL generation of LLMs for the text-to-SQL domain. Current LLM-based text-to-SQL methods face several challenges like ambiguous user intent, sophisticated database schema which often lacks proper documentations, and complex syntax structure of the SQL queries. To address's these challenges this work suggests SGU-SQL which represents the user query and the database structure into a unified graph and use a structure-learning model to find the links between the user question and the database schema effectively improving the schema linking. Finally the linked schema is divided into sub-syntax trees that are used to generate the final SQL query incrementally, breaking the complex SQL generation into multiple steps.

给作者的问题

N/A

论据与证据

The claims about surpassing SOTA performance on both Spider and BIRD is not quite accurate as methods such as XiYan-SQL and CHASE-SQL achieve much higher performances on both of these benchmarks.
The decomposition approach proposed in this work is only compared with few-shot ICL, CoT, DIN-SQL, ACT-SQL, and MAC-SQL, which are all relatively older decomposition approaches. Methods like Divide-and-conquer prompting suggested in CHASE-SQL are more advanced that should be considered as well.

方法与评估标准

Benchmarks and evaluation criteria used in this paper is fair, but the problem is the use of older versions of the LLMs such as GPT-4, PaLM, and text-bison model. Some of these models are no longer used for text-to-SQL pipelines and considering more advanced models like Gemini-2.0-flash or GPT-4o is essential for fair comparison.

理论论述

In the problem formulation section, and specifically definition 1 is wrong. In this section, it is mentioned "given a natural language query D", I think here instead of D authors should use Q?

实验设计与分析

I checked all experiments and mentioned some of the issues in the above comments. Additionally, I think it would be beneficial to compare the schema linking method proposed in this work with some of the previous works such as the ones used in CODES and CHESS paper in terms of recall and precision.

补充材料

Yes, the BIRD results, future words, and related works.

与现有文献的关系

The proposed approach for schema linking seems promising to mitigate some of the challenges for complex queries and database schemas.

遗漏的重要参考文献

XiYian-SQL paper is not mentioned.

其他优缺点

N/A

其他意见或建议

N/A

作者回复

2025-04-01

Dear Reviewer hLFK,

Thanks a lot for your detailed feedback. We really appreciate your time and effort in pointing out the potential concerns related to our paper, and also, thanks a lot for the opportunity to clarify the technical details and contribution of our framework.

To avoid any potential confusion, we first offer the following clarification:

Our model, like most text-to-SQL methods, is model-agnostic, meaning that it can be integrated with any LLM as the backbone model.
In Tables 1 and 2, we report the main results using GPT-3.5 and GPT-4 for cost-effectiveness considerations.
We compare with the top-performing methods (CHASE-SQL, CHESS, Distillery) using Gemini-1.5 Pro and GPT-4o as the backbone LLM in Table 5 of the Appendix.

Below are our responses in detail.

Detailed comparison with XiYan-SQL. Thanks for your insightful comments. XiYan-SQL is the top research-based method (3rd on the BIRD leaderboard, behind two industrial solutions), but its source code and backbone LLM remain undisclosed.

However, the authors released the pre-train version (XiYanSQL-QwenCoder) on Hugging Face, enabling a direct comparison by using Qwen2.5-Coder as the backbone LLM.

Table 2: Comparison with XiYan-SQL using Qwen2.5-Coder as the backbone LLM.

Model	+Qwen2.5-Coder-7B	+Qwen2.5-Coder-14B	+Qwen2.5-Coder-32B
XiYan-SQL(DDL)	56.58	60.37	63.04
XiYan-SQL(M-Schema)	59.78	63.10	67.01
SGU-SQL	60.24	63.75	68.12

As shown in Table 2, our SGU-SQL outperforms this competitor across all model sizes, suggesting the effectiveness and robustness of our framework.

Compared with other SOTA methods on BIRD. Thanks for your insightful suggestions. For a thorough evaluation of SGU-SQL's performance, we added top-performing models from the BIRD leaderboard as baselines. From the top 10 methods in the BIRD leaderboard, we include XiYan-SQL (3rd), CHASE-SQL (4th), OpenSearch-SQL (6th), Distillery (7th), CHESS (8th), and PURPLE (10th) in our comparisons. We exclude the remaining 4 methods (AskData, Contextual-SQL, ExSL, Insights AI) since they are all industrial solutions without any released instructions (papers and technical reports) or accessible code.

Table 1: Performance comparison on BIRD dev with different LLMs as backbones.

Execution Accuracy	MAC-SQL	PURPLE	E-SQL	CHESS	Distillery	CHASE-SQL	SGU-SQL (Ours)
GPT-4	59.59	60.71	58.95	61.37	-	-	61.80
GPT-4o	65.05	68.12	65.58	68.31	67.21	-	69.28
Gemini-1.5 Pro	-	-	-	-	-	73.14	-

Note that PURPLE, Distillery, and CHASE-SQL are closed-source models. We will update their results on GPT-4 and GPT-4o once their implementations become publicly available.

As shown in Table 1, we have the following observations:

SGU-SQL+GPT-4 achieves the best performance compared to the other baselines using GPT-4 as the backbone.
SGU-SQL+GPT-4o achieves $\textcolor{maroon}{69.28\\%}$ , outperforming the strong baselines: E-SQL+GPT-4o (65.58%), Distillery+GPT-4o (67.21%), PURPLE+GPT-4o (68.12%) and CHESS+GPT-4o (68.31%). (We didn’t compare our model with CHASE-SQL+GPT-4o since CHASE-SQL is still closed-source and unable to integrate other LLMs. While XiYan-SQL is also closed-source, only their Qwencoder series model has been released.)
When using Gemini 1.5 Pro as the backbone, SGU-SQL achieves highly competitive results ( $\textcolor{maroon}{72.76\\%}$ , with gemini-1.5-pro and $\textcolor{maroon}{72.93\\%}$ with gemini-1.5-pro-exp-0827) compared to CHASE-SQL (73.01%).
With Claude 3.5 Sonnet as the backbone, SGU-SQL ( $\textcolor{maroon}{70.36\\%}$ ) slightly outperforms CHASE-SQL (69.53%). This improvement suggests that our method may better leverage Claude's capabilities through its structured decomposition approach.

To summarize, our approach demonstrates robust and competitive performance across different base LLMs.

The effect of schema linking. Thank you for the constructive comments. Following your suggestion, we compare our graph-based schema linking with previous models and report the results in the following table.

Table 3. Schema linking on BIRD.

Metrics	CodeS	CHESS	SGU-SQL
Precision	92.40	93.12	95.19
Recall	79.69	81.33	85.60

As shown in Table 3, our model achieves the best performance across different linking strategies, which further verifies the effectiveness of our proposed structure-aware linking mechanism.

Typos in Definition 1. Thanks for your careful review. We will fix the typos in line62-63 by changing the statement “Given a natural language query D and a database schema Q” into “Given a natural language query Q and a database schema D.”

最终决定Accept (poster)

2025-05-01

This paper proposes SGU-SQL, a structure-guided framework designed to improve the performance of Large Language Models (LLMs) on complex text-to-SQL tasks. The core idea is to address challenges like ambiguous user intent and complex database schemas through two main components: 1) a structure-aware schema linking module that represents the user query and database schema as graphs and uses a Relational Graph Attention Network (RGAT) to identify relevant schema elements, and 2) a syntax-guided decomposition strategy that breaks down the SQL generation process into smaller sub-tasks based on SQL syntax trees, guiding the LLM to construct the query incrementally. Initial reviews were mixed, acknowledging the novelty of combining structure-aware linking and syntax-guided decomposition, but raising significant concerns. Key issues included the accuracy of claims regarding state-of-the-art (SOTA) performance (missing comparisons to top leaderboard methods like XiYan-SQL and CHASE-SQL), the justification for the importance of schema linking (given schema-free methods like Distillery-SQL), the limited comparison of the decomposition strategy to recent methods, and the use of older LLM backbones in primary experiments.

In response, the authors provided a substantial rebuttal with extensive new experimental results. They added direct comparisons against top BIRD leaderboard competitors (including XiYan-SQL on Qwen models, CHASE-SQL, Distillery, CHESS, etc.) using various modern LLM backbones (GPT-4, GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet), demonstrating that SGU-SQL achieves competitive or state-of-the-art performance across different models, notably outperforming CHASE-SQL on Claude 3.5 Sonnet and XiYan-SQL on Qwen models without relying on execution feedback like CHASE-SQL. They defended the importance of schema linking by arguing for its efficiency benefits over iterative refinement methods like Distillery and provided new experimental results showing their schema linking outperforms prior methods (CODES, CHESS). They also added ablations on the GNN component (justifying RGAT) and addressed clarity issues regarding evaluation metrics and error analysis. This rebuttal effort successfully convinced one reviewer (wUD6) to raise their score from Weak Accept to Accept. However, the two most critical reviewers (hLFK and 5QmZ), who had raised the concerns about SOTA comparisons and schema linking, did not acknowledge or respond to the rebuttal, despite the authors providing direct evidence addressing their points.