W2-II (1) Training details and quality of the noise filtering model is not discussed, e.g. through an intrinsic evaluation of how accurately it can discriminate noisy examples. (2) The difference between ROUTE’s data synthesis procedure and that of SENSE is not clear. (3) The method to “artificially and randomly introduce errors” (lines 225-230) is also not documented. (4) Overall, this part of the method is not clearly explained, and its contribution in ROUTE is not obvious.

Ans: Thank you for your valuable comments, we will reply to this weakness point-by-point.
(1) According to the suggestion, we have added an experiment to assess the accuracy of noise sample identification. We construct an evaluation set from SPIDER and BIRD development sets to evaluate their ability to identify positive examples and negative examples, where the negative examples come from challenging artificially synthesized SQL queries and the incorrect SQL queries obtained by SQL generation with Llama3-8B.

As shown in following table, we can observe that the LLM’s ability to identify negative (noisy) and positive examples has been significantly improved after SFT. Further, for more challenging dataset like Bird, there still remains room for improvement in distinguishing noisy samples.

	w/o SFT	w/o SFT	w/o SFT	with SFT	with SFT	with SFT
	Spider	Bird	All	Spider	Bird	All
Positive (Ground-truth, 100)	0.51	0.23	0.37	0.96	0.83	0.90
Negative (Artificial, 100)	0.64	0.50	0.57	0.68	0.77	0.73
Negative (Llama3, 100)	0.63	0.83	0.73	0.96	0.94	0.95

(2) Compared to SENSE, the data synthesis pipeline of ROUTE encompasses not only Text2SQL but also multiple other SQL-related tasks. Our approach focuses on utilizing existing data to synthesize multi-task SFT data, thereby enhancing the capabilities of open-source language models to handle various SQL-related tasks. In contrast, SENSE mainly focused on SQL generation task, leveraging powerful LLMs to increase the diversity of SQL generation training set and also synthesize preference data. We have clarified the relationship and difference between ROUTE and SENSE in our revised submission (Appendix A.8).

(3) Thank you for your reminder, we have clarified it in Appendix A.7 of our revised manuscript.

(4) Please see our Answer to W1.

Q1: What is the “pseudo-SQL” used to perform schema linking? How is it implemented? This term only appears twice in the paper without any further elaboration.

Ans: Thanks for your question. The pseudo SQL refers to the generated intermediate SQL using the defined template and the complete schema, i.e., , which we have clarified in the revised submission (please see lines 277~284 for the details)

Q2: The use of "hallucination" may not be appropriate here in the context of text-to-SQL parsing. Are the authors simply trying to say incorrect column matching and entity linking?

Ans: Thanks for your feedback. We follow several recent works to utilize hallucinations in SQL generarion task [6,7]. Hallucinations in LLMs refer to cases where LLM generate plausible but factually incorrect or nonsensical information[8]. Hallucination in Text2SQL task indicates incorrect SQL generations. Schema hallucinations and logic hallucinations are widely observed in LLM-based SQL generation [7].

Considering that the definition of hallucinations might be unclear in the context of Text2SQL, we have revised the usage of SQL hallucinations in the revised submission. Please see our revised submission for details.

Q3: The manuscript would benefit from another round of proof-read to correct typos and standardize term usage.

Ans: Thanks for your suggestion. The typos and the standardize term usages have been double checked. We will re-polish our paper carefully.

Reference

[1] Teaching Large Language Models to Self-Debug, ICLR, 2024.
[2] MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL, arxiv preprint, 2024.
[3] Evaluating the Text-to-SQL Capabilities of Large Language Models, arxiv preprint, 2022.
[4] A comprehensive evaluation of ChatGPT's zero-shot Text-to-SQL capability, arxiv preprint, 2023.
[5] MIGA: A Unified Multi-task Generation Framework for Conversational Text-to-SQL, AAAI, 2023.
[6] Before Generation, Align it! A Novel and Effective Strategy for Mitigating Hallucinations in Text-to-SQL Generation. ACL 2024.
[7] PURPLE: Making a Large Language Model a Better SQL Writer. ICDE 2024.
[8] A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM TOIS.