7.3

/10

Poster4 位审稿人

最低4最高5标准差0.5

3.8

置信度

创新性2.0

质量3.3

清晰度3.3

重要性2.3

NeurIPS 2025

Cypher-RI: Reinforcement Learning for Integrating Schema Selection into Cypher Generation

Hanchen Su,Xuyuan Li,Yan Zhou,zhuoyi lu,Ziwei Chai,Haozheng Wang,Chen Zhang,Yang Yang

OpenReview PDF

提交: 2025-05-10更新: 2025-10-29

摘要

关键词

Text-to-CypherReinforcement learningGraph DatabasesLarge Language Models

评审与讨论

审稿意见

评分: 5置信度: 52025-07-03

This paper proposed Cypher-RI, a framework that improves the schema selection for the cypher generation pipeline. Cypher-RI addresses these challenges by conceptualizing schema selection as an important element of the reasoning process. More specifically, without supervision on intermediate steps, the authors trained a 7B parameter LLM (Qwen2.5-Coder-7B) to both select relevant schema elements and generate executable Cypher queries via GRPO. Experimental results on CypherBench and Neo4j-Text2Cypher proved the effectiveness of the framework, and showed 9.41% accuracy improvement over GPT-4o on CypherBench.

优缺点分析

Strengths

First framework by leveraging RL to optimize the reasoning process for Text-to-Cypher. The method does not rely on the intermediate reasoning steps annotations.
The proposed method consistently outperforms the baselines and even outperforming GPT-4o.

Weaknesses

I believe that the schema selection/linking is an important component for many languages besides Cypher, such as SQL queries. The limited impact of paper would have be strengthened if more diverse programming languages can be experimented.
I observe that the performance on Neo4J-Text2Cypher is quite low (30.61%). This is be a generalization problem since the model is trained on CypherBench. It is important to have the ablation study on Neo4J-Text2Cypher, to see what's the performance of SFT and RL(w/o schema selection). However, this critial ablation study is missing.

问题

How is the reward designed? Do we have any ablation on those values?
What are the sample sizes that are used in SFT, RL(w/o schema selection), and Cypher-RI in the ablation study? Are the sample sizes significantly different for these methods?

局限性

yes

最终评判理由

The experiments are expanded to text-to-SQL, showing better impact than the previous version.
sample size question is answered.

格式问题

N/A

作者回复

2025-07-31

We sincerely thank the reviewer for their thorough evaluation and constructive feedback. The suggestions have been invaluable in helping us strengthen our work. In the following sections, we address each of the reviewer's points in detail, providing clarifications and new experimental results to support our claims and demonstrate the effectiveness of our method.

Q1: Reward Function Design and Ablation.

Our reward function is central to the reinforcement learning process, providing essential supervision signals where explicit intermediate annotations are unavailable. The function is designed to guide the model through the key stages of the Text-to-Cypher task by decomposing the total reward into three distinct components.

The total reward is the sum of three parts: $r_{format}$ , $r_{selection}$ and $r_{execution}$ .

Format Reward: This reward encourages the model to generate its reasoning (thought process) before providing the final Cypher query. This "think first, answer later" structure improves the model's logical flow and interpretability.
$r_{format} = \begin{cases} a, & \text{if the format is correct} \newline -a, & \text{if the format is incorrect} \end{cases}$
Schema Selection Reward: This reward is crucial for the core task of identifying the question related schema. It guides the model to select the right nodes, relationships, and properties needed to construct the query.
$r_{selection} = \begin{cases} 2.0, & \text{if the selected schema matches the gold answer} \newline b, & \text{if the selected schema does not match the gold answer} \newline -2.0, & \text{if the selected schema is unparseable} \end{cases}$
Execution Reward: This is the ultimate measure of success. The reward is based on whether the final generated Cypher query executes and produces the same result as the ground truth query.
$r_{execution} = \begin{cases} 2.0, & \text{if the generated Cypher executes consistently with the gold answer} \newline b, & \text{if the generated Cypher executes inconsistently with the gold answer} \newline -2.0, & \text{if the generated Cypher fails to execute} \end{cases}$
To determine the optimal values for the hyperparameters $a$ and $b$ , we conducted the following ablation studies.

1.Determining the Value of a

Our hypothesis was that the format reward ( $a$ ) should be substantial enough to enforce the desired output structure but smaller than the primary task rewards for selection and execution (which have a magnitude of 2.0). A value that is too high might cause the model to prioritize formatting over correctness. We tested values of $a\in\{0.5,1.0,2.0\}$ .

Methods EX(%) PSJS (%) Exec. (%)
Cypher_RI( $a=0.5$ ) 69.04 74.62 99.04
Cypher_RI( $a=2.0$ ) 68.36 74.35 98.74
Cypher_RI( $a=1.0$ ) 69.59 75.21 99.28

The results confirm our hypothesis. Setting $a=1.0$ achieves the best performance across all metrics, striking the right balance between enforcing structure and prioritizing query correctness.

1.Determining the Value of b

Next, we investigated the appropriate penalty ( $b$ ) for outputs that are syntactically valid but semantically incorrect (i.e., a parseable schema that is wrong, or an executable query that returns the wrong result). We theorized that a moderate penalty would be more effective than a penalty that is too lenient or too harsh. A moderate penalty discourages incorrect solutions while still recognizing that a syntactically valid attempt is better than a complete failure. We tested values of $b\in\{−1.0,−1.5,−2.0\}$ .

Methods EX(%) PSJS (%) Exec. (%)
Cypher_RI( $b=-1.0$ ) 68.02 74.10 98.43
Cypher_RI( $b=-2.0$ ) 65.72 72.83 98.05
Cypher_RI( $b=-1.5$ ) 69.59 75.21 99.28

The experiment demonstrates that setting $b=−1.5$ yields the best performance. A lenient penalty of −1.0 is insufficient, while a harsh penalty of −2.0 (the same as for a complete failure) is detrimental. This result supports our strategy of providing a nuanced, moderate penalty for partially correct attempts.

Methods	EX(%)	PSJS (%)	Exec. (%)
Cypher_RI( $a=0.5$ )	69.04	74.62	99.04
Cypher_RI( $a=2.0$ )	68.36	74.35	98.74
Cypher_RI( $a=1.0$ )	69.59	75.21	99.28

Methods	EX(%)	PSJS (%)	Exec. (%)
Cypher_RI( $b=-1.0$ )	68.02	74.10	98.43
Cypher_RI( $b=-2.0$ )	65.72	72.83	98.05
Cypher_RI( $b=-1.5$ )	69.59	75.21	99.28

Q2: Discussion Sample Sizes of Training.

To ensure a fair comparison in our main experiments, the sample size is identical for all methods: SFT, RL and our Cypher-RI. We used a dataset of 8,295 samples.

This dataset was curated through a careful process to enhance training efficiency for RL: For each question in the source dataset, we used Qwen-2.5-Coder-7B to generate 8 candidate Cypher queries with schema selection. We then filtered these generations, keeping only those prompts where the 8 attempts were either all correct (mean accuracy = 1.0) or all incorrect (mean accuracy = 0.0).

Inspired by the reviewer's question, we further investigated how performance scales with data size. We trained all three models on subsets of 2k, 4k, and our full ~8k dataset.

Methods	EX(%)	PSJS (%)	Exec. (%)
SFT-2k	36.43	39.27	87.22
SFT-4k	40.84	44.53	91.78
SFT-8k	44.34	49.80	94.34
RL(w/o schema selection)-2k	49.19	53.92	89.82
RL(w/o schema selection)-4k	55.20	60.03	93.38
RL(w/o schema selection)-8k	58.35	64.55	96.12
Cypher-RI-2k	58.73	65.21	93.78
Cypher-RI-4k	65.20	70.29	97.74
Cypher-RI-8k	69.59	75.21	99.28

The results show that all methods improve as the number of training samples increases, though the gains are not substantial. We think this phenomenon might be related to the limited diversity and complexity of the training data, and improving these aspects could be a valuable direction for future work.

W1: Integrating schema selection into SQL Generation.

To prove that our core idea—explicitly integrating schema selection within an RL framework—is an important component for many query languages, we applied it to the Text-to-SQL domain. We adapted the existing SQL-R1 [1] framework by incorporating a schema selection step and a corresponding reward component ( $S_c$ ), while keeping all other experimental conditions identical.

The reward function in SQL-R1 is:
$S=S_f+S_e+S_r+S_l$
where the components represent rewards for format $S_f$ , execution $S_e$ , result correctness $S_r$ , and query length $S_l$ .

The modified reward function is:
$S=S_f+S_e+S_r+S_l+S_c$
where the new schema selection reward $S_c$ is:
$S_c = \begin{cases} 3, & \text{if schema selection result is correct} \newline 0, & \text{if json format of selected schema is incorrect} \newline -3, & \text{if schema selection result is incorrect} \end{cases}$
We evaluated this enhanced framework on the Spider and BIRD benchmarks.

Methods Base Model Spider(Dev) Spider(Test) BIRD(Dev)
OmniSQL Qwen2.5-Coder-7B 85.5 88.9 66.1
DAIL-SQL GPT-4 83.6 86.2 54.8
MAC-SQL GPT-4 86.8 82.8 59.4
SQL-R1 Qwen2.5-Coder-7B 84.5 86.1 63.1
Ours Qwen2.5-Coder-7B 87.1 86.8 65.4

Our approach yields consistent improvements, with a 2.6-point gain on Spider (Dev) and a 2.3-point gain on BIRD (Dev). These results confirm that explicitly rewarding schema selection within an RL loop is a robust and portable methodology that enhances performance in other major Text-to-Query systems.

In summary, we introduce the first RL-based framework for the important Text-to-Cypher task, achieving SOTA performance with a practical, efficient model. Furthermore, our new experiments demonstrate that the core principle of our work is not a narrow, domain-specific trick, but a generalizable methodology that improves performance on other query languages and database types.
W2: Ablation study on Neo4J-Text2Cypher dataset.

For this new experiment, we trained SFT, RL(w/o schema selection), and Cypher-RI on a 4k-sample subset of the Neo4J-Text2Cypher training set.

Methods EX(%) Google-BLEU (%)
SFT 26.32 49.58
RL(w/o schema selection) 30.81 63.29
Cypher-RI 38.58 65.73

The results clearly confirm the robustness of our method. Cypher-RI achieves an EX score of 38.58%, which is a significant improvement of ~7.8 points over RL(w/o schema selection) and ~12.3 points over SFT. This demonstrates that the framework of Cypher-RI provides a generalization advantage.

Methods	EX(%)	Google-BLEU (%)
SFT	26.32	49.58
RL(w/o schema selection)	30.81	63.29
Cypher-RI	38.58	65.73

References:

[1] SQL-R1: Training Natural Language to SQL Reasoning Model By Reinforcement Learning

2025-08-09

Dear Reviewer,

Thank you again for your thoughtful feedback on our submission. As the discussion period is concluding soon, we were hoping to confirm if our response successfully addressed your concerns. Any further feedback you might have would be greatly appreciated. We are ready to provide any additional clarifications needed.

审稿意见

评分: 4置信度: 22025-07-03

This paper introduces Cypher-RI, a reinforcement learning framework that integrates schema selection into the Text-to-Cypher generation process. The method trains a 7B-parameter model from scratch using a custom rollout format and GRPO optimization, enabling explicit schema selection steps without annotated reasoning traces. Experiments on CypherBench and Neo4j-Text2Cypher show strong improvements over multiple baselines, including a 9.4% Execution Accuracy gain over GPT-4o. The paper is well-written, technically sound, and offers a practical contribution for robust natural language to graph query translation.

优缺点分析

Strengths:

Novel integration of schema selection as an explicit reasoning step within RL-based text-to-Cypher generation.
Strong empirical gains over strong baselines (e.g., GPT-4o) on multiple benchmarks.
Clear training design with open-sourced code and detailed rollout/reward structure for reproducibility.

Weaknesses:

Reward design is fairly simple and rule-based, which may limit flexibility or generalization.
Relies on existing RL techniques without introducing major new algorithmic components.
Scope limited to Cypher, with no evaluation on other graph query languages like Gremlin or PGQL.

问题

Why did you choose Gemini-1.5-Pro instead of Gemini-2.5 series?
Can you report schema selection accuracy separately?
How fast is inference compared to GPT-4o?

局限性

See above

最终评判理由

The author has partially addressed my concerns, especially regarding the improvements in the experiments. I will raise my score.

格式问题

No formatting concerns.

作者回复

2025-07-31

We sincerely thank the reviewer for their insightful feedback and positive evaluation of our work. Below, we address the questions and weaknesses raised.

Q1: Experiments on Reasoning Models

Our primary experiments were completed prior to the official release of the Gemini-2.5 series. We have now benchmarked our model against a suite of more powerful reasoning models on CypherBench.

Methods EX(%) PSJS (%) Exec. (%)
Qwen3-8B 48.72 66.49 81.26
Qwen3-32B 58.56 81.23 80.75
DeepSeek-R1 60.01 73.52 96.68
o3-mini 62.39 74.73 98.00
Gemini 2.5-pro 74.14 81.70 99.45
Cypher-RI 69.59 75.21 99.28

Our 7B Cypher-RI model significantly outperforms much larger models, including the Qwen3-32B and DeepSeek-R1.
Q2: Accuracy of Schema Selection.

The accuracy of the initial schema selection step is a critical component of our model's success. The following table reports this accuracy on both benchmarks.

Methods CypherBench Neo4j-Text2Cypher
Qwen2.5-Coder-7B 76.45 73.98
Cypher-RI 95.87 93.87

Our Cypher-RI framework boosts schema selection accuracy by a remarkable ~20 percentage points over the base model. This confirms that the model effectively learns to identify the correct schema elements.
Q3: Comparison of inference time.

To assess practical usability, we benchmarked the inference latency of Cypher-RI against the GPT-4o API. Cypher-RI was deployed on a single NVIDIA A800 GPU using the vLLM inference server. The results below show the average time to process one question.

Methods Inference Time
GPT-4o 2.79
Cypher-RI 0.61

Our locally-deployed 7B model is approximately 4.5 times faster than the GPT-4o API, offering state-of-the-art accuracy with significantly lower latency and computational cost, making it a highly practical solution.

Methods	EX(%)	PSJS (%)	Exec. (%)
Qwen3-8B	48.72	66.49	81.26
Qwen3-32B	58.56	81.23	80.75
DeepSeek-R1	60.01	73.52	96.68
o3-mini	62.39	74.73	98.00
Gemini 2.5-pro	74.14	81.70	99.45
Cypher-RI	69.59	75.21	99.28

Methods	CypherBench	Neo4j-Text2Cypher
Qwen2.5-Coder-7B	76.45	73.98
Cypher-RI	95.87	93.87

W1: Discussions about Reward design

Our choice of a simple, rule-based reward function was a deliberate one, aimed at ensuring stable learning and preventing reward hacking. In complex generation tasks, learned or overly intricate reward functions can sometimes be exploited by the model to maximize the reward signal without actually improving performance on the intended task. A clear, rule-based function directly incentivizes the desired behaviors—correct format, valid schema selection, and accurate query execution—which is crucial for effective and targeted learning in an RL framework.

Furthermore, our framework is not limited by this initial design; it is inherently flexible and can be generalized to incorporate other objectives. We have conducted a new experiment introducing a length-based penalty into our reward function to explore the trade-off between performance and inference cost. The modified reward is:

r=r_{format}+(r_{selection}+r_{execution})*r_{length}

r_{\text{length}} = \begin{cases} 1, & r_{selection}+r_{execution}<=0 \\ \text{clip}(\alpha*(n_{max}-n_{y}),0,1), & r_{selection}+r_{execution}> 0 \end{cases}

Here, $n_y$ is the number of tokens generated, and $n_{max}$ is a hyperparameter setting the desired token limit. This allows us to explicitly guide the model to produce reasoning chains of varying lengths.

The results on CypherBench for models trained with different $n_{max}$ values are below:

Methods	Inference Tokens	EX(%)	PSJS (%)	Exec. (%)
Qwen2.5-Coder-7B	53	13.12	22.65	62.95
Cypher_RI( $n_{max}=256$ )	223	45.78	49.65	91.93
Cypher_RI( $n_{max}=512$ )	436	60.78	64.29	96.93
Cypher_RI( $n_{max}=1024$ )	901	66.99	71.73	98.93
Cypher_RI( $n_{max}=1280$ )	1045	67.46	74.39	99.11
Cypher_RI(w/o $n_{max}$ )	1482	69.59	75.21	99.28

By incorporating the length-related reward term $r_{length}$ and adjusting the parameter $n_{max}$ , we can effectively control the model’s output length. This demonstrates that, although simple, our reward design is not rigid; rather, it is modular and extensible, enabling the integration of additional objectives such as computational efficiency or domain-specific constraints.

W2: Relies on Existing RL Technique

We agree with the reviewer that our work employs an established RL algorithm (GRPO) as its optimization engine. We believe this is a strength, as it grounds our method in a proven, stable technique.

Our core contribution is not a new RL algorithm, but rather a novel framework for the Text-to-Cypher task. The originality lies in formulating schema selection as an explicit, learnable step within the model's reasoning process. We designed a custom training template, rollout mechanism, and reward function specifically to enable the model to learn this complex, multi-stage task without direct supervision on reasoning traces.

The effectiveness of our approach is demonstrated by strong empirical performance, including outperformance of significantly larger models. This validates the originality and impact of our formulation. Additionally, to highlight the generality of our methodology beyond graph databases, we successfully apply our framework to the Text-to-SQL task, as elaborated in our response to Reviewer W3.
W3: Expand scope of our method

To demonstrate the broader applicability of our work, we have conducted two new sets of experiments.

1. Generalization to a New Graph Query Language (Gremlin)

We evaluated our Cypher-trained model on a public Gremlin dataset [1] to test its zero-shot, cross-language generalization. The model was not fine-tuned on Gremlin.

Methods EX(%)
Qwen2.5-Coder-7B 60.73
Text2Cypher-Gemma-3-27B 70.09
Qwen3-32B 73.41
GPT-4o 80.06
Gemini 2.5-pro 88.82
Cypher-RI 77.79

Remarkably, our model, despite being trained only on Cypher data, improves performance on Gremlin by 17% over its base model and 7.7% over a 27B model fine-tuned for Text-to-Cypher. This indicates that our training method helps the model learn underlying graph logic that partially transfers to other query languages.

2.Expand Schema Selection to SQL Generation

To validate the generalizability of our proposed methodology, we extended its application from graph databases to the Text-to-SQL domain. This involved integrating our core concept—an explicit schema selection stage within a RL framework—into an existing Text-to-SQL pipeline.

Our experiment builds upon the SQL-R1 [2] framework. In its original form, SQL-R1 employs an RL agent optimized with the following composite reward function:
$S=S_f+S_e+S_r+S_l$ $S_r = \begin{cases} 3, & \text{if query result is correct} \newline 0, & \text{if format is incorrect or SQL candidate is not executable} \newline -3, & \text{if query result is incorrect} \end{cases}$
where the components represent rewards for format $S_f$ , execution $S_e$ , result correctness $S_r$ , and query length $S_l$ .

To integrate schema selection into the training process, we modified the prompt to require the model to perform schema selection before generating the SQL query, similar to our Cypher-RI approach. Consequently, we introduced a new reward function:
$S=S_f+S_e+S_r+S_l+S_c$ $S_c = \begin{cases} 3, & \text{if schema selection result is correct} \newline 0, & \text{if json format of selected schema is incorrect} \newline -3, & \text{if schema selection result is incorrect} \end{cases}$
where $S_c$ is the new schema selection reward.

For a controlled and fair comparison, we omitting the SFT step to isolate the performance gains attributable to our RL-based approach and strictly adhered to the experimental conditions of SQL-R1.

We evaluated our enhanced framework on the Spider and BIRD benchmarks. The evaluation results are detailed in the table below.

Methods Base Model Spider(Dev) Spider(Test) BIRD(Dev)
OmniSQL Qwen2.5-Coder-7B 85.5 88.9 66.1
DAIL-SQL GPT-4 83.6 86.2 54.8
MAC-SQL GPT-4 86.8 82.8 59.4
SQL-R1 Qwen2.5-Coder-7B 84.5 86.1 63.1
Ours Qwen2.5-Coder-7B 87.1 86.8 65.4

Our approach improves performance across the board, achieving a 2.6-point gain on Spider (Dev) and a 2.3-point gain on BIRD (Dev).

These experiments confirm that our central idea—explicitly integrating schema selection within an RL framework—is a portable principle that enhances performance beyond just Cypher. It is a robust methodology applicable to a wider range of Text-to-Query systems.

References:

[1] numb3r33/know_gremlin dataset in huggingface

[2] SQL-R1: Training Natural Language to SQL Reasoning Model By Reinforcement Learning

2025-08-04

Dear Reviewer,

We hope this message finds you well. We wanted to follow up and kindly ask whether our response has addressed your concerns. Your feedback means a great deal to us, and we truly appreciate your time and insights.

2025-08-06

Dear Reviewer,

Thank you again for your thoughtful feedback on our submission.

As the discussion period is concluding soon, and the updated ratings are not visible this year, we were hoping to confirm if our response successfully addressed your concerns.

Any further feedback you might have would be greatly appreciated. We are ready to provide any additional clarifications needed.

审稿意见

评分: 4置信度: 42025-07-04

The paper targets the Text-to-Cypher task over property-graph databases. It proposes Cypher-RI, a framework that (i) interleaves schema selection steps with reasoning tokens inside the generation trace, and (ii) trains a 7 B-parameter code LLM (Qwen-2.5-Coder) end-to-end with Group-Relative Policy Optimisation (GRPO) plus a simple rule-based reward composed of format, schema-selection, and execution components. Empirical results show superior performance compared to GPT4o on CypherBench and compatible performance on Neo4j-Text2Cypher.

优缺点分析

Pros: Integrating schema selection as learned intermediate actions, rather than separate pre- or post-processing, is a neat idea. Beating GPT-4o with a 7 B open model shows the training recipe is competitive. Paper provides data splits, implementation hyper-parameters, and claims open-sourcing code. Roll-out template (Table 1) and Figure 1 make the pipeline easy to follow.

Cons: The work seems to lack technical contribution. The GRPO algorithms and RL paradigm have been popular these days. The paper didn't make much contribution to the algorithm. Though the results are good, the paper is more or less like a direct application of GRPO to the domain specific problem.

问题

N/A

局限性

N/A

最终评判理由

I will increase my score because of the additional experiments on other datasets to show the effectiveness.

格式问题

N/A

作者回复

2025-07-31

We sincerely thank the reviewer for their valuable feedback and for acknowledging the strengths of our work. We wish to address the main concern regarding the technical contribution and originality, and provide new experimental results that further underscore the significance and generality of our method.

The reviewer suggests our work is "more or less like a direct application of GRPO to the domain specific problem." We respectfully argue that this perspective undervalues two key aspects of our contribution: (1) the significance of the problem itself and our novelty in addressing it, and (2) the generalizability of our proposed framework, which we demonstrate with new experiments.

1.The Importance and Novelty of the Text-to-Cypher Challenge

While the RL algorithm (GRPO) is not our invention, our primary contribution lies in designing a novel framework, Cypher-RI, that is the first to successfully leverage reinforcement learning to holistically solve the Text-to-Cypher task by unifying reasoning and schema selection.

Graph databases are increasingly central to modern data infrastructure, yet their potential is capped by the complexity of query languages like Cypher. Automating the translation from natural language to Cypher is a critical bottleneck that, if solved effectively, unlocks vast capabilities for a broader, non-technical audience. The value and difficulty of this domain should not be understated.

Prior to our work, the prevailing methods relied on supervised fine-tuning or few-shot prompting. These approaches often struggle with complex queries and lack robustness.

Cypher-RI is the first framework to demonstrate that reinforcement learning can effectively benefit Cypher generation without requiring supervised reasoning traces. Moreover, it integrates the crucial component of schema selection as an explicit reasoning step within the RL process. This is a significant methodological shift for the Text-to-Cypher domain. The "application" itself—designing the specific reasoning trace, reward function, and training pipeline that makes this possible is the core technical innovation.

2.New Experiments to Demonstrate Practicality and Broader Applicability

Practical Usability and Inference Efficiency

We benchmarked Cypher-RI's inference latency against the GPT-4o API.

Methods	Inference Time
GPT-4o	2.79
Cypher-RI	0.61

Our locally-deployed 7B model is approximately 4.5 times faster than the GPT-4o API. This result highlights that Cypher-RI offers a state-of-the-art solution that is highly practical for real-world deployment due to its lower latency and computational cost.

Zero-Shot Generalization to a New Graph Query Language

We tested our Cypher-trained model on a public Gremlin dataset to assess its ability to generalize to an unseen graph query language without any fine-tuning.

Methods	EX(%)
Qwen2.5-Coder-7B	60.73
Text2Cypher-Gemma-3-27B	70.09
Qwen3-32B	73.41
GPT-4o	80.06
Gemini 2.5-pro	88.82
Cypher-RI	77.79

Remarkably, Cypher-RI improves performance on Gremlin by 17% over its base model and 7.7% over a 27B model fine-tuned for Text-to-Cypher. This indicates that our training method helps the model learn underlying graph logic that partially transfers across languages, proving our approach teaches more than just syntactic memorization.

Expanding the Methodology to Text-to-SQL Generation

To prove that our core idea—explicitly integrating schema selection within an RL framework—is a portable principle, we applied it to the Text-to-SQL domain. We adapted the existing SQL-R1 framework by incorporating a schema selection step and a corresponding reward component ( $S_c$ ), while keeping all other experimental conditions identical.

The reward function in SQL-R1 is:

S=S_f+S_e+S_r+S_l

where the components represent rewards for format $S_f$ , execution $S_e$ , result correctness $S_r$ , and query length $S_l$ .

The modified reward function is:

S=S_f+S_e+S_r+S_l+S_c

where the new schema selection reward $S_c$ is:

S_c = \begin{cases} 3, & \text{if schema selection result is correct} \newline 0, & \text{if json format of selected schema is incorrect} \newline -3, & \text{if schema selection result is incorrect} \end{cases}

We evaluated this enhanced framework on the Spider and BIRD benchmarks.

Methods	Base Model	Spider(Dev)	Spider(Test)	BIRD(Dev)
OmniSQL	Qwen2.5-Coder-7B	85.5	88.9	66.1
DAIL-SQL	GPT-4	83.6	86.2	54.8
MAC-SQL	GPT-4	86.8	82.8	59.4
SQL-R1	Qwen2.5-Coder-7B	84.5	86.1	63.1
Ours	Qwen2.5-Coder-7B	87.1	86.8	65.4

Our approach yields consistent improvements, with a 2.6-point gain on Spider (Dev) and a 2.3-point gain on BIRD (Dev). These results confirm that explicitly rewarding schema selection within an RL loop is a robust and portable methodology that enhances performance in other major Text-to-Query systems.

In summary, we introduce the first RL-based framework for the important Text-to-Cypher task, achieving SOTA performance with a practical, efficient model. Furthermore, our new experiments demonstrate that the core principle of our work is not a narrow, domain-specific trick, but a generalizable methodology that improves performance on other query languages and database types.

2025-08-06

Dear Reviewer,

Thank you again for your thoughtful feedback. We noticed that, following our response, the final rating is no longer visible due to this year's policy.

If possible, we would appreciate it if you could kindly let us know whether our rebuttal has sufficiently addressed your concerns. Your feedback is extremely valuable to us, and we’d be happy to clarify further if needed.

评论- Follow-up

2025-08-08

After adding the additional experiment, I would like to increase my score a bit.

2025-08-08

Thank you very much for raising your score, which is highly valuable to us. If you have any further concerns or suggestions, please do not hesitate to share them with us.

审稿意见

评分: 5置信度: 42025-07-04

The paper develops a reinforcement learning method for improving the ability of an LLM to translate natural language questions into Cypher, a programming language akin to SQL but for graph databases. The method involves prompting the model with a minimized schema for the graph database along with the question, after which the model should generate reasoning tokens in <think> tags and then output a JSON representation of the parts of the schema that it's interested in. The prompt is extended with the schema details, after which the model should generate more thinking and then output the final query.

The authors use reinforcement learning with GRPO to train the Qween-2.5-Coder-7B model on the CypherBench training set. They compare against 6 general-purpose LLMs and 2 other Cypher-specific methods, and report better or similar performance compared to all other models, including proprietary models like GPT-4o which are presumably much larger than the 7B model. They also compare against a rejection sampling-based SFT method and RL without the schema selection part and show that the proposed method performs better than the ablations.

优缺点分析

Strengths

The authors perform extensive experiments using many comparison methods, two distinct test sets, an ablation study, and comparing the performance on different subsets of the test sets.
The method shows clearly convincing improvements over previous methods, especially considering that some of the other models compared are much larger.
The paper studies a relatively narrow but impactful problem with potential for real-world impact.
The authors provide code, which makes it much easier to reproduce the details of the paper.

Weaknesses

There is no study of the tradeoff between inference costs and the increase in accuracy obtained. By allowing the model to generate shorter or longer amounts of reasoning tokens, it might be that close-enough performance can be achieved with fewer tokens or that performance can be improved even further with more tokens.
There is no comparison against other models that also use thinking tokens.

问题

How often does the model fail to generate valid JSON for the schema?

局限性

Yes

最终评判理由

The authors have sufficiently addressed my questions with additional experimental results. The other reviews and the discussion did not raise any concerns on my end for lowering the score from before. My score was already positive and I believe it is appropriate to maintain it as-is.

格式问题

No concerns

作者回复

2025-07-31

We sincerely thank the reviewer for their constructive feedback, which has helped us improve our work. We address the specific questions and weaknesses below.

Q1: Accuracy of JSON Schema Generation

We have evaluated the accuracy of generating valid JSON format. The results on both benchmark datasets are presented below.

CypherBench(%) Neo4j-Text2Cypher(%)
Qwen2.5-Coder-7B 78.32 62.65
Cypher-RI 100.00 99.62

As shown in the table below, our method achieves near-perfect format accuracy on both datasets: 100.00% on CypherBench and 99.62% on Neo4j-Text2Cypher.

In addition to evaluating accuracy in JSON format, we also assessed the accuracy of schema selection on both benchmarks.

Methods CypherBench(%) Neo4j-Text2Cypher(%)
Qwen2.5-Coder-7B 76.45 73.98
Cypher-RI 95.87 93.87

As shown, our model improves the accuracy of schema selection by a significant margin of 20 percentage points over the base model on both datasets. This demonstrates that the RL process effectively teaches the model to select needed schema, which is a critical first step in the generation pipeline.

	CypherBench(%)	Neo4j-Text2Cypher(%)
Qwen2.5-Coder-7B	78.32	62.65
Cypher-RI	100.00	99.62

Methods	CypherBench(%)	Neo4j-Text2Cypher(%)
Qwen2.5-Coder-7B	76.45	73.98
Cypher-RI	95.87	93.87

W1: Trade-off Between Inference Cost and Accuracy.

We agree with the reviewer that the trade-off between inference cost and performance is a critical aspect of reasoning models. While longer reasoning chains can improve accuracy, the lack of control over their length makes it difficult to manage test-time compute budgets.

To investigate this trade-off, we designed an additional experiment, inspired by prior work [1,2,3], aimed at controlling the length of the generated reasoning tokens. We introduce a length-based penalty, $r_{length}$ , into our reward function. The modified reward is:

r=r_{format}+(r_{selection}+r_{execution})*r_{length}

r_{\text{length}} = \begin{cases} 1, & r_{selection}+r_{execution}<=0 \newline \text{clip}(\alpha*(n_{max}-n_{y}),0,1), & r_{selection}+r_{execution}> 0 \end{cases}

where, $n_y$ is the number of tokens generated during rollout, and $n_{max}$ is a hyperparameter that sets the desired maximum token limit. By adjusting $n_{max}$ during training, we can explicitly guide the model to produce reasoning chains of varying lengths.

The results on CypherBench for models trained with different $n_{max}$ values are below:

Methods	Inference Tokens	EX(%)	PSJS (%)	Exec. (%)
Qwen2.5-Coder-7B	53	13.12	22.65	62.95
Cypher_RI( $n_{max}=256$ )	223	45.78	49.65	91.93
Cypher_RI( $n_{max}=512$ )	436	60.78	64.29	96.93
Cypher_RI( $n_{max}=1024$ )	901	66.99	71.73	98.93
Cypher_RI( $n_{max}=1280$ )	1045	67.46	74.39	99.11
Cypher_RI(w/o $n_{max}$ )	1482	69.59	75.21	99.28

As we increase the allowed inference tokens, the model's performance improves across all metrics. However, the gains exhibit diminishing returns. For example, increasing the token count from 223 to 436 yields a 15% absolute gain in Execution Accuracy (EX), while increasing from 1045 to 1482 tokens yields only a 2.13% gain. This analysis shows that our model can be adapted to control inference costs. By incorporating a length-based reward, we can choose a balance between performance and computational budget.

W2: Comparison with Additional Reasoning Models.

Per the reviewer's suggestion, we have benchmarked our model against a wider array of reasoning language models. The results on CypherBench are as follows:

Methods EX(%) PSJS (%) Exec. (%)
Qwen3-8B 48.72 66.49 81.26
Qwen3-32B 58.56 81.23 80.75
DeepSeek-R1 60.01 73.52 96.68
o3-mini 62.39 74.73 98.00
Gemini 2.5-pro 74.14 81.70 99.45
Cypher-RI 69.59 75.21 99.28

These experiments further strengthen our paper's claims. Our model significantly outperforms much larger models, including the Qwen3-32B and DeepSeek-R1.

References:

[1] L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

[2] AdaptThink: Reasoning Models Can Learn When to Think

[3] Learn to Reason Efficiently with Adaptive Length-based Reward Shaping

Methods	EX(%)	PSJS (%)	Exec. (%)
Qwen3-8B	48.72	66.49	81.26
Qwen3-32B	58.56	81.23	80.75
DeepSeek-R1	60.01	73.52	96.68
o3-mini	62.39	74.73	98.00
Gemini 2.5-pro	74.14	81.70	99.45
Cypher-RI	69.59	75.21	99.28

评论- Response to rebuttal

2025-08-05

Thank you for the information from the additional experiments. They are helpful to gain further confidence that the work has impact beyond the mere addition of test-time reasoning, and that the model can do better with additional time/tokens allocated for reasoning. I will maintain my already positive score.

2025-08-06

Thank you once again for your recognition and affirmation of our work. If you have any further questions or concerns, please feel free to let us know at any time.

最终决定Accept (poster)

2025-09-17

This paper proposes Cypher-RI, a reinforcement learning framework that integrates schema selection directly into the query generation pipeline as part of the reasoning process. It uses reasoning tokens and recursive schema grounding during generation. The authors trained a 7B parameter model with GRPO and rule-based rewards (format, schema, execution) to achieve strong performance. The paper is technically sound with good empirical results, and the method seems easy to reproduce. Weakness-wise, the paper has limited algorithmic novelty, a relatively narrow scope (only Cypher), and lacks some ablations. Overall, the reviewers generally lean toward acceptance.