PaperHub
5.5
/10
Poster4 位审稿人
最低2最高4标准差0.7
4
3
3
2
ICML 2025

CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

Method CodeSteer to augment LLM capabilities by guiding LLM code/text generation, and SymBench for evaluation of symbolic tasks.

摘要

关键词
Large Language ModelsCode InterpreterCode/text generationSymbolic computingModel fine-tuningModel reasoning and planning

评审与讨论

审稿意见
4

The paper introduces CodeSteer, a method to guide large language models (LLMs) in making optimal choices between textual reasoning and code generation. The authors propose a multi-round supervised fine-tuning (SFT) and direct preference optimization (DPO) approach using a newly created dataset. A benchmark, SymBench, is introduced, containing 37 reasoning tasks. The model CodeSteerLLM, trained on Llama-3-8B, significantly enhances GPT-4o’s symbolic reasoning capabilities. Experiments demonstrate that CodeSteer improves the performance of GPT-4o from 53.3% to 86.4%, outperforming OpenAI’s o1 and DeepSeek R1. The framework generalizes across different LLMs, improving performance on Claude-3, Mistral, and GPT-3.5.

update after rebuttal

The rebuttal address my questions. I remain positive on the paper.

给作者的问题

Overall, the paper is a pleasant read. The paper is well-structured and easy to follow. The empirical study supports the claims made by the paper. The methodology seems sound and the empirical results demonstrate the effectiveness of CodeSteer in assisting LLMs in solving reasoning tasks. Below are a few questions:

  • What are the failure cases where CodeSteer struggles?
  • In Figure 1, why not address the case 1 problem with coding approach as well?
  • Can an alternative approach, where coding is preferred at the beginning and, if coding failed, switched to text-based reasoning, work? From Figure 1's examples, it seem this alternative approach could achieve similar results as the proposed method. I had the question mainly because it is hard to come up with examples where there would be frequent switching between code and textual based reasoning. Based on the examples shown in the work, the solution seem to be having the LLM first generating code-based solutions. And if the code-based solutions are hard-coded (detected from rules), then ask the model to do textual-based reasoning instead.

论据与证据

Yes, the claims are well-supported.

  • Performance improvement demonstrated in Table 1
  • Ablation studies show the importance of the design choices

方法与评估标准

Yes, the methodology is well-founded and appropriate.

  • Eval is performed on both seen and unseen tasks

Minor suggestion:

  • The evaluation could point out failure cases where CodeSteer struggles. This will help provide insights on how to further improve on the work.

理论论述

The paper does not rely on theoretical claims, but its methodology is sound. Both DPO and SFT and well-studied in the literature, the work uses these approaches to build CodeSteer.

实验设计与分析

Yes, the experimental setup is well-structured.

补充材料

Partially, reviewed the datasets part. The appendix contains useful details.

与现有文献的关系

  • The development of SymBench, which is open-sourced, unlike prior works.
  • New methods for dataset construction and fine-tuning
  • Strong empirical results

遗漏的重要参考文献

The paper is well-positioned within the broader literature.

其他优缺点

  • The paper present a multi-round guidance mechanism to help LLMs to select between producing code or text-based reasoning
  • The design elements such as DPO, self-answer checking, symbolic evaluation seems to be helpful

其他意见或建议

  • see questions for authors.
作者回复

Thank you for your appreciation with our work and helpful suggestions. Here are our responses and newly added experiments based on reviewers’ questions. We kindly ask the reviewer to reconsider our work in light of the responses below. We are happy to further communicate with the reviewer.

Question 1: The evaluation could point out failure cases where CodeSteer struggles. This will help provide insights on how to further improve on the work.

Response 1: CodeSteer encounters failures under several conditions:

  1. Insufficient Capability of TaskLLM: In some cases, the capabilities of the TaskLLM—whether through coding or textual reasoning—are not sufficient to solve the given problem.

  2. Suboptimal Code Generation: The generated code may not use the most efficient method, which can lead to timeouts. For example, as shown in Figure 4c, CodeSteer’s success rate decreases when the target values increase, due to the exponential growth in search complexity.

  3. Lack of Robustness to Task Complexity: CodeSteer is not yet robust enough across tasks with varying complexity. As shown in Figure 4a, performance drops in medium-complexity samples. In these cases, CodeSteer sometimes selects textual reasoning over coding and ends up producing incorrect answers.

We will elaborate on these issues, along with relevant examples, in the Appendix to guide future improvements to the system.

Question 2: In Figure 1, why not address the case 1 problem with coding approach as well?

Response 2:

Thank you for the great question. In our study, solving Case 1 problems through coding is acceptable. During training data synthesis, we do not explicitly account for differences in execution or time cost between code generation and textual reasoning. Our primary criterion for data selection is whether the answer is correct. If both coding and textual reasoning successfully solve a problem, both are included in the dataset. Since code execution costs and runtimes depend on user-specific hardware, it is difficult to define a consistent comparison metric between the two approaches. Instead, we incorporate the number of guidance/generation rounds as part of the DPO scoring to favor more efficient solutions. We will include this discussion in the revised version of the paper.

Question 3: Can an alternative approach, where coding is preferred at the beginning and, if coding failed, switched to text-based reasoning, work? From Figure 1's examples, it seem this alternative approach could achieve similar results as the proposed method. I had the question mainly because it is hard to come up with examples where there would be frequent switching between code and textual based reasoning. Based on the examples shown in the work, the solution seem to be having the LLM first generating code-based solutions. And if the code-based solutions are hard-coded (detected from rules), then ask the model to do textual-based reasoning instead.

Response 3:

Thank you for the insightful suggestions. Based on the suggestion from you and Reviewer AJhL, To better evaluate CodeSteer and provide deeper comparisons, we include three prompt-based baselines:

  1. Few-Shot: Uses five example-based prompts to guide the TaskLLM in mimicking the 'code/text' switching reasoning process.
  2. Code-First-Rule: A rule-based approach where the TaskLLM is prompted to use code for the first three rounds (with increasing complexity) and then switch to text-based reasoning.
  3. Code-First-Agent: Employs GPT-4o as the CodeSteerLLM to guide the TaskLLM using the same code-first-then-text strategy as in Code-First-Rule.

As shown in the table, the three prompt-based methods perform significantly worse than CodeSteer, underscoring the effectiveness of training with our synthesized data. Upon analyzing failure cases, we identify two main reasons:

  • CodeSteerLLM’s guidance often includes problem-specific coding knowledge (e.g., suggesting A* or DFS) and how to formalize the problem, which purely prompt-based methods struggle to capture.
  • Switching between code and text can be advantageous, as later code generations can build on insights from prior textual reasoning. For instance, in Path Plan, a text-generated trajectory may be partially correct; subsequent code can refine it directly, reducing the search space.
Task Success Rate %CodeSteerFew-ShotCode-First-RuleCode-First-Agent
Game 2493286876
Path Plan75545957
Eight Queen78476273
Combinatorial Calculation86584759
204856494048

We will include the above added experiments and discussion in the final paper.

审稿意见
3

This work introduces a comprehensive benchmark SymBench comprising 37 symbolic tasks with adjustable complexity and also datasets of 12k multi-round guidance/generation trajectories and 5.5k guidance comparison pairs, and fine-tune a CodeSteerLLM using the introduced datasets, achieving improved reasoning performance.

给作者的问题

see above

论据与证据

The claims are decent and supported by the comprehensive experimental results.

方法与评估标准

The authors train the CodeSteerLLM to verify and guide the reasoning process of LLMs through supervised fine-tuning on the collected dataset. This is different from the recent surge of RL to incentivize such capabilities. I am curious if RL can further improve the potential of CodeSteerLLM.

理论论述

There are no theoretical claims to be evaluated.

实验设计与分析

Experimental designs are proper.

补充材料

no review

与现有文献的关系

The proposed method can improve the reasoning capabilities of LLMs and help the realm of ai4science.

遗漏的重要参考文献

no issue to be discussed

其他优缺点

My biggest concern is if RL can further improve the potential of CodeSteerLLM, and I expect more experimental results that explore this direction. Moreover, why not combine CodeSteerLLM with reasoning LLMs like o1 or r1?

其他意见或建议

no

作者回复

Thank you for your appreciation with our work and helpful suggestions. We've incorporated new experiments and analyses based on reviewers’ questions. We kindly ask the reviewer to reconsider our work in light of the responses below.

Question 1: The authors train the CodeSteerLLM to verify and guide the reasoning process of LLMs through supervised fine-tuning on the collected dataset. This is different from the recent surge of RL to incentivize such capabilities. I am curious if RL can further improve the potential of CodeSteerLLM. I expect more experimental results that explore this direction.

Response 1:

  1. Both CodeSteer and recent reasoning LLMs, such as R1, begin with supervised fine-tuning (SFT) to initialize the model with basic answer formats and reasoning capabilities. In the second stage, we apply DPO to further optimize CodeSteerLLM, while other methods typically use PPO or GRPO. Prior work has shown that DPO, PPO, and GRPO aim to learn the same optimal policy.

  2. Why do we choose DPO rather than PPO/GRPO?

The key distinction between CodeSteer and recent reasoning models is its two-LLM design: a fixed, non-tunable TaskLLM (GPT-4o) and a learnable CodeSteerLLM. This setup introduces three main constraints of directly applying PPO/GRPO:

  • GPT-4o is closed-source and non-deterministic, producing varied outputs even with temperature set to 0, making it difficult to evaluate the effectiveness of a single CodeSteerLLM guidance using one answer trajectory.
  • Sampling from GPT-4o incurs significant API costs, unlike the virtually unlimited sampling budget available when training open-source models.
  • Existing PPO/GRPO approaches rely on final reward signals for optimization, but in CodeSteer, intermediate guidance steps are hard to evaluate directly.

To address these challenges, we propose a tree-based multi-round guidance sampling strategy (Section 4.2). Each intermediate guidance is scored based on its average downstream returns. Additionally, since DPO relies on preference comparisons rather than precise reward values, it may require fewer TaskLLM samples than PPO/GRPO.

In the following table, we compare the performance of DPO and PPO using the same number of answer samples (ranging from 0 to 5000) and starting from the same post-SFT model. After convergence, DPO consistently outperforms PPO, particularly in low-sample regimes. This supports our decision to adopt DPO in this study. We will add the above discussion and results into the final paper version.

CodeSteer Ave. Norm. ScoreNum.0Num.1000Num.3000Num.5000
DPO79.179.580.381.4
PPO79.179.079.681.1
  1. A promising future direction is to apply reinforcement learning to enhance reasoning abilities in unified code/text LLMs. While PPO and GRPO may not offer clear advantages over DPO in our current setup, they could be more suitable when applied to a single unified model that both guides and generates code/text. In such cases, PPO/GRPO can directly optimize the policy, which we plan to explore in future work.

Question 2: Moreover, why not combine CodeSteerLLM with reasoning LLMs like o1 or r1?

Response 2:

We have added o1+CodeSteer to further verify CodeSteer’s effectiveness. During the time this work is done, we find the o1 and r1 APIs always report errors when requiring multi-round answer generation, thus hindering testing CodeSteer with these reasoning LLMs. Now the OpenAI platform has fixed this issue. Here we add the experiments to test whether CodeSteer can improve the performance of o1. As shown in the following table, the performance of o1 will improve notably when augmented with CodeSteer on 5 randomly chosen unseen tasks, further verifying the effectiveness of CodeSteer. We will include these results in the final version.

Task success rate %CryptanalysisSynthesis Decomposition2048Eight QueensCombinatorial Calculation
o16057528457
o1 + CodeSteer7394729795

We are glad to answer further questions from the reviewer.

审稿意见
3

This paper targets the better performance of LLMs on symbolic reasoning tasks such as Game-24 with multi-round generations. The method contains many components, such as a small model fine-tuned for guiding the generation, a self-answer checker, and a hardcoded symbolic checker.

The guidance model (CodeSteer based on Llama-8B) is fine-tuned to guide large models (e.g., GPT-4o) during multi-round generation, e.g., deciding on if to generate code to solve the problem or not. The guidance model is trained on a synthesized dataset with GPT-4o solving 28 symbolic tasks. The method is evaluated in another 9 unseen tasks.

The self-answer checker always prompts GPT-4o to generate codes to verify the answer's correctness after each round of generation. The symbolic checker evaluates the complexity of the generated codes to reject codes that are not sufficiently sophisticated for the task at hand.

With all the modules, the proposed method outperforms GPT-o1 and Deepseek-R1 for 9 unseen tasks. Without any of the symbolic checkers, the proposed method performs worse while still outperforming the baseline that prompts GPT-4o for guidance.

update after rebuttal

I still have concerns regarding the generalizability of such method, like if we can use it in more general domains/tasks. The authors’ replies are not that helpful.

给作者的问题

  • Are the symbolic checkers task/domain-specific? Can they generalize more general reasoning tasks?
  • Are there results in more domains and tasks?

论据与证据

The authors claim that the proposed outperforms baselines including o1. However, it only shows evidence in 9 tasks. They are not sufficient to evaluate the models' performance in general, especially when considering the symbolic checkers that are specifically designed for the tasks at hand.

方法与评估标准

I like the proposed data generation and fine-tuning process for the guidance module in general. The model is fine-tuned to prefer actions that lead to higher returns whose expectations are estimated given the MCTS trees. The name, multi-round data generation, was confusing though, as it could also refer to generating the datasets multiple times (e.g., obtaining new MCTS trees using the current fine-tuned guidance model).

The evaluation metric is less intuitive and stable with the normalization term as the maximum pass rates of all methods. I would recommend the raw average of pass rates. This may not be too problematic, as I did check the other metric using the numbers reported in Table 1, and the ranking of methods looks similar to the reported one.

理论论述

No theoretical claim in this paper.

实验设计与分析

The test sets are small and may not be diverse enough.

There is no validation set, and I haven't seen links to the hyperparameter tuning process.

补充材料

I checked the prompts and analysis in the Appendix. They look good and informative.

与现有文献的关系

Guided search is important for test-time-compute scaling. There are multiple works investigating it for various domains, such as multi-round code refinement. The recent reasoning modules such as o1 and R1 seem to have similar abilities as well.

遗漏的重要参考文献

N/A

其他优缺点

N/A

其他意见或建议

It could be interesting to finetune a bigger model, such as GPT-4o, for guidance, but I am not sure if the performance will improve a lot and surpass o1 without the symbolic checkers.

作者回复

Thank you for your appreciation with our work. The following responses clarify several misunderstandings raised by the reviewer. We've also incorporated new experiments and analyses. We kindly ask the reviewer to reconsider our work in light of the responses below.

Question 1: Test on 9 unseen tasks are not sufficient to evaluate the models' performance, especially when considering the symbolic checkers that are specifically designed for the tasks at hand. Are the symbolic checkers task/domain-specific?

Response 1:

  1. The 28 seen and 9 unseen tasks are randomly divided without human bias. The test sets are diverse enough as they cover all 6 types of reasoning capabilities we evaluated as shown in Table 4. Furthermore, in our work, the seen tasks are also valid and decent benchmarks to evaluate since the tested samples are all different from the training samples and the complexity range is large enough.

  2. The relative improvement of CodeSteer to other methods are almost the same in seen and unseen groups, showing no overfitting is happening. We include a table of CodeSteer’s performance improvement comparison over several main baselines for both seen and unseen tasks below.

Avg. Norm. Improvemento1DeepSeekR1Symbolic AgentCode/Text ChoiceCode Interpreter
Seen4.38.811.18.414.8
Unseen1.912.213.49.219.4

From the table we can observe, GPT-4o+CodeSteer has compatible and even better performance improvements for unseen tasks. This result indicates that no overfitting happens during our fine-tuning process and our method is generalizable.

  1. Symbolic checkers are not domain-specific. Please refer to appendix F for the whole code, which is universal for all tasks we evaluate. Symbolic checkers evaluate the code complexity by checking all the symbolic computing characteristics, with no task-specific components. Thus, they do not weaken the generalization of our method.

  2. SymBench is not small since it gathers and covers nearly all types of symbolic tasks appearing in the current research domain (refer Appendix Section C describing SymBench).

Question 2: There is no validation set, and I haven't seen links to the hyperparameter tuning process.

Response 2: As explained above, we directly tune hyperparameters on 28 seen tasks without requiring a validation set. We will add the explanation of the hyperparameter tuning process in the Appendix of the revised version.

Question 3: The name, multi-round data generation, was confusing though, as it could also refer to generating the datasets multiple times.

Response 3: Thank you for your helpful advice. Admittedly, this will cause some confusion. We will change it to multi-turn data generation in the final version.

Question 4: The evaluation metric is less intuitive and stable with the normalization term as the maximum pass rates of all methods. I would recommend the raw average of pass rates.

Response 4:

  1. We use the normalized metric to better compare the relative performance among methods and prevent any single task from disproportionately influencing the overall evaluation.

  2. To ensure the robustness of our conclusions against changes in the evaluation metric, we recalculate the average score without normalization. From the result in the table, we can observe that for both seen and unseen tasks, our method can still outperform all main baselines obviously, even more on unseen tasks.

We will add this metric in the final paper.

Avg. Scoreo1DeepSeekR1Symbolic AgentCode/Text ChoiceCode InterpreterGPT-4o + Codesteer
Seen73.770.467.570.064.976.1
Unseen67.864.860.964.956.072.2
Total72.369.065.968.862.875.2

Question 5: It could be interesting to finetune a bigger model, such as GPT-4o.

Response 5:

Since we do not have the access to finetune GPT-4o, here we add the experiments to compare finetuning CodeSteerLLM on Llama-3-8B, Llama-2-13B, and Llama-3-70B (LoRA) on the same dataset of SFT and DPO, as shown in the following table. We find the performance will not improve apparently with larger models. The current main bottleneck is still the lack of high-quality dataset.

Ave. Norm. ScoreSeenUnseen
Llama-3-8B (full-parameter)88.181.3
Llama-2-13B (full-parameter)87.382.2
Llama-3-70B (LoRA)88.680.0

We will add the above new experimental results and discussion in the final paper.

审稿意见
2

This paper introduces CodeSteer, a model fine-tuned to enhance reasoning abilities between text and code. The authors propose a synthetic dataset and a new evaluation suite of 37 symbolic tasks to demonstrate the model’s performance on complex reasoning tasks. They claim that by leveraging both code-generation capabilities and natural language generation, CodeSteer shows significant improvements on reasoning benchmarks.

给作者的问题

see above

论据与证据

  • Claim: The fine-tuned model (CodeSteer) performs well on symbolic reasoning tasks by switching contexts between text and code.
  • Evidence: The experimental results on the proposed benchmark demonstrate CodeSteer’s effectiveness, supporting the claim that it leverages code generation to assist with reasoning.

While this claim is promising, more details on the following would strengthen the evidence:

  1. Comparison with standard baselines: A direct comparison with models that use text or code alone would highlight the advantages of the interleaved method.

方法与评估标准

Dataset Synthesis

One of the central elements of CodeSteer is the synthesized training dataset. However, the steps to create this dataset are not fully detailed in the main body. Questions that arise include:

  1. What is the precise process to generate synthetic data?
    • How are prompts, code fragments, and textual explanations created or derived?
    • Are there automated or semi-automated processes for generating these examples?
  2. Data Evaluation
    • How is the quality of the synthesized data validated?
    • Are there checks for correctness, diversity, and relevance to the targeted symbolic tasks?
  3. Data Quality Considerations
    • What filtering steps are taken to remove low-quality or repetitive samples?
    • What metrics (e.g., coverage of different symbolic reasoning types) ensure the dataset’s comprehensiveness?

Proposed Benchmark

The paper introduces a benchmark consisting of 37 symbolic tasks:

  • Data Collection: How exactly were these tasks sourced or created?
    • Were they adapted from existing symbolic reasoning benchmarks or newly constructed?
    • What cleaning and filtering procedures were applied?
  • Benchmark Size: With 37 tasks, the benchmark appears relatively small.
    • Is there a plan to scale it further in future work or to combine it with existing reasoning datasets for broader coverage?
  • Placement in the Main Text: Since the benchmark is a significant contribution, it would be beneficial to include more details and examples in the main paper (instead of primarily in the appendix).

理论论述

There are no explicit theoretical claims beyond empirical demonstrations of CodeSteer’s effectiveness.

实验设计与分析

In the current evaluation:

  • Reported Metric: Table 1 uses a single metric to show CodeSteer’s performance.
    • Recommendation: Consider adding additional metrics that assess different aspects of generation, such as:
      • Exact Match (for correctness of final answers).
      • CodeBLEU (for measuring code-generation quality and similarity).
  • Baselines: While CodeSteer’s performance is the focus, it would be illuminating to compare with:
    • GPT-4 (Few-shot): Prompt GPT-4 directly with a few-shot approach, guiding it to replicate CodeSteer’s “code-then-text” reasoning procedure, but without the fine-tuning. This would reveal whether the gains come primarily from the interleaved prompting or from specific CodeSteer training data.

By incorporating multiple metrics and stronger baselines, the paper’s findings could be even more compelling.

补充材料

Yes

与现有文献的关系

Code Generation Reasoning

遗漏的重要参考文献

N/A

其他优缺点

see above

其他意见或建议

see above

作者回复

Thank you for the helpful feedback. We've clarified several misunderstandings raised by the reviewer and also incorporated new experiments and analyses. Hope the reviewer could reconsider our work based on the responses below.

Question 1: Direct comparison with models that use text or code alone.

Response 1: We have already compared methods that use only text or only code in the original paper. As shown in Table 1, the results for All Text + CoT and All Code + CoT represent cases where GPT-4o is prompted to generate solely text or code to solve the task in a single round (see Lines 209–213 for details). The superior performance of GPT-4o+CodeSteer further demonstrates the effectiveness of CodeSteer.

Question 2: Process to generate synthetic data: How are prompts, code fragments, and textual explanations created? Are there automated processes?

Response 2:

  1. Apart from Symbench and dataset, we also propose novel training techniques such as multi-round DPO, SFT data augmentation, Symbolic Checker, and Self-answer Checker.

  2. All training data are synthesized and validated automatically using predefined rules, without costly human annotation. Details of dataset synthesis for SFT and DPO are provided in Section 4.1 (Lines 157–170) and Section 4.2. The data generation process, including textual reasoning components, is handled by GPT-4o guided by detailed prompts with preset knowledge or hints. The implemented prompts are shown in Page 3 (Lines 160–163, first column) and Appendix Section D. Generated code is extracted using predefined scripts.

Question 3: Checks for correctness, diversity, and relevance to the targeted symbolic tasks. Filtering steps to remove low-quality samples. Metrics for the dataset’s comprehensiveness.

Response 3: In both SFT and DPO, we include only answer trajectories that lead to correct solutions. To encourage diverse responses, we use varied prompts for SFT (Section 4.1, Lines 163–168) and different model checkpoints for DPO. The selected training tasks span a wide range of symbolic reasoning types, as detailed in Appendix Section C and Table 4. Each task includes samples of varying complexity to promote diverse reasoning. Training data is dynamically updated based on task performance—tasks with lower performance receive more data in subsequent rounds.

Question 4: How exactly were these tasks sourced or created? What cleaning and filtering procedures?

Response 4: As noted in the original paper (Lines 90–100 and 128–140, first column), we collect and redevelop all the tasks from former works since their datasets and codes are not open-sourced. The specific sourced work of each task is explained in Appendix Section C. Among 45 tasks, we select 37 tasks since the left 8 tasks are not challenging to GPT-4o.

Question 5: With 37 tasks, the benchmark appears relatively small. Plan to scale it further in future work or to combine it with existing reasoning datasets?

Response 5: SymBench is not small since it covers nearly all types of symbolic tasks appearing in the current research domain. SymBench also comprises more tasks (37) than other reasoning and planning benchmarks like BIG-Bench-Hard (23 tasks), PlanBench (6 tasks), LogicGame (31 tasks). Definitely, we will add more tasks and combine with other existing reasoning datasets for a more comprehensive benchmark in the future.

Question 6: Additional metrics: Exact Match, CodeBLEU

Response 6: Since many SymBench problems have multiple correct solutions, Exact Match is not ideal. Similarly, CodeBLEU relies on reference code and assumes a single ground-truth solution, making it less suitable since TaskLLM outputs may be in text or code and correct solutions are not unique. The reviewer’s suggestion inspired us to explore alternative metrics that capture different aspects of generation. For example, the following table shows the average code complexity increases with more rounds, supporting the idea that TaskLLM refines code progressively under guidance.

Code complexity scoreRound 1Round 2Round 3Round 4Round 5
9.3211.4412.9413.3113.54

Question 7: Prompt GPT-4 directly with a few-shot approach, guiding it to replicate CodeSteer’s “code-then-text” reasoning procedure.

Response 7: Here we include three prompt-based baselines: Few-Shot, Code-First-Rule, Code-First-Agent. Due to word limitation, please refer to our response to Reviewer RBSg for the full experimental results and detailed discussion. In summary, these three prompt-based methods significantly underperform compared to CodeSteer, highlighting the value of training with our synthesized data.

We will include all of the above contents in the revised paper. Happy to answer any further questions.

最终决定

This paper made several interesting contributions to the space of coding / reasoning, specifically in text vs coding based reasoning. Reviewers generally find the following contributions to be valid:

  • the benchmark
  • the data synthesis technique and process
  • the framework that can leverage both the tuned llm and powerful model like gpt-4o
  • and the empirical gain over benchmark datasets.

The reviewer AJhL found some aspects that can be further improved or clarified, and the authors promised to include that in the final revision. This decision of acceptance is based on that the authors will carefully incorporate the improvements and clarifications in their revisions.