PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
5
4
4
3.8
置信度
创新性2.5
质量2.8
清晰度2.3
重要性2.8
NeurIPS 2025

Scaling Code-Assisted Chain-of-Thoughts and Instructions for Model Reasoning

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29

摘要

关键词
Data SynthesisReasoning

评审与讨论

审稿意见
4

The paper introduces Caco (Code-Assisted Chain of Thought), a framework for generating large-scale, high-quality, and executable reasoning data (in the form of Python code) to train large language models (LLMs). It addresses key limitations of traditional Chain of Thought (CoT) prompting, which is typically done in natural language and suffers from issues such as unverifiability, reliance on manual annotation, and logical inconsistencies. The authors demonstrate that fine-tuning both specialized and general-purpose open-source models on the proposed dataset significantly improves their performance on mathematical reasoning benchmarks.

优缺点分析

Strengths

  • A dataset where reasoning traces are executable (i.e., Python programs) can significantly improve the verifiability and reliability of reasoning chains. Unlike natural language-based reasoning, which is difficult to parse and not directly executable.
  • Fine-tuning results show consistent improvements across benchmarks.

Weaknesses

  • Although the dataset is useful, the core research contributions are somewhat unclear. Representing reasoning in executable code and using fine-tuned models like CodeGen for data generation have both been explored in prior work. For example, [1] uses (a similar idea) Formalizer models to generate formal proof data from informal sources. It would be great if the authors could clarify their technical contributions and how they compare with existing, well-known ideas.
  • While the results reported in Table 1 are promising, existing methods such as rStar-Math appear to achieve substantially better performance with similarly sized models (see Table 5 in [2]). The authors should clarify why this baseline was not included and, if possible, provide a fair comparison with such existing methods.

[1] Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving
[2] rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

问题

See the weakness section

局限性

yes

最终评判理由

I am largely satisfied with the submission and will maintain my score at 4. However, I would not give a higher score due to the lack of technical novelty in the paper, although it does present strong empirical results.

格式问题

None

作者回复

Thank you for your time and thoughtful review. We have carefully considered your comments and provide detailed responses below, with the hope that they clarify our intentions and strengthen the understanding of our work.

W-1: Although the dataset is useful, the core research contributions are somewhat unclear. Representing reasoning in executable code and using fine-tuned models like CodeGen for data generation have both been explored in prior work. For example, [1] uses (a similar idea) Formalizer models to generate formal proof data from informal sources. It would be great if the authors could clarify their technical contributions and how they compare with existing, well-known ideas.

In [1], the approach first converts math problems into formal theorem statements and then uses a prover model to generate formal proofs. These proofs are filtered based on verifiability, and the correct ones are added for iterative self-improvement. The primary goal of [1] is to sample the highest-quality answers possible under a fixed set of problems.

In contrast, our work focuses on generating large-scale, diverse, and verifiable problems across domains such as mathematics, algorithms, and logic—all unified under a Python-based code representation. By leveraging a code generation model, we synthesize new problems by first generating code-based reasoning traces and then translating them into natural language, enabling both diversity and scalability.

To clarify these distinctions, we will revise the manuscript to explicitly highlight the differences from [1], and we will include proper citations to both works as suggested.

For more precise definitions and contribution statements, we refer the reviewer to our responses to Reviewer oWXi-W1 and oWXi-W2.

W-2: While the results reported in Table 1 are promising, existing methods such as rStar-Math appear to achieve substantially better performance with similarly sized models (see Table 5 in [2]). The authors should clarify why this baseline was not included and, if possible, provide a fair comparison with such existing methods.

Thank you for raising this important point. Our method focuses on scalable reasoning data generation with basic SFT, while rStar-Math is a System 2-style test-time reasoning framework that uses multi-stage evolution, Monte Carlo Tree Search (MCTS), and a PPM-based reward model to select final answers.

To enable a fair comparison, we consider both greedy and sampled decoding modes:

Greedy Inference

The main results reported in rStar-Math (Table 5 in [2]) rely on sampling 64 responses per question and selecting the best one using PPM scoring (denoted rStar-Math-best@64). This is not directly comparable to our default greedy decoding setting.

To enable a fair comparison, we refer to rStar-Math’s greedy baseline (Table 10 in [2]) and compare it to our Caco-1.3M model under similar decoding.

System2 Sampling

To compare with the main results in rStar-Math, we initially considered replicating rStar’s full MCTS+PPM pipeline but found it incompatible:

  • The rStar-Math model is trained to decode step-by-step outputs (### Step n) to enable MTCS which our model doesn’t generate natively.
  • Instead, we adopt long-form CoT with majority voting—a simpler and widely used System 2-style approach.

We also evaluate Caco-Long, a System 2-style model trained on long CoTs, using both greedy decoding and 8× sampling with majority voting (temperature=0.8, top_p=0.9).

Due to the large evaluation cost on Math and Olympiad, we only performed sampling on AMC and AIME. All model are based on Qwen2.5-Math-7B.

ModelInference ModeTraining StrategyAIME2024MathAMC23OlympiadAverage
System1rStar-Math (greedy)Greedy4stage eolve with PPM26.778.447.547.149.9
Caco-1.3MGreedysft 3epoch23.382.467.546.554.9
System2Caco-Long-300KGreedysft 3epoch40.089.675.058.965.9
rStar-Math-best@64MCTS+PPM (64 samples)4stage eolve with PPM53.390.087.565.674.1
Caco-Long-major@8Majority (8 samples)sft 3epoch56.6-90.0-

Note: Unlike the results reported in the original paper, we found after submission that the model requires 3 epochs of training to fully converge. The results shown here are based on this updated 3-epoch training.

Key observations:

  • Under greedy decoding, Caco-Long outperforms rStar-Math’s greedy policy by a large margin.
  • With just 8 sampled outputs, Caco-Long surpasses rStar-Math-best@64, which uses 64 samples and a reward model.

While rStar-Math introduces a sophisticated multi-round framework, our work focuses on improving cold-start performance via better data generation. These two directions are complementary: Caco can provide stronger initial models and data, which rStar-style pipelines can further evolve.

We appreciate the reviewer’s suggestion and will include this comparison in the revised version to better contextualize our contributions.

评论

Dear Reviewer,

We would like to kindly remind you to review our rebuttal for our submission. As the rebuttal phase is concluding, we hope to address any remaining concerns and clarify our points. Your feedback is highly appreciated.

Thank you very much for your time and consideration.

Best regards, Authors

评论

Dear Reviewer NsmM,

We hope you are doing well. As we approach the conclusion of the rebuttal period, we are writing to kindly request your review of our submitted rebuttal materials when time permits.

Should any questions or concerns arise during your evaluation, please know we are readily available to provide further clarification. Your insights have been invaluable in enhancing the quality of our work, and we sincerely appreciate your continued engagement with our submission.

Thank you once more for your time and thoughtful review.

Best regards, authors

评论

Thanks for your response. I am mostly happy with the submission and will retain my score.

审稿意见
5

This paper proposes a technique called “Caco” for synthetic data generation which works by translating math reasoning problems to and from code. Specifically, Caco has the following stagers: (1) prompt an LLM (Qwen-2.5-72B-Instruct) to convert existing math datasets to a unified code format (which the paper calls a “Code CoT”) which are filtered to ensure they produce the same answer as in the original dataset; (2) finetune a new LLM (unclear which one) on the Code CoTs to produce a model called CodeGen; (3) sample from the CodeGen model to create novel Code CoTs; (4) prompt an LLM (unclear which one) with the Code CoTs to produce corresponding natural language prompts, and filter the resulting prompts (using Qwen3-32B) to ensure that they are solvable. This procedure results in a dataset of 1.3M math problems which, when used to finetune various models (DeepSeekMath-7B, Qwen-2.5-Math-7B, LLaMA3-8B), substantially improves their reasoning performance across six benchmarks, outperforming the performance of those same models when trained on a variety of other math-specific datasets.

优缺点分析

Pros:

  • Originality: It is an interesting idea to leverage a formal language to constrain the synthetic data generation process, which enables more reliable verification. This is not a completely original idea on its own (e.g. methods like AlphaGeometry also leverage this idea) but to my knowledge the particular instantiation of this idea in this paper is new.
  • Significance: Synthetic data generation is an important area of LLM research especially in reasoning domains, as high-quality, expert-level human-generated data is hard to come by.
  • Clarity: I appreciate that all the prompts used in the pipeline are included in the appendix, thank you for including this level of detail.

Cons:

  • Clarity: I found the paper difficult to read at times. (1) The method is relatively straightforward to explain yet I found the explanation—especially in the abstract and introduction—to be unclear and it was not until I had fully finished reading the methods that I was able to understand it. Similarly, I found Figure 2 to not be particularly easy to understand—I needed to read the text first to understand it, whereas it should really be the other way around. (2) Many details are also scattered throughout the paper & appendix making it difficult to completely understand the full picture.
  • Clarity: I find the term “Code CoT” to be misleading. It’s not a chain-of-thought at all because it’s not step-by-step reasoning—it’s a programmatic solution to the given problem.
  • Quality: Important details are missing from the paper, and there are some inconsistencies. I detail these further below in the “Questions” section.
  • Quality: Given that the paper uses a larger model (Qwen-2.5-72B-Instruct) to generate the seed dataset of Code CoTs, I have a concern about whether the gains come from the knowledge from the more advanced model, or from the Caco procedure. I discuss this concern in more detail below in the “Questions” section.
  • Quality: The proposed procedure shares some similarities with STaR/ReST, which are much simpler methods. Ideally one iteration of STaR would be included as a baseline to confirm that the gains obtained from Caco are indeed due to leveraging the code representation. I discuss this concern in more detail below on the “Questions” section.

问题

Missing, Unclear, and Inconsistent Details

There are some missing details and inconsistencies. I would need to see all of these addressed to raise my score:

  • Which model is finetuned to produce the CodeGen model?
  • Which model is used to produce the natural language instructions and solutions (detailed in Section 3.3)?
  • More generally (e.g. lines 114, 147, 176, 187, 192): rather than saying “a LLM” the paper should always be explicit about which particular model is used.
  • Caco-Seed-109K needs more explanation. The paper says it is the “seed data used to train CodeGen”, but that would be data consisting of Code CoTs, not natural language prompts and CoTs. Moreover, Line 162 says that the seed dataset is 162K examples, not 109K. Why are these numbers different? Where do the natural language CoTs come from? (Is it just that you’ve applied the rest of the Caco procedure to the 162K to generate the CoTs, and applied filtering, thus reducing the size of the dataset?)
  • Plots and tables should include confidence intervals/error bars. These do not need to be derived from multiple experiments (I agree with the authors this would be too computationally expensive) but they can still be computed over the dataset. For example, for GSM8K (with N=1319 questions) and Caco-1.3M-LLaMA3-8B (with p = .878) the upper 95% confidence interval is .878 + 1.96 * sqrt(p * (1 - p)) / sqrt(n) = 0.896 (see https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval).

Possible Distillation Effect

To generate the seed Code CoTs, the paper uses a much larger model (Qwen-2.5-72B-Instruct) than the models ultimately finetuned on Caco & benchmarked in Table 1. I am worried that because of this, the Caco data pipeline is essentially just implementing a process of distillation from this larger model.

An important, simple baseline would be to instead sample natural language CoTs from Qwen-2.5-72B-Instruct and use them for finetuning the smaller models. If this results in similar performance as training on Caco (when matched for dataset size), this would suggest that the ability to synthesize new prompts is not providing much benefit—rather, the gains are coming from solutions generated by a larger model.

I would need to see a comparison to this baseline (or something analogous to convince me that the results are not coming from distillation) to consider raising my score.

Ideally, the paper would also include benchmark scores for Qwen-2.5-72B-Instruct for reference.

Comparison to STaR / ReST-EM

Related to the above point, but slightly different, is that the Caco procedure can be seen in some sense as a (relatively complicated) iteration of STaR 11 or ReST-EM 22 , which is a much simpler method. An important baseline would be to include a comparison to a single iteration of self-improvement: (1) generate CoTs from a model (e.g. one of the ones benchmarked), (2) filter the CoTs by correctness, (3) and then finetune the model on the resulting traces.

This would be an important control because it would again show that the gains are coming from having a higher number of unique prompts to train on, rather than a more generic form of self-improvement. (Note it should be possible to match for dataset size as well by generating multiple samples per prompt.)

I would need to see a comparison to this baseline (or something analogous to convince me that the results are not coming from vanilla self-improvement) to consider raising my score.

Refs:

  1. Zelikman, E., Wu, Y., Mu, J., & Goodman, N. (2022). STaR: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35, 15476-15488.
  2. Singh, A., Co-Reyes, J. D., Agarwal, R., Anand, A., Patil, P., Garcia, X., ... & Fiedel, N. (2023). Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585.

Minor Points

These are some suggestions for improving the paper (but have not strongly influenced my score):

  • Line 66 - please make it clear what the “65.4% accuracy” is over (i.e. the six benchmarks).
  • As far as I can tell, the “problem-level augmentation” described in Figure 3 isn’t mentioned anywhere else in the paper. It should be removed if it’s not part of the paper, or made clear what part of the text this is referring to.
  • Section 3.4 seems like it could be in the appendix.
  • Section C.1: What is Caco-100K? It is not described anywhere.
  • Section C.2: I’m not sure I understand this experiment. Is the idea that when generating the problem solutions on the Caco prompts, you enable thinking, and then train on these longer traces instead? If so, I think this not a very informative experiment without the inclusion of some further baselines (e.g. AIME performance after training on Caco w/o long prompts)

局限性

The main unaddressed limitation is the fact that the paper uses a larger model as part of the data generation procedure, but then only evaluates on smaller models. This could result in the improved results being due to a distillation effect, rather than due to the proposed data generation method. This limitation should be addressed.

EDIT: I believe the authors' rebuttal has sufficiently addressed this limitation.

最终评判理由

The authors have included extensive results regarding the possible distillation and self-improvement effects, which fully address my concerns. They have also promised to improve the clarity and sufficiently answered the questions I had about various details. Of course, I can't evaluate whether the clarity is improved without reading the revised paper but my main concerns were about distillation/self-improvement and these have been resolved.

格式问题

n/a

作者回复

We’re grateful for your valuable feedback and the careful attention you’ve given to our submission. Below we provide responses to your questions, and we hope this helps clarify our methodology and resolve your concerns.

W-1: Clarity: I found the paper difficult to read at times. W-2: Clarity: I find the term “Code CoT” to be misleading.

Response to W-1: Clarity of Presentation

Thank you for the feedback. As suggested, we have revised the abstract and introduction to provide a clearer and more concise overview of the core methodology, ensuring readers can grasp the main idea without needing to read the full methods section.

In addition, we will redesign Figure 2 to make the workflow more intuitive and self-explanatory. We also plan to include a dedicated “Problem Formulation” section, as recommended by reviewer oWXi-W2, to help organize key concepts.

Furthermore, in response to your point in W-3, we have consolidated scattered technical details from the appendix into the main text to improve clarity and cohesion.

Response to W-2: On the Term “Code CoT”

We appreciate your observation. Our original intent behind the term “Code CoT” was to describe reasoning traces expressed in code (i.e., programmatic) form, as a structured and executable variant of chain-of-thought.

We used the term “CoT” because such programs encode sequential reasoning steps implicitly—for example, variable initialization reflects problem setup, and each computation (e.g., via SymPy) corresponds to an intermediate logical operation.

This framing is consistent with prior terminology:

  • PAL [1] refers to this as “programmatic steps,”
  • PoT [2] describes it as “Programs of Thought,” suggesting that “LMs can express reasoning steps as Python programs.”

We acknowledge that the term “Code CoT” may be misleading, as such traces do not contain explicit step-by-step natural language explanations. In the revised version, we are considering replacing it with a more precise term such as “Program of Thoughts” for improved clarity.

[1] Gao, Luyu, et al. "Pal: Program-aided language models." International Conference on Machine Learning. PMLR, 2023

[2]Chen, Wenhu, et al. "Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks." arXiv preprint arXiv:2211.12588

W-3: Quality: Important details are missing from the paper, and there are some inconsistencies. I detail these further below in the “Questions” section.

Sorry for missing details. We give detailed information here, and we will revise as you suggested.

  • After generating the seed Code CoT dataset, we fine-tuned Qwen2.5-Coder-7B to obtain the CodeGen model.
  • The natural language instructions and answers mentioned in Section 3.3 were generated using Qwen3-8B (non-thinking mode).
  • Thank you for the suggestion. We will replace vague mentions like "an LLM" with the specific model names throughout the paper, e.g. Qwen2.5-Code-7B, Qwen3-8B.
  • Exactly — the 162K seed dataset includes the original source problems, which correspond to the “Original Math Problem” branch in Figure 2. Like the synthetic problems, we used Qwen3-8B to generate Language CoTs for them, and after filtering, 109K samples remained. The current description is unclear, and we will revise it in the next version.
  • Thank you for the reminder. We will add confidence intervals to tables.

W-4 & Limitation: Quality: Given that the paper uses a larger model (Qwen-2.5-72B-Instruct) to generate the seed dataset of Code CoTs, I have a concern about whether the gains come from the knowledge from the more advanced model, or from the Caco procedure.

Thank you for raising this important point. To evaluate whether the performance gains are primarily due to knowledge transfer from the large teacher model (Qwen-2.5-72B-Instruct) rather than the Caco procedure itself, we conducted the following control experiment:

  • We used Qwen-2.5-72B-Instruct to generate natural language CoT answers for the same seed questions, using multiple sampling rounds to create a dataset of 300K QA pairs, which we refer to as Qwen72B-Seed-distilled.
  • We then fine-tuned the same student model on this dataset and compared it to a 300K subset of Caco, and the full Caco-1.3M dataset.

The results are shown below:

Dataset#SamplesMATHGSM8KCollegeDMOlympiadTheoremQAAVG
Qwen-7B-Qwen72B-Seed-distilled300K79.091.252.184.441.345.065.5
Qwen-7B-Caco-300K300K81.692.451.284.842.544.966.2
Qwen-7B-Caco-1.3M1.3M82.492.651.487.146.546.067.7
DS-7B-Qwen72B-Seed-distiiled300K57.483.042.469.023.431.451.1
DS-7B-Caco-300K300K61.883.243.376.023.931.153.2
DS-7B-Caco-1.3M1.3M68.285.146.080.229.533.857.1
Qwen2.5-72B-Instruct-83.195.853.083.746.746.068.1

Note: Unlike the results reported in the original paper, we found after submission that the model requires 3 epochs of training to fully converge. The results shown here are based on this updated 3-epoch training.

We draw two key conclusions from this:

  1. Caco outperforms the distilled data (66.2 vs. 65.5 AVG) even at the same dataset size, suggesting that prompt and reasoning diversity—enabled by Caco’s code-based augmentation—is beneficial beyond simply copying teacher traces.
  2. Caco scales more effectively: When expanding to 1.3M samples, the model gains further improvement, which would be difficult to achieve using the seed-only natural language questions, demonstrating Caco’s scalability advantage in generating diverse and verifiable new problems.

These results confirm that the improvements come not just from the use of a strong teacher, but from the Caco generation procedure itself, which promotes data diversity, verifiability, and scalability.

W-5: Quality: The proposed procedure shares some similarities with STaR/ReST, which are much simpler methods. Ideally one iteration of STaR would be included as a baseline to confirm that the gains obtained from Caco are indeed due to leveraging the code representation. I discuss this concern in more detail below on the “Questions” section.

Thank you for the valuable suggestion. We would like to clarify that Caco’s primary focus is not on improving answers for a fixed set of questions (as in STaR or ReST), but rather on generating a much larger and more diverse set of new problems through code-based reasoning traces. Unlike STaR, Caco does not follow a traditional self-improvement paradigm. Rather than using model-generated answers to improve the same model, we train a dedicated code generator to produce executable CoTs from scratch—shifting the focus from answer refinement to scalable, verifiable data creation.

That said, we agree that the two directions are not mutually exclusive. In fact, we believe that applying STaR-style self-improvement on top of the more diverse prompts generated by Caco could yield further gains.

To directly address your suggestion of a single iteration of self-improvement as a control, we conducted the following experiment:

  • Using DeepSeek-Math-7B, we generated 4 CoTs for each of the 109K seed examples.
  • After filtering incorrect answers, we retained 363K verified solutions, from which we randomly sampled 300K (DS-7B-Seed-self-improve) for comparison.
  • We compared this to 300K randomly sampled data from the full Caco dataset (DS-7B-Caco-300K).

The results are as follows:

#sampleMATHGSM8KCollegeDMOlympiadTheoremqaAVG
DS-7B-Seed-self-improve300K53.186.741.662.519.330.248.9
DS-7B-Caco-300K300K61.883.243.376.023.931.153.2
Qwen-7B-Seed-self-improve300K70.783.047.147.639.028.252.6
Qwen-7B-Caco-300K300K81.692.451.284.842.544.966.2

We observe that the Caco-based dataset consistently outperforms the STaR-style self-improvement baseline across all evaluated benchmarks. We hope this additional baseline helps clarify the distinction and effectiveness of our approach.

Minor Points

These are some suggestions for improving the paper (but have not strongly influenced my score):

  • The 65.4% refers to the average accuracy of Caco-1.3M-Qwen2.5-Math-7B across the six benchmarks in the main experiment. We will revise it to: “65.4% average accuracy on six benchmarks in Section 4.2.”
  • Thank you for the reminder. We will discuss both forms of augmentation in Section 3.3 to better intuitively show the advantages of our method.
  • Noted, we will move Section 3.4 to the appendix.
  • Caco-100K is a subset of more difficult problems sampled from the full Caco dataset. We assess difficulty based on the average pass rate over three generations using Qwen2.5-Math-7B-Instruct.
  • Yes, the experiment enables chain-of-thought reasoning when generating answers, and then trains on these longer traces. Our motivation is that test-time scaling methods like system-2 prompting can further benefit from our data, especially on extremely hard problems, e.g. AIME. We have added a comparison between the short-CoT and long-CoT fine-tuned models in the NsmM-W2 table.
评论

Thank you very much for the updated set of results—these are very impressive and address my concerns about distillation and self-improvement. I am happy to increase my score.

评论

Dear Reviewer,

Thank you for your thoughtful review and constructive feedback. We're delighted that our updated results addressing distillation and self-improvement concerns met your expectations, and we truly appreciate your decision to increase the score.

Your insights have been invaluable.

Best regards, Authors

评论

Dear Reviewer,

We would like to kindly remind you to review our rebuttal for our submission. As the rebuttal phase is concluding, we hope to address any remaining concerns and clarify our points. Your feedback is highly appreciated.

Thank you very much for your time and consideration.

Best regards, Authors

审稿意见
4

This paper proposes Caco (Code-Assisted Chain-of-ThOught), a novel framework designed to automate the synthesis of high-quality, verifiable, and diverse instruction-based Chain-of-Thought (CoT) reasoning data through code-driven augmentation. This framework can improve the scalability and generalizability of the code-assisted reasoning method, thereby enhancing the reasoning capabilities of large language models.

The key contributions are:

  1. The authors proposed Caco, a scalable code-assisted CoT and instruction generation framework designed to automate the production of high-quality reasoning training data through code-anchored refinement.
  2. The authors created Caco-1.3M, a large-scale synthetic dataset containing millions of high-quality reasoning traces
  3. Through extensive experiments on multiple mathematical reasoning benchmarks, the authors demonstrate the effectiveness and advantages of Caco over existing approaches.

优缺点分析

Strengths:

  1. High-quality framework proposal: The paper proposes a original and novel framework called Caco which effectively addresses the limitations of traditional CoT methods on the code-assisted reasoning problem, such as uncontrolled generation, insufficient quality, and limited diversity in reasoning paths.
  2. Clear framework description: The paper provides a detailed introduction to the overall Caco framework. Each component of Caco is elaborated with specific methods and processes.
  3. Providing high-quality training data: The Caco framework generates a large-scale dataset, Caco-1.3M, which offers high-quality LLM reasoning training data.

Weaknesses

  1. Lack of clarity in the Introduction: The main contributions of the paper are not clearly explained in the introduction.
  2. Inadequate problem formulation: The paper lacks a clear problem formulation. It is essential to explicitly define and describe the code-assisted reasoning problem, using appropriate notations to help readers understand the whole framework.

问题

  1. Why is your method not evaluated on the AIME dataset?
  2. Code-assisted methods primarily address incorrect calculations in reasoning trajectories. Does the Caco framework specifically aim to reduce calculation errors? It would be beneficial to analyze whether improvements arise from other factors, such as Caco's influence on the reasoning patterns and behaviors of LLMs.
  3. The introduction mentions the length inefficiency problem. Have you compared the efficiency metrics, such as response length and generation time, of Caco with those of other baseline methods?
  4. Suggestion: Clearly articulate the main contributions of your work, including the dataset generation and any novel mechanisms introduced in Caco.
  5. Suggestion: Complete the problem formulation section to enhance clarity.

局限性

Yes

最终评判理由

Thanks for the response.

格式问题

No paper formatting concerns

作者回复

We truly appreciate the time and effort you dedicated to reviewing our work. Below, we provide detailed responses to your concerns, and we hope this clarifies our design decisions and highlights the key contributions of our approach more clearly.

W-1: Lack of clarity in the Introduction: The main contributions of the paper are not clearly explained in the introduction. Q-4: Clearly articulate the main contributions of your work, including the dataset generation and any novel mechanisms introduced in Caco.

Thank you for pointing out the need for clearer articulation of our contributions. In the revised version, we will restructure the introduction to explicitly present the main contributions and novel mechanisms introduced by Caco. Below, we summarize these points clearly:

We propose Caco, a unified and scalable framework for generating large-scale, high-quality, and verifiable chain-of-thought (CoT) reasoning data by treating code as the central abstraction of logical reasoning. Unlike prior methods that begin with natural language questions and derive CoTs, Caco reverses this pipeline: it first generates executable code-based reasoning traces (Code CoTs) and then reverse-translates them into diverse, verifiable question-answer pairs. This inversion unlocks key benefits in scalability, verifiability, and cross-domain diversity.

To support this, we train a code-based CoT generator, which learns from a high-quality seed dataset and is capable of sampling millions of reasoning chains. This approach significantly improves the efficiency and diversity of data generation compared to prior methods that rely heavily on manual templates or language-only augmentation.

The key contributions of Caco are:

  • Closed-Loop Generation with Built-in Verifiability. Caco introduces an end-to-end pipeline combining generation, validation, and translation. The process guarantees correctness by design through Code execution checks, AST-based structural filtering and Answer consistency verification. This makes every QA pair self-consistent, verifiable, and ready for downstream use, including reinforcement learning.

  • Scalable Data Generation via a Trained Code Generator: We train a code-based CoT generator on a small set of verified seed examples. This model can efficiently sample millions of high-quality reasoning traces, enabling scalable and low-cost dataset creation without manual templates or annotations.

  • Code-Centered Augmentation for Diversity and Generalization

    Caco uses code as a modular reasoning template, supporting two types of augmentation:

    • Pattern-level augmentation: sampling novel reasoning strategies beyond seen templates.
    • Problem-level augmentation: translating a single code trace into multiple linguistically distinct questions.

    This improves data diversity and model generalization across tasks and formats.

  • Cross-Domain Reasoning Unification

    By representing both mathematical and algorithmic reasoning in Python code, Caco provides a shared reasoning format across domains. This promotes data source diversity, facilitates cross-domain learning, and broadens the applicability of the framework.

W-2: Inadequate problem formulation: The paper lacks a clear problem formulation. It is essential to explicitly define and describe the code-assisted reasoning problem, using appropriate notations to help readers understand the whole framework.

Q-5: Suggestion: Complete the problem formulation section to enhance clarity.

1. Problem Definition

  • Let the problem space be P\mathcal{P}, with each problem denoted as pPp \in \mathcal{P} and its correct answer as aa^*.
  • Define the space of Code-based Chain-of-Thoughts (Code CoT) as C\mathcal{C}, where each executable reasoning program cCc \in \mathcal{C} produces the correct answer via Python execution: F(c)=aF(c) = a^*, where F()F(\cdot) denotes running cc in a Python environment.
  • Our goal is to generate diverse, verifiable new problems pp' and answers ss' from a small set of seed problems, with each pp' linked to a validating program cc'.

2. Core Functions

  • We use a general language model GG for conditional generation:

    Gpc:PC,Gcp:CP,Gps:PSG_{p \to c}:\mathcal{P}\to\mathcal{C},\quad G_{c \to p}:\mathcal{C}\to\mathcal{P},\quad G_{p \to s}:\mathcal{P}\to\mathcal{S}

    where GG can be any sequence-to-sequence model.

  • An unconditional generator UU samples candidate programs:

    U:CU: \varnothing \to \mathcal{C}

  • FF is a code executor that filters valid programs through execution, AST checks, and output consistency:

    F(c)=aF(c) = a^*

3. Conventional Method and Its Limitations

The common data synthesis approach:

  • Rewrite pp into pp': P=Gpp(p)P' = {G_{p \to p'}(p)}
  • Generate answer: p,s=Gps(p){p', s'} = {G_{p \to s}(p')}

Limitations:

  1. pp' depends on pp, limiting diversity.
  2. No guarantee of solvability of pp' or correctness of ss'.

4. Caco: A Code-Driven Framework

Caco generates new problems by sampling new code instead of rewriting existing problems:

  1. Seed Collection and CodeGen Fine-Tuning

    Cseed={Gpc(p)F(c)=a}\mathcal{C}_{\rm seed} = \{G_{p \to c}(p)\mid F(c) = a^*\}

    Fine-tune UU on Cseed\mathcal{C}_{\rm seed} to model valid code distributions.

  2. Sampling New Programs and Reversing to Questions

    {(p,c)}={Gcp(c)cU()}\{(p', c')\} = \{G_{c \to p}(c')\mid c' \sim U()\}

    Each cc' is translated into a question pp', with its correctness validated via F(c)F(c').

  3. Answer Generation and Filtering

    {(p,s,c)}valid={Gps(p)Gps(p)=F(c)}\{(p', s', c')\}_{\rm valid} = \{G_{p \to s}(p') \mid G_{p \to s}(p') = F(c')\}

    Answers are only accepted if consistent with code execution.

Benefits

  • Diversity: Code serves as an abstract, language-independent reasoning template, enabling pattern-level and problem-level augmentation.
  • Verifiability: Each problem is paired with executable logic, allowing strict automated validation.
  • Scalability: The framework supports large-scale generation via unconditional sampling and efficient filtering.

Q-1: Why is your method not evaluated on the AIME dataset?

Thank you for the question. We initially did not evaluate on AIME because our models were trained only with short-CoT supervised fine-tuning, which is insufficient to yield strong improvements on AIME without more advanced System-2 reasoning strategies.

In early experiments using Qwen-Math-base, we observed only marginal gains on AIME(20% to 23.3%)—likely due to the fact that Qwen already exhibits strong performance on math reasoning tasks. However, when we tested LLaMA-based models (as shown in ZeGf-W2 Table), we observed a more substantial gain from 0% to 10.83% (Average@8 samples), suggesting Caco’s benefits are more pronounced on weaker base models.

Inspired by the System-2 reasoning paradigm, we later conducted experiments using Caco-Long, described in Appendix C.1. This involved generating 100K long-form CoTs using Qwen3 in “thinking mode”, and yielded some improvement.

After the submission, we further realized that long-CoT learning requires higher learning rates and longer training schedules due to its distinct reasoning pattern. We therefore re-trained the model using lr=5e-5 (10× the original), with 3 epochs and an expanded dataset of 300K long CoTs. This led to further performance gains, as shown in the updated results in NsmM-W2 Table.

Q-2: Code-assisted methods primarily address incorrect calculations in reasoning trajectories. Does the Caco framework specifically aim to reduce calculation errors? It would be beneficial to analyze whether improvements arise from other factors, such as Caco's influence on the reasoning patterns and behaviors of LLMs.

We appreciate the reviewer’s interest in the underlying factors behind Caco’s performance gains. Unlike many code-assisted methods that focus specifically on reducing calculation errors, Caco does not target a specific error type.

Instead, Caco emphasizes scaling diverse, verifiable reasoning traces. During code verification, all incorrect samples—whether due to calculation mistakes, logical flaws, or incomplete reasoning—are filtered out, ensuring that only correct and executable reasoning paths are retained.

Furthermore, to better diversify the reasoning behaviors and patterns represented in the training data, we deliberately included algorithmic problems in the seed dataset. These problems require reasoning patterns that are structurally different from typical math problems, such as recursion, dynamic programming, and search algorithms. By doing so, Caco enables the model to internalize a broader set of reasoning strategies, which we believe contributes meaningfully to its improved generalization and robustness across tasks.

Q-3: The introduction mentions the length inefficiency problem. Have you compared the efficiency metrics, such as response length and generation time, of Caco with those of other baseline methods?

When we mention “length inefficiency” in the introduction, we are referring to how long chain-of-thoughts (CoTs) often introduce redundancy and increase the risk of logical inconsistencies. In many long-CoT setups, the reasoning trace can span tens of thousands of tokens, even for relatively simple problems—resulting in unnecessarily verbose answers.

Regarding your question on response length, we analyzed the length of our generated code-based CoTs, which are much more compact. On average, each CoT contains 415 tokens, with a standard deviation of 284. Compared to long-CoT baselines, this represents a significant reduction in output length, suggesting improved efficiency in both reasoning compactness and model usage.

评论

Dear Reviewer,

We would like to kindly remind you to review our rebuttal for our submission. As the rebuttal phase is concluding, we hope to address any remaining concerns and clarify our points. Your feedback is highly appreciated.

Thank you very much for your time and consideration.

Best regards, Authors

评论

Thank you for your response. The revised version of your paper is well-structured and demonstrates improved clarity. I will consider to update my score.

评论

Dear Reviewer oWXi,

Thank you for your time and thoughtful review of our manuscript. We sincerely appreciate your positive feedback regarding the improved clarity and structure of our revised paper. Your constructive comments have been invaluable in strengthening our work.

We are delighted to hear that you are considering updating your score, and we would be most grateful if you could finalize this adjustment at your earliest convenience. Please don't hesitate to let us know if you require any additional information or clarifications.

Once again, thank you for your support throughout this review process.

Best regards, Authors

审稿意见
4

This paper introduces Caco (Code-Assisted Chain-of-ThOught), a framework for automated generation of verifiable and code CoT data. Unlike traditional CoT methods that suffer from unverifiability issues, Caco leverages code execution to validate reasoning chains and automated back-translation to produce natural language instructions and CoTs.

优缺点分析

Strengths

  • This paper contribute a fully auto-generates 1.3M high-quality instruction-CoT pairs without human labels.
  • The proposed method outperforms prior datasets across multiple math benchmarks.

Weaknesses

  • This paper is difficult to distinguish from some accepted works [1,2]. Both adopt the same approach of converting mathematical CoTs into code, using Python execution for filtering. This has become a common practice for CoT verification. Moreover, this paper also uses AST-based code validation, resulting in significant overlap in methodlogy.
  • Evaluation limited to math.Despite the claimed generality, experiments are only conducted on math datasets. There is no evidence the framework generalizes to other reasoning domains (e.g., logic puzzles, science QA).
  • Scalability claim lacks resource metrics. While Caco claims to be scalable, the paper does not report data generation time, compute cost, or comparison to other data-efficient methods.

[1] Gao L, Madaan A, Zhou S, et al. Pal: Program-aided language models[C]//International Conference on Machine Learning. PMLR, 2023: 10764-10799. [2] PaD: Program-aided Distillation Can Teach Small Models Reasoning Better than Chain-of-thought Fine-tuning (Zhu et al., NAACL 2024)

问题

  1. The proposed method is only evaluated on mathematical and algorithmic tasks. It remains unclear whether Caco's code-assisted CoT generation and filtering pipeline generalizes to other domains (e.g., logic puzzles, scientific reasoning, or commonsense QA). Have the authors considered extending their framework beyond math-centric problems?
  2. The paper does not report the proportion of generated code-CoTs that fail execution or are filtered out due to semantic or structural issues. What is the rejection rate during data generation? Understanding how much noisy or invalid data is discarded would help assess the efficiency and robustness of the pipeline.

局限性

See in Questions and Weaknesses

最终评判理由

Author's clarifications effectively addressed my concerns regarding generalizability and evaluation scope, and the additional results are very promising. I encourage the authors to include these promising findings and discussions in the revised version.

Most of my concerns have been resolved, and I am inclined to raise my score.

格式问题

I did not find any major formatting issues.

作者回复

Thank you very much for your valuable feedback and thoughtful questions. We greatly appreciate your efforts in reviewing our paper. Below, we provide our detailed response to your concerns.

W-1: This paper is difficult to distinguish from some accepted works [1,2].

We appreciate the reviewer’s comment and acknowledge the growing popularity of code-assisted reasoning in recent literature. As noted, we reference relevant prior works [1,2] in Section 2.2 “Code Integration for Enhanced Reasoning”, which primarily use Python code for answer verification—not for data generation.

In contrast, our work reframes code as an abstract representation of reasoning patterns and logical templates for entire problem families. Specifically:

  • Caco treats problem generation as code generation: Instead of verifying generated problems with code (as in [1,2]), we first generate executable Code CoTs, then reverse-translate them into diverse and verifiable QA pairs. This reflects a fundamentally different motivation and methodology.
  • This shift enables several key benefits:
    • Scalability: By fine-tuning on a small seed set, we sample millions of structurally valid and semantically rich Code CoTs, automatically filtered without human annotation.
    • Diversity: Code abstracts reasoning beyond natural language, enabling generalizable problem-solving patterns
      1. Through pattern-level augmentation (e.g., as illustrated in Figure 3), the model generates novel reasoning strategies beyond template memorization.
      2. Each code trace can yield multiple natural language questions via reverse translation, increasing linguistic and topical diversity.
    • Verifiability: Executable code enables automatic verification through execution, AST analysis, and answer consistency checks, ensuring high-quality reasoning data.
    • Cross-Domain Unification via Code: Unlike prior math-focused work, we unify multiple domains (math, algorithms, etc.) into a shared code-based format, greatly expanding data diversity and applicability.

We will clarify these distinctions in the revised version and ensure both works are properly cited. For detailed definitions and contributions, please also see our W1 and W2 responses to Reviewer oWXi.

W-2 & Q-1: Evaluation limited to math.Despite the claimed generality, experiments are only conducted on math datasets. There is no evidence the framework generalizes to other reasoning domains (e.g., logic puzzles, science QA).

Thank you for the insightful question. While our primary experiments focused on mathematical reasoning, Caco is fundamentally a general-purpose framework for structured, code-based reasoning, and is applicable to domains that exhibit logical, symbolic, or programmatic structure—such as logic puzzles, scientific reasoning, and procedural tasks.

Logic Puzzles

Caco naturally extends to logic-based tasks like puzzles and rule-based games. In many of these problems (e.g., arithmetic expression puzzles, countdown problems), the underlying logic is shared across instances and can be encoded as parameterized templates. For example, in countdown puzzle [1], we can reuse the same program structure and vary only the inputs:

import random

def generate_expression_sample(min_len=3, max_len=6):
    ops = ['+', '-', '*', '/']
    count = random.randint(min_len, max_len)
    nums = [random.randint(1, 100) for _ in range(count)]
    ops_list = [random.choice(ops) for _ in range(count - 1)]
    expr = ' '.join(str(num) + ' ' + op for num, op in zip(nums[:-1], ops_list)) + f' {nums[-1]}'
    return {'numbers': nums, 'expression': expr, 'result': eval(expr)}

if __name__ == "__main__":
    for i in range(3):
        s = generate_expression_sample()
        print(f"Sample {i+1}:\n  Numbers: {s['numbers']}\n  Expression: {s['expression']}\n  Result: {s['result']}\n")

This aligns with Caco's central idea: code abstracts problem logic more compactly than natural language, enabling batch generation of diverse problem instances from shared reasoning templates. As such, logic puzzles are a natural fit for Caco’s generation and filtering pipeline.

Scientific Reasoning

To further evaluate Caco pipeline's generalization, we applied the framework to scientific QA tasks using the MegaScience dataset. After filtering out math-related instances, we retained 5.2K science reasoning samples as seeds. Using Caco’s pipeline:

  • We sampled 50K Code CoTs by CodeGen tuned on 5.2K seed CoTs.
  • After filtering, we obtained 37K valid science QA samples

When finetuned on these, the model showed strong performance improvements across several scientific benchmarks:

ModelAGIEvalARC-cMMLU-STEMAverage
LLaMA-MegaScience-5.2K42.878.655.459.0
LLaMA-MegaScience-37K45.084.860.563.4

This demonstrates that Caco's pipeline generalizes effectively to scientific reasoning, provided that the reasoning logic can be encoded programmatically.

Broad Generalization Evaluation

We also benchmarked models trained with Caco-1.3M data across a wide range of tasks using the OpenCompass framework:

  • Math (hard): AIME2024
  • Code generation: HumanEval-Plus
  • Science QA: ARC-c
  • Logic puzzles: BBH, KorBench
  • General knowledge & science: AGIEval

Except for AIME (8-sample majority voting), all evaluations used greedy decoding.

ModelAGIEvalAIME2024HumanEval+ARC-cBBHKorBenchAverage
Qwen-math-base42.520.012.872.219.939.734.5
Qwen-Caco-1.3M53.323.353.181.465.147.153.9
LLaMA-base28.50.032.379.019.823.830.6
LLaMA-Caco-1.3M46.510.834.283.133.844.142.1

These results confirm that Caco-trained models generalize well beyond math, showing strong improvements in logic puzzles, scientific reasoning, and code tasks, across three different model backbones.

In summary, while Caco was designed with mathematical reasoning in mind, its core mechanisms are broadly applicable to any domain where logic can be codified, and our experiments show strong generalization across diverse reasoning domains. We will clarify this in the revised version and include additional non-math evaluations to highlight its broader applicability.

W-3: Scalability claim lacks resource metrics.

We appreciate the reviewer’s comment regarding the need to quantify the efficiency of our proposed method. To address this, we report the full computational cost for generating the Caco-1.3M dataset. All experiments were conducted on a machine with 8× NVIDIA A100 GPUs.

Stage# SamplesTime (Hours)
Unifying Code CoTs339K2h
Scaling Code CoTs5.3M8h
Question Reversal4.6M6h
Answer Generation4.6M38.5h
Total (for 1.3M valid data)~55h
  • Low Cost and Accessibility: First, the pipeline uses only open-source models, avoiding reliance on costly proprietary APIs.

  • Efficiency and Scalability: Generating 1.3M verified samples in under 55 hours demonstrates strong scalability, with most steps parallelizable.

  • Minimal Overhead over Standard Pipelines: the majority of the cost comes from answer generation—which is unavoidable in any instruction tuning or self-improvement setup. Prior works such as STaR【1】also incur similar or higher costs in the solution generation phase, especially when multiple samples per prompt are required.

    The additional steps specific to Caco (e.g., Code CoT generation and question reversal) are relatively lightweight (combined ~16h). This is due to natural language solutions are significantly longer than questions or Code CoTs.

  • Clarification on Focus: While we acknowledge the value of data-efficient methods, Caco’s focus is on scalable, diverse, and verifiable reasoning data, not minimal data volume. Our aim is to enable generalization across domains by generating rich reasoning trajectories at scale.

[1] Zelikman, E., Wu, Y., Mu, J., & Goodman, N. (2022). STaR: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems.

Q-2: The paper does not report the proportion of generated code-CoTs that fail execution or are filtered out due to semantic or structural issues.

As mentioned in Line 171 and Line 193:

“5.3M Code CoT samples. After filtering, we retained a high-quality subset of around 4.6M executable and structurally valid programs.”
“After the processes, it yields approximately 1.3M validated instruction-answer pairs...”

Therefore, the rejection rate during code generation is approximately 13%, and the rejection rate from code to question–answer pair generation is about 71%. This looks somehow with a high fail rate because we conduct a very hard and strict filtration that any failure during the code-to-question or question-to-answer generation steps is detected and filtered.

This also suggests that other data generation methods that do not include verification mechanisms may produce a large amount of incorrect or noisy data.

To present this more clearly, we will add a figure showing the retention rate at each stage in the final version of the paper.

评论

Dear Reviewer,

We would like to kindly remind you to review our rebuttal for our submission. As the rebuttal phase is concluding, we hope to address any remaining concerns and clarify our points. Your feedback is highly appreciated.

Thank you very much for your time and consideration.

Best regards, Authors

评论

Dear Reviewer ZeGf,

We hope this message finds you well. As the rebuttal period is drawing to a close, we wanted to kindly follow up regarding our submitted rebuttal. We would greatly appreciate your review of our responses at your earliest convenience.

Please rest assured that we remain fully available to address any additional questions or concerns you might have. Your valuable feedback has been instrumental in improving our work, and we're committed to ensuring all outstanding issues are properly clarified.

Thank you again for your time and thoughtful consideration throughout this process.

Best regards, Authors

评论

Thank you very much for the detailed and thoughtful response. Your clarifications effectively addressed my concerns regarding generalizability and evaluation scope, and the additional results are very promising. I encourage the authors to include these promising findings and discussions in the revised version.

Most of my concerns have been resolved, and I am inclined to raise my score.

评论

Thank you for your recognition. In the revised version, we will further elaborate on the differences between our Caco and existing approaches. Additionally, we will include results of Caco on other reasoning domains and report the computational cost associated with generating the Caco dataset.

最终决定

This paper proposes a novel framework, Caco (Code-Assisted Chain-of-ThOught), which automates the generation of high-quality, verifiable, and diverse reasoning data. Caco first fine-tunes a code-based CoT generator on existing math and programming solutions in code format, and then scales the data generation to a large amount. Experiments demonstrate that models trained with Caco achieve competitive performance.

For the revision, the authors should provide a clearer justification of the contributions compared to existing code-aided reasoning approaches and include further comparisons with similar methods. Additional details are also needed, such as the problem formulation, dataset construction process and cost, as well as clarifications on experimental settings.