Demystifying Long Chain-of-Thought Reasoning
摘要
评审与讨论
This paper systematically investigate the underlying mechanics of long cot reasoning. It conducts experiments on qwen-7b and llama3-8b, and evaluates on diverse math benchmarks. Based on the experiment results, the paper draws several conclusions.
##update after rebuttal: Thank you for the authors’ efforts in addressing the scalability concern raised in my review.
During the rebuttal phase, the authors mentioned that they were conducting additional experiments, but did not provide concrete results at the time to support whether their findings generalize beyond the 7B/8B model scale. While I understand the limitations of academic projects, I wasn’t fully convinced by the explanation that they couldn’t run smaller models—especially since testing on models like 1.5B or 3B would have been more accessible. I’m glad to see that these experiments have now been/will be included.
I appreciate the authors' efforts and responsiveness. That said, I still feel that many of the findings are not particularly surprising, and some of the experimental analysis could be more thorough. Therefore, I will slightly improve my score to a weak accept.
给作者的问题
NA
论据与证据
the claims in this submissions are supported by the experiments.
方法与评估标准
This work primarily aims to understand the underlying mechanisms of long-CoT reasoning through various experimental designs. Most of the experiments seem reasonable to me, but the evaluation is currently limited to a math benchmark and two models (7B and 8B). It remains unclear whether the derived findings are generalizable.
理论论述
This submission does not include any theoretical claims
实验设计与分析
I appreciate the timeliness of this paper. However, it does not include experiments on larger (>8B) or smaller (i.e., 3b, 1.5b) LLMs or non-math benchmarks. Therefore, I find the experiments not sufficiently sound.
补充材料
I read the appendix
与现有文献的关系
This paper is timely and contributes to enhancing the community's understanding of long-CoT reasoning.
遗漏的重要参考文献
Most related works have been referenced
其他优缺点
The paper is timely and designs a large number of experiments to explore the impact of diverse factors of long-CoT on reasoning. Many experimental observations could help improve reasoning. However, my main concern is that the paper only conducts experiments on 7B and 8B LLMs on math benchmarks. Therefore, these experimental observations are likely to hold only for the 7B and 8B models.
其他意见或建议
Add experiments on different sized LLMs (i.e., 1.5b-32b-70b), add evaluations on more benchmarks (i.e., code)
We thank the reviewer for their thoughtful comments and for recognizing the timeliness and scope of our work.
Add experiments on different sized LLMs (i.e., 1.5b-32b-70b)
We fully agree that evaluating across a wider range of model sizes is important. Experiments with both smaller (1.5B) and larger (32B) models are currently underway. While we couldn’t complete these runs in time for the rebuttal, we will do our best to report results during the discussion period or include them in the final revision.
In the meantime, we want to emphasize that:
- We already include diverse model families: our experiments use both Qwen and LLaMA, two distinct model families with different pretraining setups. This increases confidence that our findings are not limited to a single architecture.
- We cover a wide range of experimental settings: As the reviewer noted, we explore many axes—including SFT vs. RL, different reward functions, various data sources (MATH, WebInstruct), and multiple evaluation benchmarks. This breadth helps us surface consistent and robust patterns.
- Some findings are not scale-dependent: Several of our observations, such as the emergence of branching, backtracking, and self-validation more from pre-training and are not related to scales.
We kindly note that this is an academic project, and running large-scale models like 32B presents significant infrastructure and compute challenges. While such experiments may be routine for large industry labs, they require substantial effort in academia. We deeply appreciate the reviewer’s perspective and are working hard to extend our coverage despite these constraints.
This paper investigates the mechanics of long chain-of-thought (CoT) reasoning in large language models (LLMs), focusing on how supervised fine-tuning (SFT) and reinforcement learning (RL) can enhance reasoning capabilities. Key findings include: 1) SFT with long CoTs significantly improves model performance and facilitates further RL improvements; 2) RL-driven CoT scaling can be unstable, but reward shaping, such as a cosine length-scaling reward with a repetition penalty, stabilizes and improves reasoning; 3) leveraging noisy, web-extracted data with filtering mechanisms shows promise. The study provides practical insights for optimizing training strategies to enhance long CoT reasoning in LLMs.
给作者的问题
Can the proposed reward shaping techniques be adapted for other types of reasoning tasks beyond mathematics and STEM?
论据与证据
The paper claims that supervised fine-tuning (SFT) with long chain-of-thought (CoT) data significantly improves model performance and facilitates further reinforcement learning (RL) enhancements. Evidence includes experiments showing that models fine-tuned with long CoTs achieve higher accuracy and are more amenable to RL improvements compared to short CoTs.
方法与评估标准
The study employs SFT and RL to train models on long CoT reasoning tasks, using datasets like MATH and WebInstruct. Evaluation is conducted on benchmarks such as MATH-500, AIME 2024, TheoremQA, and MMLU-Pro-1k.
理论论述
No Theoretical Claims.
实验设计与分析
Experiments involve fine-tuning models on long CoT data distilled from QwQ-32B-Preview and comparing them with short CoT models. The authors test different reward functions, including a cosine length-scaling reward, and evaluate their impact on CoT length and model accuracy.
补充材料
Supplementary material includes detailed hyperparameters for SFT and RL experiments, as well as additional results and analyses. The authors provide appendices with further details on the distillation process, reward function design, and evaluation setup. They also include source codes in a link.
与现有文献的关系
The study builds on previous work in CoT prompting and RL for LLMs, extending these techniques to long CoT reasoning.
遗漏的重要参考文献
N/A
其他优缺点
Weaknesses include the limited exploration of model scalability beyond 8B parameters and the potential for reward hacking in RL training.
其他意见或建议
N/A
We thank the reviewers for their positive and thoughtful feedback! Below, we address the key points raised:
Q1: “Can the proposed reward shaping techniques be adapted for other types of reasoning tasks beyond mathematics and STEM?”
Yes, the reward shaping techniques we propose, particularly the cosine length-scaling reward and repetition penalty, are designed to be general-purpose. While our experiments focus on mathematics and STEM due to the availability of verifiable signals (e.g., MATH, AIME), the core principles—encouraging correct long reasoning without over-optimization—are task-agnostic.
Q2: Limited exploration of scalability beyond 8B parameters
We fully agree that evaluating across a wider range of model sizes is important. Experiments with both smaller (1.5B) and larger (32B) models are currently running. While we couldn’t complete these runs in time for the rebuttal, we will do our best to report results during the discussion period or include them in the final revision.
We kindly note that this is an academic project, and running large-scale models like 32B presents significant infrastructure and compute challenges. While such experiments may be routine for large industry labs, they require substantial effort in academia. We deeply appreciate the reviewer’s perspective and are working hard to extend our coverage despite these constraints.
Q3: Potential for reward hacking in RL
We appreciate this concern and directly addressed it in the paper. As noted in Section 4.5, we observed length-based reward hacking (e.g., via repetition) with increased compute. To mitigate this, we introduced an n-gram repetition penalty and tuned reward discounting to improve learning stability. Figure 5 shows that applying this penalty improved downstream accuracy while reducing degenerate behaviors. We also found that token-level penalties and distinct discount factors per reward signal (Section 4.6) helped control optimization dynamics and reduced unwanted artifacts.
This paper conducts comprehensive experiments for long chain of thoughts. They adopt a variety of methods related to long CoTs, including SFT, RL, source of the SFT data, impact of reward design, noisy reward.
给作者的问题
NO.
论据与证据
The paper provides many orthogonal directions for long CoTs. The style is more like a review paper, but the messages are valuable for the field. I find some claims to be particularly useful and convincing:
- Takeaway 3.1&3.3: long, high quality CoTs data is much better than short CoTs data.
- Takeaway 3.2: SFT initialization is beneficial for RL.
- The takeaways 4.1&4.4: Directly using constrained context window size cannot easily stabilize RL training.
- Takeaway 4.2&4.5: reward shaping can be used to stabilize and control CoT length. The reward hacking may happen but can be mitigated using a repetition penalty.
However, I think certain results seem to be not surprising and can be relegated to appendix to leave room for more discussions on important messages.
- The takeaways 5.1 and table 2: I think the inclusion of WebIT only improves the performance on MMLU, which is not surprising because it bears more resemblance to the WebIT dataset. Moreover, including WebIT also decreases the reasoning ability for models. The message seems to be not new given the well-known forgetting problem in finetuning.
方法与评估标准
The methods and evaluations mostly make sense.
理论论述
NO.
实验设计与分析
NO.
补充材料
NO.
与现有文献的关系
NO.
遗漏的重要参考文献
NO.
其他优缺点
NO.
其他意见或建议
NO.
We sincerely thank the reviewer for the very positive assessment and thoughtful comments. We're glad the key takeaways around long CoT data, reward shaping, and SFT+RL initialization were found to be useful and convincing.
Q1: On Takeaway 5.1 and the impact of WebInstruct (WebIT) data
We appreciate the reviewer’s point and would like to clarify that we view this as an important and encouraging finding, especially in the context of studying generalizability in reasoning. While existing reasoning models are often trained exclusively on math datasets, our results show that incorporating more diverse, real-world data such as WebInstruct can significantly enhance out-of-domain general reasoning performance, notably improving MMLU-Pro-1k by nearly 10 absolute points.
Although we observe a slight drop in MATH performance, the overall trend shows improved average accuracy across tasks. This suggests that query diversity in training data plays a key role in building more broadly capable reasoning models, rather than weakening them.
We agree that this topic deserves deeper investigation, and we plan to explore how to balance domain-specific precision with general reasoning strength in future work.
Q2: On presentation:
We appreciate the suggestion to move some expected results to the appendix and will revise the paper to better emphasize the most important insights.
Summary
This paper studies long chain-of-thought (CoT) reasoning in large language models (LLMs), with a focus on understanding how supervised fine-tuning (SFT) and reinforcement learning (RL) influence reasoning length and effectiveness. The authors report three main findings: 1) SFT greatly improves training ease and efficiency, though it is not strictly required; 2) increasing computational resources during RL can encourage better reasoning, but targeted reward design is necessary to maintain stable growth; and 3) incorporating noisy solutions from web sources, with appropriate filtering, effectively extends RL performance, especially for reasoning tasks beyond the training distribution, such as STEM problems. These results offer clear guidelines for improving the training of models capable of extended CoT reasoning.
Decision
The paper is well-written, and the proposed idea is interesting. It is a timely and important topic. The reviewers have raised several important concerns and questions. I think overall, the authors have done a good job answering and addressing them. As a result of reviewers' feedback, the authors have quite significantly improved the paper and their results. As requested by the reviewers, the authors conducted additional training and evaluations on large (Qwen2.5-32B) and small (Qwen2.5-1.5B) models to validate the scalability of their proposed cosine reward shaping technique. The authors couldn't share their results in their rebuttal but privately shared them with the AC. They show significant improvements with the cosine+representation pen, the reward design they proposed. I recommend that the authors include those results in the final version of the paper, which will make the paper a good contribution to the ICML community. Thus, I would recommend this paper for acceptance.