Style over Substance: Distilled Language Models Reason Via Stylistic Replication
We show that language models distilled from reasoning models primarily mimic stylistic patterns rather than internalize deeper reasoning capabilities.
摘要
评审与讨论
This paper studies the role of reasoning traces in training language models for reasoning tasks. Building on prior work that highlights the effectiveness of detailed reasoning steps, the authors introduce two datasets: emergent reasoning traces and synthetic reasoning traces, both curated to produce correct final answers. Experiments show that language models trained on these datasets perform similarly. Moreover, models trained with synthetic reasoning traces that include incorrect final answers still show improved performance, suggesting that the structure of reasoning, rather than correctness alone, may be beneficial.
接收理由
The paper tackles an important and timely question in LLM reasoning—what elements of reasoning traces contribute most to model performance. The proposed dataset construction methodology is clear, and the finding that structured reasoning (even with incorrect answers) can improve model performance is intriguing and potentially useful for future prompting and training strategies.
拒绝理由
The central claims are not sufficiently substantiated. The similarity in performance between the two datasets does not conclusively demonstrate that the synthetic structure is the key contributing factor. Moreover, emergent traces still slightly outperform synthetic ones, which may point to the importance of diversity and naturalness in reasoning rather than structure alone. The paper would benefit from deeper analysis and stronger theoretical or empirical justification for its conclusions.
给作者的问题
How do you justify that style replication is the key contributing factor, given that the synthetic traces (ST-HC) still underperform compared to the emergent traces (ST)?
Additionally, have you experimented with using other teacher models besides GPT-4o to generate synthetic reasoning traces?
================= The authors have addressed my concerns by providing additional experiments.
Thank you for your detailed feedback!
How do you justify that style replication is the key contributing factor, given that the synthetic traces (ST-HC) still underperform compared to the emergent traces (ST)?
The reviewer is correct in pointing out that the results are not exactly the same between ST and ST-HC.
If we were to think of the ST reasoning traces as an ideal distribution, then the replication of ST in ST-HC is inherently an approximation. This approximation is not perfect, as can be seen in the lower performance of ST-HC. Still, while the synthetic traces do indeed consistently underperform the emergent ones, the gap between them is quite narrow, indicating that it is a good approximation of emergent reasoning.
To lend additional credibility to our argument, we add another baseline to our paper, with the goal of showing the impact of the particular style replication we propose versus distilling from regular CoT. To do this, we instruct the same LLM used for our synthetic data generation to think step-by-step (SBS), following [1], without utilizing anything proposed by us (e.g., pivots). We then use the resulting CoT for distillation the same way we use the emergent (ST) and synthetic (ST-HC) reasoning traces.
Here is the updated results table with SBS for Llama 3.2, Ministral, and Qwen2.5:
| Model | Variant | Params | MATH500 | AIME2024 | GPQA (D) |
|---|---|---|---|---|---|
| Llama 3.2 | Base | 3B | 36.4 | 6.7 | 26.3 |
| SBS | 3B | 45.8 | 10.0 | 28.3 | |
| ST | 3B | 68.4 | 23.3 | 31.3 | |
| ST-HC | 3B | 64.2 | 16.7 | 29.3 | |
| Ministral | Base | 8B | 52.8 | 10.0 | 28.8 |
| SBS | 8B | 60.6 | 16.7 | 31.3 | |
| ST | 8B | 78.2 | 33.3 | 38.9 | |
| ST-HC | 8B | 77.0 | 33.3 | 34.8 | |
| Qwen2.5 | Base | 32B | 76.8 | 16.7 | 49.0 |
| SBS | 32B | 78.2 | 20.0 | 49.5 | |
| ST | 32B | 89.0 | 53.3 | 56.1 | |
| ST-HC | 32B | 83.4 | 46.7 | 53.0 |
SBS, while beneficial compared to the base model, noticeably underperforms ST/ST-HC by a large margin. This demonstrates that the ‘structure’ adopted for ST-HC is key in achieving improved performance. We hope this addresses the reviewer’s concerns and will update our paper accordingly.
Additionally, have you experimented with using other teacher models besides GPT-4o to generate synthetic reasoning traces?
For teacher models with instruction-following capabilities similar to GPT-4o, we did not find the choice of model to have an impact on performance.
We again thank the reviewer for taking the time to thoroughly review our paper.
References
[1] Kojima, T., et al. (2022). Large language models are zero-shot reasoners. Advances in neural information processing systems, 35, 22199-22213.
Thank you for your response. The additional table shows that having a good style is important for improving performance. It would be great if the authors could include a wider range of style patterns tested across various reasoning models.
With the end of the discussion period approaching, we wanted to reach out and ask the reviewer if they believe our response and new experiments sufficiently address their concerns? Please let us know if there is anything else we may clarify.
In this study authors investigate distillation through finetuning on data generated by strong LLMs. Authors start with reasoning traces generated by RLMs, and find that successful reasoning traces consistently align with cognitive stages of human reasoning, while employing multiple pivot types (realization, verification, and exploration). Authors call these "stylistic patterns". Based on this analysis, authors craft a prompt to guide strong non-reasoning LM (GPT-4o) to generate solutions while adhering to RLLMs' reasoning traces. Authors find that small LMs finetuned on these synthetic data significantly improve performance across multiple tasks, even if traces in train data lead to incorrect answers.
接收理由
- paper is clear and well-written
- authors perform comprehensive experiments, comparing models of different sizes and families
- authors test performance on multiple datasets requiring different types of reasoning
- authors first to clearly define stylistic patterns of the reasoning traces, proving the importance of form over substance
- this paper deepens the understanding of reasoning abilities of LLMs and offers a way to transfer them to smaller models
拒绝理由
Paper could benefit from additional ablation studies (see below)
My main concerns were addressed in rebuttal.
给作者的问题
In table 1, you compare models finetuned on traces generated by RLLM and on synthetic traces generated by GPT. In Table 2 you additionally ablate on wrong answers, and data without traces. However, it was previously shown that finetuning on CoT data generated by strong models (such as GPT), can significantly boost small LMs. What I'm missing is comparing with distillation on GPT's CoT data, generated by "Let's think step-by-step" prompt. How close will it be to ST-HC results? Is it possible to replicate process described in 3.2, but with "Let's think step-by-step" prompt, to answer the following: are the performance improvement brought by stylistic learnings, or through distilling GPT's reasoning abilities? in other words, how much improvement prompt in Fig 3 brings over "Let's think step-by-step" prompt for synthetic data generation?
We thank the reviewer for their positive comments, as well as their acknowledgment that our paper deepens the community's understanding of LLM reasoning.
What I'm missing is comparing with distillation on GPT's CoT data, generated by "Let's think step-by-step" prompt. How close will it be to ST-HC results?
We agree that this additional baseline could be useful information for readers and would better substantiate the advantages of our approach. We have since conducted additional experiments on this. Here we instruct the LLM to think step-by-step (SBS), following [1]. We then use the resulting basic CoT for distillation — similarly to how we use the emergent (ST) and synthetic (ST-HC) reasoning traces.
Here is the updated results table with SBS for Llama 3.2, Ministral, and Qwen2.5:
| Model | Variant | Params | MATH500 | AIME2024 | GPQA (D) |
|---|---|---|---|---|---|
| Llama 3.2 | Base | 3B | 36.4 | 6.7 | 26.3 |
| SBS | 3B | 45.8 | 10.0 | 28.3 | |
| ST | 3B | 68.4 | 23.3 | 31.3 | |
| ST-HC | 3B | 64.2 | 16.7 | 29.3 | |
| Ministral | Base | 8B | 52.8 | 10.0 | 28.8 |
| SBS | 8B | 60.6 | 16.7 | 31.3 | |
| ST | 8B | 78.2 | 33.3 | 38.9 | |
| ST-HC | 8B | 77.0 | 33.3 | 34.8 | |
| Qwen2.5 | Base | 32B | 76.8 | 16.7 | 49.0 |
| SBS | 32B | 78.2 | 20.0 | 49.5 | |
| ST | 32B | 89.0 | 53.3 | 56.1 | |
| ST-HC | 32B | 83.4 | 46.7 | 53.0 |
From these results, it is clear that while generating synthetic data with a standard SBS prompt is beneficial compared to the base model, there is a significant performance gap between SBS and our ST/ST-HC methods. This demonstrates that the performance improvements are indeed substantially driven by the specific stylistic patterns we identified, rather than solely by distilling GPT's general reasoning abilities. We hope this addresses the reviewer’s concerns and will update our paper accordingly.
We again thank the reviewer for taking the time to thoroughly review our paper.
References
[1] Kojima, T., et al. (2022). Large language models are zero-shot reasoners. Advances in neural information processing systems, 35, 22199-22213.
Thank you for addressing my concern. These new results make sense and further support your findings. I'm willing to further increase my score to "clear accept".
This paper argues that distilling reasoning abilities in LLMs relies more on mimicking "stylistic patterns" (e.g., structured steps, backtracking, phrases like "Let me check") than learning deep reasoning. Using two datasets—SMOLTRACES (real RLM traces) and SMOLTRACES-HARDCODED (synthetic traces mimicking style)—the authors show that models trained on synthetic style-focused traces (even with wrong answers) perform nearly as well as those trained on real RLM traces. Key experiments suggest "style replication" drives reasoning gains, not just correct answers.
接收理由
- This paper challenges the assumption that distillation requires deep reasoning. Style imitation alone boosts performance.
- The methodology of analyzing 17K RLM traces to identify cognitive stages (problem framing, exploration, verification, synthesis) and lexical pivots is well-grounded in cognitive science (Newell & Simon, 1972) and provides a structured way to define "style." The quantitative analysis in Appendix B (pivot diversity, frequencies) adds rigor.
- The experiments are smart. 1) Synthetic traces (ST-HC) with explicit style rules match real RLM performance. 2) ST-HC-W (stylistic traces with wrong answers) still improve models over baselines, proving style matters.
- The SMOLTRACES (ST) and SMOLTRACES-HARDCODED (ST-HC, ST-HC-W) datasets, along with the analysis of RLM trace styles, are valuable resources for the community.
拒绝理由
-
Some "style" elements (e.g., verification steps) are part of reasoning itself. The paper defines style by "structural attributes such as trace length, lexical coherence, and backtracking frequency, rather than comprehension itself" (lines 38-40). However, many of these "stylistic" elements (e.g., explicit verification steps like "Let me double-check," exploration of alternatives like "What if") are arguably integral parts of a substantive reasoning process. Can you truly separate the style of verification from the act of verification?
-
If models improve significantly even with ST-HC-W, what exactly are they learning? Are they learning to generate longer, more structured outputs that, by chance or by some implicit bias in the evaluation benchmarks, lead to better scores? Or are they learning a meta-cognitive scaffold that, even if filled with a final error in training, helps them structure their "thoughts" better at test time on new problems? The paper leans towards the latter but could explore this more.
-
While GPT-4o is a "standard LM" compared to a specialized RLM like R1, it's still an extremely powerful model. The prompt in Figure 3 is quite detailed in instructing how to reason. How much of the "style" benefit comes from the explicit prompting versus the inherent (and significant) reasoning capabilities of GPT-4o itself, even if not "specialized" in the RLM sense? Could a much weaker model generate equally effective "stylistic" traces with the same prompt?
-
Real traces (ST) come from R1, a model already optimized for reasoning. Its "style" is refined, not raw.
-
The paper shows improved performance on standard benchmarks. How well would these "stylistically trained" models perform on problems requiring genuinely novel insights or reasoning patterns not captured by the identified styles? Is there a risk of overfitting to a particular way of expressing reasoning, which might be brittle?
Thank you for your insightful and positive review.
Can you truly separate the style of verification from the act of verification?
This is a great question without an easy answer. When dealing with something as complex as language, it is very difficult to definitively say that specific words or phrases are entirely related to style or substance, for example. This mirrors long-standing debates in linguistics from syntax versus semantics [1] to form versus meaning [2]. We believe it is beneficial to draw such a distinction in order to clarify how LLMs reason – which is what our work aims to do. We’d be more than happy to continue this discussion at the conference.
If models improve significantly even with ST-HC-W, what exactly are they learning?
Our ST-HC-W experiment (Sec 4.2) suggests a deeper learning mechanism. Models trained on ST/ST-HC/ST-HC-W internalize a structure (or “meta-cognitive scaffold”) from the preserved stylistic and structural reasoning patterns – even with incorrect final training answers – as the trace body often contains valid intermediate reasoning. We provide a deeper exploration of this in Appendix B and in our new experiments (see below).
How much of the "style" benefit comes from the explicit prompting versus the inherent (and significant) reasoning capabilities of GPT-4o itself?
To gauge how much of an impact GPT-4o’s reasoning has, we conducted an additional experiment, where we instruct the LLM (GPT-4o) to think step-by-step (SBS), following [3]. We then use the resulting basic CoT for distillation – similarly to how we use the emergent (ST) and synthetic (ST-HC) reasoning traces.
Here is the updated results table with SBS for Llama 3.2, Ministral, and Qwen2.5:
| Model | Variant | Params | MATH500 | AIME2024 | GPQA (D) |
|---|---|---|---|---|---|
| Llama 3.2 | Base | 3B | 36.4 | 6.7 | 26.3 |
| SBS | 3B | 45.8 | 10.0 | 28.3 | |
| ST | 3B | 68.4 | 23.3 | 31.3 | |
| ST-HC | 3B | 64.2 | 16.7 | 29.3 | |
| Ministral | Base | 8B | 52.8 | 10.0 | 28.8 |
| SBS | 8B | 60.6 | 16.7 | 31.3 | |
| ST | 8B | 78.2 | 33.3 | 38.9 | |
| ST-HC | 8B | 77.0 | 33.3 | 34.8 | |
| Qwen2.5 | Base | 32B | 76.8 | 16.7 | 49.0 |
| SBS | 32B | 78.2 | 20.0 | 49.5 | |
| ST | 32B | 89.0 | 53.3 | 56.1 | |
| ST-HC | 32B | 83.4 | 46.7 | 53.0 |
As we can see, using GPT-4o as a teacher to generate SBS CoT traces provides a benefit over the base student model. There is a further, notable gap in performance between ST/ST-HC traces compared to SBS, confirming that the performance improvements are indeed substantially driven by the specific stylistic patterns we identified. We will update our paper accordingly.
Real traces (ST) come from R1, a model already optimized for reasoning.
We agree that existing reasoning models have a distinct style that emerges during reinforcement learning. As this has been found to be effective, we aim to emulate it (ST-HC).
How well would these "stylistically trained" models perform on problems requiring genuinely novel insights or reasoning patterns not captured by the identified styles? Is there a risk of overfitting to a particular way of expressing reasoning, which might be brittle?
Regarding generalization and potential brittleness, our work argues that the distilled 'style' is not arbitrary but rather embodies effective and generalizable reasoning processes. The cognitive stages we identify (problem framing, exploration, verification, synthesis) and their associated lexical pivots are fundamental to successful problem-solving, as evidenced by their emergence in high-performing RLMs and their grounding in cognitive science. Our ST-HC-W experiment, showing improvement even with incorrect final answers, strongly supports that the learned structure itself is beneficial and transferable.
We again thank the reviewer for taking the time to thoroughly review our paper.
References
[1] Chomsky, N. (1955). Logical syntax and semantics; their linguistic relevance. Language, 31, 36–45.
[2] Bender, E.M. and Koller, A. (2020). Climbing towards NLU: On meaning, form, and understanding in the age of data. In ACL, pp. 5185-5198.
[3] Kojima, T., et al. (2022). Large language models are zero-shot reasoners. Advances in neural information processing systems, 35, 22199-22213.
Dear Reviewer nWWk,
If possible, please respond to this author comment. Hearing your thoughts will be very helpful in forming my final recommendation.
Thank you, Area Chair
This paper researches the reason of reasoning ability distillation. Results show that the distillation mainly benefit from the stylistic pattern instead of the true reasoning abilities. To do that, the authors first analyze the cot with correct answer and find out some lexical pivots. And then they use the generated synthetic data to finetune models and achieve roughly the same performance compared to the real cot. This implies that the styl is key to improvements in reasoning.
接收理由
-
The idea is quite novel. The research question is important. And the insights that the authors provide are quite inspiring.
-
The experiment results are strong. It's convincing to support the proposed claim with the performance on the synthetic training setting. The design of synthetic data is quite reasonable with the support of lexical pivots.
-
The overall writing is clear.
拒绝理由
I didn't foresee any severe issues of this work. But if possible, more qualitative study (e.g., case study) could be helpful to further support the findings of this work. Afterall, the claim in this work is quite new and it might be controversial.
Besides, it would be great if the authors could provide more insights about the lexical pivots. This claim might also be controversial. Providing more insights or descriptions would be helpful for readers to accept/understand this point.
We thank the reviewer for their encouraging feedback. We are glad to hear that they believe the paper is inspiring.
More qualitative study (e.g., case study) could be helpful to further support the findings of this work. Besides, it would be great if the authors could provide more insights about the lexical pivots.
While we make an initial effort to give an overview of the pivots found in emergent traces (Appendix B), we agree that additional insights into pivots could be useful to readers. We will add additional qualitative examples to the appendix, breaking down pivots by type.
We again thank the reviewer for taking the time to thoroughly review our paper.
Thanks for the reply. I'll keep my positive score.
Dear Reviewer x9kt,
If possible, please respond to this author comment. Hearing your thoughts will be very helpful in forming my final recommendation. Please respond by June 10 (the end of the discussion period). Even a quick response is helpful.
Thank you, Area Chair
This paper argues that when large reasoning models are distilled into smaller models, the smaller models mainly are learning stylistic patterns rather than deeper reasoning.
Pros:
- The paper provides important insights about the nature of reasoning distillation.
- The paper provides some careful analyses.
- The paper involves the release of datasets that will be useful artifacts.
- The paper provides interesting grounding in cognitive science.
Cons:
- Reviewer nWWk raised some big-picture questions about the framing and implications (e.g., can style and reasoning truly be separated? How would these results extend to benchmarks that require more novel reasoning?) The authors gave thoughtful responses to these questions; to the extent that space allows, incorporating these thoughts into the paper would be helpful.
- Reviewer CCPR suggested some additional ablations, which the authors have run; I encourage the authors to add these to the paper.
[Automatically added comment] At least one review was discounted during the decision process due to quality]