6.4

/10

Poster4 位审稿人

最低3最高5标准差1.0

4.3

置信度

创新性2.5

质量2.8

清晰度2.8

重要性2.5

NeurIPS 2025

Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning

Jiayu Wang,Yifei Ming,Zixuan Ke,Caiming Xiong,Shafiq Joty,Aws Albarghouthi,Frederic Sala

OpenReview PDF

提交: 2025-05-09更新: 2025-10-29

摘要

关键词

Large Language ModelsMathematical ReasoningReinforcement LearningEvaluation

评审与讨论

审稿意见

评分: 3置信度: 52025-06-09

This paper employs LLMs to synthesize three complementary components for mathematical questions, namely, the planning skeleton, relevant knowledge, and sub - problems. By leveraging these complementary elements, along with the original questions, as input information, the paper attempts to analysis diverse aspects of the performance of LLMs in mathematical reasoning, extending beyond mere accuracy. This paper adopts a multi-stage training approach to explore the influence of question difficulty on the efficacy of training. Through the augmentation of partial-solution guidance, challenging problems can be made relevant for reinforcement learning (RL) training.

优缺点分析

Strenghts: The paper is well written. Analysis of mathmatical reasoning beyond accuracy is a welcoming topic

Weaknesses: Overreliance on LLM-generated annotations for metrics analysis, relying on a single LLM (GPT-4.1) for annotation may embed biases and artifacts. The claim that "RL improves plan flexibility" is conflated with "SFT struggles with LLM-generated plans," leaving open whether the deficit is model- or input-specific. No direct comparison between LLM-generated vs. human-generated plans leaves ambiguity on input quality’s role.

问题

How do the multi-staged training experiments integrate within your Sparkle framework? They appear to involve distinct objectives being presented within a single research paper. It seems as if two experiments are combined to amass the contributions of this paper.
Given that, when complementary information is used as input, a decline in performance is observed in SFT models. Does this not suggest that the instructions are of poor quality? The complementary information is generated by LLMs, which may contain toxic and distracting elements. Does the absence of human verification not undermine the effectiveness of your framework?
The analysis framework is contingent upon the quality of the data synthesized by this LLM. This data serves as the gold standard for each aspect of reasoning, which can be brittle and biased. How do you ensure fairness and prevent bias?
How do you justify the conduction of providing partial solutions to the RL process? Partial solutions may train models to rely on step-by-step hints, reducing autonomy—contradicting RL’s goal of fostering self-generated strategies. Why not SFT instead if you have the "gold standard" solution? Could SFT models perform better with human-authored (non-LLM) plans? The paper assumes LLM-generated plans are "valid," but human planners might structure steps differently.

局限性

data dependency assumptions (LLM-generated gold standards) and under-explored tradeoffs in training design (e.g., partial solutions)

格式问题

作者回复

2025-07-31

We appreciate your comments and suggestions and we address each question and provide further clarifications in detail below. Please let us know if you have any additional questions or comments and we are more than happy to discuss further.

Q1: How do the multi-staged training experiments integrate within Sparkle framework? They appear to involve distinct objectives being presented within a single research paper.

Thank you for the thoughtful question. The dual focus is intentional and essential to our study. Our work aims to understand mathematical reasoning under RL. When applying RL to a set of problems, the key knobs are (i) the components necessary to solve a problem (planning, knowledge, etc.)—captured through the SPARKLE analysis framework—and (ii) the raw difficulty of the problems, which motivates our use of multi-stage RL training. A full characterization of RL for reasoning is not possible without addressing both of these elements.

To unify these aspects, we adopt a curriculum-style two-stage RL training setup, inspired by prior works (Sec. 2), and analyze its effects of RL using the SPARKLE framework. Specifically, Stage 1 establishes a strong RL-tuned model using diverse math problems, while Stage 2 explores three strategies for fine-tuning on a subset of the most challenging problems. We explore three data strategies (L184-188) for the second-stage training, from naive RL training with no augmentation, to including partial-solution augmentation. Our results show that without augmentation, outcome-based RL struggles to learn from difficult problems due to sparse or absent reward signals (Table 1). In contrast, partial-solution augmentation emerges as a simple yet effective method for enabling RL to extract training signal from these challenging cases.

The SPARKLE framework is applied consistently across models trained at each stage, allowing us to systematically analyze how different training strategies influence specific aspects of reasoning. Rather than representing distinct objectives, the RL experiments and SPARKLE framework are tightly coupled to provide a unified and comprehensive analysis of RL’s impact on LLM reasoning.

Q2.1: Given that, when complementary information is used as input, a decline in performance is observed in SFT models. Does this not suggest that the instructions are of poor quality?

We would like to clarify that our paper does not evaluate or include SFT models. To avoid confounding effects from supervised fine-tuning, we focus exclusively on comparing base models with RL-tuned models. This design complements prior studies that analyze the interplay between SFT and RL [1,2] (l.194–195).

All results involving planning and knowledge interventions (e.g., Sec. 6.3 and 6.4) are based on this Base vs. RL-tuned comparison. (summarized in l.220–237).

In Sec. 6.3, we find that providing external plans can degrade performance—particularly on the most challenging benchmark (AIME24). However, this is not due to poor instruction quality, but rather reflects a misalignment between externally provided plans and the model's preferred internal plans.

Moreover, the base model itself exhibits some instruction-following ability: for example, its performance improves with planning on GSM8K (Fig. 3, l.279–284), and with sufficient sampling (Fig. 2), it can match RL-tuned model performance.

Q2.2&Q3: The complementary information is generated by LLMs, which may contain toxic and distracting elements...Does the absence of human verification not undermine the effectiveness of your framework?

Human verification is an integral part in the benchmark construction process and has been included in this work.

We would like to clarify that we utilize a web agent to "extract a planning skeleton and summarizing key reasoning steps" (l.145-146) from existing step-by-step solutions in the test sets, instead of prompting LLM to generate a plan without reference. Moreover, we address this risk through a rigorous multi-stage annotation and verification pipeline, as described in Sec. 3.2 (l.149–153). Specifically:

A second verification agent checks each generated annotation for correctness, coherence, completeness, and pedagogical soundness. If any aspect fails, the annotation is regenerated.
Final validation is conducted by graduate students with advanced mathematics backgrounds to ensure that each annotation faithfully captures the required reasoning and knowledge.

In addition, to curate the training set for the second-stage RL tuning, we also involved "human verification to further filter out items with flawed solutions" (l.207-208). We believe this step is important for mitigating the potential brittleness of the GPT-4.1-based web agent.

We hope this clarifies that human verification is not only present but plays a critical role in ensuring the reliability and effectiveness of our framework. We would appreciate if you could re-evaluate our paper in light of the human verification measures we have in place.

Q4: How do you justify the inclusion of partial solutions in RL? Partial solutions may train models to rely on step-by-step hints, reducing autonomy—contradicting RL’s goal of fostering self-generated strategies.

Thank you for raising this point. Our motivation is grounded in prior work showing that curriculum-based approaches can significantly enhance the effectiveness of RL training for LLMs (see Sec. 2). However, outcome-based RL methods such as GRPO often struggle to learn from the most challenging problems due to the sparsity of positive reward signals (i.e., most sampled trajectories in a group result in incorrect answers).

This raises the question: can we still leverage these hard problems in a way that improves performance without compromising the capabilities already learned by a well-tuned RL model?

We agree that overusing partial solutions may reduce autonomy if applied indiscriminately. However, our approach is targeted and carefully scoped. Partial-solution augmentation is not used as a general strategy for RL training (i.e., Stage 1), which could encourage dependency on hints. Instead, we introduce partial-solution scaffolding only in Stage 2, and only on the hardest problems that the Stage 1 RL-tuned model consistently fails to solve.

Therefore, We would like to clarify that partial solution augmentation is not used as a general approach for RL training on all problems (Stage 1), which "may train models to rely on step-by-step hints". We only perform partial solution scaffolding in the second stage, on the hardest problems the first-stage RL-tuned model cannot solve, to further boost the reasoning performance.

Specifically, given an RL-trained model, we have explored three data strategies for the second-stage RL training (l.184-188):

(1) Hard-Only: Training on difficult problems without any augmentation.
(2) Mixed Difficulty: Combining easy and hard problems for exposure diversity.
(3) Hard-Augmented: Introducing partial solutions to help the model navigate complex reasoning paths.

Our results (Table 1) show that the Hard-Augmented strategy significantly outperforms the other two, demonstrating that partial-solution scaffolding can effectively improve reasoning when the model is otherwise unable to learn from sparse feedback. In this context, partial augmentation serves as a practical curriculum tool to extend the learning frontier—rather than undermining reasoning autonomy.

Q5: Why not SFT instead if you have the "gold standard" solution?

We agree that a natural baseline would be to apply SFT using available solution traces. However, there are both practical limitations and empirical drawbacks that make SFT sub-optimal in our setting.

The training set used in Stage 1 contains over 40K math problems, for which "gold standard" solutions are largely unreliable: many solution traces are noisy or incomplete. These traces vary in quality---from simplistic chain-of-thought outputs to human-written final answers. While obtaining new traces by distilling from competitive LLMs is possible, it would require substantial compute and additional human validation to ensure correctness and clarity.
Empirically, we observe that applying SFT on noisy traces does not lead to improved performance. To evaluate this, we conducted a controlled comparison between three approaches:

Model	AIME24	AMC23	MATH500	GSM8K	OlympiadBench
SparkleRL-Stage 1	46.67%	67.50%	80.00%	91.77%	39.11%
SparkleRL-Stage 2-aug	50.42%	71.25%	81.00%	92.38%	40.11%
SFT on hard problems with solutions	15.00%	53.44%	70.03%	88.30%	30.70%

As shown above, SFT on hard problems from the SparkleRL-Stage 1 model leads to significant performance degradation across all benchmarks. In contrast, our Stage 2-aug method yields consistent improvement.

The key distinction in Stage 2-aug (details in l.202–216) is that instead of fine-tuning on full traces that are noisy, we augment the hardest unsolved examples from Stage 1 with partial solutions in the input prompt. These partial solutions does not disrupt agent's on-policy learning, which is important for effective RL training. This preserves the benefits of RL, which encourages models to develop robust, self-directed reasoning strategies—given hints (partial solutions) from the challenging problems.

In contrast, SFT fine-tuning on noisy or overly prescriptive solutions can misalign with the model's learned heuristics, disrupting the internal reasoning behaviors acquired during RL training.

[1] Chu et al., Sft memorizes, rl generalizes: A comparative study of foundation model post-training, 2025

[2] Yeo et al., Demystifying long chain-of-thought reasoning in llms, 2025.

审稿意见

评分: 5置信度: 42025-06-30

This paper analyzes various interventions in RLVF posttraining on top of Qwen-2.5-Math-7B in order to better understand what limits reasoning performance: planning, knowledge, problem decomposition, or problem difficulty. The paper proposes a new dataset of 2,564 questions called SPARKLE which augments popular math benchmarks (such as GSM8K, MATH, AIME, etc.) with (1) a suggested plan for solving the problem; (2) relevant knowledge/concepts needed to solve the problem; and (3) a set of subproblems and corresponding answers that need to be solved in the course of solving the whole problem. Ablations that either include or exclude the plans, knowledge, and subproblems demonstrate that: plans harm the base model’s performance while leaving the RL-trained model performance unchanged; knowledge harms base model performance but slightly improves RL-trained model performance; and all models seem unable to reliably solve subproblems. Additionally, the paper demonstrates that continued RL training on the most difficult problems augmented with hints can lead to the best performance.

优缺点分析

Strengths:

Clarity: I think the paper is structured well and it very clearly lays out the dimensions that are being studied.
Significance: understanding RL reasoning performance at a more fine-grained level—beyond just accuracy on various benchmarks—is an important and burgeoning area of research.
Originality: The various findings with Qwen 7B are very interesting and I believe will be of interest to the RLVF community: (1) RL performance can be boosted on hard problems if partial solutions are available; (2) performance of RL-trained models seems to be limited more by knowledge retrieval ability rather than planning ability; and (3) all models struggle with solving decomposed problems.

Weaknesses:

Quality: there are a number of unsupported or overly broad claims. I detail these in the “Questions” section below, but will call out the two most important ones:
- The paper is missing an important baseline, which is to compare to Qwen-2.5-7B-Math-Instruct. Some of the claims made in the paper about the base model may just be due to poor instruction-following capability, and it’s not clear whether the RL-trained models do better because they are better at instruction-following or because they’ve been trained with RLVF. Including an instruction-tuned baseline would help to clarify this.
- There are no error bars or statistical tests reported in the paper, and some of the claims are made based on very small differences that could be due just to statistical error.
Clarity: there are a number of missing details that make it difficult to fully evaluate the paper. I detail these in the “Questions” section below, but most importantly, there needs to be more detail about (1) how the SPARKLE dataset was constructed, and (2) how the “partial solution guidance” works.
Quality: only a single model (Qwen-2.5-7B-Math) is studied in the paper, which limits the generality of the claims.

问题

Unsupported or overly broad claims

I think the drops in performance seen by the base model might just be due to poor instruction following. I therefore think that (1) the paper should state this is the reason for the drop in base model performance, and (2) Qwen-2.5-Math-7B-Instruct should really be included as a baseline in order to disentangle whether it is the lack of instruction-tuning that leads to these drops or the lack of reasoning. My prediction is actually that including things like the plan will help Instruct quite a lot, but maybe not as much as doing RLVF (i.e. base < instruct < instruct+plan < rl $\\approx$ rl+plan). I would need to see Instruct added as a baseline to consider increasing my score.
Some of the claims need to be backed up by statistical tests—e.g. In the discussion in paragraph starting line 285, it’s not clear to me that the increases/decreases in performance are statistically significant. If you would like to make claims like “RL-tuned models achieve superior performance when permitted to develop autonomous planning strategies” then it is important to establish whether that superior performance is statistically significant. (Also, in that particular case it would also be helpful to clarify that you are specifically talking about AIME24). I would need to see statistical tests included throughout the paper in order to consider increasing my score.
“Improved sampling efficiency alone cannot fully explain the observed gains” (line 267) → Why not? I didn’t understand this argument, or how the subsequent experiments disprove it.
“RL-tuned models show greater flexibility in following different plans compared to the base model” (line 285) → how do you know they are actually following the plan rather than just ignoring it and doing their own thing?
“Knowledge integration is a key driver of performance on harder tasks” (line 334) → I’m not sure this conclusion is correct. It is hard to know for sure without seeing the distribution of difficulty, but based on the pass@1 rates it looks like there might be very few problems at higher difficulty levels (e.g. maybe just one at difficulty 10, and three at difficulty 9?). If so, the fact that the solve rate looks high might just be the result of high variance. It would be much better to (1) report the avg@8 score (as in the other plots), (2) to include error bars, and (3) to report how many questions there are at each difficulty level, in order to make sure the interpretations are correct.
Some of the language used in the paper is overly broad given that only a single model (Qwen-2.5-Math-7B) was used. For example, the abstract says “RL-tuned models exhibit greater robustness… than their base counterparts” (line 11), but there is only one base model examined! The language in the paper needs to be softened to clarify that Qwen 7B is the only model studied: I think this is ok because the benchmark is still interesting, but the claims in the paper need to be made more precise because without testing other models it is not possible to generalize the results in the paper to them.

Missing details

Significantly more detail is required to explain how SPARKLE is constructed. For example, which prompts are used to produce the plans/knowledge/decomposition? What prompts are used for the verification agent? How do you ensure that the subproblems all have correct answers and are solvable? How do you determine difficulty level? This doesn’t need to be in the main text, but at least should be in the appendix. I would need to understand these details before considering raising my score.
I don’t understand how the “partial solution guidance” works. Could you please provide an example of what the prompt looks like? How do you choose the boundaries for the “chunks”? How is it different from methods for reverse curriculum training, e.g. https://arxiv.org/pdf/2402.05808 ? (Some of the details are scattered throughout the paper, e.g. where the reasoning traces come from—it would be really helpful to include a dedicated section in the appendix restating some of those things and explaining the whole procedure in more detail). I would need to understand this better to consider raising my score.
What is the distribution of difficulty level (i.e. how many questions at each level)? It would be helpful to include a histogram in the appendix.
Are the results reported in Figure 6 over the entire SPARKLE benchmark? It would be helpful to be more clear about this in the captions and text.

Model	AIME24	AMC23	MATH500	GSM8K	OlympiadBench
Qwen-2.5-Math-7B-SparkleRL-Stage 1	46.67% ± 2.11%	67.50% ± 1.58%	80.00% ± 0.13%	91.77% ± 0.12%	39.11% ± 0.21%
Qwen-2.5-Math-7B-SparkleRL-Stage 2-aug	50.42% ± 0.99%	71.25% ± 1.12%	81.00% ± 0.27%	92.38% ± 0.17%	40.11% ± 0.32%

Task	t	p-value	Cohen’s d	Sig.
AIME24	-3.81	0.0030	1.91	Yes
AMC23	-4.58	0.0006	2.29	Yes
MATH500	-7.98	<0.0001	3.99	Yes
GSM8K	-6.96	<0.0001	3.48	Yes
OlympiadBench	-6.15	<0.0001	3.07	Yes

Task	AIME24	AMC23	MATH500	GSM8K	OlympiadBench
Avg. Difficulty Level	3.6	2.1	1.5	1.0	3.1
Subproblem Solve Rate	53.94%	72.96%	84.92%	95.60%	66.96%

Q1: Instruction-following

I think perhaps I didn't articulate my concern about instruction-following clearly enough. In the paper, you are concerned with "[isolating] the effect of RL on reasoning behavior" (quote from the rebuttal). However, reasoning isn't the only thing that can be improved via RL—it's also possible (likely) that RL improves instruction-following capability, especially if there is no prior SFT stage. Because the paper doesn't compare to any SFT models, it's not possible to know whether the gains from RL are due to improvements in reasoning or improvements in instruction-following. Without including an SFT/instruct baseline and analyzing the extent to which RL is improving reasoning vs. just instruction following, I don't think it's valid to claim anything about reasoning per se. Perhaps Qwen-2.5-Math-7B-Instruct is not exactly the right baseline for the reasons you mentioned, but I still think including some form of SFT/instruction-tuned baseline is important in distinguishing which of your results are due to better instruction following and which are due to better reasoning. (And just to be clear I read the response to reviewer wRv5 as well and your points about SFT there, which I think are fair—my suggestion here is not to show that RL is better than SFT but to use SFT as a way to identify what improvements are due to instruction following vs. reasoning. It's not the cleanest analysis in the world but essentially what I'm thinking is: SFT - baseline = improvements due to better instruction following, and RL - SFT = improvements due to better reasoning.)

Alternatively, another way to address my concern would be to tone down the claims about reasoning in the text: e.g. if rather than saying "insights into how general-purpose (single-stage) RL affects reasoning across a wide range of problem difficulties" (line 174), you replaced "reasoning" with "performance" (and similarly for many other places in the text where the word "reasoning" is used). And similarly, if you discussed in Section 6.3 that the robustness of the RL-trained models to incorporating the plan might be due to better instruction-following ability than the base model. I think it would be a lot more interesting to include an analysis trying to disentangle instruction-following and reasoning rather than changing the writing in the paper, but of course it's more work.

(Also FWIW, I'm somewhat skeptical of drawing any strong conclusions based on GSM8K as it, and questions of similar style, have likely been included many times over in pretraining).

Q2: Statistical tests

Thanks for adding these! I guess because you're doing avg@8 the confidence are smaller than I was expecting—and it makes me much more confident about the results. I appreciate it.

Q3: Sampling efficiency

Thanks for the explanation, that's a nice insight. I think it would be helpful to readers to make this come through more strongly in the text (especially in section 6.2), as I clearly missed it.

Q4: Plan following

Thanks, that makes sense. Maybe a better way to phrase what you're looking at here is "plan integration" or "plan conditioning" rather than "plan following" per se, as "following" really implies to me something about the faithfulness of reasoning (as also brought up by reviewer wRv5).

Q6: Other models

Thanks for agreeing to revise the language, and for pointing to the 32B results (you should highlight these in the main text, I don't think they're mentioned at all!). These results are interesting especially in light of the instruction-following discussion above: all the 32B models seem to benefit from the plan information, which strengthens my belief that the robustness you see with the 8B RL-trained model is due to instruction-following improvements.

评论- Thanks for the rebuttal (2/2)

2025-08-05

Q11-Q13: Subproblems

That's very interesting, thanks for providing that analysis! Are those numbers for SparkleRL-Stage 2?

Re: citation [4], I mentioned it not as a suggestion to show how your results are different (it's clearly a different setup along multiple dimensions), but more that their findings could possibly help explain/augment your results too. They found that questions are harder to solve in composition rather than independently, which might also be an explanation for why asking the model to solve subproblems might make the performance worse in your case—even if the model could solve each subproblem independently, it might struggle to compose them. I suppose we can't fully identify the "independent" solve rate in your case but we can roughly compute $p$ as the average subproblem solve rate (what you've reported above) and then $p^n$ where $n$ is the number of subproblems, and compare that to the actual compositional solve rate (what was already in the paper). So based on the numbers you reported above:

	AIME24	AMC23	MATH500	GSM8K	OlympiadBench
Expected compositional solve rate	10.8%	51.5%	78.2%	95.6%	28.8%
Actual compositional solve rate	17.5%	37.5%	65.8%	87.6%	34.4%

So from this rough analysis we can see that on AMC23, MATH500, and GSM8K you get the same results as [4]—the compositional solve rate is lower than what you'd expect than if you were solving each problem independently, perhaps suggesting something about the difficulty of solving the subproblems in composition. But interestingly on AIME24 and OlympiadBench the results go the other way! These are the hardest problems, and have probably been overfit to less in pretraining, so perhaps in this case the decomposition could help, if the model were better at solving the individual subproblems (which it seems quite poor at).

Q5, Q7-Q10

Thanks for the clarifications, these address my concerns.

2025-08-07

Thank you very much for your thoughtful and detailed feedback. We are glad that Q2–Q10 addressed your concerns. We agree with the subtleties you described around reasoning vs. instruction-following and we will qualify/tone-down the language in the text of the draft around this. We provide additional information for Q1 and Q11–Q13 below.

Q1: instruction following

Great point! We ran the suggested SFT experiment. We performed SFT on the base model (Qwen-2.5-Math-7B) using CoT traces on hard questions, followed by RL finetuning. The results are shown below:

Model	AIME24	AMC23	MATH500	GSM8K	Olympiad Bench	Avg.
Qwen-2.5-Math-7B	16.67%	42.50%	44.03%	44.30%	28.65%	35.23%
Qwen-2.5-Math-7B (add.plan)	9.58%	30.94%	41.45%	46.92%	18.85%	29.55%
Qwen-2.5-Math-7B-SFT	15.42%	50.63%	68.55%	85.03%	31.44%	50.21%
Qwen-2.5-Math-7B-SFT (add. plan)	14.17%	50.31%	69.08%	88.84%	31.78%	50.83%
Qwen-2.5-Math-7B-SFT+RL	33.33%	64.06%	77.33%	91.58%	36.74%	60.61%
Qwen-2.5-Math-7B-SFT+RL (add. plan)	36.25%	67.19%	79.45%	94.47%	39.80%	63.43%

What we observe is that

Overall, as expected, SFT performs better than base, but worse than RL
In terms of plan-following flexibility, SFT is not necessarily able to use a provided plan---just like the base model, and unlike the RL-tuned model

This finding suggests that RL improvements (at least in terms of plan use flexibility) cannot be solely attributed to improved instruction following.

We agree that this is a very interesting direction! We will add the discussion above to our draft.

Q11-13: Subproblems

Yes, these numbers are reported for SparkleRL-Stage 2. Very interesting findings, and we appreciate your detailed explanation. Thank you for sharing these observations—they are helpful for further understanding the role of problem decomposition and problem difficulty in mathematical reasoning under RL.

评论- Response

2025-08-07

Thank you for running these experiments with SFT! These are very interesting, and I agree it shows that RL is bringing something to the table above and beyond simple instruction following. (I am rather surprised though that SFT+RL is so much worse than just RL, though... it is probably worth trying to understand why this is for the final version of the paper.)

I think with this all of my concerns are addressed. I am happy to increase my score.

审稿意见

评分: 3置信度: 42025-07-03

The paper presents a framework for analysing the reasoning behaviours of LLMs along three dimensions: planning and execution, knowledge utilisation, and chain of subproblems. The authors also explore the use of curricula and solution guidance to improve the learning potential of hard problems when conducting RL for reasoning. Results demonstrate that these approaches improve the utility of hard problems during training and reveal that RL training reliably enhances two out of the three reasoning behaviours considered.

优缺点分析

Strengths:

By not focusing on benchmark accuracy alone, the paper reveals some qualitative behaviours, which can be quantitatively measured, that are enhanced by RL finetuning. These insights provide intuitive insights into the general capabilities that translate to improved performance on reasoning tasks. The results also demonstrate that the proposed problem difficulty curriculum and solution augmentation do indeed improve learning from hard problems.

Weaknesses:

The dual focus of analysing reasoning behaviours and improving learning from hard problems is disjointed. The proposed framework is also not comprehensive, as it fails to account for aspects such as the faithfulness of reasoning. While the paper reveals that problem decomposition is not useful, this unintuitive result warrants further investigation, or could even motivate an approach to making better use of problem decomposition. The impact of additional training is not accounted for; comparisons to SFT would and other offline methods such as DPO would be necessary to robustly show that it is RL training specifically that reinforces these behaviours. Though not necessarily a weakness, the findings of the paper are not particularly surprising and so their impact may be minimal.

问题

Why does performance by difficulty level have a U-shape with knowledge augmentation?

局限性

Limitations are discussed in an Appendix, but should feature in the main text.

最终评判理由

The paper makes progress towards quantitatively evaluating aspects of reasoning models besides their final performance, but the categorisation of three particular types of behaviours seems limited and is not likely to be of significant impact in its current state.

格式问题

No formatting concerns.

作者回复

2025-07-31

W1: The dual focus of analyzing reasoning behaviors and improving learning from hard problems is disjointed.

The dual aspect here is intentional; in fact, we believe it is crucial for our overall findings. Our work is centered on analyzing mathematical reasoning under reinforcement learning (RL). When applying RL to a set of problems, the key knobs are (i) the components necessary to solve a problem (planning, knowledge, etc.,) and (ii) the raw difficulty of the problems. A full characterization of RL for reasoning is not possible without addressing both of these elements.

Our Hard-Augmented Stage 2 setting is not proposed as a general solution to the reasoning challenges observed, but rather as a complementary contribution: it demonstrates a practical method for leveraging hard problems in RL training—an area prior work has found difficult due to absent reward signals. Together, our dual contributions—(1) analysis framework and benchmark of reasoning behaviors across planning and execution, knowledge utilization, and chain of subproblems; and (2) a method to incorporate hard examples—provide stronger empirical evidence for understanding the effect of RL. The availability of two distinct RL-tuned models (SparkleRL-Stage 1 and SparkleRL-Stage 2-aug) allows us to conduct more fine-grained analysis than is typically possible in existing work.

W2: The proposed framework is also not comprehensive, as it fails to account for aspects such as the faithfulness of reasoning.

We appreciate this observation. Our work focuses on three core aspects of mathematical reasoning captured by the SPARKLE framework: plan-following, knowledge utilization, and subproblem decomposition. While faithfulness is indeed an important dimension, it is orthogonal to the goals of our current study. We believe that a fine-grained and dedicated analysis of these three components is equally valuable, as they represent foundational building blocks of mathematical reasoning.

Our goal is not to exhaustively cover all aspects of reasoning—necessary trade-offs must be made to support a focused investigation with a clear central theme. We hope our work can serve as a cornerstone for future research to extend the framework, including incorporating dimensions such as faithfulness.

W3: While the paper reveals that problem decomposition is not useful, this unintuitive result warrants further investigation, or could even motivate an approach to making better use of problem decomposition.

We would like to clarify that our paper does not claim that problem decomposition is not useful. Instead, our findings reveal a more nuanced perspective on how decomposition interacts with RL-tuned models.

In our analysis of plan-following (Sec. 6.3), we observe that RL-tuned models exhibit greater flexibility in generating and executing their own plans. Interestingly, even valid human-written plans do not always yield improved performance—particularly on harder benchmarks like AIME24—because high-level plans may overlook critical edge cases. However, for simpler datasets such as GSM8K (difficulty level 1 out of 10), explicit plans consistently improve performance for both base and RL-tuned models. In these cases, planning helps reduce cognitive load by structuring the reasoning into manageable steps.

In our subproblem decomposition analysis (Sec. 6.5), we consider a problem unsolved if any subproblem is answered incorrectly. Under this strict criterion, RL-tuned models often struggle, especially when the decomposition is overly granular. Combined with our plan-following results, this suggests that fine-grained decompositions may disrupt the model’s internal reasoning flow, rather than supporting it.

We view this as an valuable direction for future research: rather than relying on human-authored decompositions, it could more effective to design model-aligned decomposition strategies that are better matched to the model’s inductive biases. However, proposing a new algorithm that leverages this phenomenon would require dedicated study and is beyond the current scope of our work.

W4: The impact of additional training is not accounted for; comparisons to SFT and other offline methods such as DPO would be necessary to robustly show that it is RL training specifically that reinforces these behaviors.

Thanks for bringing this up! We would like to clarify that our paper does not evaluate or include SFT or DPO methods by design. To avoid introducing confounding effects, we deliberately focus on comparing base pretrained models and RL-tuned models, isolating the impact of RL. This design choice complements prior studies that specifically analyze the interplay between SFT and RL [1,2] (see l.194–195).

Therefore, it is not our objective to claim that RL, rather than SFT or DPO, uniquely reinforces these behaviors. Instead, our goal is to provide a focused, fine-grained analysis of reasoning behaviors that emerge specifically from RL training.

To ensure a clean comparison, we initialize both the base and RL models from the same pretrained checkpoint (Qwen-2.5-Math-7B) and apply RL training only, without any intermediate supervised fine-tuning (SFT) or preference optimization (e.g., DPO). This setup allows us to attribute the behavioral differences directly to RL, without contamination from other training paradigms.

While comparisons to SFT or DPO are indeed important in the broader landscape, they are outside the scope of this work. Our goal is not to establish whether RL is superior to SFT or DPO, but rather to provide a fine-grained analysis of how RL shapes reasoning behaviors—an area that remains underexplored despite RL's widespread use in modern LLM pipelines.

We hope this clarifies our motivation and experimental design. We believe that narrowing the focus to RL is not a limitation but a necessary step toward understanding its unique contributions.

[1] Chu et al., Sft memorizes, rl generalizes: A comparative study of foundation model post-training, 2025 [2] Yeo et al., Demystifying long chain-of-thought reasoning in llms, 2025.

Q: Why does performance by difficulty level have a U-shape with knowledge augmentation?

Thank you for this thoughtful question. The observed U-shaped trend in Fig. 11 primarily reflects the natural imbalance in difficulty levels within existing math benchmarks. Our framework is designed to augment existing problems—not to construct new ones or rebalance the dataset—so it inherits a skewed distribution, with only 4 total problems at levels 9–10. This sparsity at the top makes accuracy volatile and can visually exaggerate trends like the U-shape. We will clarify this in the revised manuscript.

However, this pattern does not contradict our central finding: knowledge augmentation is more beneficial than planning, especially for harder problems. For example:

On level 8 (40 problems), knowledge augmentation improves performance by +42.5%, compared to a +15% gain from planning.
On level 7 (41 problems), knowledge gives a +4.9% gain, while planning causes a −4.9% drop.
When weighted averaged across all levels, knowledge yields a +4.53% gain, compared to +2.50% from planning. These trends remain robust outside the sparsely populated level 9–10 range.

These trends are robust across sufficiently sampled levels (1–8) and support our conclusion that external knowledge access is a key driver of performance on complex tasks(l.334).

We also emphasize that difficulty levels are assigned based on AoPS Competition Ratings [AoPS], which map to the structure of major U.S. math competitions: Levels 1–2 correspond to AMC 8, early AMC 10 and AMC 12 problems (e.g., standard middle school math), Levels 4–6 span through later AMC 12, AIME and intro level Olympiad-styple problems, and Levels 7–10 include USAMO, IMO, and other hard Olympiad-level problems.

We deliberately avoid using metrics like pass@k of model performance to define difficulty levels, as they may introduce model-induced inductive bias. Instead, we rely on competition-based standards to provide an objective and interpretable standard. While pass@k filtering would allow easier balancing, it introduces model-induced inductive bias, which we explicitly avoid.

2025-08-06

Thank you for your rebuttal. As it addresses several of my concerns, I have increased my score. However, given that this paper's core contribution is an analysis of reasoning models, I still find the three chosen factors to be a somewhat crude categorisation of the possible behavioural variances.

2025-08-06

Thank you for your updated rating and follow-up comment! We would like to take this opportunity to further clarify the rationale behind our design choices and the scope of our work:

1. On the rationale behind the three dimensions

Our goal is not to comprehensively categorize all possible LLM reasoning behaviors—a task far beyond the scope of a single paper—but to focus on a core set of foundational dimensions that are widely recognized as important to mathematical reasoning. Specifically, we center our analysis on knowledge utilization [1,2,3], subproblem decomposition [4,5,6,7], and planning [8]. While we acknowledge that this selection is non-exhaustive, we chose to prioritize these crucial dimensions.

Producing a fine-grained categorization of reasoning and the factors behind it is an interesting idea--and something we have been thinking about as well! We are inspired by research in cognitive science [9–12], which identifies knowledge retrieval, planning, and decomposition as some of the most important components of human problem-solving. We will add this motivation for extra clarity in our updated draft.

2. Contributions beyond analysis framework

In addition to the analysis framework, we would also like to highlight two further contributions:

SPARKLE benchmark, a fine-grained benchmark with 2,564 open-ended questions spanning 10 difficulty levels and 9 domains, with rigorous human annotation and validation pipeline.
Partial-solution scaffolding, a new strategy for multi-stage curriculum RL that provides an effective way to utilize the hardest problems—often accompanied by noisy or mixed-quality solution traces—in second-stage RL training, leading to further improvements in LLM reasoning.

We appreciate your engagement and feedback! We are happy to go into more detail or answer any additional questions you may have.

References:

[1] Ko et al., Hierarchical Deconstruction of LLM Reasoning: A Graph-Based Framework for Analyzing Knowledge Utilization, EMNLP 2024

[2] Wang et al., Boosting Language Models Reasoning with Chain-of-Knowledge Prompting, ACL 2024

[3] Gao et al., Retrieval-Augmented Generation for Large Language Models: A Survey, 2024

[4] Tomer et al., Break it down: A question understanding benchmark, ACL 2020

[5] Zhou et al., Least-to-most prompting enables complex reasoning in large language models, ICLR 2022

[6] Dua et al., Successive prompting for decomposing complex questions, EMNLP 2022

[7] Khot et al., Decomposed prompting: A modular approach for solving complex tasks, ICLR 2023

[8] Wei et al., PlanGenLLMs: A Modern Survey of LLM Planning Capabilities, 2025

[9] Allen & Herbert, Human Problem Solving, 1972

[10] Peter & Philip, Psychology of Reasoning: Structure and Content, 1972

[11] Ken Manktelow, Reasoning and Thinking, 2012

[12] Jiří Dostál, Theory of Problem Solving, 2015

审稿意见

评分: 5置信度: 42025-07-09

This paper proposes a number of methods for analyzing strengths and weaknesses of language model reasoning, and uses these methods to explore the effect of RL training (through GRPO) on reasoning in math domains. They also propose and evaluate a method for training through RL on difficult problems based on providing partial solution scaffolds.

The first angle they explore deals with planning versus executing – they test models with and without plans being provided, and surprisingly find that providing plans can often actually hurt the performance of models, especially base (non-RL) models, presumably because models are better at following the plans that they come up with instead of provided plans.
The second angle they study is on the ability to recall relevant background knowledge – they test models with and without relevant theorems/definitions/etc being provided in the prompt, and find RL based methods are much more strongly able to make use of provided knowledge (in fact it often hurts the base models!).
The third angle looks into the problems with providing plans (the first angle) in more detail by exploring how well the model can solve a subproblem given the ground truth solution prefix up to that subproblem. They find that models are far better at solving overall problems than solving the subproblem sequence.

优缺点分析

Overall I'm very positive on this paper, I think that it's an excellent fit for NeurIPS. It is beautifully written and presents a series of interesting results pertaining to a very hot area in the field right now (effects of RL on language models). The paper reads like an excellent example of good science – clear questions, well-designed experiments, beautifully presented. I'm not an expert in RL for LLMs and may not know all the related work, but given my understanding this is a solid, original contribution to the field.

Strengths

I find the overall framework compelling. It's true that we need to move beyond accuracy based metrics, and I think that this is a nice breakdown into different aspects of the problem (planning, background knowledge, ability to execute plans).
The dataset construction pipeline including a final check by graduate students with a math background is an excellent step of verification and one that many papers don't do. And likewise during the hard-solution curation there's a graduate student verification step to find flawed solutions.
Section 6 is a nice presentation of a number of interesting results. The ordering and flow of the section made it quite easy to read.
- The results in 6.1 on training on scaffolded problems shows that the proposed method provides a reasonable boost. However I feel the main contribution of the paper is primarily in the other results.
- The result in 6.3 that externally provided plans decrease performance for many base/RL models over many domains (but that RL models are more robust to this) is particularly interesting – the idea that actually models are particularly good at following their own plans, and being forced into an externally provided plan can decrease performance. I appreciated the granular explanation of why that trend doesn't hold for GSM8K.
- The section 6.4 result that external knowledge harms the base model but helps the RL models was also really interesting, as was the 6.5 results on models having more trouble solving all subproblems than overall tasks.
- 6.6 was an important analysis to include for shedding a bit more light on the 6.3/6.4 result trends, showing how they vary with problem difficulty.
The writing of Section 3.1 is excellent, each of the axes is presented quite clearly. The style works well – leading with a sentence about the more general reasoning phenomenon the axis is trying to capture then getting very concrete and specific. And the high quality writing carries through the rest of the paper. The figures are also beautiful, the paper overall has a very high level of polish.

Weaknesses

I don't have any strong weaknesses.
The Hard-Augmented model results (6.1) are interesting but don't feel too significant. In part because the approach requires having solutions (or partial solutions) to these problems, at which point one could just train on those solutions in the first place.

Minor edits and suggestions

line 208: difficulty -> difficult
Figure 6 isn't labelled as "Figure 6" anywhere
Figure 6a caption: "stable" -> "stability"
I wonder if 6.6 could be moved up to be right after 6.4, since it's about how 6.3/6.4 vary with difficulty? Possibly 6.5 could fit well right after 6.3, since it's a bit of a look in detail into how things break down when solving human-provided problems. Just a thought / suggestion though, may or may not be better.

问题

Relating to both 6.3 and 6.5, I'm curious if you have any qualitative observations from looking at what sorts of human-provided plans (6.3) or subproblems (6.5) are difficult for the model to work with. Do you have any observations on how these plans and subproblems differ from those that the model comes up with and prefers to work with?
On my minor weakness about the Hard-Augmented results, it'd be helpful to hear a bit more of what kind of settings motivate this, or where you see this result being useful?

局限性

Yes.

最终评判理由

Maintaining my Accept rating – my minor concerns were resolved in the rebuttal.

格式问题

None

作者回复

2025-07-31

Thank you for your valuable feedback, thoughtful comments, and your support of our work! We address each question in detail below. Please let us know if you have any additional questions or comments and we are more than happy to discuss further.

Q1: Any qualitative observations from looking at what sorts of human-provided plans (6.3) or subproblems (6.5) are difficult for the model to work with. Do you have any observations on how these plans and subproblems differ from those that the model comes up with and prefers to work with?

Yes, we obtained several qualitative observations while comparing model-generated reasoning with human-provided plans and decompositions.

Planning:

A common pattern is that high-level human plans often lack sufficient granularity, particularly for more difficult problems. This can lead models to overlook important edge cases, especially in mathematically nuanced settings.

For example, in Table 3 of the supplementary materials (AIME24), the provided planning instruction states: "Identify the pattern that arises for losing positions." Following this plan, the model correctly identifies only one of the two correct losing scenarios (n ≡ 2 (mod 5)), but misses the other (n ≡ 0 (mod 5)), leading to an incorrect final answer.

In contrast, when no plan is provided, the same model generates Python code to perform dynamic programming and simulates the game for all n ≤ 2024. This strategy leads to the correct result (809), successfully capturing both losing scenarios.

This example illustrates a broader trend: RL-tuned models often perform better when allowed to explore the problem space using their internal reasoning mechanisms (e.g., generating code or applying strategies they are familiar with), rather than following externally imposed plans. While human-written plans can be helpful for simpler tasks (e.g., GSM8K), they may restrict model flexibility or bias the model toward unfamiliar or suboptimal strategies on more challenging tasks.

Subproblem decomposition:

We observe a similar pattern when studying subproblem decomposition (Sec. 6.5). While human-provided subproblems are often semantically meaningful, breaking a problem into isolated steps can disrupt the model’s ability to reason holistically. In the same AIME24 example, a corresponding subproblem is phrased as:

Q: What is the pattern of losing positions for n tokens?

Despite being logically sound, the model still fails to identify one of the two key scenarios. This suggests that isolated subproblems can inherit the same brittleness as high-level plans if they are not grounded in the model’s natural reasoning strategies.

Q2: What motivates the Hard-Augmented setting and the possiblity of training on these solutions.

Thank you for the thoughtful question! We are motivated by prior work that suggests RL methods with outcome rewards such as GRPO cannot benefit from the hardest problems due to the lack of positive reward signals (i.e., all trajectories in a group yield wrong answers). We explore whether these hard problems could still be leveraged in a way that improves performance—without compromising the capabilities learned by a well-tuned RL model from Stage 1.

We agree that a natural approach would be to train on available solution traces with supervised fine-tuning (SFT). However, SFT does not yield promising results in our setting for the following reasons:

The training set used in Stage 1 contains over 40K math problems, where many solution traces are noisy (e.g., incomplete solutions). These traces vary in quality---from simplistic chain-of-thought outputs to human-written final answers. While obtaining new traces by distilling from competitive LLMs is possible, it would require substantial compute and additional human validation to ensure correctness and clarity.
Empirically, we conducted an additional experiment by applying SFT on these noisy traces. The results are shown below:

Model	AIME24	AMC23	MATH500	GSM8K	OlympiadBench
SparkleRL-Stage 1	46.67%	67.50%	80.00%	91.77%	39.11%
SparkleRL-Stage 2-aug	50.42%	71.25%	81.00%	92.38%	40.11%
SFT on hard problems with solutions	15.00%	53.44%	70.03%	88.30%	30.70%

We observe that SFT on hard problems from the SparkleRL-Stage 1 model leads to significant performance degradation across benchmarks. In contrast, our Stage 2-aug method yields consistent improvement.

The key distinction in Stage 2-aug (see L202–216) is that we augment the hardest unsolved examples from Stage 1 with partial solutions provided in the input prompt, rather than full solutions used for fine-tuning. This design preserves on-policy learning, which is important for effective RL training, thereby further boosting the reasoning performance.

Typos

Thank you! We will carefully revise and correct all typos in the final revision.

2025-08-05

Thank you for the rebuttal, that resolves my questions and my minor concern.

最终决定Accept (poster)

2025-09-17

This paper investigates the benefits of RL training. In particular, the paper explores these benefits in the context of plan-following and execution, problem decomposition, and (3) improved reasoning and knowledge utilization. Experimental evidence is provided for each of these aspects. Interestingly, the paper shows that plans derived from human solutions are not necessarily good for the model to execute, which is not conceptually surprising. The paper also considers the effect of training exclusively on hard problems and using hard problems augmented with partial solutions and provides interesting insights. The reviews for the paper are mixed. While the problem setting is interesting, the reviewers were concerned about insufficient ablations, novelty of the findings and overall presentation of the paper.

After going through the paper, it is somewhat surprising that the main body almost omits the key details about the experiments (e.g. details about the plans and how they are used). There are many design choices involved in the experiments and careful review of them is required to ensure robustness of these findings across settings (especially with respect to different design choices). I'm happy that this discussion came during the rebuttal process. It is important to address these concerns before publication of the final version because in the current form the paper seems somewhat misleading. It is important to clearly highlight the limitations of the work raised during discussion with Reviewer wRv5 and Reviewer 1rRb.

Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning

摘要

评审与讨论

优缺点分析

问题

局限性

格式问题

优缺点分析

问题

Unsupported or overly broad claims

Missing details

Other questions/comments

局限性

最终评判理由

格式问题

Q1: Instruction-following

Q2: Statistical tests

Q3: Sampling efficiency

Q4: Plan following

Q6: Other models

Q11-Q13: Subproblems

Q5, Q7-Q10

优缺点分析

问题

局限性

最终评判理由

格式问题

优缺点分析

问题

局限性

最终评判理由

格式问题