PaperHub
7.0
/10
Poster3 位审稿人
最低4最高5标准差0.5
4
5
4
4.0
置信度
创新性2.7
质量2.7
清晰度3.0
重要性2.7
NeurIPS 2025

The Overthinker's DIET: Cutting Token Calories with DIfficulty-AwarE Training

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29
TL;DR

DIET makes LLMs more token-efficient by using problem difficulty to dynamically guide compression during reinforcement learning, boosting reasoning performance and enabling superior inference scaling under fixed budgets.

摘要

关键词
token compressionefficient reasoningreinforcement learninginference scaling

评审与讨论

审稿意见
4

This paper studies the token overconsumption of reasoning models. In particular, the authors point out that as the problem difficulty increases, these models might overly increase the number of tokens uttered without meaningfully increasing the performance. To solve this problem, they propose difficulty aware reward estimation -- correctness reward plus adaptive trade-off parameter times token length penalty. First, difficulty of each problem is estimated as the average incorrectness of responses sampled for a given input prompt. Based on this definition, adaptive trade-off parameter is estimated. Combined with the penalty term estimated from either min/max normalization or target length normalization, reward is estimated. Naively using this new reward function with GRPO introduces distortion in length penalty term due to GRPO normalization. The authors propose independently normalizing both terms and additively combining them with an adaptive weighting term. Finally, the authors introduce a scheduling function for the length penalty term as a heuristic to escape local minima. Experimental results on reasoning benchmarks, such as MATH, AIME etc. show that the proposed approach, DIET, outperforms competitors while reducing the token frequency significantly. They also show that the relative increase in performance is higher than some baselines. Finally, as was the motivation at the beginning, response length and difficulty are much more correlated after training with DIET compared to RL or SFT fine-tuning.

优缺点分析

Strengths

  1. The paper is overall well-written with a clear motivation and well-explained methods and experiments sections.

  2. Combining difficulty with length penalty is intuitive and practically useful.

  3. Empirical results on improved correlation is a promising direction in improving inference-time scaling with better token usage.

Weaknesses

  1. Pitfalls of using reward in GRPO is based on unjustified assumptions. I have two problems with the explanation in Appendix B, both are based on the fact that αada=αbaseC^\alpha_{ada}=\alpha_{base} \hat{C} is defined as weighted average correctness.
  • The authors make an independence assumption in Eq (13) but ignore the fact that αada\alpha_{ada} is defined as weighted average correctness which certainly correlates with the outcome reward function. For example, if N=1N=1 and the reward is the indicator function whether a response is correct or not, it is identical to the estimated correctness. This suggests that, if we add an additional non-zero term to Eq (13), αada\alpha_{ada} will not cancel when C^1\hat{C} \approx 1 or 00.
  • The authors claim that when C^=0.5\hat{C}=0.5, effective penalty scaling is minimized. This is not true in general. If you plot the penalty scaling term wrt C^\hat{C} for various σp\sigma_p and αbase\alpha_{base} values, it shows a monotonic behavior with extreme points at C^=0\hat{C}=0 or 11.

While empirical results show that naively using the combined reward in GRPO leads to worse performance, theoretical explanation is insufficient and requires further clarification.

  1. Ablation on using penalty term as is. In GRPO, can you use the penalty term (eq (5) or Eq (7)) as is in Eq (9) rather than using GRPO normalization? Is normalization for the penalty term important?

  2. How does scaling N change the performance? What happens if you use different NN which impacts estimated correctness as well as GRPO normalization?

  3. Functional form of difficulty vs token budget relationship. Can you discuss what functional form captures the relationship between difficulty and token budget the best? In Figure 2, you show somewhat a logarithmic relationship but the penalty terms you use are different.

  4. Clarification on inference scaling. Can you quantify the scaling efficiency of approaches in Figure 3? You claim that your approach shows faster convergence but TokenSkip seems to have steeper curves. Please also use log-scaling to align curves.

问题

  1. Please explain pitfalls of using reward in GRPO in more details.

  2. What is the benefit of GRPO normalization on penalty term?

  3. How does scaling N change the performance?

  4. What is the correct functional form of difficulty vs token budget relationship?

  5. Can you quantify the scaling efficiency of approaches in Figure 3?

局限性

Yes

最终评判理由

Clarifications and additional results. But one weakness is still not very clear.

格式问题

N/A

作者回复

We sincerely thank the reviewer for the thorough and insightful review. The deep dive into our theoretical analysis is valuable and demonstrates a careful reading of our work.

We have addressed the identified weaknesses below and have conducted new experiments and revised our manuscript to incorporate this valuable feedback.


1. On the Pitfalls of Naive Reward Weighting in GRPO

We thank the reviewer for raising these two important points regarding the analysis in Appendix B.

  • Correlation between αada\alpha_{ada} and routcomer_{outcome}: You are correct that αada=αbaseC^\alpha_{ada}=\alpha_{base}\hat{C} correlates with the outcome reward, especially for small N. However, our independence assumption in Eq. (13) remains valid within the context of the variance calculation. When computing the variance for a group of N responses to a prompt x, the value of C^\hat{C} (and thus αada\alpha_{ada}) is calculated once for the entire group. It becomes a fixed scalar constant for that specific calculation. The variance is then computed over the random variables routcome,ir_{outcome,i} and pip_i within the group. Since αada\alpha_{ada} is not a random variable in this context, the identity Var(routcome,iαadapi)Var(routcome,i)+αada2Var(pi)Var(r_{outcome,i}−\alpha_{ada}p_i)\approx Var(r_{outcome,i})+\alpha_{ada}^2Var(p_i) still holds.

  • Relationship between C^\hat{C} and Effective Penalty Scaling: You have correctly identified a mistake in our analysis regarding the case where α is adaptive (αada\alpha_{ada}). We sincerely thank you for this careful catch. Our original proof held for a constant trade-off, αbase\alpha_{base}, but we directly generalized it to αada\alpha_{ada} in our manuscript, which is incorrect.

    We have now revised our manuscript to first present the proof with the constant αbase\alpha_{base} (where the conclusion that penalty is minimized at C^0.5\hat{C}\approx 0.5 holds) and then add a detailed discussion for the more complex αada\alpha_{ada} case, which aligns with your observation of monotonic behavior. We have added the following discussion to our manuscript and will include it in the final version:

    ...[content from Line 523 to Line 552, with αada\alpha_{ada} replaced with αbase\alpha_{base}]

    The situation becomes even more complex when using the adaptive weight \alpha\{\text{ada}}(x, \pi\\theta) = \alpha\{\text{base}} \cdot w(\hat{C}) from the main paper. If we substitute \alpha\{\text{ada}} into Eq.14 and assume a linear weighting function w(C^)=C^w(\hat{C})=\hat{C}, the effective penalty scaling factor τ^_p\hat{\tau}\_p becomes a non-linear function of correctness C^\hat{C}.

    To visualize this, we can analyze its squared form, τ^_p2\hat{\tau}\_p^2:

    τ^_p2=α_base2C^2C^(1C^)+α_base2C^2σ_p2.\hat{\tau}\_p^2 = \frac{\alpha\_{\text{base}}^2 \hat{C}^2}{\hat{C}(1-\hat{C}) + \alpha\_{\text{base}}^2 \hat{C}^2 \sigma\_p^2}.

    Assuming C^0\hat{C} \neq 0 for now, this simplifies to:

    τ^_p2=α2_base1C^C^+α2_baseσ_p2=α2_base1C^1+α2_baseσ_p2.\hat{\tau}\_p^2 = \frac{\alpha^2\_{\text{base}}}{\frac{1-\hat{C}}{\hat{C}} + \alpha^2\_\text{base}\sigma\_p^2} = \frac{\alpha^2\_{\text{base}}}{\frac{1}{\hat{C}} - 1 + \alpha^2\_\text{base}\sigma\_p^2}.

    This relationship is highly non-linear. To illustrate, we can take the case where α_base=0.1\alpha\_\text{base}=0.1 and σ_p2=1\sigma\_p^2=1 (a plausible assumption if the penalty term is also normalized, as we have done in Eq.7). The resulting function for τ^_p2\hat{\tau}\_p^2 is plotted in the figure above [a functio plot is presented]. The curve is far from the linear relationship we desire; the penalty's influence is negligible for most of the difficulty range (C^<0.8\hat{C} < 0.8) and then increases sharply only as the problem becomes extremely easy. This interaction obscures the difficulty-aware trade-off that we would like to intervene.

    The Advantage Weighting approach presented in the implementation section avoids all these issues by normalizing the outcome and penalty advantages separately before applying the weight, thus preserving the intended adaptive scaling and ensuring stable, effective training.

    Additionally, in our response to Reviewer zKPN, we have formally demonstrated that the sole source of bias in our Advantage Weighting technique originates from the same fundamental limitation present in the GRPO algorithm itself. By incorporating Dr.GRPO's modification to GRPO, we can have an unbiased gradient estimator. We kindly refer you to our response to Reviewer zKPN if you are interested. We are deeply grateful to both you and Reviewer zKPN for your insightful contributions, which have significantly strengthened the theoretical foundation of our work.

    We particularly appreciate your meticulous review in identifying the error in our proof. Your expert feedback has been invaluable in enhancing the rigor of our theoretical analysis through these corrections.


2. Ablation on Normalizing the Penalty Term

This is a good suggestion. To test the importance of normalizing the penalty term within our Advantage Weighting framework, we ran an ablation where we removed the GRPO-style normalization from the penalty advantage. The results are as follows:

SettingMATH500P@1MATH500TokAIME2024P@1AIME2024TokAMC2023P@1AMC2023TokOlympiadP@1OlympiadTokMinervaP@1MinervaTokAvgP@1AvgTok
DIET83.0%306131.8%1057865.4%642543.7%691726.9%350550.2%6097.2
Eq.5 w/o Norm.82.5%285629.0%999466.5%572443.8%638426.9%307049.7%5605.6
Eq.7 w/o Norm.72.7%78017.4%264648.3%195734.6%144424.1%67939.4%1501.0

As the results show, removing normalization leads to a substantial token reduction but also causes a significant degradation in performance. Some careful tuning of the weighting parameter might mitigate the issue, but normalization is an effective and simple approach. Therefore, we adopt the normalization. We hypothesize that the un-normalized penalty can have extreme values, which overly biases the policy gradient towards generating shorter sequences at the expense of correctness. We will add it to our ablation study.


3. Impact of Scaling N

This is another important hyperparameter. We selected N=8 based on preliminary experiments. As requested, we have presented experiments for N=4 and 16.

NMATH500P@1MATH500Tok.AIME24P@1AIME24Tok.AMC23P@1AMC23Tok.OlympiadP@1OlympiadTok.MinervaP@1MinervaTok.AvgP@1AvgTok.
478.1%210429.1%870963.8%499342.7%529824.8%227047.7%4674.8
8 (in our paper)83.0%306131.8%1057865.4%642543.7%691726.9%350550.2%6097.2
1682.2%261328.3%989966.6%532844.0%597026.3%283449.5%5328.8

The results show a clear trend. Decreasing the sample count to N=4 results in a noticeable drop in performance. We attribute this to the less stable estimation of problem difficulty and reward baselines, which can introduce noise into the policy gradient updates. Increasing the sample count from N=8 to N=16 does not yield further performance improvements. However, it lowers the number of tokens, potentially because the difficulty estimation is more accurate, allowing for more token reduction for problems with suitable difficulty.

Considering the trade-off between performance and efficiency, we select N=8 in our main experiments. We will add it to the ablation study in our paper.


4. Functional Form of Difficulty vs. Token Budget

This is an insightful observation. Indeed, as illustrated in Figure 2, the relationship between difficulty and token count follows a more logarithmic trend rather than a linear one. To explore this, we conducted a new experiment. We first fitted a logarithmic curve to the data in Figure 2, then we scaled and moved the curve to ranges from 0 to 8192 when x is in [0, 1], resulting in y=3116.9ln(21.33x+1.66)1579.7y=3116.9*ln(21.33x+1.66)-1579.7 We then used this function to sample target lengths in our Dynamic Target method, instead of sampling from a linear range. The experimental results are as follow:

SettingMATH500P@1MATH500TokAIME2024P@1AIME2024TokAMC2023P@1AMC2023TokOlympiadP@1OlympiadTokMinervaP@1MinervaTokAvgP@1AvgTok
Linear (in our paper)83.0%306131.8%1057865.4%642543.7%691726.9%350550.2%6097.2
Logarithmic83.5%297129.8%1089061.5%652643.7%669326.5%331949.0%6079.8

The results show that the method is robust to the selection of the target function, the two choices do not differ significantly, and the simple linear function already works well. We will include it in our ablation study.


5. Clarification on Inference Scaling

Thank you for the suggestion. We have revised Figure 3 to use a log scale, which provides a clearer comparison. Regarding the "faster convergence" claim, our intention was to compare methods that achieve comparable final performance. While TokenSkip's curve is initially steep, its accuracy quickly plateaus at a significantly lower level than the Base Model or our methods. We argue that rapid convergence to suboptimal performance does not constitute meaningful efficiency, as these token savings come at the expense of significantly degraded reasoning capability.

Our key claim is that DIET achieves high accuracy more efficiently – that is, with a lower total token budget – compared to other high-performance methods. The log-scaled plot included in the revised manuscript provides clearer visualization of this advantage.


We hope our rebuttal has adequately addressed your concerns. Incorporating your valuable suggestions has significantly strengthened and refined our paper. We would be deeply grateful if you could consider revising your assessment of our work accordingly. Thank you again for your efforts.

评论

Dear Reviewer vKd6,

We would greatly appreciate it if you could take some time to review our rebuttal, as the discussion period is nearing its end. We hope our response adequately addresses your concerns and look forward to your feedback.

Thank you for your time and consideration.

评论

Dear Reviewer vKd6,

As it is the final day of the rebuttal period, we wanted to kindly check if you’ve had a chance to review our response. We’d greatly appreciate any final feedback you may have.

Thank you again for your time and valuable input!

Authors

评论

Thank you for the clarifications and additional results. The case for the adaptive α\alpha makes sense to me. While I understand that you are treating αada\alpha_{ada} as a constant, you are still using approximation to write down the variance equality. I believe a more grounded explanation would be better. Such as, defining ww in αada\alpha_{ada} so that it becomes an expectation over outcome rewards (or C^\hat{C}), hence independence assumption holds. But, given the substantial additional results and other clarifications, I will increase my score.

评论

Thank you for your valuable feedback. The suggestion is insightful, and we will incorporate it into the final version of our paper. We sincerely appreciate the time and care you took in reviewing our work and identifying areas for improvement. Your input has been instrumental in enhancing the quality of our paper, and we are glad that you found our response meaningful as well.

Thank you for being part of the review process for our paper.

审稿意见
5

This paper proposes a suite of methods that reduce the token length of chains of thought maintaining quality. This is done by considering the difficulty of the problem overall, as measured by the average accuracy, which is calculated in the process of doing group-level objectives such as GRPO.

优缺点分析

Strengths:

  • Clarity: The paper is written quite clearly and I was able to understand all the methodology while I believe.
  • Originality: The idea of using the average reward over the example is creative. And while it is intuitive, I don't think that I would have immediately come up with it.

Weaknesses:

  • Significance: One major potential issue with the experimental results is that they are all done on very small models that have already been trained for reinforcement learning through distillation. It is unclear whether this sort of approach would scale up to the larger models that people actually use.

问题

  • Do you think this approach will scale to larger models?

局限性

Yes, it was fine.

最终评判理由

With the additional experiments on larger models I think that the paper deserves publication.

格式问题

None

作者回复

We sincerely thank the reviewer for your positive feedback on the clarity and originality of our work. We agree that the question of scalability is one of the most important aspects for the practical significance of any new method in this field.

Our answer to this concern is: Yes, our approach scales effectively to larger models. To demonstrate this, we have conducted a new set of experiments on the R1-distilled Qwen 2.5 7B model. Due to the rebuttal time limit, we present the results from competitive RL-based baselines. DIET not only maintains its effectiveness but excels at this larger scale, achieving state-of-the-art performance while improving token efficiency.

ModelMATH500 P@1MATH500 Tok.AIME24 P@1AIME24 Tok.AMC23 P@1AMC23 Tok.Olympiad P@1Olympiad Tok.Minerva P@1Minerva Tok.Macro Avg P@1Macro Avg Tok.
Base Model (7B)92.1%392153.5%1338982.5%773056.4%889037.4%493064.4%7772
CosFn87.1%139750.5%865978.3%357854.0%459336.4%134561.3%3914
O1-pruner71.1%495839.2%1205479.1%601450.9%685230.5%527754.2%7031
Kimi 1.5 RL64.0%112444.3%790370.6%278346.1%322532.5%67751.5%3142
Adaptive W.90.6%178252.1%831380.2%405054.9%483836.9%163862.9%4124
Dynamic T.90.4%212149.6%818679.8%441654.3%467734.5%181961.7%4244
DIET92.1%318757.9%1012482.6%607556.5%702637.9%369565.4%6021

From these new results, we draw two key conclusions:

  1. DIET Improves Performance while Reducing Tokens: Our DIET framework achieves the highest average Pass@1 score (65.4%), improving upon the already strong 7B base model (64.4%). Crucially, it simultaneously reduces the average token count This demonstrates that our method effectively enhances both performance and efficiency at a larger scale.
  2. Superior Performance-Efficiency Trade-off: Most other compression methods exhibit a clear trade-off, where significant token reduction leads to a noticeable drop in performance. DIET breaks this trend.

These results confirms that DIET is not an artifact of small-scale models but is a robust and scalable framework. We will add this new experimental section to the final version of the manuscript. Thank you again for encouraging us to strengthen this crucial aspect of our paper. We hope you find our responses satisfactory and we would be grateful for your consideration in raising the score.

评论

Thank you for running additional experiments with a 7B model. This further strengthens the potential significance of the results.

评论

Thank you for your prompt and encouraging feedback! We're delighted that you find the additional 7B model experiments helped reinforce the significance of our work. We truly appreciate your support.

审稿意见
4

Training LLMs to reason more effectively using RL has shown impressive results in some challenging domains. However, while enhanced reasoning has enabled these models to tackle complex problems, they often tend to overthink, generating lengthy responses that are inefficient. This paper attempts to address this issue by proposing a training framework called Difficulty-Aware Training, or DIET. The core idea is to train LLMs to dynamically determine the number of tokens required to solve a task based on its difficulty.

The paper introduces two dynamic objectives: an adaptive trade-off parameter and a dynamic length target. It also discusses an implementation detail regarding how to compute the advantage when incorporating the dynamic objective (penalty) into the RL objective. The authors conducted experiments using the R1-Distilled model with the DeepScale R dataset, comparing their approach against several competitive baselines. The results demonstrate that the proposed framework not only reduces token length based on problem difficulty but also maintains strong performance.

[1] L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning by Aggarwal 2025

优缺点分析

The ability to reduce models' overthinking while maintaining performance on downstream tasks is a timely and important issue. The paper effectively outlines why this problem is important, and the proposed solutions are clearly articulated. Several ablation studies are conducted to demonstrate how the suggested objective improved the models' behavior in the expected manner.

Although the paper has some strengths, it also has several weaknesses. It lacks important baselines that would clarify how the proposed idea improves upon existing ones in the literature. For instance, the authors cite [1] but do not compare their results to it. Some other relevant papers are mentioned, but no comparison is provided. The ablation study in section 4.4 challenges the paper's proposal, suggesting that an alternative indicator bias could potentially be used to compress tokens. Notably, Kimi shows good performance while reducing token generation length; although this criterion is not directly related to the tested metric, it still demonstrates an interesting property. Finally, the paper does not provide insights into how different dynamic variables interact with one another. Specifically, there are dynamic weights, losses, and learning rates, but no ablation studies showing why these properties matter or how the model performs if some are disabled.

[1] L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning by Aggarwal 2025

问题

  • How does the proposed framework compare to prompt-based approaches such as [1]?
  • Why do you sample t(x,π_theta) from a uniform distribution? How sensitive is the framework's performance to the choice of L_max and delta?
  • What is the difference between the proposed advantage and GRPO? The GRPO Advantage is known to have bias issues, as pointed out in [2] and [3].
  • Do you know if your gradient estimate is actually unbiased or not?
  • Do you have any ablation studies demonstrating how the dynamic weight, function, and learning rate relate to each other? I'm unsure about the importance of each component.

[1] L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning by Aggarwal 2025

[2] Understanding R1-Zero-Like Training: A Critical Perspective Liu 2025

[3] https://x.com/leloykun/status/1903382502158500119?s=46

局限性

Yes

格式问题

No

作者回复

We sincerely thank the reviewer for the valuable feedback and for acknowledging the importance of our work, the clarity of our proposed solutions, and the effectiveness of our ablation studies.

Below, we respond to each of the questions and weaknesses identified.


1. On the Comparison with L1

We view our work and L1 as addressing different and complementary aspects of the token compression problem.

  • Different Focus: L1's primary contribution is a novel RL reward designed to control response length. In contrast, our work, DIET, focuses on a different axis: proposing a framework for integrating on-the-fly problem difficulty into the length-controlled training process. Our main contributions are methodologies for how to apply any length-related penalty, theoretically including L1’s penalty, in a more nuanced, difficulty-aware manner.
  • Potential for Synergy: This distinction makes our approaches highly compatible. Our DIET framework is designed to be modular: actually, the current implementation of Adaptive Weighting can partly be viewed as applying our difficulty-aware techniques to the penalty function from Kimi RL to demonstrate DIET's effectiveness. One could just as easily replace the penalty term in DIET with the L1 reward. DIET would then modulate this L1 penalty based on task difficulty, and our Advantage Weighting technique would ensure its stable incorporation into the GRPO update step. This would effectively create a "Difficulty-Aware L1" with minimal modification.

As L1 is a concurrent and complementary work to DIET, it was not included in our original set of baselines. We chose to demonstrate DIET's effectiveness on the Kimi penalty as a representative example of modern length-control rewards, and compare with methods that do not rely on designing a length-related RL reward. We agree that formally exploring the synergy between DIET and L1 is a compelling direction for future work and will add this discussion to our paper. Also, we will make the above claim more explicit in our revision.


2. Difference from GRPO and Unbiased Gradient Estimation

Thank you for these critical questions on the nature of our advantage calculation.

  • Clarification of our Contribution: To be clear, we do not propose a new advantage function to replace GRPO's. The Advantage Weighting technique is a method for correctly structuring a multi-component reward (i.e., outcome + penalty) before it is fed into an RL algorithm like GRPO. Our analysis in Appendix B shows that naively combining rewards leads to signal distortion during GRPO's normalization step. Advantage Weighting prevents this by normalizing each component's advantage separately, ensuring the final weighted combination is faithful to the intended objective.
  • Unbiased Gradient Estimation: This is an excellent point. While GRPO's advantage normalization biases gradient magnitude, we add a proof that our Advantage Weighting (without std normalization) yields unbiased policy gradient estimates. The remaining bias matches GRPO's inherent limitation and can be eliminated by removing its std normalization.

Here's the major part of the added proof in our manuscript (refresh the page if it is not correctly rendered):

[...]

First, we define the optimization objective J(θ)J(\theta), which combines the task performance reward with the difficulty-aware length penalty:

J(θ)=E_xD,yπ_θ(x)[r_total(x,y)],wherer_total(x,y)=r_outcome(x,y)α(x,π_θ)p(y).J(\theta) = \mathbb{E}\_{x \sim \mathcal{D}, y \sim \pi\_{\theta}(\cdot|x)} \left[ r\_{\text{total}}(x, y) \right], \quad \text{where} \quad r\_{\text{total}}(x, y) = r\_{\text{outcome}}(x, y) - \alpha'(x, \pi\_{\theta}) \cdot p(y).

Here, r_outcomer\_{\text{outcome}} is the task reward, p(y)p(y) is the penalty magnitude, and α(x,π_θ)\alpha'(x, \pi\_{\theta}) is the adaptive trade-off parameter, which is treated as a constant for each group of samples generated for a prompt xx.

According to the Policy Gradient Theorem, the true gradient of this objective is:

_θJ(θ)=E_x,y[_θlogπ_θ(yx)r_total(x,y)].\nabla\_{\theta} J(\theta) = \mathbb{E}\_{x, y} \left[ \nabla\_{\theta} \log \pi\_{\theta}(y|x) \cdot r\_{\text{total}}(x, y) \right].

To reduce variance, one can introduce a baseline B(x)B(x) that depends on the prompt xx but not the specific response yy. The gradient estimate remains unbiased because the expectation of the baseline term is zero:

E_yπ_θ(x)[_θlogπ_θ(yx)B(x)]=B(x)_yπ_θ(yx)_θlogπ_θ(yx)dy=B(x)_y_θπ_θ(yx)dy=B(x)_θ_yπ_θ(yx)dy=B(x)_θ(1)=0\mathbb{E}\_{y \sim \pi\_{\theta}(\cdot|x)} \left[ \nabla\_{\theta} \log \pi\_{\theta}(y|x) \cdot B(x) \right] = B(x) \cdot \int\_{y} \pi\_{\theta}(y|x) \nabla\_{\theta} \log \pi\_{\theta}(y|x) dy = B(x) \cdot \int\_{y} \nabla\_{\theta} \pi\_{\theta}(y|x) dy = B(x) \cdot \nabla\_{\theta} \int\_{y} \pi\_{\theta}(y|x) dy = B(x) \cdot \nabla\_{\theta}(1) = 0

Thus, the true gradient can be equivalently written as:

θJ(θ)=E_x,y[θlogπθ(yx)(rtotal(x,y)B(x))].\nabla_{\theta} J(\theta) = \mathbb{E}\_{x, y} \left[ \nabla_{\theta} \log \pi_{\theta}(y|x) \cdot (r_{\text{total}}(x, y) - B(x)) \right].

The term A(x,y)=rtotal(x,y)B(x)A(x,y) = r_{\text{total}}(x, y) - B(x) is the advantage function. Our Advantage Weighting method constructs a specific baseline for this multi-component reward setting. For a group of NN responses y_i_i=1N\\{y\_i\\}\_{i=1}^N to a prompt xx, we define the empirical means μ_outcome(x)\mu\_{\text{outcome}}(x) and μ_p(x)\mu\_p(x). We then construct the baseline as:

B(x)=μoutcome(x)α(x,πθ)μp(x).B(x) = \mu_{\text{outcome}}(x) - \alpha'(x, \pi_{\theta}) \cdot \mu_p(x).

Since this baseline only depends on group-level statistics for a given prompt xx, it is a valid baseline. Substituting this into the gradient equation, the advantage for a sample yiy_i becomes:

A(x,yi)=(routcome,iαpi)(μoutcomeαμp)=(routcome,iμoutcome)α(x,πθ)(piμp).\begin{aligned} A(x, y_i) = \left( r_{\text{outcome}, i} - \alpha' p_i \right) - \left( \mu_{\text{outcome}} - \alpha' \mu_p \right) = (r_{\text{outcome},i} - \mu_{\text{outcome}}) - \alpha'(x, \pi_{\theta}) \cdot (p_i - \mu_{p}). \end{aligned}

This is precisely the structure of the advantage used in our proposed Advantage Weighting method. The only difference is that in our practice, we adopt GRPO-style normalization to divide the two terms in the advantage with their standard deviation respectively (Eq.8), which follows common practice, but introduces bias. Theoretecally, the stochastic gradient estimator g^_i(θ)=_θlogπ_θ(y_ix)A(x,y_i)\hat{g}\_i(\theta) = \nabla\_{\theta} \log \pi\_{\theta}(y\_i|x) \cdot A(x, y\_i) is an unbiased estimator of the true policy gradient _θJ(θ)\nabla\_{\theta} J(\theta) if we remove the standard deviation denominator in Eq.8.

We additionally conducted an experiment using the unbiased version of our method, which removes the std and sequence length normalization, as suggested in Dr.GRPO (we only report the averaged performance due to length limit in the rebuttal):

SettingAvg P@1Avg Tok.
DIET (in our paper)50.2%6097.2
Unbiased DIET49.6%5033.6

While the unbiased variant exhibits a modest performance decrease, it achieves a reduced average token count. We will incorporate these experimental results in our discussion and propose this modification as a valuable enhancement to our method. Thank you for your suggestion!


3. Ablation Studies on Dynamic Variables

We appreciate the request for more insight into the ablation study, as you have proposed in your questions 2 and 5.

Ablation on Dynamic Weight (αbase\alpha_{base}): We tested several values for the base trade-off parameter under the Adaptive Weighting setting.

αbase\alpha_{base}Avg P@1Avg Tok
0.148.0%5988.6
0.549.9%5723.2
1.039.2%2194.4

We chose αbase=0.5\alpha_{base}=0.5 for our main experiments as it provided the best balance between significant token reduction and high performance.

Ablation on Dynamic Function: Indeed, as shown in Figure 2, the relationship follows a logarithmic pattern rather than a linear one. Building on this observation, we conducted additional experiments where we sampled target token lengths from a fitted logarithmic function instead of the original linear function.

To derive an appropriate logarithmic sampling function, we first fitted the model y=aln(bx+c)y = a*ln(bx + c) to the data in Figure 2, then scaled and added a bias term to normalize the output range to [0,8192][0, 8192] for x[0,1]x \in [0, 1], resulting in: y=3116.9ln(21.33x+1.66)1579.7y = 3116.9ln(21.33x + 1.66) - 1579.7. During training, we randomly sample x from [D^δ,D^][\hat{D}-\delta, \hat{D}] and use the corresponding y value as the target token length. The results are as follows:

SettingAvg P@1Avg Tok
Linear (in our paper)50.2%6097.2
Logarithmic49.0%6079.8

The results show that the method is robust to the selection of target function, the two choices do not differ significantly. This is a valuable ablation that highlights the impact of the target function's form, and we will include it in our appendix.

Abaltion on δ\delta and LmaxL_{max} in t(x,πtheta)t(x,\pi_{theta})

We added ablation study for δ\delta and LmaxL_{max}, the results are as follows:

Max TokensAvg P@1Avg Tok.
409649.7%5385.0
8192 (in our paper)50.2%6097.2
deltaAvg P@1Avg Tok.
0.0549.1%5290.6
0.1 (in our paper)50.2%6097.2

Reducing LmaxL_{max} decreases average response length with modest performance degradation, similar to decreasing δ\delta. We chose the current configuration as the optimal balance between performance and token efficiency.

Learning Rate Selection: Following standard practices from recent RL work at comparable scales, we fixed the learning rate to isolate our method's contributions. This ensures fair comparison by eliminating optimizer configuration as a confounding factor.

We'll include these results in the final version. Your feedback has been invaluable in strengthening our paper's completeness.

评论

Dear Reviewer zKPN,

We would greatly appreciate it if you could take some time to review our rebuttal, as the discussion period is nearing its end. We hope our response adequately addresses your concerns and look forward to your feedback.

Thank you for your time and consideration.

评论

Dear Reviewer zKPN,

As it is the final day of the rebuttal period, we wanted to kindly check if you’ve had a chance to review our response. We’d greatly appreciate any final feedback you may have.

Thank you again for your time and valuable input!

Authors

评论

Thank you for addressing the questions I raised, and your responses make sense. I will maintain my current score.

评论

Thank you for the feedback. If possible, we would sincerely appreciate it if you could kindly share any remaining concerns that might be preventing you from granting a higher score. We have carefully addressed all the issues you raised, and we would be more than happy to provide further clarifications if needed. If there are no additional concerns, we would be truly grateful if you could consider revising the score. Thank you very much for your time and valuable feedback.

评论

Dear reviewers,

This is a reminder that the end of author-reviewer discussion period is Aug. 8. Please do carefully read all other reviews and the author responses; and discuss openly with the authors, especially on your own questions that the authors addressed.

Best, AC

最终决定

This paper proposes DIET (DIfficulty-AwarE Training), an RL framework that reduces excessive token usage in reasoning models by adapting token compression to task difficulty. The method introduces dynamic trade-offs for length control and an Advantage Weighting scheme to stabilize optimization under GRPO. Experiments show DIET improves performance–efficiency trade-offs and enhances inference-time scaling, enabling more effective majority voting under fixed compute budgets.

Strengths:

  • Tackles a timely and practically important problem (LLM overthinking).

  • Difficulty-aware length control and Advantage Weighting are creative and well-motivated.

  • Clear and well-written presentation.

  • Strong empirical results, including new large-model (7B) experiments confirming scalability.

  • Highlights inference scaling benefits overlooked by prior work.

Weaknesses:

  • Missing direct comparisons to some concurrent baselines (e.g., L1).

  • Theoretical justification of GRPO pitfalls relies on debatable assumptions, though partially clarified in rebuttal.

  • Initial experiments limited to small models; later addressed.

  • Some ablations were missing initially but added later.

Review discussion and rebuttal:

Reviewer zKPN: Concerned about missing baselines and theoretical bias; authors argued complementarity with L1, clarified Advantage Weighting, and added ablations. Reviewer acknowledged responses but stayed borderline (4).

Reviewer gh6m: Concerned about scalability; authors added 7B experiments showing improved efficiency and accuracy, leading to a score increase to 5.

Reviewer vKd6: Questioned GRPO analysis and scaling of hyperparameters; authors corrected proofs, added ablations (N, target functions), and clarified inference scaling. Reviewer found responses satisfactory and raised to 5.

Overall, the rebuttal strengthened the paper significantly, with two reviewers raising their scores and one remaining borderline.

--

Despite limitations in baseline coverage and theoretical depth, the paper makes a meaningful and timely contribution with a principled framework that improves both efficiency and reasoning quality, validated at scale. Overall, the strengths outweigh the weaknesses.