PaperHub
8.2
/10
Poster4 位审稿人
最低5最高5标准差0.0
5
5
5
5
3.3
置信度
创新性2.3
质量2.8
清晰度2.8
重要性2.8
NeurIPS 2025

DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29
TL;DR

We proposed a framework for reinforcing large reasoning models with discriminative constrained optimization , grounded in the principle that increasing the scores of positive answers while decreasing those of negative ones.

摘要

关键词
Large Language ModelsLarge Reasoning ModelsDiscriminative Learning

评审与讨论

审稿意见
5

The authors propose a new reinforcement learning (RL) optimization framework for large reasoning models based on discriminative learning. Specifically, it employs a score-based discriminative objective instead of a group-relative policy and uses a scoring function in place of clipping. Training is stabilized through a simple constrained optimization method. The proposed framework demonstrates strong potential across multiple reasoning benchmarks, achieving better performance than GRPO while producing shorter responses.

优缺点分析

[Strengths]

  • The paper is well-written and easy to follow.
  • The proposed framework to alleviate difficulty bias is interesting and achieves meaningful results across multiple benchmarks.
  • The ablation study is detailed and comprehensive.

[Weaknesses]

  • The framework only considers binary rewards (though the authors acknowledge this). It would be more compelling if it were shown to work with non-binary reward settings as well.
  • The paper misses an ablation study on response length. It would be helpful to understand how DisCO's performance varies with different response lengths.

问题

  • What happens if we increase the response length (e.g., 24k or 32k)? Does it improve DisCO's performance notably?

局限性

yes

最终评判理由

The authors addressed my concerns. I will increase my score (4-->5).

格式问题

N/A

作者回复

We thank the reviewer for dedicating the time to providing helpful feedback on our paper.

Q1: The framework only considers binary rewards (though the authors acknowledge this). It would be more compelling if it were shown to work with non-binary reward settings as well.

A: We agree that it will strengthen the work. However, the focus on the binary reward setting helps us to better illustrate the limitation of GRPO through theoretical analysis. We believe the improvement brought by DisCO is non-trivial and expect to extend its strength to non-binary rewards by considering other discriminative objectives, e.g., ranking losses.

Additionally, to make our work more compelling, we further validated our method on a model from another LLM-family, i.e., DeepSeek-R1-Distill-Llama-8B, training them for 1000 steps. The results are summarized below. We observed that our proposed DisCO methods still outperform other baselines by a large margin, demonstrating the generalizability of the proposed method to other models.

ModelMRL(Train/Test)AIME 2024AIME 2025MATH 500AMC 2023MinervaO-BenchAvg.
DS-Distill-Llama-8B32k+ / 32k0.5060.3460.8960.8150.2950.5410.566
DS-Distill-Llama-8B32k+ / 8k0.3480.2380.8250.6520.2670.4400.462
GRPO8k / 8k0.4100.2400.8730.7590.3070.5060.516
GRPO-ER8k / 8k0.4080.2770.8820.7850.3110.5110.529
Dr. GRPO8k / 8k0.4230.2850.8670.7860.3000.4970.526
DAPO8k / 8k0.3330.3080.8790.7940.3250.5220.527
TRPA8k / 8k0.4540.2790.8640.7560.2890.5180.527
DisCO (L-ratio)8k / 8k0.5060.3560.9000.8310.3260.5530.579
DisCO (log-L)8k / 8k0.5230.3540.8960.8430.3310.5600.584

Q2: It would be helpful to understand how DisCO's performance varies with different response lengths. For example, What happens if we increase the response length (e.g., 24k or 32k)?

A: We thank the reviewer for the constructive comment. We have conducted an experiment using 16k response length for training and inference on 1.5B models. The results shown below demonstrate that increasing response length brings additional improvements, compared with 8k length for training and inference.

ModelMRL(Train/Test)AIME 2024AIME 2025MATH 500AMC 2023MinervaO-BenchAvg.
DS-Distill-Qwen-1.5B32k+ / 32k0.2880.2630.8280.6290.2650.4330.451
DSR-1.5B-Preview24k / 32k0.4310.3040.8780.7360.3020.5000.525
DisCO (L-ratio)8k / 8k0.3810.3060.8780.7460.3190.5120.524
DisCO (log-L)8k / 8k0.4040.3170.8760.7580.3330.5090.533
DisCO (L-ratio)16k / 16k0.4100.3400.8850.7480.3240.5150.537
DisCO (log-L)16k / 16k0.4040.3330.8910.7610.3210.5310.540
评论

Dear Reviewer ewTG,

Thank you for your thoughtful review and encouraging rating!

As the author-reviewer discussion period is nearing its end, we would like to follow up to see if our rebuttal has addressed your concerns. Please let us know if any further clarification would be helpful.

Thank you again!

Authors

评论

Thanks for your response. The authors have addressed all my concerns.

审稿意见
5

This paper addresses the problem difficulty bias and training instability in existing Large Reasoning Models (LRMs) during reinforcement learning by proposing a novel framework. The study optimizes scoring functions to enhance scores for correct answers while suppressing those for incorrect answers, employing a constrained optimization approach to ensure training stability. Additionally, it introduces Distributionally Robust Optimization (DRO) techniques to mitigate data imbalance issues during training. Experiments demonstrate that the proposed architecture outperforms existing GRPO and its variants on reasoning tasks, significantly enhancing model performance.

优缺点分析

Strengths 1.The paper offers detailed theoretical analysis and experimental validation, exposes GRPO's limitations, and presents targeted solutions. The well-designed experiments cover multiple benchmark tasks and various model sizes. The results show that the new architecture in the paper is better than existing methods in improving model performance. Also, the paper discloses enough experimental details for other researchers to reproduce the experiments. 2.This study is highly significant for enhancing the reasoning abilities of large - scale reasoning models in complex tasks. By addressing key issues in current methods, it offers a more effective and stable model optimization approach, which will advance research in related fields.

Weaknesses 1.In the experimental part, although the paper provides a large number of experimental results and analyses, the descriptions of certain experimental details may not be in-depth enough, such as the selection and tuning process of hyperparameters. 2.When presenting related work and introducing the DisCO method, the paper emphasizes the limitations of GRPO and its variants but rarely mentions their merits and applicability in specific scenarios. This may bias readers' understanding of different methods, preventing them from objectively evaluating the relative advantages and disadvantages of each approach in various contexts.

问题

1.Could you quantify the individual contribution rates of each improvement in DisCO through ablation studies in detail? 2.Will the DisCO algorithm pose computational efficiency challenges as models scale further? 3.In hyperparameter selection, have you accounted for the impact of different hyperparameter combinations on model generalization capabilities?

局限性

Yes,the authors have explicitly stated several limitations of their work in the paper, noting that their method primarily optimizes for binary rewards and that experiments were constrained by computational resources, covering only 1.5B and 7B models.

最终评判理由

Thanks for the authors' thorough and clear responses. All my questions have been fully addressed, and I have no further comments. Therefore, I recommend acceptance.

格式问题

None.

作者回复

We thank the reviewer for providing helpful comments on improving our paper. Below we would like to answer raised questions.

Q1: In hyperparameter selection, have you accounted for the impact of different hyperparameter combinations on model generalization capabilities?

A: We tuned two hyper-parameters for DisCO, i.e., the learning rate and the τ\tau. The process of hyperparameter selection is the same as model selection, which accounts for the generalization performance. We report the performance of the best hyper-parameters in the tuning range that yields the best average test performance. We did the same model selection for all baselines. We will add more details in the revision.

Q2: Mentioning the merits and applicability of GRPO in specific scenarios.

A: We thank the reviewer for the constructive comment. In the opening paragraph of the Introduction, we highlighted the importance of GRPO in achieving performance comparable to advanced proprietary models across many reasoning benchmarks. Additionally, in the second paragraph of the Related Work section, we discussed key contributions of GRPO and its variants to provide readers with a foundational understanding of each approach.

We agree with the reviewer that each method has its own strengths. In Section 7, we acknowledged the limitation of our work in focusing on binary rewards, whereas GRPO and its variants can naturally handle non-binary reward signals without modification. We will make this clearer in the revision.

Q3: Could you quantify the individual contribution rates of each improvement in DisCO through ablation studies in detail?

A: Thank you for the helpful suggestion. The proposed method incorporates three key design choices: (1) removing question-level weight bias, (2) using a non-clipping scoring function, and (3) applying a KL-divergence constraint.

The effectiveness of (1) has been demonstrated in the comparison between GRPO and GRPOrw_\text{rw}, as shown in Fig. 1 (c, d). Additionally, we conducted new experiments on training a 1.5B model, comparing DisCO with its variant that includes question-level weight bias while keeping all other components unchanged. The results are presented in the table below.

The benefit of (2) is validated in Figure 3 (second from the left), which shows that using the non-clipping scoring functions yields significant improvements over its clipping-based counterpart, when all other components are the same as DisCO.

To evaluate the impact of (3), we conducted another experiment on the 1.5B model, comparing DisCO with a variant that replaces the KL constraint with KL-divergence regularization. The results are also summarized in the table below.

We can see that each component contributes to the improvements, where the use of non-clipping scoring function contributes the most. We will incorporate the discussion into our ablation studies.

Methodpass@1(Avg.)Methodpass@1(Avg.)
DisCO (log-L)0.533DisCO (L-ratio)0.524
DisCO --> w/ weight bias0.522 (↓ 0.011)DisCO --> w/ weight bias0.511 (↓ 0.013)
DisCO -->Clipping scoring func.0.430 (↓ 0.103)DisCO --> Clipping scoring func.0.430 (↓ 0.094)
DisCO--> KL Regularization0.519 (↓ 0.014)DisCO --> KL Regularization0.523 (↓ 0.001)

Q4: Will the DisCO algorithm pose computational efficiency challenges as models scale further?

A: The DisCO algorithm does not introduce any extra computational overhead, compared with GRPO and its variants. They have similar computational costs.

评论

Dear Reviewer CL8D,

Thank you for your thoughtful review and encouraging rating!

As the author-reviewer discussion period is nearing its end, we would like to follow up to see if our rebuttal has addressed your concerns. Please let us know if any further clarification would be helpful.

Thank you again!

Authors

评论

Thank you for your thorough and clear responses. All my questions have been fully addressed, and I have no further comments. Therefore, I recommend acceptance.

审稿意见
5

The paper proposes a new Discriminative Constrained Optimization framework for reinforcing large reasoning models (LRMs). Starting from group relative policy optimization (GRPO), DisCO firstly replace the group relative objective with a discriminative objective defined by a scoring function, and employs a constrained optimization approach to enforce the KL divergence constrainet. Experimental results show that DisCO outperforms GRPO and DAPO on 1.5B and 7B model with the same training queries.

优缺点分析

Strengths:

  1. From Table 2&3, the proposed method can outperform previous works on both 1.5B and 7B models and exhibits better stability in training than previous works.

Weaknesses:

  1. There are too many equations that make the paper hard to follow, especially there are many symbols that do not have explaination and requires the readers to guess. For example, function g(o,q)g(o,q) is a one time symbol that can be essentially replaced by f(x,y)f(x, y), f+f^{+} and ff^{-} can be omitted. There are also many abbreviations that show up without explanation, such as VPG, GRPO_RW in the equations.
  2. The claimed motivation that previous methods have difficulty bias seems to be a bit emperical and does not have theoretical support.
  3. In Table 3, when the initial policy is strong, the improvements of DisCO is small, and the baseline methods such as GRPO and DAPO even decrease the performance of initial policy, is there any explanation?

问题

  1. The equation 9 seems to have very similar formulation as Direct Policy Optimization. Is there any essential connection and difference between the proposed DisCO and online DPO?
  2. The reward distribution of training queries are highly correlated with the initial policy and the training data. If the initial policy is strong, and the training queries are hard (for example, using the queries provided by DAPO[1] or OREAL[2], is DisCO still effective and better than previous methods? [1] DAPO: An Open-Source LLM Reinforcement Learning System at Scale https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k [2] OREAL: Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning. https://huggingface.co/datasets/internlm/OREAL-RL-Prompts

局限性

See weaknesses.

最终评判理由

The authors' response and clarification have addressed my concerns, thus I raised my rating from 4 to 5.

格式问题

None

作者回复

We thank the reviewer for providing constructive comments. Below we would like to answer the raised questions.

Q1: On the math symbols that do not have explanation. For example, function g(o,q)g(o,q) is a one time symbol that can be essentially replaced by f(x,y)f(x,y), f+f^+ and ff^- can be omitted. There are also many abbreviations that show up without explanation, such as VPG, GRPO_RW in the equations.

A: We apologize for the confusion!

  • g(o,q)g(o,q) in line 134 is any function, introduced for a statement of a general fact in equation (1), which is used in the proof of Proposition 1. We will move this to the appendix in the revision.

  • f+f^+ and ff^- in Proposition 1 are assumed to be non-decreasing functions for decomposing f(x,y)f(x, y). Their specific forms for GRPO are actually the form in sθ+(o,q)s_{\theta}^+(o, q) and sθ(o,q)s_{\theta}^-(o, q) in Proposition 1, i.e., f+(x,1)=min(x,1+ϵ)f^+(x, 1)=\min(x, 1+\epsilon) and f(x,1)=min(x,1ϵ)f^-(x,1)=\min(x, 1-\epsilon). For other methods, their specific forms can be found in Table 4. We will make it clearer in the revision.

  • VPG stands for Vanilla Policy Gradient method, and GRPO_RW stands for a GRPO version with the question-level weight bias removed (line 183). We will revise the paper to make the term clearer.

Q2: The claimed motivation that previous methods have difficulty bias seems to be a bit empirical and does not have theoretical support.

A: We politely disagree. Our finding regarding the difficulty bias of GRPO is through a theoretical analysis of GRPO's objective by connecting it with a discriminative objective in Proposition 1. The empirical studies in Figure 1 are used for further supporting our claim.

Q3: In Table 3, when the initial policy is strong, the improvements of DisCO is small, and the baseline methods such as GRPO and DAPO even decrease the performance of initial policy, is there any explanation?

A: There is a misunderstanding in the interpretation of comparison with the initial policy in Table 3. It is important to compare their performance against the second entry in Table 3: DS-Distill-Qwen-7B (32k+/8k), which represents the initial policy evaluated with the same 8k response length used for GRPO, DAPO and DisCO. The first entry, DS-Distill-Qwen-7B (32k+/32k), uses a much longer 32k response length at inference and is included as a reference to highlight DisCO’s efficiency under smaller generation budgets (8k during both training and inference).

In fact, both GRPO and DAPO do improve upon the initial policy when evaluated under the same response length constraint—raising performance from 0.513 to 0.592 (GRPO) and 0.570 (DAPO), respectively. Also, the improvements of DisCO are still significant, from 0.513 to 0.627 or 0.625.

Please also note that while the performance of GRPO and DAPO has already saturated, DisCO continues to exhibit an upward trend as the number of training steps increases (Figure 2c).

Q4: The equation 9 seems to have very similar formulation as Direct Policy Optimization. Is there any essential connection and difference between the proposed DisCO and online DPO?

A: Indeed DPO's objective can be regarded as a discriminative objective and can be recovered in (9) by choosing sθ(o,q)=βlogπθ(oq)πref(oq)s_{\theta}(o,q) = \beta\log\frac{\pi_{\theta}(o|q)}{\pi_{ref}(o|q)}, (s)=log(1+exp(s))\ell(s) = \log(1+\exp(-s)), and removing the squared hinge penalty function. Nevertheless, the objective in (10) for DisCO is different from DPO, as it has an effect on weighting more on hard negative rollouts.

Q5: Is DisCO still effective and better than previous methods on other datasets like those provided in DAPO or OREAL papers?

A: We thank the reviewer for the constructive comment. We conducted additional experiments on DAPO-Math-17k dataset with 1.5B models, training them for 1400 steps. The results are summarized below. We can observe that DisCO methods achieve slightly better performance than those trained on DeepScaleR-Preview-Dataset, consistently outperforming GRPO and other baselines.

ModelMRL(Train/Test)AIME 2024AIME 2025MATH 500AMC 2023MinervaO-BenchAvg.
DS-Distill-Qwen-1.5B32k+   /   8k0.1810.2150.7580.5150.2370.3530.376
GRPO8k   /    8k0.3420.2560.8420.6720.2670.4580.473
DAPO8k   /   8k0.2750.2290.8120.6530.2560.4410.444
TRPA8k   /   8k0.3460.2790.8360.6830.2810.4500.479
DisCO (L-ratio)8k   /   8k0.4130.3100.8740.7750.3070.4950.529
DisCO (log-L)8k   /   8k0.4600.3170.8730.7750.3200.5020.541
评论

Dear Reviewer Hi9A,

Thank you for your thoughtful review and encouraging rating!

As the author-reviewer discussion period is nearing its end, we would like to follow up to see if our rebuttal has addressed your concerns. Please let us know if any further clarification would be helpful.

Thank you again!

Authors

审稿意见
5

This paper proposes several modifications to GRPO:

  • Does not clip advantage
  • Removes down-weighting of questions with very high or low reward
  • More stable KL divergence term using a squared-hinge penalty function

They train and compare their method (Disco) with GRPO, Dr. GRPO, DAPO, TRPA. They also compare against Deepseek-R1-1.5B PReview and OpenAI-o1-Preview. Disco trained model outperforms these variants on math tasks.

优缺点分析

  • [Good performance] Based on numbers reported, it looks like the algorithm works well.
  • [Organization needs some work. Contributions are weakly communicated --- multiple things happening with the algorithm, might be good to decompose the paper into separate sections] There seems to be two things the paper is trying to communicate: that GRPO is downweighting examples with extreme rewards + standard clipping/KL divergence term causes instability. However, in the experiments, it's very unclear where the gains are actually coming from. It might be good to more closely decompose the effect of each design choice to the performance in the main body + provide more extensive ablation studies that demonstrate the problem tackled.
  • [Concurrent work] The idea of upsampling examples with the highest and lowest rewards is also explored in concurrent work Not All Rollouts are Useful [https://arxiv.org/abs/2504.13818].
  • [No hyperparameter tuning for baselines] Based on the experimental section, it looks like hyperparameters are manually tuned for their algorithm, but baseline hyperparameters are reported numbers adapted from previous papers. It's not clear whether these hyperparameters are optimal for their setup.

问题

  • I'm curious what the interplay is between the expected reward and entropy collapse in your algorithm. These concepts seem to be closely related, since entropy collapse would also lead to a bimodal expected reward distribution (e.g. you could be oversampling one particular incorrect or correct answer and get extremely low/high rewards, respectively). If you reduce entropy collapse, I imagine less datapoints would have extremely low or high rewards. Yet Disco would also precisely focus more on these outlier examples. I wonder if some effects of these two design choices are somewhat cancelling each other out?
  • If you remove the clipping for GRPO + use the same squared-hinged KL term, does the performance improve?

局限性

  • Based on the insights provided, are there any particular datasets you would foresee Disco to be particularly effective or suboptimal in comparison to GRPO? e.g. perhaps you can try comparing these algorithms on datasets with a small number of very easy and hard examples.

最终评判理由

From what I understand, clipping stabilizes training but performance degrades. Doing non-clipping but replacing the KL regularization term to stabilize training achieves better performance.

I think the experiments addresses my concerns. I increase my score to 5. I think the organization of the paper could possibly improve to communicate why changing each component of GRPO was necessary.

格式问题

None

作者回复

We thank the reviewer for providing a comprehensive review. Below we would like to address raised concerns.

Q1: In the experiments, it might be good to more closely decompose the effect of each design choice on the performance.

A: We thank the reviewer for the constructive suggestion! The proposed method incorporates three key design choices: (1) removing question-level weight bias, (2) using a non-clipping scoring function, and (3) applying a KL-divergence constraint.

The effectiveness of (1) has been demonstrated in the comparison between GRPO and GRPOrw_\text{rw}, as shown in Fig. 1 (c, d). Additionally, we conducted new experiments on training a 1.5B model, comparing DisCO with it variant that includes question-level weight bias while keeping all other components unchanged. The results are presented in the table below.

The benefit of (2) is validated in Figure 3 (second from the left), which shows that using the non-clipping scoring functions yields significant improvements over its clipping-based counterpart, when all other components are the same.

To evaluate the impact of (3), we conducted another experiment on the 1.5B model, comparing DisCO with a variant that replaces the KL constraint with KL-divergence regularization. The results are also summarized in the table below.

We can see that each component contributes to the improvements, where the use of non-clipping scoring function contributes the most. We will incorporate the discussion into our revision.

Methodpass@1(Avg.)Methodpass@1(Avg.)
DisCO (log-L)0.533DisCO (L-ratio)0.524
DisCO --> w/ weight bias0.522 (↓ 0.011)DisCO --> w/ weight bias0.511 (↓ 0.013)
DisCO -->Clipping scoring func.0.430 (↓ 0.103)DisCO --> Clipping scoring func.0.430 (↓ 0.094)
DisCO--> KL Regularization0.519 (↓ 0.014)DisCO --> KL Regularization0.523 (↓ 0.001)

Q2: The idea of upsampling examples with the highest and lowest rewards is also explored in concurrent work [r1].

A: Thank you for pointing out the interesting work. While [r1] introduces max variance down-sampling to train on only an informative subset, we'd like to clarify that our method does not apply up-sampling or down-sampling to rollouts. Instead, we proposed a discriminative objective using all generated rollouts for more effective learning than GRPO. It would be interesting to integrate [r1]'s sampling approach into our framework.

Q3: Based on the experimental section, it looks like hyperparameters are manually tuned for their algorithm, but baseline hyperparameters are reported numbers adapted from previous papers. It's not clear whether these hyperparameters are optimal for their setup.

A: Thank you for the thoughtful comment! We would like to provide further clarification on our hyperparameter tuning process.

First, we tune the learning rate for all methods from the set {1e6,2e6}\{1\text{e}{-6}, 2\text{e}{-6}\} (see lines 310–311).

Second, for our proposed DisCO methods, we tune only one additional parameter, τ\tau, over a small range of three values per variant. The other hyperparameters, β\beta and δ\delta, are selected based on heuristics and are not tuned (lines 318-319). Our ablation study (Figure 3, right) shows that the performance is not sensitive to the choice of τ\tau.

Third, for the baseline methods GRPO, GRPO-ER, TRPA, we follow the hyper-parameter settings used in prior works [r2, r3, r4] because they have used exactly the same model and dataset as our experiments. Dr. GRPO does not introduce any additional hyperparameters beyond the learning rate. For DAPO, we used ϵhigh=0.28\epsilon_{\text{high}} = 0.28, as in the original paper. While we acknowledge that some studies may use a different value, we tested an alternative setting ϵhigh=0.24\epsilon_{\text{high}} = 0.24 to address this concern. The result, shown in the table below, shows that ϵhigh=0.24\epsilon_{\text{high}} = 0.24 performs even worse than ϵhigh=0.28\epsilon_{\text{high}} = 0.28.

In summary, the improved performance of DisCO does not stem from extensive or unfair hyperparameter tuning.

ModelAIME 2024AIME 2025MATH 500AMC 2023MinervaO-BenchAvg.
GRPO0.2770.2420.8380.6470.2760.4620.457
DAPO(ϵhigh\epsilon_{high} =0.24)0.2920.2460.8380.6650.2880.4640.465
DAPO(ϵhigh\epsilon_{high} =0.28)0.3100.2520.8480.6750.2960.4560.473
DisCO (log-L)0.4040.3170.8760.7580.3330.5090.533

Q4: On the interplay between the expected reward and entropy collapse in your algorithm. Are there less datapoints would have extremely low or high rewards in DisCO? Would Disco focus more on these outlier examples?

A: We would like to clarify a misunderstanding regarding DisCO. We agree that entropy collapse did contribute to saturation of expected rewards for GRPO, GRPO-ER and Dr. GRPO as observed in Fig. 2. Regarding DisCO,

  • (1) the generation entropy remains stable while the expected reward keeps increasing (see Fig. 2);
  • (2) We verified the reward distribution for questions in our method averaged over the first 200 steps. As shown in the following table, it does not exhibit fewer datapoints with extremely low or high rewards. It is similar to that in GRPO (Fig. 1, second from the left).
  • (3) It is not true that DisCO focuses more on the examples with extremely low or high rewards. Indeed, DisCO is motivated by giving the same weight to all questions regardless of their expected rewards. Our objective in (10) has an effect of weighting negative rollouts for each question, focusing more on hard negative rollouts.
Question Accuracy0%(0, 25%](25%,50%](50%,75%](75%,100%)100%
Ratio0.240.130.110.130.10.29

Q5: If you remove the clipping for GRPO + use the same squared-hinged KL term, does the performance improve?

A: We have experimented with the suggested variant of GRPO, keeping its advantage function, but replacing its clipping-based scoring function with non-clipping based L-ratio, and using the squared-hinge KL penalty instead of its original KL regularization. This indeed is the variant of the second row (right) in the first Table of this rebuttal. We summarize the results below, from which we can see that its average performance 0.511 is better than GRPO (0.457) but still worse than DisCO (0.524).

Modelpass@1(Avg.)
GRPO0.457
No Clipped GRPO + squared-hinged KL0.511
DisCO (L-ratio)0.524

Q6: Based on the insights provided, are there any particular datasets you would foresee Disco to be particularly effective or suboptimal in comparison to GRPO? e.g. perhaps you can try comparing these algorithms on datasets with a small number of very easy and hard examples.

A: The DeepScaleR-Preview-Dataset used in our experiments includes a wide range of questions, spanning from easy to challenging for the base models. We believe that the improvements achieved by DisCO are fundamental, rather than relying on specific properties of the dataset. To further support this claim, we conducted additional experiments on the DAPO-Math-17K dataset [r5] using 1.5B models, training them for 1400 steps. As shown below, DisCO still consistently outperforms GRPO and other baselines.

ModelMRL(Train/Test)AIME 2024AIME 2025MATH 500AMC 2023MinervaO-BenchAvg.
DS-Distill-Qwen-1.5B32k+   /   8k0.1810.2150.7580.5150.2370.3530.376
GRPO8k   /    8k0.3420.2560.8420.6720.2670.4580.473
DAPO8k   /   8k0.2750.2290.8120.6530.2560.4410.444
TRPA8k   /   8k0.3460.2790.8360.6830.2810.4500.479
DisCO (L-ratio)8k   /   8k0.4130.3100.8740.7750.3070.4950.529
DisCO (log-L)8k   /   8k0.4600.3170.8730.7750.3200.5020.541

[r1] Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

[r2] An empirical study on eliciting and improving r1-like reasoning models.

[r3] Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl.

[r4] Trust region preference approximation: A simple and stable reinforcement learning algorithm for llm reasoning.

[r5] DAPO: An Open-Source LLM Reinforcement Learning System at Scale.

评论

Dear Reviewer 2CBw,

Thank you for your thoughtful review and encouraging rating!

As the author-reviewer discussion period is nearing its end, we would like to follow up to see if our rebuttal has addressed your concerns. Please let us know if any further clarification would be helpful.

Thank you again!

Authors

评论

Thanks for the thorough response!

From what I understand, clipping stabilizes training but performance degrades. Doing non-clipping but replacing the KL regularization term to stabilize training achieves better performance.

I think the experiments addresses my concerns. I'll increase my score to 5, thanks.

评论

Dear AC and all reviewers,

We are grateful to reviewers CL8D and ewTG for acknowledging our rebuttal and recommending acceptance.

To reviewers 2CBw and Hi9A: if you have any further questions or concerns regarding our rebuttal, we would be happy to engage in additional discussion before the discussion period ends.

To AC: we kindly ask for your assistance in inviting reviewers 2CBw and Hi9A to revisit our rebuttal. We hope that their evaluations will reflect the most up-to-date information.

Thank you!

Authors

最终决定

This paper proposes DisCO (Discriminative Constrained Optimization), a reinforcement learning method for large reasoning models that alleviates difficulty bias and training instability in existing methods like GRPO.

During the discussion period, the authors addressed all major concerns through comprehensive experiments and clarifications:

  1. Ablation studies and component contributions (Reviewers 2CBw, CL8D). The authors provided detailed ablation results showing each component's contribution: removing question-level weight bias, using non-clipping scoring functions (the largest contributor), and applying KL constraints. These experiments clearly decompose the effectiveness of each design choice.
  2. Hyperparameter tuning fairness (Reviewer 2CBw). The authors clarified that they tuned the learning rate for all methods and one additional parameter without extensive grid search for their own method DisCO. Furthermore, baselines were applied under similar experimental settings by prior works, so it was fair for the authors to follow the training settings.
  3. Theoretical clarity and notation (Reviewer Hi9A). The authors clarified mathematical notation, explained the theoretical basis for difficulty bias through Proposition 1, and corrected the interpretation of Table 3 results.

All four reviewers converged on acceptance after the rebuttal. The recommendation is to accept this paper.