Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation
DPO enables iterative self-improvement for LLMs, achieving RL-level reasoning performance with lower computational cost through preference-based learning and verifiable rewards.
摘要
评审与讨论
This work presents an empirical study of iterative DPO as a lightweight alternative to RL-based post-training for improving mathematical reasoning. The authors show that (1) a single DPO round with coarse filtering already boosts a strong base model, and (2) a multi-round generator/RM co-training scheme DPO-VP reaches RL-level accuracy while requiring less compute. The manuscript is well-structured, the experiments are extensive, and the resulting insights will be of interest to researchers who seek alternatives. In particular, showing mutual evolution of the policy and reward model through iterative DPO is interesting.
The findings presented in this paper however is less surprising. Regarding the three key findings that authors listed:
- Earlier studies, some cited, have already reported gains from single-round DPO with answer verification or from iterative self-training (e.g., RPO [1], ReST-EM [2], SPIN [3]). Without answer verification, there have also been efforts [4] in leveraging DPO for self-improvement.
- To show that iterative DPO achieves online-RL level Performance, I would be expecting a direct comparison with the GRPO algorithm on the same training data, even if done on smaller models like 1.5B or 3B. The current baselines are trained on different datasets and schedules, making the comparison inconclusive.
[1] Iterative Reasoning Preference Optimization
[2] Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models
[3] Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
[4] Bootstrapping Language Models with DPO Implicit Rewards
接收理由
- The manuscript is well-structured, presenting a thorough empirical evaluation of iterative DPO as an efficient alternative.
- The experiments are extensive and provides interesting insights.
- The focus of this paper on resource-constrained scenarios is important for research community.
拒绝理由
- Improvements of DPO and iterative DPO have been found by many previous works. The key finding 1 and 2 are less surprising.
- The paper lacks an experiment directly compare iterative DPO with online RL algorithms such as GRPO on the same data (even in a toy setting with smaller models) to substantiate its key finding 3.
- Analysis on self-reflective behaviors focused on the occurrences of "let's" in figure 4 and table 7, which seem less relevant. It should be more relevant to study the occurrences of "wait" and "double check".
We sincerely appreciate your thoughtful critiques. Below, we respond to each of your concerns.
Comment 1: Novelty and Contributions
We are grateful for the opportunity to elaborate on the innovations and empirical findings introduced in our study. While we acknowledge that DPO and its iterative variants have been previously explored, our work introduces several novel insights and empirical contributions that meaningfully extend the current literature:
-
Iterative Co-Evolution of Generator and Verifier
We propose a multi-round training framework where the generator and verifier (PRM) are jointly optimized. Unlike prior works that fix the reward model, we explicitly demonstrate mutual enhancement across iterations, revealing the benefits of feedback-driven co-evolution, which has not been systematically explored in earlier DPO studies. -
Resource-Efficient Verifiable Preference Optimization
Our work focuses on multi-round DPO with verifiable rewards, showing that it can rapidly enhance reasoning ability in Qwen2.5-series models using limited computational resources. While the benefits of RL in LLMs have been recognized since the release of ChatGPT in 2022, it was not until the emergence of DeepSeek-R1 in 2025 that its effectiveness on verifiable reasoning tasks was demonstrated at scale. We empirically show that DPO can achieve comparable improvements in verifiable reasoning with significantly lower cost, offering a practical alternative to RL-based pipelines. -
Comprehensive Empirical Study Across Strong Baselines
We benchmark against strong baselines including Simple-RL-Zero, PURE-VR, LIMO, and S1. During rebuttal, we further expanded experiments to include:- Generalization to code and RC tasks,
- Robustness under label noise,
- Ablations on sampling temperature and training dynamics.
These analyses provide a thorough and well-rounded evaluation of the DPO-VP framework.
Comment 2: Comparison with Online RL on the Same Data
Thank you for raising this important point. We clarify that all our main experiments are conducted on the same 8K MATH training dataset, following the setup of SimpleRL-Zero [1].
In Table 2, models such as Qwen2.5-7B-Simple-RL-Zero and PURE-VR are all trained on the same data as our DPO variants.
To ensure completeness and up-to-date comparison, we downloaded and evaluated the latest official checkpoints of Simple-RL-Zero and Simple-RL-Zoo, both trained with PPO-style RL from reasoning trajectories. We also implemented and evaluated a GRPO-Zero variant using the same training data and base model. These are compared to our DPO-VP model under identical evaluation settings. The results are:
| Model | MATH500 | Minerva | Olympiad | AMC23 | AIME24 | Avg. |
|---|---|---|---|---|---|---|
| Qwen2.5-7B-GRPO-Zero | 76.2 | 32.7 | 38.1 | 55.0 | 16.7 | 43.7 |
| Qwen2.5-7B-Simple-RL-Zero | 78.0 | 33.1 | 36.6 | 60.0 | 26.7 | 46.9 |
| Qwen2.5-7B-Simple-RL-Zoo | 80.4 | 39.7 | 38.8 | 57.5 | 26.7 | 48.7 |
| Qwen2.5-7B-PURE-VR | 79.8 | 36.8 | 41.9 | 60.0 | 20.0 | 47.7 |
| Qwen2.5-7B-DPO-VP | 74.8 | 35.3 | 36.9 | 67.5 | 26.7 | 48.2 |
These results show that DPO-VP achieves competitive or superior performance compared to RL-based methods, while incurring significantly lower training cost.
[1] https://github.com/hkust-nlp/simpleRL-reason
Comment 3: Self-Reflection Analysis
We agree that “Let’s” alone may not fully capture self-reflective behavior. We initially used it as a proxy for explicit reasoning, as it's prevalent in mathematical CoT prompting.
To address your concern, we extended the analysis to include “wait”, “double-check”, and “re-evaluate”:
- “Wait” and “double-check” were rarely observed, likely because such introspective cues are not prominent in the Qwen2.5 base model, which lacks long-chain reflective training. These may be more common in R1-style models distilled from long-chain traces.
- We did observe “re-evaluate”, whose usage closely mirrors “Let’s” — appearing more in harder examples and correlating with correctness. However, its overall count is limited, reducing its statistical utility.
General Response Summary
Thank you again for your constructive feedback. Your comments helped clarify our key contributions in terms of:
- Methodological novelty via generator–verifier co-evolution and comprehensive empirical investigation,
- Efficiency over online RL methods, validated through consistent-scale comparisons,
- Broadened self-reflection analysis using multiple linguistic markers.
We will revise the camera-ready version to include these findings and provide clearer comparisons to related works you cited, highlighting both conceptual and empirical distinctions.
We hope this response addresses your concerns and would be grateful if you would consider a more favorable score in light of the clarified contributions and new evidence provided.
Thank the authors for the detailed rebuttal. I appreciated the additional experiments and will increase my score from 5 to 6, with remaining concern on novelty of iterative DPO with verifiable rewards.
Thank you very much for your positive feedback and for taking the time to review our work. We’re glad to hear that your concerns have been resolved. Should you have any further questions or feedback, we would be pleased to discuss them in more detail.
Thank you again for your thoughtful feedback on our submission. We have carefully addressed your comments during the rebuttal phase, including clarifications and additional experiments tailored to your suggestions.
We understand this is a busy review period, but we would greatly appreciate it if you could take a moment to review our responses and, if possible, share any further thoughts or updates to your evaluation.
Your input is extremely valuable to us. Thank you again for your time and consideration!
To show the novelty and value of Iterative Co-Evolution of Generator and Verifier, we include a comprehensive 3-epoch comparison involving three different verifier (including 671B DeepSeek-V3 as a fixed strong verifier and 7B PRM as a learned one):
| Method | GSM8K | MATH500 | Gaokao | Minerva | 4-Math Avg | 5-OOD Avg |
|---|---|---|---|---|---|---|
| Ours(with original PRM, 3 epochs) | 90.9 | 74.0 | 64.4 | 32.7 | 65.5 | 50.6 |
| Ours(co-evolution, 3 epochs) | 91.5 | 75.0 | 65.8 | 34.2 | 66.6 | 51.4 |
| + DeepSeek-V3 verifer (3 epochs) | 91.0 | 75.9 | 66.0 | 32.4 | 66.3 | 51.1 |
As shown, both our co-evolution approach and using a fixed strong verifier (DeepSeek-V3) yield similar levels of performance, outperforming the original PRM baseline. Importantly, co-evolution achieves comparable results at lower computational cost, making it a more practical alternative under resource constraints.
We understand this is a busy review period, but we would greatly appreciate it if you could take a moment to review our responses and, if possible, share any further thoughts or updates to your evaluation.
Thanks again.
Dear Reviewer AnMH,
We sincerely appreciate the time and effort you've already dedicated to reviewing our work. As the rebuttal phase is nearing its end, we would be truly grateful if you could take a moment to review our responses and the additional experiments we conducted based on your comments.
If you have any further suggestions or feedback, we would be very happy to discuss them.
Thank you again for your time and consideration! We kindly ask for your feedback at your earliest convenience, as the rebuttal deadline is approaching.
This paper explores how DPO can enhance the reasoning abilities of large language models, particularly for mathematical problems. It shows that even a single round of DPO with simple filtering can boost performance for strong base models. The study introduces an iterative framework where both the language model (generator) and a reward model (RM) are improved together over multiple DPO rounds. This iterative process allows them to achieve performance comparable to much more computationally expensive RL methods, but with the advantage of being runnable on a single 80GB GPU, making it a more practical and cost-effective alternative to RL for enhancing reasoning.
接收理由
-
This paper is well-written, presents a clear hypothesis, provides strong supporting experiments, and draws a sound conclusion.
-
They show that multi-round DPO can achieve mathematical reasoning capabilities comparable to state-of-the-art RL-based methods. Even a single round of DPO using self-generated data can significantly enhance mathematical reasoning performance.
拒绝理由
-
While the paper is well-executed, its novelty may be limited, as prior work has already demonstrated the effectiveness of iterative DPO.
-
The paper heavily focused on mathematical reasoning datasets, making claims about enhancing LLM reasoning primarily applicable to the mathematical domain. More clarification on this in the writing would be helpful and less misleading.
给作者的问题
-
Relying on "coarse" methods (such as filtering based on simple outcome labels or minimum step scores from an RM) might potentially limit the quality or complexity of the preference data used for training. Have you observed this?
-
In section 3.2 the authors note that applying SFT for Qwen resulted in a performance reduction, attributing this decline to the "inferior quality of the GSM8K and MATH datasets" used for SFT. Do you think the quality of the base datasets being used, from which self-improvement data is derived has an impact here?
Missing citations:
- Improving Reasoning Ability of Large Language Models via Iterative Uncertainty-based Preference Optimization
- Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning
- ISR-DPO: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective DPO
- Aya Expanse: Combining Research Breakthroughs for a New Multilingual Frontier
- AIPO: Agreement-Aware Iterative Preference Optimization for Length Exploitation Mitigation
- Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level
We sincerely thank you for your valuable feedback. Below, we address each of your main concerns.
Comment 1: Novelty and Contributions
We appreciate the opportunity to clarify the key innovations and empirical findings of our study. Our work offers several new insights that meaningfully extend existing literature:
-
Iterative Co-Evolution of Generator and Verifier
We introduce a multi-round framework where the generator and verifier (PRM) are jointly optimized. Unlike prior works with fixed reward models, we explicitly demonstrate their mutual enhancement across iterations, highlighting the value of feedback-driven co-evolution, it is a direction not systematically studied in earlier DPO research.
-
Resource-Efficient Preference Optimization
We show that multi-round DPO with verifiable rewards can effectively enhance reasoning in Qwen2.5 models with low computational cost. While RL's value for LLMs has been known since ChatGPT (2022), its efficacy in verifiable reasoning only became clear with DeepSeek-R1 (2025). Our findings show that DPO can match such gains at a fraction of the cost, offering a practical RL alternative.
-
Extensive Empirical Evaluation
We compare against strong baselines like Simple-RL-Zero, PURE-VR, LIMO, and S1. During rebuttal, we further extended our study with:
- Generalization to code and RC tasks,
- Robustness tests under label noise,
- Ablations on temperature and training dynamics.
These results provide a comprehensive and up-to-date evaluation of the DPO-VP framework.
Comment 2: Generality
We acknowledge that our primary experiments center on mathematical reasoning. To address concerns about generality, we conducted additional evaluations on non-math tasks, including code, scientific reasoning, and reading comprehension.
The results are summarized below:
| Model | CMATH | MMLU_STEM | HumanEval | Live_Code_Bnech | RACE | Average |
|---|---|---|---|---|---|---|
| Qwen2.5-Math-7B | 64.0 | 58.3 | 40.9 | 5.1 | 61.6 | 46.0 |
| Eurus-2-7B-PRIME | 72.5 | 42.8 | 41.5 | 21.2 | 62.6 | 48.1 |
| Qwen2.5-7B-Simple-RL-Zero | 70.3 | 59.1 | 51.8 | 61.4 | 61.1 | 52.7 |
| Qwen2.5-7B-PURE-VR | 69.5 | 58.8 | 43.3 | 6.4 | 61.3 | 47.9 |
| Qwen2.5-7B-DPO-VP | 68.5 | 59.3 | 50.6 | 17.3 | 61.2 | 51.4 |
These results demonstrate that DPO-VP though trained on math data, exhibits strong generalization to non-math reasoning domains, often outperforming strong baselines.
We will clarify this in the camera-ready version to ensure our reasoning claims are properly scoped and not misleading.
Comment 3: Concerns about "Coarse" Preference Filtering
Thank you for raising this important point. We clarify that our so-called “coarse” methods are primarily used for trajectory selection, not for collecting external data. As such, our pipeline retains control over data consistency and distribution, even if the filtering logic is simple.
To assess whether such filtering might degrade preference quality, we conducted a controlled label noise ablation under the PRM with offset setting in Table 1. Specifically, for each positive-negative response pair, we applied a random flipping probability, reversing the preference direction with probability p ∈ [0, 1]. The results are shown below:
| Noisy Ratio | GSM8K | MATH500 |
|---|---|---|
| 0.0 | 90.3 | 75.8 |
| 0.2 | 78.8 | 67.0 |
| 0.5 | 67.1 | 61.6 |
| 0.8 | 59.8 | 59.6 |
| 1.0 | 45.6 | 44.6 |
These results suggest that while DPO is robust to mild noise, its performance degrades under high label corruption, validating the importance of having a reasonably accurate verifier signal. In our case, even coarse filtering based on verifier scores appears sufficient to maintain high-quality training supervision.
Comment 4: Impact of Dataset Quality on SFT and Self-Improvement
Thank you for the question. In our self-improvement pipeline, we only retain the questions from those datasets and generate new responses by the model itself. These self-generated responses are then verified and filtered before use.
Therefore, while the original SFT data quality is low, it does not affect our self-improvement pipeline, which is built on newly generated and verified content.
General Response Summary
Thank you again for your constructive feedback. Your comments helped clarify our key contributions in terms of:
- Methodological novelty via generator–verifier co-evolution and comprehensive empirical investigation,
- Extend our evaluation to non-math tasks, supporting the claim of general reasoning enhancement,
- Explain the role and limitation of coarse filtering, and the decoupling between SFT data and self-improvement quality.
We will revise the writing in the camera-ready version to reflect these points, cite all the missing related works, and differentiate our framework accordingly.
We hope our clarifications address your concerns and would appreciate your consideration for a more favorable assessment.
Thanks for the response. I maintain my score.
Thank you very much for your positive feedback and for taking the time to review our work. Should you have any further questions or feedback, we would be pleased to discuss them in more detail.
Thank you again for your thoughtful feedback on our submission. We have carefully addressed your comments during the rebuttal phase, including clarifications and additional experiments tailored to your suggestions.
We understand this is a busy review period, but we would greatly appreciate it if you could take a moment to review our responses and, if possible, share any further thoughts or updates to your evaluation.
Your input is extremely valuable to us. Thank you again for your time and consideration!
This paper makes a compelling case that lightweight, preference-based tuning can match the performance of heavyweight RL approaches in math reasoning tasks—at a fraction of the cost. Starting from a Qwen 2.5-7B baseline, a single application of Direct Preference Optimization (DPO) boosts MATH500 accuracy by 11.6 points over vanilla SFT. More interestingly, repeating DPO while jointly training a reward model leads to further gains, eventually approaching state-of-the-art RL pipelines across five math benchmarks. Impressively, all of this is achieved in under a day on just a few A800 GPUs, suggesting iterative DPO as a highly practical alternative to RL-style training.
接收理由
Strong empirical results. The paper shows that even relatively coarse preference signals (positive–negative pairs) are enough toclose much of the gap to RL baselines on GSM8K, MATH500, Gaokao-EN, AMC23, and AIME24. In particular, 11.6 pts on MATH500 is achieved in a single round.
Joint training of generator and verifier. The reward model (PRM) is updated with the base model, leading to measurable improvements (e.g., PRM weighted-F1 from 65 to 72) and better inference-time scaling (e.g., PRM@64 outperforms majority vote).
Efficient pipeline. The full six epoch training run takes only about 12 hours on 4A800 GPUs (or about 3 days on 1A800), which is dramatically cheaper than RL pipelines.
拒绝理由
Limited domain. All evaluations focus on math and exam-style tasks. Although Gaokao-EN includes some open-domain passages, the paper does not demonstrate generalization to other domains such as code generation.
Incomplete baselines. While the paper compares DPO against RL and MCTS systems, it does not evaluate against more recent strong SFT or hybrid pipelines (e.g., LIMO, S1). This omission makes it difficult to precisely identify where DPO provides the most value.
Limited filtering ablations. The paper explores several filtering methods (Outcome, ORM, PRM ± offset) but does not deeply investigate the effects of temperature scaling, top-k/top-p sampling, or PRM training efficiency vs. accuracy trade-offs.
Lack of explicit exploration mechanisms. Although the authors introduce mild annealed sampling, DPO does not incorporate exploration strategies such as entropy regularization or diversity sampling. The diminishing returns after a few rounds suggest further gains may be possible with lightweight exploration incentives.
给作者的问题
-
Have you tried iterative DPO in domains beyond sparse-reward math tasks? For example, how does it perform on tasks like NLI or multilingual QA, where you have dense ground-truth labels instead of binary pass/fail signals?
-
How stable is the method when the reward model is noisy or mis-ranked samples? It would be helpful to see an experiment that injects controlled noise into PRM scores to test how much the performance depends on accurate preference learning.
-
Since DPO doesn’t explicitly encourage exploration, do you observe diminishing returns after a few rounds? Could lightweight strategies like entropy bonuses or diversity sampling help push performance further, without using the full RL?
Response to Comment 1: Domain Generalization
We thanks for the insightful question regarding generalization beyond mathematical reasoning.
In Appendix E, we evaluate our DPO-VP model on two non-math domains:
- CMATH (Chinese mathematical QA)
- MMLU-STEM (covering physics, chemistry, biology)
The results show that the model trained solely on math can still generalize to other reasoning-heavy domains.
Furthermore, we conducted additional experiments on the following benchmarks:
- HumanEval and LiveCodeBench for code generation
- RACE for reading comprehension
The results are summarized below:
| Model | CMATH | MMLU_STEM | HumanEval | Live_Code_Bnech | RACE | Average |
|---|---|---|---|---|---|---|
| Qwen2.5-Math-7B | 64.0 | 58.3 | 40.9 | 5.1 | 61.6 | 46.0 |
| Eurus-2-7B-PRIME | 72.5 | 42.8 | 41.5 | 21.2 | 62.6 | 48.1 |
| Qwen2.5-7B-Simple-RL-Zero | 70.3 | 59.1 | 51.8 | 21.4 | 61.1 | 52.7 |
| Qwen2.5-7B-PURE-VR | 69.5 | 58.8 | 43.3 | 6.4 | 61.3 | 47.9 |
| Qwen2.5-7B-DPO-VP | 68.5 | 59.3 | 50.6 | 17.3 | 61.2 | 51.4 |
DPO-VP shows consistent performance across multiple out-of-domain tasks.
In addition, we note that the DPO framework is not inherently math-specific. In fact, its original paper [1] demonstrated strong results on dialogue, summarization, and sentiment alignment tasks. We believe that as long as dense labels can be converted into binary preferences, our iterative DPO method can be directly applied to NLI, QA, and other tasks. Moreover, we think step-wise preference construction for multi-step reasoning is also feasible, though beyond the scope of this current study, just like the reference [2] shows.
[1] Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023.
[2] Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs. arXiv:2406.18629.
Response to Comment 2: Comparison to Strong SFT Baselines
We thank you for pointing out the relevance of recent SFT-based approaches like LIMO and S1. As these methods were originally evaluated only on 32B models, we reproduced them on our 7B base model to provide a fair and direct comparison.
Specifically, we fine-tuned Qwen2.5-Math-7B on the released datasets from LIMO and S1 for 15 epochs, and evaluated using a 32K context length, consistent with their original setup. In contrast, our DPO-VP model uses a maximum context length of only 2K tokens during both training and evaluation.
The results are shown below:
| Model | MATH500 | Minerva | Olympaid | AMC23 | AIME24 | Average |
|---|---|---|---|---|---|---|
| Qwen2.5-Math-7B | 64.8 | 15.4 | 25.6 | 37.5 | 16.7 | 32.0 |
| Qwen2.5-Math-7B-LIMO | 70.2 | 21.0 | 37.0 | 45.0 | 16.7 | 38.0 (+6.0) |
| Qwen2.5-Math-7B-S1 | 72.0 | 32.4 | 37.5 | 55.0 | 13.3 | 42.0 (+10.0) |
| Qwen2.5-7B-DPO-VP | 74.8 | 35.3 | 36.9 | 67.5 | 26.7 | 48.2 (+16.2) |
DPO-VP significantly outperforms S1 and LIMO across nearly all benchmarks.
This result suggests that while both LIMO and S1 yield moderate gains over the base model, their improvements are limited when applied to 7B models. One possible explanation is that SFT with small-scale, long-chain data may be less effective at smaller scales or the SFT data should be meticulously prepared for different models, whereas our iterative DPO framework achieves better alignment and reasoning performance via self-generated preference pairs for different models(Qwen2.5-3B/7B), even under resource-constrained settings.
Response to Comment 3: Label Noise Ablation & PRM Robustness
We appreciate your concern regarding the robustness of our DPO framework under noisy preference labels.
To investigate this, we conducted a controlled label noise ablation under the PRM with offset setting in Table 1. Specifically, for each positive-negative response pair, we applied a random flipping probability — referred to as noisy ratio — that reverses the preference direction with probability p ∈ [0, 1]. For example, a noisy level of 0.5 means fully random pairwise supervision, while 1.0 implies completely reversed preferences.
The results below show pass@1 accuracy after a single DPO round on GSM8K and MATH500 using Qwen2.5-7B:
| Noisy Ratio | GSM8K | MATH500 |
|---|---|---|
| 0.0 | 90.3 | 75.8 |
| 0.2 | 78.8 | 67.0 |
| 0.5 | 67.1 | 61.6 |
| 0.8 | 59.8 | 59.6 |
| 1.0 | 45.6 | 44.6 |
These results confirm that model performance degrades as preference supervision becomes less accurate. While DPO is somewhat tolerant to mild noise (e.g., ≤0.2), higher noise levels lead to noticeable drops. This empirically validates the importance of maintaining reasonably accurate verifier signals for effective preference optimization.
Response to Comment 4: Lack of Explicit Exploration Mechanism
We agree with you that exploration may play an important role in iterative self-improvement.
In our early attempts, we applied fixed sampling temperature (t = 0.7) across all DPO rounds. However, we observed performance saturation after the third epoch. To introduce greater sampling diversity, we increased the temperature in later epochs: t = 1.0 for epochs 4–5 and t = 1.2 for epoch 6-7.
This approach increased response diversity in later rounds and yielded substantial gains. The results below compare fixed vs. increased temperature on the average performance across five benchmarks:
| Epoch | t = 0.7 | Ours (temperature ↑) |
|---|---|---|
| 3 | 44.5 | 44.5 |
| 4 | 44.6 | 44.7 |
| 5 | 44.3 | 46.5 |
| 6 | 46.2 | 48.2 |
| 7 | 45.5 | 47.6 |
These results demonstrate that even simple temperature annealing can enhance DPO by encouraging more diverse preference pairs, leading to stronger downstream performance.
While our method does not include formal entropy-based or diversity-penalized objectives, we believe that lightweight exploration strategies can meaningfully complement preference-based training. We acknowledge this as a valuable direction for future work, particularly inspired by recent and cutting-edge advances in exploration mechanisms within reinforcement learning [2–4].
[2] Entropy-based Exploration Conduction for Multi-step Reasoning. airxiv:2503.15848
[3] DRA-GRPO: Exploring Diversity-Aware Reward Adjustment for R1-Zero-Like Training of Large Language Models. airxiv:2505.09655
[4] DGRO: Enhancing LLM Reasoning via Exploration-Exploitation Control and Reward Variance Management. arXiv:2505.12951
General Response Summary
Thank you for the insightful and valuable comments. In response, we conducted a series of new experiments and ablation studies to thoroughly address each concern:
- We demonstrate that our DPO framework generalizes well beyond math, with strong results on code generation (HumanEval, LiveCodeBench) and reading comprehension (RACE).
- We include direct comparisons with recent strong SFT baselines (LIMO, S1) under the same 7B setting, showing that DPO-VP outperforms them across multiple benchmarks. More critically, while strong performance can be achieved with limited SFT data, obtaining such effective data demands large-scale experimentation with meticulously selected samples for different models. On the contrary, DPO-VP makes it easier via self-generated preference pairs.
- We provide a detailed label noise ablation to verify the robustness of our framework under noisy preference signals.
- We explore lightweight sampling-based exploration strategies, and show that performance can be further improved via simple temperature scaling.
We appreciate your suggestions, which have helped us further validate the scalability, robustness, and extensibility of our approach. In the camera-ready version, we will integrate the additional experimental results and expand our discussions accordingly.
We hope our response has satisfactorily clarified the key points, and we kindly ask you to consider a more favorable score based on the additional evidence provided.
I really appreciate the authors for the detailed responses with the additional experiments. My concerns are well resolved. Therefore, I increased my score to 7 from 6.
Thank you very much for your positive feedback and for taking the time to review our work. We’re glad to hear that your concerns have been resolved. Should you have any further questions or feedback, we would be pleased to discuss them in more detail.
Thank you again for your thoughtful feedback on our submission. We have carefully addressed your comments during the rebuttal phase, including clarifications and additional experiments tailored to your suggestions.
We understand this is a busy review period, but we would greatly appreciate it if you could take a moment to review our responses and, if possible, share any further thoughts or updates to your evaluation.
Your input is extremely valuable to us. Thank you again for your time and consideration!
This paper studies using iterative DPO to fine-tune LLMs on mathematical reasoning tasks.
接收理由
- The paper is easy to follow.
- The experiments are conducted on various models and the results support the main claim of the paper.
拒绝理由
- The novelty of the paper is limited. The authors show that iterative DPO can be applied to enhance the reasoning abilities of LLMs, which does not introduce new methods or provide new insights.
- Applying iterative DPO on reasoning tasks has been explored before, e.g., in [1], it was observed that iterative DPO works for reasoning as long as the prompt data is well chosen.
- To make the claims more convincing, the performance in Table 1 should be reported for multiple training runs, given that the difference in accuracies, such as the pass@8 accuracies, is within the range of 1.
- When training the PRM, the authors use a stronger model, such as DeepSeek-V3, to provide the supervision signal. The resulting PRM is thus approximately fixed, as a proxy of the stronger model, and does not get stronger with more training iterations. A reasonable training strategy is to use on-policy Monte-Carlo unrolls, such as the ones proposed in [2, 3]. Could the authors comment on why not use this training strategy, which is more aligned with the outcome reward model training?
[1] Dong et al., "RLHF Workflow: From Reward Modeling to Online RLHF." [2] Wang et al., "Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations." [3] Luo et al., "Improve Mathematical Reasoning in Language Models by Automated Process Supervision."
给作者的问题
See weakness above.
We thank the reviewer for the thoughtful critiques. We address each concern below.
Comment 1: Novelty and Contributions
We are grateful for the opportunity to elaborate on the innovations and empirical findings introduced in our study. While we acknowledge that DPO and its iterative variants have been previously explored, our work introduces several novel insights and empirical contributions that meaningfully extend the current literature:
-
Iterative Co-Evolution of Generator and Verifier
We propose a multi-round training framework where the generator and verifier (PRM) are jointly optimized. Unlike prior works that fix the reward model, we explicitly demonstrate mutual enhancement across iterations, revealing the benefits of feedback-driven co-evolution, which has not been systematically explored in earlier DPO studies. -
Resource-Efficient Verifiable Preference Optimization
Our work focuses on multi-round DPO with verifiable rewards, showing that it can rapidly enhance reasoning ability in Qwen2.5-series models using limited computational resources. While the benefits of RL in LLMs have been recognized since the release of ChatGPT in 2022, it was not until the emergence of DeepSeek-R1 in 2025 that its effectiveness on verifiable reasoning tasks was demonstrated at scale. We empirically show that DPO can achieve comparable improvements in verifiable reasoning with significantly lower cost, offering a practical alternative to RL-based pipelines. -
Comprehensive Empirical Study Across Strong Baselines
We benchmark against strong baselines including Simple-RL-Zero, PURE-VR, LIMO, and S1. During rebuttal, we further expanded experiments to include:- Generalization to code and RC tasks,
- Robustness under label noise,
- Ablations on sampling temperature and training dynamics.
These analyses provide a thorough and well-rounded evaluation of the DPO-VP framework.
Comment 2: Variance in Table 1 Results
We appreciate this suggestion. The performance difference may indeed fall within variance margins. In the camera-ready version, we will re-run key experiments and report averaged results over multiple seeds to improve statistical robustness.
That said, we emphasize that the main purpose of Table 1 is to demonstrate that even a single round of coarse filtering can substantially enhance the base Qwen model’s reasoning performance. We believe that a single-run result is sufficient to validate this clear performance gain and justify the effectiveness of our filtering strategy.
Comment 3: Justification for PRM Construction without Monte Carlo Unrolling
Our choice to train the PRM using outputs labeled by a stronger verifier (e.g., DeepSeek-V3) was primarily driven by efficiency and quality:
- It allows us to bootstrap a high-quality PRM without costly Monte Carlo unrolling or reinforcement learning.
- During DPO iterations, although the PRM architecture is fixed, it is continuously trained with new actor outputs, so its supervision signal gradually aligns better with the evolving generator—forming a form of implicit on-policy adaptation.
While Monte Carlo-based methods such as [2,3] are appealing, they often incur significantly higher computational costs and rely on fine-grained outcome annotations. In our setting, such annotations are not readily available and do not align well with the co-evolutionary training scheme of our generator and PRM, which emphasizes efficient and scalable online supervision.
General Response Summary
We sincerely thank the reviewer for the thoughtful and constructive feedback. Your comments have helped us further clarify the novelty and practical implications of our work. Specifically, we have:
- Clarified our core contribution as a systematic study of generator–verifier co-evolution, an aspect underexplored in prior DPO research;
- Reaffirmed the robustness and efficiency of our framework through new results on generalization, label noise tolerance, and ablations;
- Explained our choice of PRM training strategy and its advantages in terms of practical scalability and alignment with the evolving generator;
- Committed to re-running key experiments in Table 1 with multiple seeds in the final version to ensure statistical confidence.
We hope our response has sufficiently addressed your concerns and clarified both the novelty and empirical rigor of our work. If these clarifications and additional results help resolve earlier doubts, we would greatly appreciate your consideration of a more favorable evaluation.
Thank you for the detailed response. Regarding the novelty of feedback-driven co-evolution of both policy and reward model, I am still confused why we would need a stronger verifier, e.g., DeepSeek-V3, to provide supervision to train the PRM, instead of directly using this fixed verifier. Besides, since the DPO+PRM variant achieves higher overall performance, I feel it necessary to compare with process-reward RL, instead of GRPO, which is based on outcome rewards. The additional gain of DPO could come from the stronger verifier. For this reason, I'm also not fully convinced why not use Monte-Carlo rollouts for PRM training, which could better isolate the true benefits of DPO+PRM.
We sincerely thank you for the thoughtful and constructive comments. Your feedback helped clarify the scope and strengthen the presentation of our work. In this rebuttal, we address the main concerns regarding:
1. Why not use a fixed verifier like DeepSeek-V3
We appreciate your insightful observation. DeepSeek-V3 is a 671B MoE model and expensive to query, while our PRM is initialized from Qwen2.5-7B. Using DeepSeek-V3 for every DPO iteration is prohibitively expensive—even a single epoch already requires 7.5K queries. We instead distill its supervision into a lightweight PRM to support scalable multi-round training.
While one might consider directly using DeepSeek-V3 as a fixed verifier, we found this to be computationally excessive and unnecessarily strong for the lightweight filtering task. For comparison, we evaluated a one-epoch filtering baseline using DeepSeek-V3 on top of Table 1:
| GSM8K | MATH500 | GaoKao | Minerva | Avg | |
|---|---|---|---|---|---|
| +Outcome Label | 89.8 | 74.2 | 62.9 | 25.0 | 63.0 |
| +Qwen2.5-PRM | 90.5 | 74.6 | 61.0 | 25.0 | 62.8 |
| +DeepSeek-V3 Verifier | 90.3 | 75.1 | 61.7 | 25.6 | 63.2 |
The differences between these variants are minor, as the verifier only guides data filtering, and DPO iterations are still driven by the generator itself. This shows that training a PRM provides nearly equivalent performance at a fraction of the cost.
Lastly, we emphasize that our framework is general: if a strong verifier is available and cost is not a concern, it can be directly used as a fixed verifier - this is a special case of our formulation. In fact, as discussed in Section 3.5, we also experimented with outcome labels (i.e., ground-truth correctness) as a fixed verifier, and showed that iteratively optimizing the generator alone can already yield strong performance.
2. Compare with process-level RL methods
Thank you for the suggestion. We conducted an additional comparison against Qwen2.5-7B-PURE-PRM[1], a strong PPO-based baseline with PRM trained on the public PRM800K dataset.
We used our best 3-epoch variant from Figure 3(a), and present results below:
| GSM8K | MATH500 | GaoKao | Minerva | Avg | |
|---|---|---|---|---|---|
| Ours(3epoch) | 91.5 | 75.0 | 65.8 | 34.2 | 66.6 |
| PURE-PRM | 88.3 | 79.8 | 63.9 | 34.1 | 66.5 |
While the two models achieve comparable average performance, our approach remains more computationally efficient, as evidenced by the detailed comparison in Table 3.
Additionally, in Table 2 (§3.5), we include comparisons to other process-level RL baselines like rStar-Math-7B and Eurus-2-7BPRIME, which also rely on fully trained reward models. Their final performance does not consistently surpass our verifiable reward variants, further supporting the effectiveness of our approach.
[1] https://huggingface.co/jinachris/Qwen2.5-7B-PURE-PRM
3. Why not use Monte Carlo rollouts for training the PRM
We apologize for the lack of clarity in our earlier response. Our design choice is motivated by both efficiency and data availability: Instead of using Monte Carlo rollouts, we directly label sampled responses using a strong verifier. This offers a practical and online way to train the PRM during each DPO iteration, leveraging fresh actor outputs without incurring the high cost of repeated MC unrolling.
We agree that the performance gain may stem from external supervision. In Appendix B (Table 4), where a PRM trained on the public MathShepherd dataset is outperformed by our online-labeled PRM on ProcessBench (67.2 vs. 73.6 average F1). This suggests that the extra performance gain likely comes from strong supervision signals provided by the verifier, which are absent in standard MC setups. This also indirectly reflects the efficiency of directly training the PRM with a small amount of online-labeled data with strong supervisior.
This setting also follows the data labeling strategy of a recent ACL 2025 paper [2], which uses a similar method to label trajectories with bi-directional verifier feedback.
We agree that MC can provide a more theoretically grounded way to isolate reward model effects. However, our primary goal is to study co-evolution between generator and verifier, and the key performance gains in this work are grounded in outcome-level rewards, not in standalone PRM accuracy. Therefore, we believe that this simplified approach still sufficiently supports our core findings.
[2] Better Process Supervision with Bi-Directional Rewarding Signals, ACL 2025.
Summary
We hope these clarifications and new results demonstrate the robustness and generality of our framework. We would greatly appreciate your reconsideration based on the provided evidence. Thank you again for your insightful feedback and your time in reviewing our work.
Thank you again for your thoughtful feedback on our submission. We have carefully addressed your comments during the rebuttal phase, including clarifications and additional experiments tailored to your suggestions.
We understand this is a busy review period, but we would greatly appreciate it if you could take a moment to review our responses and, if possible, share any further thoughts or updates to your evaluation.
Your input is extremely valuable to us. Thank you again for your time and consideration!
Thank the authors for the further clarification. The comparison with process-reward RL baselines addressed one of my concerns, so I raised my score to 5. However, my concerns regarding the novelty and method designs still remain. Since the primary goal of this paper is to study co-evolution between the generator and verifier, and a fixed strong verifier gives better performance, it makes the co-evolution less necessary. Higher computation cost versus higher performance + additional training is a trade-off in my opinion. Additionally, the outcome-reward-only DPO already exhibits strong performance, which makes the novelty of co-evolution less pronounced, as this variant uses a fixed verifier.
Thank you again for your thoughtful and constructive feedback. We are glad that our additional experiments on process-reward RL baselines helped address part of your concerns.
Regarding your remaining concern about the necessity of co-evolution and the trade-off between using a fixed strong verifier versus a learned one: Due to time constraints in the original submission, we previously only compared single-round results. In this rebuttal, we now include a more comprehensive 3-epoch comparison involving three variants:
| Method | GSM8K | MATH500 | Gaokao | Minerva | 4-Math Avg | 5-OOD Avg |
|---|---|---|---|---|---|---|
| Ours(with original PRM, 3 epochs) | 90.9 | 74.0 | 64.4 | 32.7 | 65.5 | 50.6 |
| Ours(co-evolution, 3 epochs) | 91.5 | 75.0 | 65.8 | 34.2 | 66.6 | 51.4 |
| + DeepSeek-V3 verifer (3 epochs) | 91.0 | 75.9 | 66.0 | 32.4 | 66.3 | 51.1 |
As shown, both our co-evolution approach and using a fixed strong verifier (DeepSeek-V3) yield similar levels of performance, outperforming the original PRM baseline. Importantly, co-evolution achieves comparable results at lower computational cost, making it a more practical alternative under resource constraints.
We also agree that the strong outcome-reward-only DPO variant achieves competitive performance. However, it presumes the availability of clear-cut outcome labels, which are not always accessible in some open-ended domains such as mathematical proof, scientific reasoning, or biomedical QA. In such cases, distilling knowledge from a strong verifier into a compact and general-purpose critic remains a promising and efficient direction, which our co-evolution framework enables. This remains an important direction we aim to pursue in our future work.
We sincerely appreciate your objective evaluation and helpful suggestion, which are valuable and will help us further improve and refine our future research.
Dear reviewers,
Thank you for your valuable and constructive feedback. During the rebuttal phase, we carefully addressed your main concerns and provided additional experiments to support our claims. Our responses are summarized as follows:
-
On Methodological Novelty: We emphasized the core contribution of our work—the co-evolutionary training of generator and verifier—which enables scalable and efficient preference optimization. This was clarified primarily in response to Reviewers 1K3W, znvZ, and AnMH.
-
On PRM Construction Using a Strong Verifier: We explained our design choice to train the PRM with outputs labeled by a stronger verifier (e.g., DeepSeek-V3), balancing efficiency and alignment without relying on costly Monte Carlo rollout. (Reviewer 1K3W)
-
On Filtering Strategy and Dataset Quality: We addressed concerns regarding the coarse nature of preference filtering and its impact on SFT by providing noise ablation results and clarifying our design goals. (Reviewer znvZ)
-
On Self-Reflection Signals: We conducted additional analysis on linguistic markers of self-reflection and discussed their relation to model behavior. (Reviewer AnMH)
-
On Additional Empirical Studies: We added several ablations and generalization tests to further validate our approach:
- Domain generalization: Tested on code and reading comprehension tasks (Reviewers V4JU, znvZ)
| Model | CMATH | MMLU_STEM | HumanEval | Live_Code_Bnech | RACE | Average |
|---|---|---|---|---|---|---|
| Qwen2.5-Math-7B | 64.0 | 58.3 | 40.9 | 5.1 | 61.6 | 46.0 |
| Eurus-2-7B-PRIME | 72.5 | 42.8 | 41.5 | 21.2 | 62.6 | 48.1 |
| Qwen2.5-7B-Simple-RL-Zero | 70.3 | 59.1 | 51.8 | 21.4 | 61.1 | 52.7 |
| Qwen2.5-7B-PURE-VR | 69.5 | 58.8 | 43.3 | 6.4 | 61.3 | 47.9 |
| Qwen2.5-7B-DPO-VP | 68.5 | 59.3 | 50.6 | 17.3 | 61.2 | 51.4 |
- Comparison to recent SFT baselines: Including LIMO and S1 at the 7B scale (Reviewer V4JU)
| Model | MATH500 | Minerva | Olympaid | AMC23 | AIME24 | Average |
|---|---|---|---|---|---|---|
| Qwen2.5-Math-7B | 64.8 | 15.4 | 25.6 | 37.5 | 16.7 | 32.0 |
| Qwen2.5-Math-7B-LIMO | 70.2 | 21.0 | 37.0 | 45.0 | 16.7 | 38.0 (+6.0) |
| Qwen2.5-Math-7B-S1 | 72.0 | 32.4 | 37.5 | 55.0 | 13.3 | 42.0 (+10.0) |
| Qwen2.5-7B-DPO-VP | 74.8 | 35.3 | 36.9 | 67.5 | 26.7 | 48.2 (+16.2) |
- Label noise ablation: Demonstrating the robustness of DPO-VP (Reviewers V4JU, znvZ)
| Noisy Ratio | GSM8K | MATH500 |
|---|---|---|
| 0.0 | 90.3 | 75.8 |
| 0.2 | 78.8 | 67.0 |
| 0.5 | 67.1 | 61.6 |
| 0.8 | 59.8 | 59.6 |
| 1.0 | 45.6 | 44.6 |
- Exploration analysis via temperature scaling: Showing performance gain from higher-temperature rounds (Reviewer V4JU)
| Epoch | t = 0.7 | Ours (temperature ↑) |
|---|---|---|
| 3 | 44.5 | 44.5 |
| 4 | 44.6 | 44.7 |
| 5 | 44.3 | 46.5 |
| 6 | 46.2 | 48.2 |
| 7 | 45.5 | 47.6 |
- Baseline comparisons under the same data: Including GRPO-Zero, Simple-RL-Zero/Zoo, etc. (Reviewer AnMH)
| Model | MATH500 | Minerva | Olympiad | AMC23 | AIME24 | Avg. |
|---|---|---|---|---|---|---|
| Qwen2.5-7B-GRPO-Zero | 76.2 | 32.7 | 38.1 | 55.0 | 16.7 | 43.7 |
| Qwen2.5-7B-Simple-RL-Zero | 78.0 | 33.1 | 36.6 | 60.0 | 26.7 | 46.9 |
| Qwen2.5-7B-Simple-RL-Zoo | 80.4 | 39.7 | 38.8 | 57.5 | 26.7 | 48.7 |
| Qwen2.5-7B-PURE-VR | 79.8 | 36.8 | 41.9 | 60.0 | 20.0 | 47.7 |
| Qwen2.5-7B-DPO-VP | 74.8 | 35.3 | 36.9 | 67.5 | 26.7 | 48.2 |
In the camera-ready version, we will further improve the writing, clarify our methodology and motivation, and ensure that missing references are properly cited.
We sincerely appreciate your time and thoughtful feedback. Should you have further questions, we would be happy to engage in deeper discussion, and we respectfully invite you to consider updating your evaluation based on the new materials provided.
This paper proposes to enhance LLM reasoning ability with DPO-based algorithms. The central contribution lies in showing that iterative DPO, combined with co-evolutionary training of the policy and verifier, can match the performance of reinforcement learning approaches while being more computationally efficient.
The major concerns from the reviewers are about novelty: many iterative DPO methods have been discussed in the existing literature. I also agree with this concern and would like to point out one missing citation [1], where iterative DPO has been carefully discussed and shown as a strong enhancement to the classical DPO.
However, I would also admit that the insights that the reward model can be co-evolved is new to me (and the reviewers). This places the paper at an interesting middle ground between DPO and PPO. In addition, during the rebuttal, the authors have included a wide collection of evaluation benchmarks, which consolidate the position of this work.
The final concern is from Reviewer 1K3W regarding the use of external supervision. I agree that this can be potentially improved or replaced with a real human annotation process, which can be left for future work.
To sum up, I would give an acceptance recommendation for this paper. But I would encourage the authors to include the missing citation and more discussions on the final concerns from Reviewer 1K3W.
[1] Xu et a., Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study, https://arxiv.org/abs/2404.10719