PaperHub
6.4
/10
Poster4 位审稿人
最低4最高4标准差0.0
4
4
4
4
4.0
置信度
创新性2.8
质量2.8
清晰度3.0
重要性2.8
NeurIPS 2025

Incentivizing LLMs to Self-Verify Their Answers

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

We introduce an RL framework that unify the training of answer generation and verification in a single model.

摘要

关键词
large language modelmath reasoningreinforcement learning

评审与讨论

审稿意见
4

This paper presents a unified RL framework that trains LLM by synergizing answer generation and verification via Group Relative Policy Optimization (GRPO), eliminating the need for external reward models at inference time. The authors integrate a generative-verifier reward into GRPO, using a policy‐aligned buffer and a dynamic verification reward to focus the model on challenging cases. This method is then evaluated on mathematical reasoning benchmarks (MATH500, AIME24/25, AMC23, OlympiadBench), showing that “Self-Verification” improves post-training performance over standard GRPO and enables more effective test-time scaling compared to self-consistency, best-of-N, beam search, etc.

优缺点分析

Strength:

  1. The single‐model approach bridges post-training and test-time scaling, avoiding distributional mismatch with external verifiers.

  2. Evaluation: On multiple benchmarks, Self-Verification models outperform base models and prior RL baselines (e.g., SimpleRL, DeepScaleR) in both problem-solving and verification tasks. In terms of efficiency, verification consumes significantly fewer tokens (24% in problem-solving) and lower inference overhead compared to external-verifier methods. Ablation studies

Weaknesses:

  1. (Data Contamination): The paper does not discuss whether evaluation sets (e.g.\ AIME24, MATH500) might overlap with the training data or pretraining corpora, especially for the self-verification verifier (Tables 2–3). Without contamination analysis, reported verification accuracies may be overly optimistic.

  2. The authors focus exclusively on LLM-based generative verification (which may contain hallucinations and black-box mechanisms) and do not compare to symbolic or formal engines (e.g.\ SAT/SMT solvers) that could provide deterministic verification paths for many math sub‐tasks.

问题

  1. Verification without Ground Truth: During test-time scaling (Section 3.2), how does the model reliably assess correctness in the absence of explicit ground-truth labels? What confidence calibration mechanisms prevent propagation of false positives/negatives?

局限性

Yes, the authors have explained the limitations, including the scope and potential constraints of their approach.

最终评判理由

The authors have provided a detailed response to clarify my confusion about formal verification (W2, W3) and explained my concerns about the data contamination issue in the experiment. Based on their response, I will keep my rating and increase confidence.

格式问题

N/A

作者回复

Weakness 1 The paper does not discuss whether evaluation sets (e.g. AIME24, MATH500) might overlap with the training data or pretraining corpora, especially for the self-verification verifier (Tables 2–3). Without contamination analysis, reported verification accuracies may be overly optimistic.

All our training data are inherited exactly from prior works (please refer to SimpleRL [27] and DeepScaleR [28] in our paper). The dataset used in SimpleRL is the training portion of the original MATH dataset and well removes the test data. The dataset of DeepScaleR contains AIME and AMC problems, all prior to our test benchmarks according to their description. Therefore, all the training data used in our method has no overlap with the benchmarks. Our verifier training is based on the same query data as the problem-solving data, which does not overlap with our evaluation benchmarks.

Regarding possible leakage from the underlying pre-training corpora of Qwen and DeepSeek, those corpora are proprietary and unavailable for direct inspection. However, every baseline we compare against is built on the same pretrained models, so any residual contamination, if existing, affects all methods equally and therefore cannot explain the performance gap we observe.

Weakness 2 The authors focus exclusively on LLM-based generative verification (which may contain hallucinations and black-box mechanisms) and do not compare to symbolic or formal engines (e.g. SAT/SMT solvers) that could provide deterministic verification paths for many math sub‐tasks.

Our benchmarks involve full natural-language contest problems. We note that it is infeasible to use SAT/SMT solvers for our benchmarks, since they are only for problems that fit their formal theories -- pure propositional logic for SAT and a few decidable first-order fragments for SMT. Encoding these challenging math questions into these formal forms is labor-intensive and leaves many sub-goals outside our scope. Here we list the differences between our method and SAT/SMT solvers in the problem categories and encoding methods.

DimensionSAT SolversSMT SolversSelf-Verification
Problem categoryBoolean satisfiabilityAlgebra/logic within supported theoryOpen math problems
Problem EncodingManual CNF translationManual, theory-specific constraintsNatural-language prompt

Our LLM pipeline operates directly on the raw natural-language statement, generates a complete derivation, and self-verifies it probabilistically, covering the full breadth of contest-level mathematics while requiring zero manual encoding.

Question 1 Verification without Ground Truth: During test-time scaling (Section 3.2), how does the model reliably assess correctness in the absence of explicit ground-truth labels? What confidence calibration mechanisms prevent propagation of false positives/negatives?

As shown in Equation (6), we adopt a weighted score to aggregate different generated responses and select the answer with the highest score. This ensemble has two safeguards:

  1. The verifier is trained jointly with the solver and receives a reward that penalises wrong judgments, so its Yes/No probabilities are already aligned with empirical correctness.
  2. Even if an individual judgment is noisy, combining N independent solutions dilutes single-case errors. In practice, we see a superior math performance with aggregating 16 samples in Table 4.

Therefore, the weighted aggregation can mitigate both false positives and negatives without requiring external ground-truth labels.

评论

Thanks for addressing my concerns! I will maintain the rating and raise the confidence.

审稿意见
4

This paper introduces a novel "self-verification" framework to address the distributional shift between specialized language models and external verifiers, a key problem that limits performance scaling at test-time. The authors propose unifying answer generation and verification within a single RL process, training the model to simultaneously solve complex reasoning tasks and assess the correctness of its own solutions. By leveraging a policy-aligned buffer to train on its own recent outputs and a dynamic reward function that incentivizes learning on difficult cases, the model effectively becomes its own highly capable and efficient verifier. Experiments on mathematical reasoning benchmarks demonstrate that this method not only improves the model's intrinsic problem-solving ability but also enables superior test-time scaling by self-evaluating multiple candidate responses, outperforming traditional methods like self-consistency and those relying on external reward models.

优缺点分析

The proposed unified framework for problem-solving and self-verification is quite intriguing, as it appears, at least on the surface, to lead to more strong reasoning capabilities. Furthermore, the authors' point regarding verifier distribution shift is compelling, and I concur with this perspective.

However, a conceptual challenge arises from this design. While using a single model for both problem-solving and self-verification elegantly addresses the issue of distribution shift, it seems that, theoretically, the responses generated by the model itself do not introduce new, external information. This raises the crucial question of why this unified training approach should enhance the model's fundamental reasoning abilities. The paper seems to lack a more thorough or intuitive explanation for this synergistic effect.

问题

1、Regarding Table 1, the comparison seems potentially imbalanced. If I understand correctly, the "Self-Verification (Ours)" method involves generating multiple responses which are then verified, whereas the baseline methods only generate a single response. Could you clarify if this is an apples-to-apples comparison, or am I misinterpreting the experimental setup for this table?

2、The performance improvement shown in Table 1 appears to be somewhat marginal. Specifically, when comparing the "Self-Verification" method to the "GRPO (Problem-solving Only)" baseline, the gain seems modest. Could you elaborate on the significance of this improvement?

3、Regarding the training process, my understanding is that for each problem-solving sample, G responses are generated, which then create G verification samples. If G=8, this seems to imply a significant increase in computational overhead for training the verification task. Could you comment on the cost-benefit analysis here, as the performance improvement from this additional training seems modest?

4、I acknowledge that the dynamic reward function for self-verification is quite sophisticated. However, looking at Figure 6 in the appendix, the performance of the simple 0-1 reward function doesn't seem substantially worse, especially for the smaller R1-1.5B model?

5、I'm interested in the claim that problem-solving and self-verification are mutually beneficial. From one perspective, the verification task could be seen as a simple re-framing of the original task's output, potentially not adding substantial new information. Could you provide a more detailed or intuitive explanation for how training a model to verify its own solutions leads to a genuine improvement in its core problem-solving abilities?

局限性

Yes

最终评判理由

Thanks for the author's reply, which has resolved some of my confusion. I will keep my score, and it is already a positive score.

格式问题

No

作者回复

For the weakness stated in the review, please refer to the response to Question 5.

Question 1 Regarding Table 1, the comparison seems potentially imbalanced. If I understand correctly, the "Self-Verification (Ours)" method involves generating multiple responses which are then verified, whereas the baseline methods only generate a single response. Could you clarify if this is an apples-to-apples comparison, or am I misinterpreting the experimental setup for this table?

The results in Table 1 are comparisons of greedy-decoding scores for all models, without generating and selecting from multiple responses. This is an evidence that introducing verification tasks indeed helps problem-solving by enhancing the potential reasoning capabilities (refer to Question 5 for more in-depth discussion), as our self-verification model has higher greedy scores than standard GRPO. To further utilize the verification abilities, our main results on test-time scaling are shown in Figure 1 and Table 4. We will emphasise this distinction more clearly in the revised version.

Question 2 The performance improvement shown in Table 1 appears to be somewhat marginal. Specifically, when comparing the "Self-Verification" method to the "GRPO (Problem-solving Only)" baseline, the gain seems modest. Could you elaborate on the significance of this improvement?

As stated in our response to Question 1, the improvement is based on greedy-decoding scores. Compared to the GRPO baseline, our method only has extra verification training in the RL process. This process has no direct correlation with the problem-solving performance, while we still prove that it can improve greedy-decoding scores due to better reasoning abilities.

Question 3 Regarding the training process, my understanding is that for each problem-solving sample, G responses are generated, which then create G verification samples. If G=8, this seems to imply a significant increase in computational overhead for training the verification task. Could you comment on the cost-benefit analysis here, as the performance improvement from this additional training seems modest?

We keep the training cost of our self-verification model similar to the GRPO baseline by keeping the same training batch size and training steps. As shown in our listed algorithm, we sample data batches from a mixed dataset with both problem-solving and verification problems. Therefore, we use the same total amount of data as the baseline GRPO algorithm. Our policy-aligned buffer design maintains a fixed verification data size, without selecting all samples generated during training.

Question 4 I acknowledge that the dynamic reward function for self-verification is quite sophisticated. However, looking at Figure 6 in the appendix, the performance of the simple 0-1 reward function doesn't seem substantially worse, especially for the smaller R1-1.5B model?

Your finding is correct. We speculate that the model size is the bottleneck of limited verification performance for the 1.5B-size model, while adopting the dynamic reward cannot improve the performance with a large margin. On the other hand, we can see that the dynamic reward function leads to significantly better performance improvement for our 7B model. Besides, the dynamic reward is not so complex to implement. We only need to record the difficulty of the original question as additional data, which can be naturally acquired through the sampling process of problem-solving tasks. Compared to other training operations like sampling and weight updates, its cost can be very small.

Question 5 I'm interested in the claim that problem-solving and self-verification are mutually beneficial. From one perspective, the verification task could be seen as a simple re-framing of the original task's output, potentially not adding substantial new information. Could you provide a more detailed or intuitive explanation for how training a model to verify its own solutions leads to a genuine improvement in its core problem-solving abilities?

The verification task can enhance the model's reasoning capability on the given problem, which further helps solve the problem in subsequent training epochs. This means that the knowledge required by the verification task can finally generalize to problem-solving. We may find similar insights in models like OpenAI Codex [1], where they find that pretraining the model on code data acquiring stronger logical abstractions that then transfer to natural-language reasoning tasks. A recent work [2] also finds that training meta-abilities on some tasks can help improve general reasoning capabilities on many benchmarks. For our task, although no new input is introduced, the verification training can also elicit responses that are also useful for problem-solving, resulting in the improvement in Table 1, indicating the effectiveness of our self-verification training.

[1] Chen, Mark, et al. "Evaluating large language models trained on code." arXiv preprint arXiv:2107.03374 (2021).

[2] Hu, Zhiyuan, et al. "Beyond'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models." arXiv preprint arXiv:2505.10554 (2025).

评论

Thanks for addressing my concerns! I will maintain the rating

审稿意见
4

Inspired along the line of the generative verifier works[14-16], the paper applies GRPO algorithm to align the model to answer and verify the question at the same time. During the inference time scaling, the same model is used to answer and question and verify the generated answer. A weighted aggregation is used to determine the final answer.

优缺点分析

Strengths:

  1. The clarity of the paper's writing and presentation is good.

Weakness:

Method-wise:

  1. The novelty of the proposed method, both technically and method-wise, is quite limited. The method builds on the success of previous generative-verifier works [14–16]. The major difference lies in: (1) instead of using SFT, the model uses an RL method; and (2) following naturally from point 1, the method relies on RL–online sampled data instead of the pre-constructed dataset used in previous generative-verifier works.
  2. One proposed benefit of the method is that no external verifier is needed for test-time scaling. However, the same amount—or even more—compute is required, since the model must be prompted to verify the generated answer. Thus, compute-wise, the proposed method offers no advantage beyond saving a copy of the model weights in storage. Moreover, when scaling to larger models, this approach is debatably limited by the need for both generation and verification to rely on the same large model, thereby losing the flexibility of using a large model for generation and a smaller model for more efficient verification.

Experiment-wise:

  1. The performance improvement seems to be marginal compared to the GRPO baselines shown in Table 1.
  2. Despite being closely related to previous Gen-Verifier works, there seems to be no comparison against previous work [14], It's expected to see a comparison on verifier's accuracy, as well as performance difference of using different verifiers for test-time-scaling
  3. Ablation studies against some of the important method design are missing, e.g. Policy-Aligned Buffer and α\alpha from Test-time Scaling

问题

  • In table 1, is the verification used in avg@10, if yes, how?
  • In table 1, GRPO-Qwen-7B seems duplicate, what's the GRPO-R1-1.5B performance?
  • What's the verifier model for best-of-N baselines? How good is the trained model's ability to serve as verifier other than the answer generated by itself? (e.g we can verify this by using the trained model as another model's verifier)

局限性

yes

最终评判理由

My major concern has been resolved. See my discussion with the author.

格式问题

N/A

作者回复

Weakness 1 The technical novelty is limited; prior works [14–16] already train a generative verifier. Your differences are merely (1) using RL instead of SFT and (2) sampling data online

The novelty of our work does NOT come from learning another generative verifier but showing the integration of problem-solving and self-verification. The works [14-16] you list only consider learning verifier models but not problem-solving ones. The purpose of training a generative verifier in our work is to realize better test-time scaling on the model itself, different from these works. Besides, replacing SFT with RL is not a cosmetic swap. We design a RL training scheme to utilize online samples and discuss specific challenges in our methodology, which are technically distinguished from previous works.

Weakness 2 The 'no-external-verifier' claim confers little benefit: the same—or even greater—compute is needed because the model still has to verify its own answer. When scaling up, tying generation and verification to a single large model removes the flexibility of pairing a large solver with a smaller, cheaper verifier.

We would like to answer this comment from the following points:

  1. “More compute is required in self-verification”. External verifiers also need extra compute. In Figure 4, we show that our method is indeed more efficient than other test-time methods using other reward models. The extra compute is always required unless we do not rely on any form of verifiers, e.g., using simple voting. In this case, the cost of ours is typically less than other baselines.
  2. The method offers no advantage beyond saving a copy of the model weights in storage”. From the infrastructure perspective, our method not only saves the memory but also reduces the complexity of deployment. For example, a method with external verifiers needs to deploy both the generator and the verifier with careful workloads balance. Otherwise, the imbalanced hardware utilization between these two models will result in inefficiency. However, our method only needs one simple deployment since the problem-solving and verification are through the same generation interface.
  3. The method loses flexibility due to relying on the same large model”. We acknowledge that using a single checkpoint removes the option to combine different solvers and verifiers. We believe this is an acceptable trade-off for our efficient unified structure. We also would like to argue the need of choosing a smaller model for verification, as recent works find that verification quality is also important. For instance, [1] shows that using the large Gemini 2.5 Pro model for both problem-solving and verification is capable of winning IMO gold, highlighting the significance of the verifier. We will describe our limit on this flexibility as the trade-off in the next version of our paper.

[1] Huang, Yichen, and Lin F. Yang. "Gemini 2.5 Pro Capable of Winning Gold at IMO 2025." arXiv preprint arXiv:2507.15855(2025).

Weakness 3 The performance improvement seems to be marginal compared to the GRPO baselines shown in Table 1.

We clarify that the results in Table 1 are comparisons of greedy-decoding scores for all models, without introducing any verification process to select answers. Because solving and verifying draw on overlapping reasoning skills but are trained with different losses, a modest transfer gain is expected and, we argue, already noteworthy. Besides, the full benefit of our approach appears once the verifier is used at test time, as shown in Figure 1 and Table 4. We will revise related descriptions to make the results clearer.

Weakness 4 Despite being closely related to previous generative verifier works, there seems to be no comparison against previous work [14], It's expected to see a comparison on verifier's accuracy, as well as performance difference of using different verifiers for test-time-scaling.

We treat the generative verifier [14] as a methodology in our work rather than a baseline. The core idea of [14]—train and ask the LLM to produce the judgment—is the prompting mechanism we adopt for every model in Tables 2 and 3. In other words, all systems are evaluated under the identical generative-verifier interface of [14]. In our work, the principal question is not which verifier alone is better, but which complete system (the post-trained model + test-time methods) yields the best math scores. In this domain, our self-verification method outperforms all other baselines. We would like to elaborate on this idea better in our next version.

Weakness 5 Ablation studies against some of the important method design are missing, e.g. Policy-Aligned Buffer and α\alpha from Test-time Scaling.

Due to the page limit, we put our ablation studies to Appendix D. We find that the choice of TbT_b that decides the buffer size will not result in significant performance difference, while the use of dynamic verification rewards can lead to better performance than simple 0-1 rewards. We also supplement experiments to show the effect of the test-time hyperparameter α\alpha in the following table. In general, different α\alpha will result in performance difference in a small range. The chocie of α\alpha will be more sensitive to the small 1.5B-size model, since its model size limits its verification performance. As a result, we adopt a smaller value of α\alpha (0.1) for the 1.5B model and a larger α\alpha (1.0) for the 7B model as stated in Appendix C.

α\alpha0.31.03.05.0
Self-Verification-Qwen-7B@16 (MATH500)86.6087.2087.1087.10
α\alpha0.030.10.31.0
Self-Verification-R1-1.5B@16 (MATH500)92.693.692.491.0

In general, we ablate both training and test-time hyperparameters, and the results show that self-verification remains stable across a broad range of hyperparameter choices. We will add a short pointer to these appendix sections in our main paper.

Question 1 In Table 1, is the verification used in avg@10, if yes, how?

As described in Weakness 3, the benchmark results are all greedy-decoding scores of different models without using verification. avg@10 means that we repeat the questions for 10 times to ensure the stability of results for some benchmarks with few questions. We will update the descriptions in the next version.

Question 2 In Table 1, GRPO-Qwen-7B seems duplicate, what’s the GRPO-R1-1.5B performance?

We apologize for the duplicate data, where we forgot to fill in the results of GRPO-R1-1.5B. Here we supplement the results of GRPO-R1-1.5B in the following table:

ModelMATH500AIME24 (avg @ 10)AIME25 (avg @ 10)AMC23 (avg @ 10)Olympiad Bench
Self-Verification-R1-1.5B (Ours)87.0043.0031.3377.5044.30
R1-Distill-Qwen-1.5B (Base model)80.0024.3325.0064.2532.89
GRPO-R1-1.5B (Problem-solving only)87.0037.6726.6772.5040.74
DeepScaleR-1.5B-Preview83.0037.0031.0077.2543.56

We can find that our self-verification model also outperforms the standard GRPO-R1-1.5B RL baseline on most benchmarks, indicating that training with verification problems can synergize problem-solving tasks, with similar findings in the 7B model. We will revise this table to include the results in the next version.

Question 3 What's the verifier model for best-of-N baselines? How good is the trained model's ability to serve as verifier other than the answer generated by itself? (e.g we can verify this by using the trained model as another model's verifier)

We use the RLHFlow/Llama3.1-8B-PRM-Deepseek-Data model as the general reward model for the best-of-N baseline, following the settings from the previous work. Different from our generative verifier, it is a discriminative model that outputs the reward score directly through an additional network head.

We also conduct a new experiment to perform our test-time scaling on the outputs of the Qwen2.5-Math-7B-Instruct model. We choose three different models as the verifier: the RLHFlow/Llama3.1-8B-PRM-Deepseek-Data reward model, the Qwen2.5-Math-7B-Instruct model itself as a generative verifier, and our Self-Verification-Qwen-7B post-trained model. The test-time scaling results with 16 samples are shown in the following table. As a generative model, we can expect that our model can better generalize to different distributions than numerical models, since the model can generate text-based rationales to help make judgments. Meanwhile, our verifier model also outperforms the instruct model itself, indicating the advantages of verification training through RL.

Verifier ModelMATH500AIME24AIME25AMC23Olympiad Bench
RLHFlow/Llama3.1-8B-PRM-Deepseek-Data (Discriminative)85.2015.6720.3371.7536.89
Qwen2.5-Math-7B-Instruct (Generative)85.4015.679.6763.0034.67
Self-Verification-Qwen-7B (Generative)88.8017.3320.0073.0038.67
评论

I want to thank the author for their response; my major concern is resolved. Please improve the presentation of Table 1 to avoid misinterpretation of the result.

  • For avg@10, if it is greedy decoding (temp=0) without a verifier, why would you need 10 trials. The common pratice is either greedy decoding, or temp>0, but with multiple trials.
评论

Thanks for your comment! Sure we will update the caption and related descriptions on Table 1 according to your feedback.

For avg@10, if it is greedy decoding (temp=0) without a verifier, why would you need 10 trials. The common pratice is either greedy decoding, or temp>0, but with multiple trials.

Your opinion is correct. Here we simply adopt the same evaluation framework as our test-time scaling experiments, in which we use temp>0 and multiple samples are needed.

It's true that there is no sufficient need to repeat 10 times in common practice. However, some specific impacts like numerical issues and hardware infrastructure may still result in potential errors when temp=0. For example, LLM service providers like OpenAI need additional arguments like seed and fingerprints to ensure reproducible outputs. By averaging over 10 trials, we can mitigate this potential for minor stochasticity and ensure that our reported metrics are more stable and robust.


We hope that our response can address your concerns about our work. We are grateful for your feedback and hope our response satisfies your concerns.

评论

Thank you for your response. I suggest just using the term greedy decoding to avoid confusion. The paper writing can be improved in the revision. E.g. Algo 1 and equation 1,2 are not quite informative, so can be excluded for the additional results provided in the rebuttal.

I've adjusted my evaluation.

评论

Thanks for your feedback and re-evaluation! We will revise the paper according to your advice.

In our initial submission, we decide to use "post-training" and "test-time" results to distinguish Table 1 and Table 4. Based on your and other reviewers' feedback, we will highlight "greedy decoding scores" to avoid confusion. Besides, we will also consider shortening the background and algorithm section, and adding our ablation studies in the next version. Hope our response can address your comments.

审稿意见
4

The authors propose a unified reinforcement learning framework that trains large language models to both generate and verify their own solutions within a single policy. They introduce a dynamic verification reward and a policy-aligned buffer to address distribution shifts and data imbalance during joint optimization. The approach is tested on mathematical reasoning benchmarks using Qwen2.5-Math-7B and DeepSeek-R1-Distill-Qwen-1.5B, their self-verification models improve post-training performance and enable effective test-time scaling without external verifiers, achieving good results.

优缺点分析

Strength

  1. Strong performance advantages.
  2. Clear motivation.

Weakness

  1. The paper’s primary impetus—that a “distribution discrepancy” between a post-trained generator and a general reward model constrains test-time scaling—is asserted but never substantiated with distributional distance metrics or error‐mode analyses
  2. Although joint RL training is said to eliminate external verifiers, there is no measurement of inference latency, sampling overhead, or energy consumption to demonstrate a net practical benefit.
  3. Complexity - deploying a unified RL framework requires managing complex components—policy and value networks, replay buffers, scheduler tuning, distributed rollout servers—versus the far simpler pipelines of prompt-based self-consistency or supervised fine-tuning.

问题

  • How well does self-verification generalize to non-mathematical domains such as code generation or open-ended commonsense reasoning?

  • How sensitive are results to hyperparameters of the dynamic verification reward and buffer size, and have the authors performed ablations on these components?

  • Recent discussions have revealed that even entropy-minimization alone can be a strong baseline. Can the authors provide comparison against those?

局限性

yes

最终评判理由

Thanks for the author's response. I will maintain my positive score of 4.

格式问题

na

作者回复

Weakness 1 The paper’s primary impetus—that a ‘distribution discrepancy’ between a post-trained generator and a general reward model constrains test-time scaling—is asserted but never substantiated with distributional distance metrics or error-mode analyses.

We not only propose the issue of distribution discrepancy but also show its influence through the test-time performance. In Figure 1, Starting from a post-trained base model, all test-time scaling strategies that rely on an off-the-shelf reward model (RM-guided beam search, best-of-N with RM reranking, DVTS, etc.) fail to beat simple majority voting. The results diverges from previous results, e.g., the results from DVTS, implying that the external RM is mis-aligned with the new generator’s output distribution.

Directly measuring distributional distance in the high-dimensional space of LLM outputs is non-trivial, so we further conduct experiments to evaluate calibration: the Pearson correlation between RM scores and ground-truth correctness. A well-aligned RM should assign higher scores to correct answers regardless of the generator. As shown in the following table, we find that the correlation coefficient drops significantly on responses generated by our post-trained model compared to open-source Llama and Qwen models.

Generator modelLlama-3.1-8B-InstructQwen2.5-Math-7B-InstructSelf-Verificaiton-Qwen-7B
Pearson correlation coefficient44.942.337.5

These two pieces of evidence substantiate our claim: without re-alignment, a general reward model generalizes poorly to a newly post-trained generator, limiting the effectiveness of test-time scaling.

Weakness 2 Although joint RL training is said to eliminate external verifiers, there is no measurement of inference latency, sampling overhead, or energy consumption to demonstrate a net practical benefit.

Our joint RL procedure is deliberately matched to the standard GRPO baseline, so its resource profile is comparable.

  1. Inference latency: Training uses the same batch size and number of gradient steps as GRPO. At inference time, our policy produces fewer tokens per problem because verification prompts are shorter than full solutions (see Figure 3).
  2. Sampling overhead: the number of samples required in each training step is the same as standard GRPO. For the test-time scaling, Figure 4 shows that our method is faster than other baselines requiring reward models.
  3. Energy consumption: Our RL training and test-time strategies use the same hardware settings as our RL baselines. Because token count and external-model calls are lower, our method inherits the same upper bound on energy use and in practice consumes less GPU time.

Weakness 3 Deploying a unified RL framework requires managing complex components—policy and value networks, replay buffers, scheduler tuning, distributed rollout servers—versus the far simpler pipelines of prompt-based self-consistency or supervised fine-tuning.

Our method only has marginal complexity over GRPO, as our method only extends the data buffer to include verification problems. The table below compares greedy decoding from our single model to 16-sample self-consistency on the Qwen2.5-Math-7B base model—arguably the simplest prompt-only baseline.

ModelMATH500AIME24AIME25AMC23
Self-Verification-Qwen-7B (greedy decoding)83.6020.0016.6763.75
Qwen2.5-Math-7B (self-consistency@16)79.0024.0015.6762.00

Despite using zero extra samples at test-time, our method surpasses the prompt-based approach on three of four benchmarks and narrows the gap on the fourth, validating that the modest training complexity translates into material accuracy gains. On the other hand, our dataset provides pass/fail verdicts but not gold-standard rationales; hence pure SFT cannot learn to both solve problems and judge its own answers. RL with a scalar correctness signal is therefore more appropriate and effective.

Question 1 How well does self-verification generalize to non-mathematical domains such as code generation or open-ended commonsense reasoning?

This is an interesting point. Our framework is task-agnostic in the sense that it only presumes access to the correctness signal for each sampled answer. In mathematics that signal is provided by a given math verifier. In other domains, we may need more efforts on designing the verification processes. For example, the verification process for code tasks can be generate reliable unit tests to examine the code, where a recent work provides related idea [1]. Besides, we can also design more complex verification mehtods using tools, which make our methodology able to extend more scenarios.

Question 2 How sensitive are results to hyperparameters of the dynamic verification reward and buffer size, and have the authors performed ablations on these components?

Yes, we ablate the effects of our hyperparameters, including TbT_b, which decides the buffer size, and the use of dynamic verification reward, in Section D from the appendix. We find that the choice of TbT_b will not result in significant performance difference, while the use of dynamic verification rewards can lead to better performance than simple 0-1 rewards, especially for the 7B-size model.

According to the feedback from Reviewer nBSh, we also conduct experiments on the effect of the test-time hyperparameter α\alpha. The results are shown in the following table. In general, different α\alpha will result in performance difference in a small range. The chocie of α\alpha will be more sensitive to the small 1.5B-size model, since its model size limits its verification performance. As a result, we adopt a smaller value of α\alpha (0.1) for the 1.5B model and a larger α\alpha (1.0) for the 7B model as stated in Appendix C.

α\alpha0.31.03.05.0
Self-Verification-Qwen-7B@16 (MATH500)86.6087.2087.1087.10
α\alpha0.030.10.31.0
Self-Verification-R1-1.5B@16 (MATH500)92.693.692.491.0

In sum, both training and test-time hyperparameters are ablated, and the results show that self-verification remains stable across a broad range of settings.

Question 3 Recent discussions have revealed that even entropy-minimization alone can be a strong baseline. Can the authors provide comparison against those?

Yes. We compare our 7B model with the one-shot entropy minimization model (EM 1-shot) [2]. We note that they are generally comparable since they are post-trained from the same base model, despite different training configurations. The results are shown in the following table. We can see our trained model still outperforms the entropy-minimization model.

ModelMATH500AIME24AIME25AMC23Olympiad Bench
Self-Verification-Qwen-7B83.620.0016.6763.7534.81
EM 1-shot68.819.338.6754.5027.26

[1] Wang, Yinjie, et al. "Co-evolving llm coder and unit tester via reinforcement learning." arXiv preprint arXiv:2506.03136(2025).

[2] Gao, Zitian, et al. "One-shot Entropy Minimization." arXiv preprint arXiv:2505.20282 (2025).

评论

Dear Authors and Reviewers,

I would like to thank the authors for providing detailed rebuttal messages. I would also like to thank reviewer nBSh for already engaging in further discussion.

For the other reviewers, I would like to encourage you to carefully read all other reviews and the author responses and engage in an open exchange with the authors. Please post your first response as soon as possible within the discussion time window, so there is time for back and forth discussion with the authors. Ideally, all reviewers will respond to the authors, so that the authors know their rebuttal has been read.

Best regards,
AC

最终决定

This paper proposes to add self-verification reward in addition to correctness reward during RL, to incentivize LLM's reasoning and verification capabilities simultaneously. The improvements on both generation and verification were demonstrated with the separate experiments. Initially, the reviewers raised some concerns, but the authors successfully resolved them during the discussion. Consequently, all the reviewers reached a unanimous agreement on the acceptance of this paper, and hence AC also recommend to accept this paper.