Generative Verifiers: Reward Modeling as Next-Token Prediction
We frame reward modeling as a next-token prediction task, incorporating Chain-of-Thought reasoning and Majority Voting within verifiers.
摘要
评审与讨论
The paper proposes a way to automatically verify whether a solution generated by a LLM is correct or not. Instead of replying on a third-party verifier which ranks multiple solutions (as the best-of-N method), it suggests the verifier should use a CoT step, followed by a majority voting, to determine the correctness probability of the solution.
优点
An easy to understand method that works in the few experiments of the paper.
缺点
I have several concerns about this paper. First, I don't think such a process can be called a "verifier", as there is no rigor in the entire process. Especially, we do not have any guarantee on the final probability value. It completely relies on the quality of the other LLM to evaluate the solution, and as mentioned at the beginning of the paper, " ... often confidently make logical and factual mistakes ". I understand this is what the community is doing, but on the other hand, this paper does not make any progress on this aspect.
Second, the novelty, i.e., utilising the autoregressive ability rather than ranking, is not significant. The contribution is very incremental.
Third, I suppose the utilisation of another LLM will not always lead to positive impact. The study of its potential negative impact (e.g., the propagation of the vulnerabilities of multiple LLMs) may be more interesting than what presents in the paper.
Finally, there is no comparison (and no discussion) with the methods on uncertainty estimation, which generates a confidence when producing a solution. I don't see this "verifier" methodology offers significantly more than an uncertainty estimator.
In summary, I found the method proposed in the paper is incremental, without tackling the major issues (guarantees, negative impacts, etc) of this problem.
问题
see the weaknesses.
We believe that there might be some misunderstandings in regards to the contributions of our paper and its novelty and significance in relation to prior work, hence we wanted to kickstart discussion soon.
At the outset, we are unaware of prior work that utilizes inference-time computation using chain-of-thought (CoT) abilities of LLMs to improve verification (though of course, CoT have been used to improve generation). Our work makes it possible to improve verification by posing it as a generative modeling task. This contribution has been acknowledged by other reviewers as “well-motivated” and “innovative”. We would like to clarify the following:
First, I don't think such a process can be called a "verifier", as there is no rigor in the entire process.
To our knowledge, the term “verifier” is widely-used in the LLM community, established by the seminal GSM8K paper [1] and notable follow-ups [2, 3]. It’s unfair to be penalized for using the terminology widely-adopted by the community.
Second, the novelty, i.e., utilising the autoregressive ability rather than ranking, is not significant. The contribution is very incremental.
We believe that there is perhaps a misunderstanding in this statement: we do not simply utilize autoregressive abilities rather than ranking, but unlock the ability to utilize inference-time computation for improving verification accuracy, by running multiple parallel chains of thought and majority voting. We are unaware of any prior work that uses chain of thought or majority voting for improving verification accuracy for learned verifiers. We are happy to revise novelty claims if there are suggestions regarding prior work demonstrating similar capabilities.
Third, I suppose the utilisation of another LLM will not always lead to positive impact. The study of its potential negative impact (e.g., the propagation of the vulnerabilities of multiple LLMs) may be more interesting than what presents in the paper.
The above concern is applicable to most papers involving LLMs, and for most work involving LLM reasoning. We are happy to discuss this as a broad societal implication of this entire line of work but we do not think this should be a reason to reject this paper.
No comparison and no discussion) with the methods on uncertainty estimation .. I don't see this "verifier" methodology offers significantly more than an uncertainty estimator.
This concern also seems applicable to any work on LLM verifiers, and we believe that this should not be a reason to dismiss the contributions of this paper. While we are not aware of work using uncertainty estimation approaches for verifiers, we do compare to prevalent verification approaches, including discriminative verifiers (classifiers), LLM-as-a-Judge, DPO, and self-consistency.
Moreover, while “verifiers” can be viewed as uncertainty estimators, current LLMs are poor at judging “correctness” of their own responses on reasoning tasks, dubbed as “Generative AI paradox” [4, 5]. As such, using uncertainty estimations for verification seems like an interesting direction for future work.
References:
[1] “Training Verifiers to Solve Math Word Problems”, Cobbe et al, 2021.
[2] “Solving math word problems with process- and outcome-based feedback”, Uesato et al, 2022.
[3] “Let's Verify Step by Step”, Lightman et al, 2023.
[4] “The Generative AI Paradox: What It Can Create, It May Not Understand". ICLR 2023.
[5] “The Generative AI Paradox on Evaluation: What It Can Solve, It May Not Evaluate.” EACL 2024.
Thanks for the response. I think the response does not address the concerns. It is not a valid argument by mentioning that this and that is "applicable to any work on LLM verifiers".
I will keep my score
Dear Reviewer odxm and the AC:
Thanks for your replies! We are more than happy to provide clarifications and address concerns to help alleviate the concerns, especially given that we have a few more days to respond.
Reviewer odxm -- Could you kindly help us understand what would precisely help address your concerns? We are happy to run experiments and modify text to this end, but we just find it hard to address the concerns in the review above or your follow up response because we are not sure what you are looking for. The other two reviewers did provide us with very actionable feedback, which is the point of the review and discussion process at ICLR. Would you kindly help us by doing the same? We will appreciate that a lot!
Alternatively, could you highlight what in our author response is problematic and why it does not address your concerns?
Thanks so much and looking forward to engaging with you in a discussion!
Best, Authors
Dear reviewer odxm,
Could you please respond to authors' rebuttal and see if you would like to update your review? Thanks very much!
AC
The authors investigate training LLMs to act as verifiers using a generative objective (training the LLM to verify if a solution is correct by directly predicting the yes/no token). Notably, they investigate the implications of this modification for scaling inference compute and for jointly optimizing solution generation and verification.
优点
The authors examine a very relevant and important problem of learning good verifiers for LLM generations.
They present a new method that is well-motivated (increasing verification compute and framing verification to match the original LLM objective).
The authors conduct a lot of key experiments exploring these dimensions, the paper is explained clearly and is very easy to understand.
- In particular, it’s great to see experiments measuring the generation performance change as well as experiments investigating the scaling properties of more verification inference compute.
缺点
In many of the plots (figure 1, 4, 5, 6), the y-axis scaling changes from plot-to-plot and is often very restricted (ie. sometimes spanning only 4%). This is misleading when comparing results, and it would be great to standardize it more.
GenRM does improve over the baselines (it seems like more on harder tasks which is worth highlighting more!) but a lot of times the improvement is relatively small (ex. 1% for gsm8k over discriminative).
"In Figure 8, we show that generative verifiers, especially GenRM-CoT, exhibit better scaling behavior than discriminative RMs,"
- Unclear from figure 8 that generative verifiers scale better -> the boosts are very similar for the discriminative RM.
Although the authors explore a lot of baselines which is great, there are some key verification methods that are missing. Specifically process reward models are generally better than the ORMs studied, which is important given that the performance of ORMs in some settings are close to GenRM.
Verification is only done on a max of 32 generated solutions. Although this is done in many other past works, given that the best-of-n performance scales to thousands of samples for some datasets it would be great to see the scaling properties along this dimension.
I am willing to raise my score if some of these concerns are addressed! In particular, the presentation of results and/or adding stronger baselines.
Nits:
- Figure 4, the color is wrong for GenRM
问题
In all evaluations are the actual generations are the same, the only difference is the verifier in each method? I want to confirm that you aren’t using the fine tuned generator for GenRM (it would be great to make this clear in the text).
Did you try any PRMs? How do they compare?
Did you try verifying sample collections larger than 32?
Couldn’t reference guided rational training introduce a train/test mismatch? Ie at training the verifier objective is conditioned on a correct answer, but it isn’t at test time?
Are the CoTs faithful? Ie. is the reasoning for yes/no accurate to the actual problem?
We thank the reviewer for the feedback. We are glad that the reviewer finds the problem we tackle to be relevant and important, the method to be well-motivated, and the paper to be clear and easy-to-understand. To address their concerns, we have improved our presentation and addressed many questions and weaknesses.
GenRM does improve over the baselines more on harder tasks (which is worth highlighting more!) but a lot of times the improvement is relatively small (ex. 1% for gsm8k over discriminative).
While gains on GSM8K (max possible accuracy is 97%) look small, going from absolute accuracy of 92.3% with discriminative RM to 93.4% with GenRM-CoT requires verifying solutions that are tricky and have subtle errors (see Figure 2, Figure 11, Figure 12, as well as Appendix D). This is akin to how improving the SOTA on imagenet from 80%+ by 1% is very challenging.
In addition, we have indeed observed that GenRM especially works well on harder tasks or easy-to-hard generalization settings. On mathematical reasoning tasks, when trained only on grade-school-math, it can generalize to high-school competition level math, and performs much better than baselines (especially discriminative RM) . This setup is much more difficult than the original setup in the easy-to-hard generalization paper [2], which trains on easy levels in MATH (rather than just grade-school math). We have updated the manuscript to highlight this.
We have also run additional experiments evaluating easy-to-hard generalization of GSM-trained verifiers on MMLU’s college_mathematics (100 problems in the test split): pass@1 is 47.6%, Self-Consistency based on 32 solutions gives 52% solve rate; Best-of-32 based on discriminative RM is 53.0%; as for genRM-CoT (using 32 majority votes), Best-of-32 gives 56.1%. See Figure C.4 in the Appendix. This shows that GenRM-CoT’s verification skills can achieve superior generalization even on college-level mathematics!
a lot of baselines which is great, some key verification methods missing … process reward models are generally better than the ORMs studied .. performance of ORMs in some settings are close to GenRM.
We have not considered PRM for two reasons:
- Currently, GenRM only uses outcome supervision signals, so we only compare it with ORM baselines. If process-level supervision signals are available, GenRM can use the PRM signals as well. For instance, PRM data can be used for generating and filtering higher-quality verification CoTs on a step level. As such, PRMs need to be compared with process-level GenRM (which we have discussed in the future work).
- PRMs often require additional human labeling, which can be costly. For instance, the original PRM paper [1] asked labelers to judge the correctness of each step in the solution. While there have been recent attempts to automate PRM labeling without human-in-the-loop [4, 5], those techniques are relatively new and have their own pros-and-cons, so we leave the investigation of combining GenRM with automated-PRM labels to future work.
In all evaluations are the actual generations are the same, the only difference is the verifier in each method?
Yes, generations are the same, and the only difference is the verifier. We have revised the paper to clarify this.
Unclear from figure 8 that generative verifiers scale better -> the boosts are very similar for the discriminative RM
Indeed, we have revised the wording in the paper to say that generative verifiers perform better than discriminative RMs across model sizes. Thank you for pointing this out.
Verification is only done on a max of 32 generated solutions .. it would be great to see the scaling properties along this dimension.
In this work, we focused on a new axis of inference compute scaling, which is scaling the compute used to verify each generated solution. As shown in Figure 7, this new axis of scaling with respect to the number of verification rationales per solution is highly effective.
Scaling N and scaling the amount of compute used to verify each of the N solutions are two orthogonal axes; we focus on the latter because there often is a large gap between Pass@N and Best-of-N (see Figure 9), which shows that the bottleneck of Best-of-N performance is often not that N is not large enough, but that the verifier does not rank accurately enough.
Some prior works such as [1] and [2] have indeed explored generating more than a thousand solutions for running best-of-n. In those cases, prior works only managed to see a clear gain in performance when the number of samples is more than a hundred (see Figure 4 in [2], and Figure 3 in [1]). By contrast, our method can already show notable gain within 32 samples.
Couldn’t reference guided rational training introduce a train/test mismatch? Ie at training the verifier objective is conditioned on a correct answer, but it isn’t at test time?
We condition on an expected answer only during data generation of verification rationales. When we finetune the model, we use those verification rationales but do not include the expected answer in the prompt, so that there is no train/test mismatch. We have updated the paper to clarify this.
Are the CoTs faithful? Ie. is the reasoning for yes/no accurate to the actual problem?
When the CoT verifier correctly verifies a correct solution, the verification CoT mostly just says that there is no mistake in each step of the solution (see Table D.5 and Table D.7 in the Appendix), so the CoT is almost always faithful in this case.
When the CoT verifier correctly verifies a wrong solution, a faithful CoT needs to point out the actual mistake in the solution. Our GenRM-CoT verifier can do this reasonably well (as shown in the examples in Appendix D). Sometimes the verifier points out incorrect mistakes, but this behavior is expected because we use model-generated synthetic data for training CoT verifiers, which can be noisy and contain some errors in the training data. We expect that this can be further improved by utilizing more human data (similar to CriticGPT [6]) or inference time compute for self-correction, akin to o1.
In many of the plots, the y-axis scaling changes from plot-to-plot and is often very restricted … it would be great to standardize it more.
We have updated the manuscript to ensure that in Figure 4, 5, 6, the y-axis starts from the pass@1 of each task. Note that the y-axis often changes because the various tasks we considered have different levels of difficulty for the base generator; we hope that starting the y-axis from pass@1 of the base generator improves the clarity of the plots across the paper.
References
[1] “Let’s verify step by step”, Lightman et al, 2023.
[2] “Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision”, Sun et al, 2024.
[3] “Scaling llm test-time compute optimally can be more effective than scaling model parameters”, Snell et al, 2024.
[4] “Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations”, Wang et al, ACL 2024.
[5] “Improve Mathematical Reasoning in Language Models by Automated Process Supervision”, Luo et al, 2024.
[6] “LLM Critics Help Catch LLM Bugs”, McAleese et al, 2024.
Dear reviewer 24YJ,
We want to check in on whether our rebuttal and updated paper have addressed your concerns, and whether you had a chance to see the MMLU college_mathematics experiment we added. We would be happy to discuss further.
Thank you!
This paper proposes Generative Verifiers (GenRM), a novel framework for verification in large language models (LLMs), which reframes reward modeling as a generative task. Specifically, the authors introduce GenRM and GenRM-CoT (Chain-of-Thought), where GenRM-CoT incorporates additional reasoning steps. They aim to improve verification by using the model (Gemini 1.0 Pro) for both solution generation and generating synthetic verification rationales and then train Gemma open source 2B, 7B, 9B models. Experiments are conducted on GSM8K, MATH and algorithmic tasks to demonstrate the effectiveness of this approach over discriminative reward models.
优点
- Innovative Approach: Reframing verification as a generative task, specifically with GenRM-CoT, is novel and shows promise for complex reasoning tasks.
- Synthetic Rationale Generation: The use of the same model to generate both solutions and synthetic rationales offers a more streamlined and potentially scalable verification process.
- Improved Performance: Results indicate that GenRM-CoT improves upon discriminative reward models, especially when using chain-of-thought (CoT)/ScratchPad [1] reasoning and majority voting.
缺点
Scientific Reservations
- Limited Mathematical Task Scope: The reliance on GSM8K and limited algorithmic tasks raises concerns about generalizability. These datasets represent only basic levels of math reasoning (grade school and high school). Including results from more rigorous benchmarks, such as the IMO portion of OlympiadBench [2] or math subsets of MMLU, would strengthen the claims.
- Over-Reliance on Proprietary Model (Gemini 1.0 Pro): By using Gemini 1.0 Pro to generate solutions and rationales for training smaller Gemma models, the paper introduces a dependency on proprietary resources, which might limit reproducibility. Showing results on more accessible, open-source models would be essential to add credibility.
- Toy Nature of Algorithmic Tasks: The algorithmic tasks feel limited and not representative of real-world complexity. Including a more robust task, or additional toy tasks for variety, would better support the general claims.
Writing Reservations
- Inconsistency in Figures and Text: Figure 4 uses inconsistent colors (e.g., GenRM in blue but lines are cyan/green), which makes interpretation challenging. Additionally, the reported improvements (e.g., 73%-93.4% in the introduction vs. 16%-40% in the abstract) should be unified to avoid reader confusion.
- Notation and Explanation Gaps: Section 3.1’s notation (e.g., inconsistent usage of x, y) creates confusion and requires more clarity. Specific variables, like I, need explicit definitions or cross-references to earlier sections to ensure readability.
- Incomplete Background: Key concepts, such as "reference-guided grading" and "LLM-as-a-Judge," are insufficiently explained, causing unnecessary interpretive burden. Adding a background section for these terms, or moving some non-essential related work to the appendix, could improve clarity.
- Confusing Terminology: The paper should clarify that “CoT Verifiers” refers to CoT reasoning in the verification process, not the solutions themselves, which also contain CoT. Renaming these methods would reduce ambiguity.
- Inference and Training Separation: The distinction between training and inference (lines 211-241) is blurred. Separating these sections would make the methodology clearer.
- Inconsistent Use of Majority Voting: The term "majority voting" implies selecting the most frequent result, yet the paper uses an averaging approach. Clarifying this terminology would prevent misunderstanding.
问题
Suggested Improvements
- Broader Mathematical Validation: To strengthen the scientific claims, I suggest including results from more advanced math reasoning benchmarks, such as IMO tasks from OlympiadBench or relevant subsets of MMLU. Results from these additional benchmarks could significantly boost the paper’s credibility. BigBench has an induction and Identify Math Theorem https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/identify_math_theorems
- Justification for Larger Model Use: The reliance on a larger model (Gemini 1.0 Pro) to generate training data for smaller models needs a sound explanation, as it impacts reproducibility. Without this, the setup may seem biased.
- CoT-Only Baseline: To isolate the effect of CoT reasoning in verification, a baseline experiment using CoT reasoning alone without verification reasoning steps could help confirm the added value of GenRM-CoT.
- Consider Extending to Other Models: Testing GenRM-CoT on open-source models would help show that the approach generalizes beyond the proprietary Gemini/Gemma series.
- Length Generalization: Generalizing to shorter problem lengths is not particularly noteworthy, as longer problem lengths often include shorter steps. Showing robustness across various task lengths would be more convincing.
I'm willing to revise if points addressed well.
citations [1] Scratch pad: https://arxiv.org/abs/2112.00114 (cite too, they created CoT concurrently) [2] OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems https://arxiv.org/abs/2402.14008
Toy Nature of Algorithmic Tasks
The algorithmic tasks are indeed designed to be illustrative: even on relatively simple tasks such as last letter concat, using verification CoT yields notable performance gains compared to discriminative RM baseline. These tasks have been introduced by previous works: last-letter-concatenation was from the CoT paper [7], and the word sorting task was from Big-Bench-Hard [8].
Length Generalization: Showing robustness across various task lengths would be more convincing.
On algorithmic tasks, the setup was already based on length generalization: we train verifiers on word lists of length {2,3,4}, and evaluate their generalization on length {5,6}. This was mentioned in the Tasks section at the beginning of Section 4 Experiments; we have also updated the manuscript to clarify this (Line 256).
In addition, our method excels at easy-to-hard generalization: when trained only on grade-school-math, it can generalize to high-school competition level math (from AMC 10, AMC 12, AIME) as well as MMLU college_mathematics, and performs much better than baselines including discriminative RM. This setup is in fact more difficult than the original setup in the easy-to-hard paper [9].
Writing reservations
We have also improved our writing in the updated draft based on the reviewer’s suggestions, including
- fixing Figure 4’s color issues,
- having a unified format to report improvements (in Abstract and Figure 1),
- highlighting variable I in the method section,
- adding citation to the scratchpad prompting paper,
- adding Background section for LLM-as-a-Judge,
- clarifying the meaning of CoT Verifiers, etc.
References:
[1] “Training Verifiers to Solve Math Word Problems”, Cobbe et al, 2021.
[2] “Measuring Mathematical Problem Solving With the MATH Dataset”, Hendrycks et al, NeurIPS 2021.
[3] “Solving quantitative reasoning problems with language models”, Lewkowycz et al, NeurIPS 2022.
[4] “Let's Verify Step by Step”, Lightman et al, 2023.
[5] “The Claude 3 Model Family: Opus, Sonnet, Haiku”, Anthropic, 2023.
[6] “Gemini: A Family of Highly Capable Multimodal Models”, Gemini Team Google, 2023.
[7] “Chain-of-thought prompting elicits reasoning in large language models”, Wei et al, 2022.
[8] “Challenging BIG-Bench tasks and whether chain-of-thought can solve them”, Suzgun et al, 2022.
[9] “Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision”, Sun et al, 2024.
We thank the reviewer for their feedback. We are glad to hear that the reviewer finds our work (a) innovative and shows promise for complex reasoning tasks, (b) streamlined and scalable, and (c) useful in terms of the downstream performance.
To address your concerns, we have run additional suggested experiments on MMLU college mathematics, clarified that we only used publicly available models with open weights and open-sourced our training dataset for reproducibility, improved our writing, and addressed the weaknesses. We answer your questions below.
To strengthen the scientific claims, I suggest including results from more advanced math reasoning benchmarks, such as IMO tasks from OlympiadBench or relevant subsets of MMLU.
We chose GSM8K [1] and Hendyckys MATH [2, 3, 4] because they are widely accepted by the community. Moreover, the MATH dataset is gold-standard for evaluating mathematical reasoning abilities; for instance, it was included in openai’s simple-evals (https://github.com/openai/simple-evals) and widely reported in Gemini and Claude model cards [5, 6].
Following reviewer's suggestion, we evaluated easy-to-hard generalization of GSM-trained verifiers on MMLU’s college_mathematics (100 problems in the test split), showing that GenRM-CoT’s verifiers trained on grade school math show superior generalization on college-level mathematics:
- pass@1 is 47.6%, Self-Consistency based on 32 solutions gives 52% solve rate; Best-of-32 based on Discriminative RM is 53.0%; as for genRM-CoT (with 32 majority votes), Best-of-32 gives 56.1% See Figure C.4 in the Appendix.
- In addition, using just a single verification rationale with GenRM-CoT can already outperform Discriminative RM (Figure C.4 on the right).
OlympiadBench is a new multimodal benchmark that only came out earlier this year. That said, a very recent follow-up work applied GenRM on OlympiadBench with LLama 3.1 and Qwen2.5 models and finds that it outperforms Discriminative RM. Specifically, Best-of-100 with Llama 8B obtains a score of 30.2%, improving the pass@1 accuracy of 19%. These results independently confirm the effectiveness of GenRM on hard tasks.
Over-Reliance on Proprietary Model (Gemini 1.0 Pro) .. which might limit reproducibility.
In our experiments, we don't fine tune Gemini 1.0 Pro but only run inference with its API that is public (via Gemini Developer API). We use the API to generate synthetic rationales for training, which we have anonymously open-sourced https://github.com/gen-agent/genrm-data/ to ensure reproducibility.
Testing GenRM-CoT on open-source models would help show that the approach generalizes beyond the proprietary Gemini/Gemma series.
The Gemma 2B, 7B and Gemma2 9B models we finetuned are open-weights models. To ensure no proprietary resources are required to reproduce our results, we have also open-sourced our training data for generative CoT verifiers.
Moreover, as mentioned earlier, recent follow up work also finds GenRM outperform discriminative RMs on LLaMa 70B and Qwen 2.5 7B models.
CoT-Only Baseline: To isolate the effect of CoT reasoning in verification, a baseline experiment using CoT reasoning alone without verification reasoning steps could help confirm the added value of GenRM-CoT.
All the generated solutions (that a verifier needs to grade) already use CoT before outputting the final answers. All verifiers and baselines we considered have solutions’ CoT as a part of the inputs. Therefore, when using Best-of-N, “CoT reasoning alone without verification reasoning steps” is the discriminative RM baseline.
As for “isolating the effect of CoT reasoning in verification”, we have provided the results of GenRM without CoT in Figure 1 and also Figure C.1 in the Appendix. GenRM without CoT performs more or less the same as discriminative RM, showing that the gain mostly comes from verification CoT.
Dear reviewer 8GSi,
We want to check in on whether our rebuttal and updated paper have addressed your concerns, and whether you had a chance to see the MMLU college_mathematics experiment we added. We would be happy to discuss further.
Thank you!
Dear reviewer 8GSi,
As the discussion period will close tomorrow, we would like to send another friendly reminder to check out our response. We believe we have responded to most of your queries and concerns. Your further feedback will be greatly appreciated.
Thank you!
We chose GSM8K [1] and Hendyckys MATH [2, 3, 4] because they are widely accepted by the community. Moreover, the MATH dataset is gold-standard for evaluating mathematical reasoning abilities; for instance, it was included in openai’s simple-evals (https://github.com/openai/simple-evals) and widely reported in Gemini and Claude model cards [5, 6].
I understand that GSM8K and MATH are widely used benchmarks. However, my role as a reviewer is to ensure that the community progresses beyond over-reliance/overfitting on limited datasets. These benchmarks, while popular, represent a narrow slice of mathematical reasoning and focus heavily on competition-level problems, which are rare in real-world applications. Thus, it is imperative to include evaluations on a broader range of mathematical areas and levels. (To be direct, I recommend not engaging in further justification of GSM8K and MATH as sufficient benchmarks in your response to this concern, this just won’t convince me).
To address this, I request more extensive evaluations. I appreciate the additional MMLU college mathematics results, which provide some confidence in the reliability of the reported results. However, running more comprehensive evaluations is crucial and feasible. Specifically, MMLU mathematics with lm-harness is trivial to execute (no VLLM required). I suggest including evaluations on the following MMLU tasks:
math_tasks = [
"abstract_algebra",
"college_mathematics",
"elementary_mathematics",
"high_school_mathematics"
]
PS: there are more mmlu benchmarks that would be interesting to see eg formal_logic, machine_leanring etc. but aren't a substitute for the above or below requests.
If results are provided for these four tasks and remain positive before the December deadline, I am prepared to raise my score by +1 (e.g., from 5 → 6). This should be straightforward to implement.
Additionally, if you include experiments on OlympiadBench with positive results, I would encourage avoiding reliance on images as inputs. Instead, consider leveraging Gemini 1.0 (to which you have access) to generate Asymptote representations for a subset of the problems (or don’t input the images). A reasonable subset of 250 problems should be achievable and would provide strong evidence for the paper’s generalizability and applicability or extend the evaluations to include harder tasks beyond the GSM8K/MATH level (e.g., reasonable subsets of OlympicArena: https://gair-nlp.github.io/OlympicArena/), I would be willing to raise my score by another +1 (6 → 8).
Finally, I recommend expanding the scope to include more advanced benchmarks like Omni-MATH, FrontierMath, Putnam-AXIOM, or comparable challenging datasets. Or, evaluating on a diverse set of models (2 more required), including Qwen, Intern-LM, LLaMa, or defensible open-source models, would demonstrate generalizability beyond Google models. Ideally, synthetic data should be generated by models of the same size as those being tested (e.g., using few-shot prompting or retrieval-augmented generation, but not larger models). Achieving this, would mean I’d be willing to further improve the paper’s score by +1 (8 → 10).
I hope these are clear and actionable steps to improve the paper. While I apologize for the delay in responding, I believe this feedback provides a concrete path to significantly strengthen your work. I also appreciate the effort required to address these points and hope the suggested changes will help elevate the paper. Please feel free to add anything discussed to the camera-ready version.
Note: December 3rd: Last day that authors may post a message on the forum (six day extension).
Some references: https://arxiv.org/abs/2410.07985 https://gair-nlp.github.io/OlympicArena/ https://huggingface.co/datasets/cais/mmlu
We thank the reviewer for their actionable feedback. While testing our approach for training verifiers on standard benchmarks does not necessarily imply overfitting to those benchmarks, we have run easy-to-hard generalization evaluation on MMLU tasks with our GSM-trained Gemma 9B verifiers. Our positive results with GenRM-CoT indeed provide strong evidence for the paper's generalizability and applicability.
I appreciate the additional MMLU college mathematics results .. including more evaluations is crucial and feasible on MMLU mathematics tasks .. prepared to raise my score by +1.
Best-of-32 evaluation results on 4 MMLU mathematics tasks requested by the reviewer:
| MMLU Dataset | Base Model (Pass@1) | Disc-RM | GenRM-CoT |
|---|---|---|---|
| elementary_mathematics | 80.1% | 90.6% | 91.1% |
| high_school_mathematics | 52.2% | 74.8% | 76.1% |
| college_mathematics | 47.6% | 53% | 56.1% |
| abstract_algebra | 37.9% | 50% | 53.50% |
Akin to our prior results, Best-of-N performance of GenRM-CoT scales with inference-time compute on MMLU also (see results on abstract_algebra) as we increase the number of verification CoT samples, a desired capability enabled by generative verifiers. This is very promising as the GenRM-CoT is trained using noisy and possibly error-prone synthetic verification rationales on GSM8K.
| Num. Verification CoT samples | GenRM-CoT (Best-of-32) |
|---|---|
| 8 votes | 49.5% |
| 16 votes | 51.5% |
| 32 votes | 53.5% |
This paper propose training verifiers using the ubiquitous next-token prediction objective, jointly on verification and solution generation. The method appears to be novel and experimental results outperforms DPO verifiers, and LLM-as-a-Judge.
Strength: A novel approach to build LLM-based verifiers and experimental results had some good improvements over existing methods. Weakness: 1. Some of the experiments could be more comprehensive to include more tasks. 2. Models are based on private models and could be difficult to replicate.
审稿人讨论附加意见
- Reviewer 8GSi 's proposed additional tasks were adequately addressed by the authors and showed some good improvements. And based on the 8GSi's discussion, I think 8GSi supported the acceptance given the discussion.
- Reviewer 24YJ generally supported the acceptance and the additional concerns were addressed through the rebuttal process.
- Reviewer odxm didn't support the acceptance. However, it didn't seem that reviewer odxm clearly pointed out why the rebuttal didn't address the concerns. There was a rebuttal discussion and reviewer odxm also pointed several concerns, though they appeared to be more subjective than clear evidence. Thus I chose to downweight the score.
Accept (Poster)