Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers
We explore scaling the number of verifier models as a novel test-time scaling dimension for improving language model performance and introduce an algorithm that enables simple scaling along this dimension.
摘要
评审与讨论
This paper proposes to improve LLMs outputs at test time by generating n answers with the LLM and selecting one answer (Best of n) using the binary approval rate of several smaller models judging the answers that have various task specific prompting.
接收理由
The paper is well presented and well structured. The experimental results are interesting and show an improvement over other BoN selection methods (reward model, self consistency) as well as better scaling with the number of candidate outputs.
拒绝理由
The method is similar to the one used in Constitutional AI (https://arxiv.org/pdf/2212.08073) that also uses various prompting of verifier LLMs to select the best answers. There are some differences (the direct usage here vs. RL training there, the averaging of the verifiers answers here instead of random selection of verifiers there). But such approaches should be discussed as similar to yours.
The results lack the comparison with an oracle baseline that would select the real best of n (pass@n). It would give a better idea of the ceiling performance of such BoN approaches. There could be simpler performance evaluation of the approach by comparing the rates at which one of the correct answers (when it exists) among the n generated were selected by different methods. This would help decorrelate the base model performance and the performance of the proposed approach.
There is no comparison with the efficiency of compute usage with other test time compute approaches. This method is probably more compute intensive compared to reflection models but it can be parallelized which could lead to a better latency. Is that the case?
The method seem to rely heavily on hand-crafted domain specific verifier prompts and the figure 5 seems to suggest that there is a large variance with the verifier selection. This limits the generality and scalability of the approach.
Practical usage of the approach seem to be as a selection mechanism for RL training but this is left as future work.
给作者的问题
l.254 it is said that "all possible combinations of m verifiers drown from Md" are used. Does this means that you produced 16 generations overall, ran all verifiers and mixed the verifier results in all possible ways on the same generations and the same verifiers output? This saves compute (or API calls) but hides some part of the variability of the results and would help understand what the percentile ranges from figure 5.
Thank you for your thoughtful review and questions. We appreciate your recognition that our experimental results are interesting and show improvements over other best-of-n selection methods. We're excited about the potential for scaling verification compute by using more diverse verifiers, in addition to just scaling generation compute. Below we address each of your points and provide additional experiments and clarifications that we believe strengthen our contributions and address your questions.
The method is similar to the one used in Constitutional AI (https://arxiv.org/pdf/2212.08073) that also uses various prompting of verifier LLMs to select the best answers. There are some differences (the direct usage here vs. RL training there, the averaging of the verifiers answers here instead of random selection of verifiers there). But such approaches should be discussed as similar to yours.
Thank you for pointing out this connection. We agree that Constitutional AI (CAI) uses LLM-based evaluation to improve performance, it was one of the first major works demonstrating reinforcement learning from AI feedback (RLAIF) for LLMs. CAI focused on improving alignment, by having the model respond to a prompt, critique the response, revise it, and then train on the improved version (as well as other similar versions of this technique, like generating preferences, etc.). That is, the focus was on improving alignment through train-time RL using the model's own feedback (and a set of constitutional principles).
Our paper has a different contribution: we propose a new approach to test-time scaling that is orthogonal to the traditional scaling method of just sampling more solutions from the generator LLM. That is, we propose scaling verification compute by increasing the number and type of verifiers, without any training. Our results show that scaling both the number and diversity of verifiers can improve test-time performance, and we show that this approach has other important properties like enabling self-improvement and weak-to-strong generalization (Table 2), as well as raising the ceiling on test-time performance while existing methods plateau (Figure 6).
While both CAI and our work involve LLM-based evaluation, they address fundamentally different problems using different methodologies. CAI is an RLAIF train-time approach for alignment, while our work is focused on test-time scaling and introduces an approach where verification compute can be scaled using more verifiers. We will definitely add a discussion of Constitutional AI and these distinctions to the paper to better position our work within the broader landscape of AI feedback methods.
The results lack the comparison with an oracle baseline that would select the real best of n (pass@n). It would give a better idea of the ceiling performance of such BoN approaches.
Thank you for this suggestion. We agree that an oracle baseline (pass@n) would provide valuable insight into the upper bound on performance. Below are the pass@n oracle results for our two most challenging benchmarks, MMLU-Pro and GPQA (diamond). To save space, we have not included the other two benchmarks, but will include pass@n results for all benchmarks in the final version of the paper.
MMLU-Pro:
| Model | pass@n | MAV | Cons |
|---|---|---|---|
| Gemini-1.5-Flash | 81.0 | 66.7 | 63.3 |
| Gemini-1.5-Pro | 85.7 | 72.3 | 71.7 |
| GPT-4o-mini | 80.3 | 67.0 | 63.7 |
| GPT-4o | 88.3 | 75.7 | 76.3 |
| Mistral-7B | 43.3 | 36.7 | 25.7 |
| Llama-3.1-8B | 87.0 | 59.3 | 55.3 |
| Gemma-2-9B | 74.0 | 57.7 | 54.3 |
| Gemma-2-27B | 81.0 | 62.0 | 58.3 |
GPQA (diamond):
| Model | pass@n | MAV | Cons |
|---|---|---|---|
| Gemini-1.5-Flash | 80.0 | 42.0 | 40.0 |
| Gemini-1.5-Pro | 86.0 | 49.0 | 45.0 |
| GPT-4o-mini | 79.0 | 50.0 | 48.0 |
| GPT-4o | 87.0 | 59.0 | 59.0 |
| Mistral-7B | 54.0 | 36.0 | 32.0 |
| Llama-3.1-8B | 90.0 | 43.0 | 36.0 |
| Gemma-2-9B | 69.0 | 34.0 | 36.0 |
| Gemma-2-27B | 81.0 | 41.0 | 40.0 |
Note: Numbers represent the accuracy (%) of the applied method for selecting between 16 sampled solutions per question (n = 16).
There could be simpler performance evaluation of the approach by comparing the rates at which one of the correct answers (when it exists) among the n generated were selected by different methods. This would help decorrelate the base model performance and the performance of the proposed approach.
Thanks for making this suggestion. We agree that this would help decorrelate the base model performance. We have computed the accuracies only counting questions where at least one of the sampled solutions is correct (i.e., where pass@n is correct) and found that MAV performs similarly to the original results (see table below on MATH). We're happy to add these additional results to the final version of the paper.
[RESPONSE CONTINUES BELOW]
MATH:
| Model | MAV | Cons | RM |
|---|---|---|---|
| gemini-1.5-flash-001 | 87.3 | 77.6 | 81.6 |
| gemini-1.5-pro-001 | 87.2 | 82.1 | 82.9 |
| gpt-4o-mini-2024-07-18 | 86.3 | 87.5 | 84.8 |
| gpt-4o-2024-08-06 | 88.2 | 88.2 | 92.0 |
| kscope_Mistral-7B-Instruct-v0.3 | 67.0 | 57.4 | 56.5 |
| kscope_Meta-Llama-3.1-8B-Instruct | 80.8 | 78.2 | 70.1 |
| kscope_gemma-2-9b-it | 82.4 | 73.8 | 78.6 |
| kscope_gemma-2-27b-it | 83.3 | 73.6 | 78.4 |
There is no comparison with the efficiency of compute usage with other test time compute approaches. This method is probably more compute intensive compared to reflection models but it can be parallelized which could lead to a better latency. Is that the case?
You're correct that our method is more compute intensive, which is precisely because our goal is to scale verification compute and demonstrate that spending more compute on verification improves performance. In Figure 6 (right), we compare accuracy against total compute budget and find that MAV uses more compute but significantly raises the ceiling on performance—achieving nearly double the improvement while other methods plateau early on. Regarding parallelization, yes, the verifiers run in parallel and so the latency is similar to adding a single extra model call.
The method seem to rely heavily on hand-crafted domain specific verifier prompts and the figure 5 seems to suggest that there is a large variance with the verifier selection. This limits the generality and scalability of the approach.
Thanks for raising this important point. We agree that hand-engineering every verifier would limit the generality and scalability of our approach. To address these concerns, we have conducted an additional experiment using fully AI-generated aspects. We find that using the AI-generated verifiers performs even better than our hand-engineered ones (see full experiment details below). We ran these experiments for Gemini-1.5-Flash as the generator LLM on MMLU-Pro and GPQA diamond (our two most challenging benchmarks):
MMLU-Pro:
| Model | MAV-automated-aspects | MAV-manual-aspects (original) | self-consistency | pass@1 |
|---|---|---|---|---|
| Gemini-1.5-Flash | 68.0 | 65.7 | 63.3 | 59.3 |
GPQA (diamond):
| Model | MAV-automated-aspects | MAV-manual-aspects (original) **** | self-consistency | pass@1 |
|---|---|---|---|---|
| Gemini-1.5-Flash | 47.0 | 41.0 | 40.0 | 42.0 |
Note: Numbers represent the accuracy (%) of the method used for selecting between 16 sampled solutions per question (n = 16).
These results suggest that aspects do not need to be hand-engineered and that we could fully automate aspect verifier generation and selection, a promising result for future work. We will include these new results in the final version of the paper, as well as additional discussion on automating the generation and selection of aspect verifiers.
Experiment Details: We provided GPT-4o with just 3 example aspect prompts from our original manual set and asked it to generate 10 new prompts with different aspect-strategy combinations. Using these 10 purely AI-generated verifiers (without any manual engineering or any of our original manual aspects), we evaluated Gemini-1.5-Flash on MMLU-Pro and GPQA (diamond). For simplicity, we do not perform any “aspect engineering” for this experiment. That is, we use all 10 AI-generated aspects (”MAV-automated-aspects” in the table above) and compare to the “MAV-All” results in Table 5 of the Appendix which uses all 10 original manual aspects (”MAV-manual-aspects (original)” in the table above). Below is one example of the AI-generated aspects created by GPT-4o:
"redundant_steps": (
"INSTRUCTIONS: \\n"
f"Look for any unnecessary steps that don’t contribute to the final result. Think out loud. "
f"If the solution includes redundant or misleading steps, stop and reply '{VERA_ANSWER_SYMBOL}False'. "
f"If every step is necessary and useful, reply '{VERA_ANSWER_SYMBOL}True'."
),
Regarding the variance in Figure 5, we note that the variance is primarily observed when using fewer verifiers than our engineered/selected set. Moreover, we view this as evidence of room for improvement rather than a fundamental limitation—better verifier engineering and selection methods would likely reduce this variance while maintaining the benefits of multi-agent verification.
[RESPONSE CONTINUES BELOW]
Practical usage of the approach seem to be as a selection mechanism for RL training but this is left as future work.
We agree that using MAV as a selection mechanism for RL training is an interesting direction for future work. However, we want to clarify that the primary contribution of our paper is demonstrating that scaling verification compute at test-time can improve the performance of existing models without any additional training. Our focus is on showing that scaling verification compute is a helpful orthogonal dimension to scaling generation compute, and that it can raise the ceiling on performance while other test-time methods plateau.
l.254 it is said that "all possible combinations of m verifiers drown from Md" are used. Does this means that you produced 16 generations overall, ran all verifiers and mixed the verifier results in all possible ways on the same generations and the same verifiers output? This saves compute (or API calls) but hides some part of the variability of the results and would help understand what the percentile ranges from figure 5.
Yes, that's correct. Initially, we considered just sampling random subsets of 1 verifier, 2 verifiers, etc., but realized that any increasing trend might be an artifact of specific sampling instances rather than a fundamental property of scaling verifiers. By computing all possible combinations for each value of m, we can show that the scaling trend is independent of which specific verifiers are chosen for that value of m and holds across the full distribution of verifier subsets. Specifically, we generated the 16 candidate outputs and ran all verifiers on these same outputs once, then we computed performance for all possible combinations of m = 1 verifiers, m = 2 verifiers, etc. As you mentioned, this also saves significant API costs by capturing the core combinatorial effects without re-generating all the data for each combination of verifiers (which was not possible on an academic budget).
Thanks again for your constructive comments and suggestions. We hope that our additional experiments and clarifications have addressed the issues you raised. We will include these updates in the final version of the paper. If you have outstanding concerns, we would be grateful if you raised them during this feedback period so that we can address them. Thanks again for your time and your review.
Thank you for the new experiments and results. I think these greatly strengthen the paper, especially the new results with generated verifier prompts. I still have a few questions:
-
On the pass@n table GPQA results in particular show a significant gap between any selection method and the oracle. This is actually a good sign that sampling methods without retraining can go a long way. However, this is surprising when I compare that with the selection accuracies that you give for MATH. Could you give the accuracies for GPQA too? they should be much lower. Why do you think the domain (math or code) affects the accuracy of your answer selection method so much?
-
About the test time compute scaling, I think you misunderstood my question. I'll try to be clearer. I saw your comparison figure 6 with other answer selection methods. I was asking you to compare with other types of test time compute methods i.e. thinking models. As the compute budget is scaled (number of samples or the number of thinking token) how does the accuracy scale for each type of test time compute method?
Thank you for the detailed answer and the new results. Now that you can scale the number of verifier models I wonder how it would scale with more verifier models.
Thank you for the additional questions. Below, we address your questions with additional clarifications and experiments.
On the pass@n table GPQA results in particular show a significant gap between any selection method and the oracle. This is actually a good sign that sampling methods without retraining can go a long way. However, this is surprising when I compare that with the selection accuracies that you give for MATH. Could you give the accuracies for GPQA too? they should be much lower. Why do you think the domain (math or code) affects the accuracy of your answer selection method so much?
Thanks for the question. Indeed, the GPQA accuracies when we consider only questions with at least 1 correct answer are lower all-around (see table below). This occurs because GPQA has only 4 multiple-choice answers (A, B, C, or D) while MATH answers are free-form numbers (the space of possible answers is much larger). Since we sample 16 solutions and there are only 4 possible answers for GPQA, it's very likely that at least one of the sampled GPQA solutions hits the correct answer by chance. It is much less likely to produce the exact answer for MATH by chance.
Given this, when filtering to questions where at least one sample is correct (pass@n = 1), GPQA includes most questions regardless of difficulty, but MATH only includes the subset where the model could solve the problem at least once. This makes the MATH task on questions where pass@n = 1 fundamentally easier, and that is why the accuracies for MATH are all-around higher than the original results in the paper.
GPQA (diamond) selection accuracies:
Note that this matches the results in the paper, where we find that MAV outperforms the baselines in all domains except for GPQA (diamond), where MAV performs similarly to RM and outperforms self-consistency.
| Model | MAV | Cons | RM |
|---|---|---|---|
| gemini-1.5-flash-001 | 51.3 | 50.0 | 57.5 |
| gemini-1.5-pro-001 | 57.0 | 52.3 | 57.0 |
| gpt-4o-mini-2024-07-18 | 62.0 | 60.8 | 55.7 |
| gpt-4o-2024-08-06 | 63.2 | 67.8 | 66.7 |
| kscope_Mistral-7B-Instruct-v0.3 | 68.5 | 59.3 | 68.5 |
| kscope_Meta-Llama-3.1-8B-Instruct | 42.2 | 40.0 | 45.6 |
| kscope_gemma-2-9b-it | 47.8 | 52.2 | 55.1 |
| kscope_gemma-2-27b-it | 50.6 | 49.4 | 49.4 |
About the test time compute scaling, I think you misunderstood my question. I'll try to be clearer. I saw your comparison figure 6 with other answer selection methods. I was asking you to compare with other types of test time compute methods i.e. thinking models. As the compute budget is scaled (number of samples or the number of thinking token) how does the accuracy scale for each type of test time compute method?
Thanks for the clarification and the interesting question. We have conducted additional experiments comparing our approach to a thinking method.
We first evaluated DeepSeek-R1, which achieves 87.7% accuracy on MATH using an average of 1,902 tokens per question. However, this is not directly comparable to our approach since R1 has been specifically fine-tuned for mathematical reasoning, whereas our approach works zero-shot with off-the-shelf non-RLVR models (no extra training).
For a more apples-to-apples comparison, we implemented a simple reasoning baseline where we iteratively prompt the same base models to continuously extend a chain of reasoning until we use a similar number of thinking tokens as our MAV approach on the MATH dataset (about 80k tokens per question on average). This comparison is more reasonable since both methods scale test-time compute using zero-shot prompting without fine-tuning.
MATH:
| Model | B-MAV (ours) | Scaling Reasoning Tokens | pass@1 |
|---|---|---|---|
| Gemini-1.5-Flash | 66.0% | 53.0% | 52.7% |
| GPT-4o-mini | 73.0% | 72.3% | 69.0% |
We find that when controlling for token count, our method performs better than simply scaling reasoning tokens within a single model call.
We hope this resolves your concern, we will definitely perform a more fulsome set of such experiments for the final version of the paper and welcome any particular suggestions for other test-time compute methods to include.
I think these new results and interpretations make the paper stronger.
This paper proposes a new test time computation method that exploits multiple verifiers. The proposed approach uses off-the-shelf LLMs as aspect verifiers to evaluate an LLM's output from multiple different aspects. Multiple verifiers evaluate each output of an LLM with a binary score, and the proposed method uses the average score to select the best one. The proposed method is easier to use since it does not need additional verifier training. The paper compares the proposed method with the self-consistency method and a best-of-n method using a reward score. The experimental results show that the proposed method outperforms baselines in MATH and MMLU-pro benchmarks, and the multi-agent verification can achieve weak-to-strong generalization.
接收理由
- The paper is clearly written and easy to read.
- The idea of using multiple verifiers is simple but interesting.
- The paper evaluates the proposed method with multiple benchmarks and using multiple LLMs for the generator and verifier.
拒绝理由
I agree that the proposed method is convenient since we can use off-the-shelf LLMs as verifiers. However, it is still unclear whether multi-agent verification is an effective approach because off-the-shelf verifiers (GPT-4o-mini and Gemini-1.5-Flash) are generally more powerful than the 8B reward model used in RM. Therefore, it is possible that the proposed method's good performance comes from the power of large LLMs, not from the use of multiple agents.
给作者的问题
-
Table 1 shows that the proposed method's gain over baseline methods tends to be small when generator LLMs are powerful. For example, GPT-4o shows the best performance among others in all the tasks, but B-MAV does not outperform Cons and RM for GPT-4o. Can you explain why the proposed method does not help much for strong generator LLMs?
-
A simple baseline method is to ask a single agent LLM to evaluate a sentence from multiple aspects rather than using multiple agents. Have you tried this?
Thank you for your thorough review. We're glad you found our method interesting, and that you see it as a simple approach to effectively scale test-time compute without training verifiers. We address each of your comments and questions below. Please feel free to engage further if you have any additional feedback or questions.
I agree that the proposed method is convenient since we can use off-the-shelf LLMs as verifiers. However, it is still unclear whether multi-agent verification is an effective approach because off-the-shelf verifiers (GPT-4o-mini and Gemini-1.5-Flash) are generally more powerful than the 8B reward model used in RM. Therefore, it is possible that the proposed method's good performance comes from the power of large LLMs, not from the use of multiple agents.
Thanks for raising this. We used the strongest reward model we could run on academic compute—the Skywork 8B model, which ranked among the top 10 on RewardBench at the time of writing (outperforming many larger reward models). However, we acknowledge this as a limitation and will add it to the limitations/discussion section of the final paper. Moreover, we show weak-to-strong generalization, where by combining multiple weaker verifier agents we can improve the performance of stronger generator LLMs (i.e., Gemini-1.5-Pro). Additionally, we have shown below that using many separate verifiers to verify different aspects outperforms using a single LLM verifier (see experiment below).
Table 1 shows that the proposed method's gain over baseline methods tends to be small when generator LLMs are powerful. For example, GPT-4o shows the best performance among others in all the tasks, but B-MAV does not outperform Cons and RM for GPT-4o. Can you explain why the proposed method does not help much for strong generator LLMs?
Thanks for your question. Our approach is indeed weakest on GPT-4o. While we don't know exactly why this is the case, we have a few hypotheses. One possibility is that because GPT-4o is quite strong, it saturates performance on our benchmarks, leaving little room for improvement from verification. It might also be that current verifiers (GPT-4o-mini and Gemini-1.5-Flash) may not be sufficiently capable to consistently identify errors in such a strong model (GPT-4o was the SOTA model at the time of writing), suggesting that multi-agent verification will be more effective when we use stronger models as verifiers.
A simple baseline method is to ask a single agent LLM to evaluate a sentence from multiple aspects rather than using multiple agents. Have you tried this?
Thanks for making this suggestion. We have conducted an additional experiment where we ask the LLM verifier to generate a “rubric” containing the approval scores for each aspect in a single response. We find that generating a single rubric for all aspects performs worse than using a separate model call per aspect (see table below). This is likely because separate model calls allows the LLM verifier agents to put all of their “focus” on one aspect, rather than having to think about many different aspects in context. Having multiple LLM verifier agents may have more expressive power, and this could be an interesting direction for future work to explore further. Thanks again for your suggestion to try this. We will include these results and additional discussion in the final version of the paper.
MMLU-Pro:
| Model | MAV-rubric (new) | MAV-separate (original) | self-consistency | pass@1 |
|---|---|---|---|---|
| Gemini-1.5-Flash | 62.0 | 65.7 | 63.3 | 59.3 |
| GPT-4o-mini | 64.0 | 65.3 | 63.7 | 62.3 |
GPQA (diamond):
| Model | MAV-rubric (new) | MAV-separate (original) | self-consistency | pass@1 |
|---|---|---|---|---|
| Gemini-1.5-Flash | 39.0 | 41.0 | 40.0 | 42.0 |
| GPT-4o-mini | 42.0 | 49.0 | 48.0 | 38.0 |
Note: Numbers represent the accuracy (%) of the method used for selecting between 16 sampled solutions per question (n = 16).
Experiment Details: We instruct the LLM “rubric” verifier to think about each aspect in a single response and generate a python dictionary mapping aspect names to True/False values (after it has finished thinking about each aspect). We use the exact same 10 aspects as the original paper, but obtain the approvals for all aspects simultaneously in a single response rather than splitting across model calls. As in the original paper, we use Gemini-1.5-Flash and GPT-4o-mini as the base LLMs for our verifiers, so there are actually 2 model calls: Gemini-1.5-Flash generates one rubric and GPT-4o-mini generates another rubric. For simplicity, we do not perform any “aspect engineering” for this experiment. We compare to the “MAV-All” results in Table 5 of the Appendix (which uses 20 separate verifier calls: 10 aspects for 2 base LLMs).
Thanks again for your positive review. Please let us know if you have any other questions or feedback.
Thank you for your response. I will raise my score. The additional experimental results are interesting and provide further support for the effectiveness of the proposed method. While some concerns remain regarding the fairness of the experimental setup, I believe this is nevertheless a valuable paper worthy of acceptance.
The paper presents a novel approach to reasoning with language models by introducing BoN-MAV (Best-of-N Multi-Agent Verification), which integrates best-of-n sampling with multiple verifiers. A verifier is an LLM prompted to verify some aspect of the generator's response. This work describes a new way of scaling test-time compute by scaling the number of verifiers (as compared to scaling the number of candidate outputs). The paper evaluates several different generator LLMs and several sets of verifier LLMs, and proposes a method for selecting which verifiers to use for a particular domain. The paper evaluates math, coding, general knowledge & reasoning, and graduate-level reasoning.
接收理由
The authors present strong experimental evidence that BoN-MaV works: they evaluate several generator models, as well as several sets of different verifiers, and show its performance over self-consistency. The authors also show that their technique scales as the number of verifiers increases.
The paper is also well-written and presented.
拒绝理由
The paper as a whole is quite strong, but it is missing a few more interesting baselines that weaken the authors' argument slightly.
- While the authors test many different LLMs for the generations and many different LLMs for the verifiers, the authors only test 1 8B RM model. There appears to be a stronger reward model released around the same time as the model they chose to use was released (they chose Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 while Skywork/Skywork-Reward-Gemma-2-27B-v0.2 was available.). Given that many of their verifiers were state-of-the-art closed-source LLMs, choosing not to use as powerful an open-weight reward model feels like it's not a fair comparison of methods.
- Another strong area for comparison would have been reward models designed for reasoning models, like process reward models (PRMs) or fine-grained reward models. PRMs are cited and mentioned in the paper, but none were chosen for comparison. Many PRMs are made specifically for verifying different components of a model's outputs, and, unlike general-purpose reward models that grade the overarching output of a model, will more likely grade the correctness of a model.
- Lastly, no effort to evaluate heterogeneous reward models (the authors provide reasons why it would be difficult) was made. Even something as simple as finding an average threshold to use to convert all reward values to 0 or 1 and average would have been helpful.
Again, these weaknesses do not take away much from the paper, but it would be substantially better if the authors had shown that BoN-MaV was much stronger than these baselines mentioned.
Thank you for the positive review of our work and for your constructive questions and suggestions, which we address below. Please feel free to engage further during this feedback period if you have any additional feedback or questions.
Given that many of their verifiers were state-of-the-art closed-source LLMs, choosing not to use as powerful an open-weight reward model feels like it's not a fair comparison of methods.
Thanks for raising this point. We used the strongest reward model we could run on academic compute—the Skywork 8B model, which ranked among the top 10 on RewardBench at the time of writing and actually outperformed many larger reward models. However, we acknowledge this as a limitation and will add it to the limitations/discussion section of the final paper.
Another strong area for comparison would have been reward models designed for reasoning models, like process reward models (PRMs) or fine-grained reward models. PRMs are cited and mentioned in the paper, but none were chosen for comparison.
We agree this would be an interesting comparison. We focused on outcome-level reward models and self-consistency since they are most similar to our approach in terms of selecting between complete candidate outputs (as opposed to intermediate steps). Due to time limitations and the additional complexity that PRMs add, we weren't able to include PRMs in this rebuttal period, but we will add a comparison to PRMs in the final version of the paper.
Lastly, no effort to evaluate heterogeneous reward models (the authors provide reasons why it would be difficult) was made. Even something as simple as finding an average threshold to use to convert all reward values to 0 or 1 and average would have been helpful.
We agree this would be a useful comparison to add. It would require evaluating each generated solution in our dataset with multiple reward models, which would take too long on our academic compute for this rebuttal period. However, we'll definitely include such a comparison (like your suggested average threshold method) in the final version of the paper. Thank you for the suggestion.
Thanks again for your thoughtful review and valuable suggestions. We believe these additions will significantly strengthen the paper, and we appreciate your recognition of the core contributions of our work. Please feel free to reach out if you have any additional questions or need further clarification.
Hi! Thanks for your response. I will keep my rating at 7, I still believe this is an accurate score for your work.
They propose applying additional test-time compute to verification by implementing multiple "aspect verifiers", which verifier different aspects of an output (e.g. logical soundness, general correctness, etc...). These aspect verifiers are produced via different prompts to an LLM judge model. They then aggregate the different judge outputs by averaging their scores. They then run best of N against this multi-aspect verifier (MAV). To obtain the set of aspects for a given task, they perform "aspect engineering", which involves selecting different random subsets of aspect verifiers from a large pool and then choosing the set which performs the best for a given task. They show that their approach enables effective test-time scaling not only in terms of the number of sampled outputs for BoN, but also by scaling the number of aspect verifiers. They conduct experiments on MATH, MMLU-Pro, GPQA, and HumanEval and find that 1) MAV outperforms baselines like standard verifier and majority voting with BoN scaling on all tasks except for GPQA; 2) in most cases, increasing the number of aspects improves performance; 3) using multiple aspect verifiers with small models can effectively verify larger model outputs; and 4) finally the conduct some analysis of how the specific set of aspects used impacts performance, finding that the diversity of the different aspects is important for maintaining high performance.
接收理由
- They conduct fairly thorough experiments with multiple LLMs and multiple evaluations to understand the benefits of applying multiple aspect verifiers to improve verification capabilities
- Their method is very simple and would be straightforward to reproduce
- The paper is clearly written, and everything is easy to follow and understand
- The problem of improving verifiers with additional test-time compute is a very interesting research direction and they show strong results using a simple approach
拒绝理由
- They method requires hand engineering multiple aspects for a given, limiting it's generality and the ability to scale the number of verifiers.
- Moreover, having the different aspects be done in separate LLM calls seems unnecessary. Why can't you just provide the LLM the full rubric and have it output a score for each item in one generation?
- It is unclear whether you perform the "aspect engineering" on the test set or on some validation set. If it is on the test-set, I would be concerned about the validity of their findings. If it is on the validation set, then they should make this clear in the paper.
- Is the MMLU and GPQA done with a chain of thought? Without a chain of thought, it would be pretty hard to verify I feel.
给作者的问题
See the reasons to reject above.
Thank you for your review and feedback. We’re glad to hear that you found our method interesting and results strong. Please see below for responses to your comments and questions.
The method requires hand engineering multiple aspects for a given, limiting it's generality and the ability to scale the number of verifiers.
Thanks for raising this important point. We agree that hand-engineering every verifier would limit the generality and scalability of our approach. To address these concerns, we have conducted an additional experiment using fully AI-generated aspects. We find that using the AI-generated verifiers performs even better than our hand-engineered ones (see full experiment details below). We ran these experiments for Gemini-1.5-Flash as the generator LLM on MMLU-Pro and GPQA diamond (our two most challenging benchmarks):
MMLU-Pro:
| Model | MAV-automated-aspects | MAV-manual-aspects (original) | self-consistency | pass@1 |
|---|---|---|---|---|
| Gemini-1.5-Flash | 68.0 | 65.7 | 63.3 | 59.3 |
GPQA (diamond):
| Model | MAV-automated-aspects | MAV-manual-aspects (original) | self-consistency | pass@1 |
|---|---|---|---|---|
| Gemini-1.5-Flash | 47.0 | 41.0 | 40.0 | 42.0 |
Note: Numbers represent the accuracy (%) of the method used for selecting between 16 sampled solutions per question (n = 16).
These results suggest that aspects do not need to be hand-engineered and that we could fully automate aspect verifier generation and selection, a promising result for future work. We will include these new results in the final version of the paper, as well as additional discussion on automating the generation and selection of aspect verifiers.
Experiment Details: We provided GPT-4o with just 3 example aspect prompts from our original manual set and asked it to generate 10 new prompts with different aspect-strategy combinations. Using these 10 purely AI-generated verifiers (without any manual engineering or any of our original manual aspects), we evaluated Gemini-1.5-Flash on MMLU-Pro and GPQA (diamond). For simplicity, we do not perform any “aspect engineering” for this experiment. That is, we use all 10 AI-generated aspects (”MAV-automated-aspects” in the table above) and compare to the “MAV-All” results in Table 5 of the Appendix which uses all 10 original manual aspects (”MAV-manual-aspects (original)” in the table above). Below is one example of the AI-generated aspects created by GPT-4o:
"redundant_steps": (
"INSTRUCTIONS: \n"
f"Look for any unnecessary steps that don’t contribute to the final result. Think out loud. "
f"If the solution includes redundant or misleading steps, stop and reply '{VERA_ANSWER_SYMBOL}False'. "
f"If every step is necessary and useful, reply '{VERA_ANSWER_SYMBOL}True'."
),
Moreover, having the different aspects be done in separate LLM calls seems unnecessary. Why can't you just provide the LLM the full rubric and have it output a score for each item in one generation?
Thanks for asking this. To address your question, we have conducted an additional experiment where we ask the LLM verifier to generate a “rubric” containing the approval scores for each aspect in a single response. We find that generating a single rubric for all aspects performs worse than using a separate model call per aspect (see table below). This is likely because separate model calls allows the LLM verifier to put all of its “focus” on one aspect, rather than having to think about many different aspects in context. Having multiple LLM verifier calls may have more expressive power, and this could be an interesting direction for future work to explore further. Thanks again for your suggestion to try this. We will include these results and additional discussion in the final version of the paper.
MMLU-Pro:
| Model | MAV-rubric (new) | MAV-separate (original) | self-consistency | pass@1 |
|---|---|---|---|---|
| Gemini-1.5-Flash | 62.0 | 65.7 | 63.3 | 59.3 |
| GPT-4o-mini | 64.0 | 65.3 | 63.7 | 62.3 |
GPQA (diamond):
| Model | MAV-rubric (new) | MAV-separate (original) | self-consistency | pass@1 |
|---|---|---|---|---|
| Gemini-1.5-Flash | 39.0 | 41.0 | 40.0 | 42.0 |
| GPT-4o-mini | 42.0 | 49.0 | 48.0 | 38.0 |
[RESPONSE CONTINUES BELOW]
Experiment Details: We instruct the LLM “rubric” verifier to think about each aspect in a single response and generate a python dictionary mapping aspect names to True/False values (after it has finished thinking about each aspect). We use the exact same 10 aspects as the original paper, but obtain the approvals for all aspects simultaneously in a single response rather than splitting across model calls. As in the original paper, we use Gemini-1.5-Flash and GPT-4o-mini as the base LLMs for our verifiers, so there are actually 2 model calls: Gemini-1.5-Flash generates one rubric and GPT-4o-mini generates another rubric. For simplicity, we do not perform any “aspect engineering” for this experiment. We compare to the “MAV-All” results in Table 5 of the Appendix (which uses 20 separate verifier calls: 10 aspects for 2 base LLMs).
It is unclear whether you perform the "aspect engineering" on the test set or on some validation set. If it is on the test-set, I would be concerned about the validity of their findings. If it is on the validation set, then they should make this clear in the paper.
Yes, we perform the aspect engineering on a validation set and then evaluate the selected verifiers on the test set. We briefly mention this in the Section 3.4 (“we select the subset which maximizes the average performance across all generator LLMs evaluated on a validation set.”) and also in Appendices B.1 and B.2, but we see how this may be unclear in the main body of the paper. We will update the paper to make this more clear. Thanks for the suggestion!
Is the MMLU and GPQA done with a chain of thought? Without a chain of thought, it would be pretty hard to verify I feel.
Yes, all solutions generated by the generator LLMs are done with Chain-of-Thought. Figure 2 in the main body of the paper and Figures 7, 8, and 9 in the Appendix show illustrations of MAV, including the full Chain-of-Thought of the generated solutions. Notice that the verifiers also do chain-of-thought (depending on the aspect-strategy used). Please let us know if we can provide any more information about this.
Thanks for your thorough review. We hope that the additional experiments and clarifications address your concerns with the paper. Please don’t hesitate to ask if you have any additional questions.
Thank you for your detailed response. I really appreciate the additional experiments to test some of the concerns I mentioned, and also the clarifications are very helpful. With these new results, I am willing to increase my score to a 6.
This paper studies, in a generate-verify style inference method, the relationship between the number and type of automatic verifiers and overall performance. Among other results, authors find that this method allows for smaller verifiers. Overall, I think this is a nice contribution to COLM that can be broadly summarized as pushing in a direction of research on scaling behavior influenced by verifier design/quality. Authors include additional experimental results in the rebuttal following reviewer suggestions that should be included in any camera-ready version of the work.