ReFeR: Improving Evaluation and Reasoning through Hierarchy of Models
A framework to use Multiple AI agents for better evaluation of generative text or images and improved reasoning
摘要
评审与讨论
The authors propose an LLM-based automatic evaluation framework that draws inspiration from the academic peer review process: independent LLMs score and comment a generation, and then these scores, comments, and the original generation are fed to an "area chair" LLM that produces a final score. They evaluate this method by assessing correlation with human annotations on existing labeled datasets. They compare to existing LLM-automatic evaluation methods.
优点
-
Automatic evaluation is an impactful problem, and I think the design space remains wide open. The grounding of evaluation in human annotations is great.
-
The inclusion of an analysis section is great; I have some questions about Sections 5.1 and 5.3 (see Weaknesses / Questions), but Section 5.2 was very clear.
-
I appreciate that the appendices are filled out, and that the authors included their code at submission time.
缺点
My main concern with this method is that in evaluations, the methods are not test-time compute matched. One explanation for ReFeR's stronger performance in Table 2 is that the system simply spends more inference compute than the other methods; this could mean that the peer review structure is not actually integral to performance. Since the paper's contribution is in the peer review structure, I think this is an important question to clarify.
-
I appreciate that the authors compute a Cost column in Table 2, but this is only the AC cost. The authors justify this column as the monetary cost of querying an OpenAI API. However, in practice, this is not the only cost we have in evaluation: test-time compute (number of FLOPs processed) would be a more valid metric. (After all, in reality the authors are paying Together AI for inference of the open weight models as well.) I suspect that if we looked at test-time compute, ReFeR becomes significantly more heavyweight than other methods: after all, we need to inference 4 large language models!
-
To make the performance - inference compute tradeoff more apparent, could the authors make a scatterplot with the corrected inference compute metric as the x-axis? An even stronger visualization would modulate n (or some other test-time compute parameter) for each method and plot how evaluation performance changes with increased test-time compute; we can then compare ReFeR to G-Eval, for example, by checking whether the ReFeR curve consistently sits above the G-Eval one.
- I'd suggest drawing this up in a setting where the AC model is also an open-weight model; this way, we know all parameter counts.
I also would like to ask the authors to provide results for a few additional baselines: (1) just the AC, and (2) a simple average of the peer agent scores. Could the authors add GPT-4o-mini / GPT-4o (the AC model) and a simple average of scores as baseline rows in the "Peer Agents" section of Tables 2 and 3? I appreciate that Section 5.3 offers some insight into these questions, but I would like to directly see the human correlation scores of these baselines.
问题
I would love to see the plots asked for in the Weaknesses box, and I will increase my score if appropriate. Additionally:
-
I had some questions about how the G-Eval and Analyze-Rate baselines were configured; it would be good to move this information into the main text.
- Is the backbone model matched to the AC model used in ReFeR? (GPT-4o-mini for NLG/Reasoning, GPT-4o for multimodal)?
- Was n=20 for these?
-
Similarly, was the single-agent CoT model matched to the AC model? Which agents were used in the multi-agent baseline?
-
Figure 2 was a bit challenging to parse: which model is the AC model for the "Llama + Gemma + Nemo" point, and which models represent the "3 Peers + Mixtral" point?
We really appreciate the reviewer for their time and for giving a detailed review and interesting direction to calculate test time compute.
§ Weakness 1
I appreciate that the authors compute a Cost column in Table 2, but this is only the AC cost. The authors justify this column as the monetary cost of querying an OpenAI API. However, in practice, this is not the only cost we have in evaluation: test-time compute (number of FLOPs processed) would be a more valid metric.
We thank the reviewer for pointing out the need for another cost metric like inference time compute (number of FLOPs processed). The suggestion of the reviewer to visualize the performance-compute tradeoff through the scatter plot with performance and compute as axes is a very good way to compare the effectiveness of the methods. We follow the following formula for the approximate measurement of FLOPs computation for all the models:
To know all parameter counts, we use an open source Qwen 2.5-72B as the AC with the same peer models as in our framework. We also perform the experiments with G-Eval and Analyze-rate using the AC model as Qwen 2.5-72B. We modulate ‘n’ for each method to see whether our method lies above the baselines in the visualization of performance vs compute. Although ‘n’ parameter doesn’t exist in any other model than the openai family, we simulated it by making n calls. We use spearman correlation as the performance metric for this visualization. Below is a table summarizing various models parameters for calculating their FLOPs.
| Model | () | layers | Total Input Tokens | FLOPs () | |
|---|---|---|---|---|---|
| Llama-3.1-8B | 4096 | 14336 | 32 | 970720 | 2.11 |
| Mistral-Nemo-12B | 5120 | 14336 | 40 | 970720 | 4.11 |
| Gemma-2-9B | 3584 | 28672 | 42 | 970720 | 2.14 |
| AC (Qwen-2.5-72B) (n=1) | 8192 | 29568 | 80 | 1016279 | 22.05 |
| Analyze-Rate (n=20) | 8192 | 29568 | 80 | 856100 | 372.16 |
| G-Eval (n=20) | 8192 | 29568 | 80 | 888500 | 386.07 |
| ReFeR-Turbo | 8192 | 29568 | 80 | 1016279 | 449.31 |
| ReFeR-Lite | 8192 | 29568 | 80 | 1016279 | 30.40 |
ReFeR’s FLOPs is AC’s FLOPS (n=20 for Turbo and n=1 for Lite) + Sum of FLOPs of Llama, Mistral-nemo and Gemma.
(After all, in reality the authors are paying Together AI for inference of the open weight models as well.)
We have used togetherai for open source models for the sake of easiness to use and to save time while running different experiments. However the compute is same whether we deploy the models on our GPU or use a 3rd party API.
I suspect that if we looked at test-time compute, ReFeR becomes significantly more heavyweight than other methods: after all, we need to inference 4 large language models!
We agree that overall test-time compute is more than other baselines for ReFeR-Turbo, but we also get superior performance justifying the compute. Our ReFeR-Lite is significantly lesser in test-time compute compared to G-Eval or Analyze-Rate.
To make the performance - inference compute tradeoff more apparent, could the authors make a scatterplot with the corrected inference compute metric as the x-axis? An even stronger visualization would modulate n (or some other test-time compute parameter) for each method and plot how evaluation performance changes with increased test-time compute; we can then compare ReFeR to G-Eval, for example, by checking whether the ReFeR curve consistently sits above the G-Eval one.
These are the spearman correlation and FLOPs processed for each of the methods. Keeping time and computation budget in mind we varied n from 1 to 10 (As n is not available in other models we simulated it by making n calls every sample):
| n | Method | ρ | FLOPs () |
|---|---|---|---|
| 1 | ReFeR | 0.620 | 30.41 |
| 1 | Analyze Rate | 0.5452 | 18.61 |
| 1 | G-Eval | 0.608 | 19.3 |
| ------- | ------------------ | --------- | ------------------------ |
| 3 | ReFeR | 0.639 | 74.51 |
| 3 | Analyze Rate | 0.5465 | 55.83 |
| 3 | G-Eval | 0.626 | 57.9 |
| ------- | ------------------ | --------- | ------------------------ |
| 5 | ReFeR | 0.646 | 118.61 |
| 5 | Analyze Rate | 0.5535 | 93.05 |
| 5 | G-Eval | 0.633 | 96.5 |
| ------- | ------------------ | --------- | ------------------------ |
| 8 | ReFeR | 0.648 | 184.76 |
| 8 | Analyze Rate | 0.5423 | 148.88 |
| 8 | G-Eval | 0.636 | 154.4 |
| ------- | ------------------ | --------- | ------------------------ |
| 10 | ReFeR | 0.649 | 230.5 |
| 10 | Analyze Rate | 0.5413 | 186.1 |
| 10 | G-Eval | 0.637 | 193.0 |
We have also added these results, the parameters and scatter plot in the appendix-L (figure-8) of the updated draft.
We can very clearly see that the correlation with ReFeR is always above G-Eval and Analyze-Rate across the entire range of inference compute. We also see that n=8 of ReFeR gives better performance than n=10 of G-eval while maintaining lesser FLOPs compared to G-Eval’s n=10. We observe overall that ReFeR-Turbo (or n=10 here, has more FLOPS than G-Eval), but ReFeR gives more correlation than G-Eval. Whereas ReFeR-Lite (n=1) has significantly lesser FLOPs but gave very high correlation although it didn't outperform G-Eval with n=20 in this experiment with Qwen-2.5-72B model. Overall we do think that ReFeR-Lite is significantly better than other methods considering the overall FLOPs to performance ratio.
§ Weakness 2
I also would like to ask the authors to provide results for a few additional baselines: (1) just the AC, and (2) a simple average of the peer agent scores. Could the authors add GPT-4o-mini / GPT-4o (the AC model) and a simple average of scores as baseline rows in the "Peer Agents" section of Tables 2 and 3?
| Method | Coherence () | Engagingness () | Groundedness () | Naturalness () | Average () | |
|---|---|---|---|---|---|---|
| Peer Agents | Llama-3.1-8B | 0.380 | 0.400 | 0.444 | 0.320 | 0.386 |
| Mistral Nemo-12B | 0.409 | 0.594 | 0.442 | 0.411 | 0.464 | |
| Gemma-2-9B | 0.536 | 0.615 | 0.582 | 0.519 | 0.563 | |
| GPT-4o-mini | 0.518 | 0.618 | 0.589 | 0.540 | 0.566 | |
| Peer Average | 0.547 | 0.648 | 0.577 | 0.512 | 0.539 | |
| Baselines | Analyze-Rate | 0.505 | 0.647 | 0.463 | 0.572 | 0.547 |
| G-Eval | 0.587 | 0.444 | 0.526 | 0.599 | 0.539 | |
| Ours | ReFeR Turbo | 0.585 | 0.673 | 0.628 | 0.625 | 0.628 |
| ReFeR Lite | 0.535 | 0.624 | 0.583 | 0.575 | 0.579 |
We have added the Simple peer average and the AC mode(GPT-4o-mini) with peer setup, as additional baselines in peer agents section of the table above.
We observe that the simple average of scores of peers as baseline performs just below G-Eval and Analyze-Rate, while ReFeR comfortably outperforms it. When we find the base performance using peer’s setup on GPT-4o-mini we get an average spearman correlation of 0.566 which is slightly better than G-Eval and Analyze-Rate on average. This shows our optimized peer setup by utilizing the eval_guidelines_prompt structure.
§ Question 1
I had some questions about how the G-Eval and Analyze-Rate baselines were configured; it would be good to move this information into the main text. Is the backbone model matched to the AC model used in ReFeR? (GPT-4o-mini for NLG/Reasoning, GPT-4o for multimodal)? Was n=20 for these?
Yes the backbone of all the baselines is always the same as the AC in the ReFeR. ‘n’ was always configured as 20 for both the baselines. And we used the same hyperparameters given in their respective papers. We set temperature to 1 to ensure uniformity in all the hyperparams of all the baselines and ReFeR. We did mention about this in our discussion that we utilized n=20, we will make it more explicit in the edited version.
§ Question 2
Similarly, was the single-agent CoT model matched to the AC model? Which agents were used in the multi-agent baseline?
Yes, the single-agent CoT was matched to the AC model. All multi-agent baselines have the same model as our AC model with their respective hyperparams mentioned in the codebase. All multi-agent baselines were essentially a single model acting as multi-agents so they have multiple instances of the AC model.
§ Question 3
Figure 2 was a bit challenging to parse: which model is the AC model for the "Llama + Gemma + Nemo" point, and which models represent the "3 Peers + Mixtral" point?
We mentioned the details in figure-2 caption, all the experiments in the ablation utilized GPT-4o-mini as the AC. -GPT-4o-mini is AC model for the "Llama + Gemma + Nemo" -"3 Peers + Mixtral" refers to the previous point in the graph which is Llama+Gemma+Nemo
We hope that our responses have answered your questions and satisfied your concerns. We are happy to provide any additional details or clarifications that you need. We look forward to your response.
Thank you to the authors for the thorough response --- this is very helpful, and the new experiments give me confidence to increase my score by 2 points. I would highly recommend moving this analysis to the main text; being able to claim that your evaluation is more test-time compute-efficient is a much cleaner story.
A few remaining clarification questions:
- Appendix L is missing some expository details: in Table 13 & Figure 8 in Appendix L, to clarify, the ReFeR method with is ReFeR-Lite, correct? I'm parsing this based on matching with Table 12. And to confirm, this is evaluating the Spearman correlation on TopicChat average across all metrics? And to confirm, the peer models are the same as in the main text?
- It would be good to include error bars in Figure 8, especially over randomness from resampling.
- I wonder if the authors could again comment on the choice of AC, which seems very important now. In Section 5.2, swapping in Qwen-1.5-72B for gpt-4o-mini resulted in an overall degradation in performance, and the takeaways were simply that ReFeR could improve over the peers alone if the AC is slightly stronger than the peers. However, based on the most recent results in the rebuttal, Qwen-2.5-72B seems to be a much stronger AC than gpt-4o-mini and Qwen-1.5-72B, based on comparing Table 13 to Tables 2 and 5. For example, in Table 13, with , while in Table 2, ReFeR-Lite has with the gpt-4o-mini AC, and in Table 5, ReFeR-Lite has . On the other hand, Analyze Rate don't seem to benefit at all from switching to Qwen-2.5-72B. Are there specific benchmarks scores that help predict this change? (For example, what is the score of just using Qwen-2.5-72B by itself? Does it outperform gpt-4o-mini and predict this result?)
- One last thought: for memory reasons, it's easier to inference the same model 4 times locally than inference 4 models separately. Have the authors tried using the same model for both the 3 peer reviews and the AC, sampling with temperature to get diversity?
Thank you very much for your comments and additional feedback. We have incorporated your suggestion and rephrased our claims to emphasize that our evaluation is more test-time compute efficient.
Appendix L is missing some expository details: in Table 13 & Figure 8 in Appendix L, to clarify, the ReFeR method with n=1 is ReFeR-Lite, correct?
Yes, ReFeR with corresponds to ReFeR-Lite. We have updated the caption of Table 13 to explicitly clarify this detail.
I'm parsing this based on matching with Table 12. And to confirm, this is evaluating the Spearman correlation on TopicChat average across all metrics?
Yes this is evaluating the spearman correlation onTopicalChat dataset with average of all metrics and total FLOPs for full evaluation.
And to confirm, the peer models are the same as in the main text?
Yes, the peer models are the same as those described in the main text. We also used the same responses that were sent to the area chair with GPT-4o-mini for this new experiment with Qwen. To ensure fairness, we have redone the experiments from Table 1 using the same GPT-4o-mini model, as OpenAI recently updated their model internally, resulting in different correlation values.
It would be good to include error bars in Figure 8, especially over randomness from resampling.
Thank you for the suggestion. We have updated Figure 8 to include error bars, accounting for the randomness introduced by resampling.
I wonder if the authors could again comment on the choice of AC, which seems very important now. In Section 5.2, swapping in Qwen-1.5-72B for gpt-4o-mini resulted in an overall degradation in performance, and the takeaways were simply that ReFeR could improve over the peers alone if the AC is slightly stronger than the peers.
Choice of AC depends on the overall ability of the AC to perform evaluation and we also think that it should have the ability to understand the assistant’s/peers evaluation to benefit for the final evaluation. Qwen-1.5-72B although is weaker than best peer (Gemma) it improved the overall evaluation because it was able to benefit through the peer’s evaluation but couldn’t maximize the assistant’s help. Whereas with Qwen-2.5-72B, the overall correlation is much higher than GPT-4o-mini for n=10 and can potentially improve further with n=20. This is clear given that Qwen-2.5 has better performance as peer and potentially was able to benefit more from the same peer evaluations than GPT-4o-mini resulting in better overall correlation even with n=1.
The choice of AC is critical and depends on its overall ability to perform evaluations and its capability to effectively interpret the assistants’ (peers’) evaluations for enhanced final results. Although Qwen-1.5-72B is weaker than the best peer (Gemma), it improved the overall evaluation because it was able to leverage the peers’ evaluations to some extent but failed to maximize the effect of assistants as Qwen was weaker than the best peer. In contrast, Qwen-2.5-72B achieved much higher overall correlation than GPT-4o-mini at and has the potential to improve further at . This is likely because Qwen-2.5 performs better as a peer, suggesting it was able to derive greater benefit from the same peer evaluations compared to GPT-4o-mini. This resulted in better overall correlation, even at .
On the other hand, Analyze Rate don't seem to benefit at all from switching to Qwen-2.5-72B. Are there specific benchmarks scores that help predict this change?
We do not yet have a clear explanation for this observation, despite conducting multiple runs. We hypothesize that this behavior may be due to the prompting structure of Analyze-Rate, which caused poor performance with Qwen, even though the hyperparameters were kept consistent. During our initial experimentation with prompt optimizations, we observed that an optimal prompt for one model might not necessarily work as effectively for another. This could explain the performance drop of Qwen with Analyze-Rate. After all, Analyze-Rate is primarily a prompting structure, not a framework like ReFeR.
what is the score of just using Qwen-2.5-72B by itself? Does it outperform gpt-4o-mini and predict this result?
Yes, Qwen-2.5-72B outperforms GPT-4o-mini in the peer setup. The table below illustrates the comparison:
| Method | Coherence (ρ) | Engagingness (ρ) | Groundedness (ρ) | Naturalness (ρ) | Avg (ρ) |
|---|---|---|---|---|---|
| GPT-4o-mini | 0.518 | 0.618 | 0.589 | 0.540 | 0.566 |
| Qwen 2.5 72B | 0.550 | 0.656 | 0.604 | 0.566 | 0.594 |
One last thought: for memory reasons, it's easier to inference the same model 4 times locally than inference 4 models separately. Have the authors tried using the same model for both the 3 peer reviews and the AC, sampling with temperature to get diversity?
Thank you for this suggestion. We conducted an experiment using the same model (Gemma) for all peers and the AC, with varying temperatures to simulate diversity. The peer models were run with temperatures of 0.25, 0.5, and 0.75, while the AC model was set at temp=1, consistent with our original setup. The results are shown below:
| Method | Coherence (ρ) | Engagingness (ρ) | Groundedness (ρ) | Naturalness (ρ) | Average (ρ) |
|---|---|---|---|---|---|
| Gemma (temp=0.25) | 0.559 | 0.614 | 0.565 | 0.536 | 0.568 |
| Gemma (temp=0.5) | 0.548 | 0.611 | 0.571 | 0.540 | 0.568 |
| Gemma (temp=0.75) | 0.547 | 0.626 | 0.582 | 0.509 | 0.566 |
| ReFeR-Turbo | 0.587 | 0.681 | 0.628 | 0.597 | 0.623 |
| ReFeR-Lite | 0.563 | 0.648 | 0.614 | 0.574 | 0.600 |
These results demonstrate that the ReFeR framework performs effectively even in a homogeneous setup, achieving significant improvements in performance.
We sincerely appreciate your thoughtful feedback and efforts to enhance our paper. We are looking forward to discussing more to clarify any further concerns or questions.
Dear Reviewer U6dR,
We hope this message finds you well. We kindly want to check if you had a chance to review our latest rebuttal (we have added newer baselines, showed if we can use same model with varying temperature for GPU constraint environment etc), and if you have any further questions or comments we can address to help with your evaluation. We sincerely thank you for your efforts and suggestions in improving our manuscript.
Sincerely,
Authors
Thank you for the additional experiments; these are helpful. I'm happy to raise my score to a 6 and suggest that the authors discuss the choice of AC more clearly in the main text, perhaps by referencing these additional experiments.
For the temperature experiments, could the authors summarize briefly how these results compare to using Gemma as the AC with other models as the peer models? My overall question is whether having actually different peer models from the AC is more beneficial than just using temperature scaling to simulate having peer models (while using the AC model). I find this difficult to extract from our previous discussions, since the AC model has changed in the most recent set of experiments.
Thank you very much for your review and additional questions.
For the temperature experiments, could the authors summarise briefly how these results compare to using Gemma as the AC with other models as the peer models? My overall question is whether having actually different peer models from the AC is more beneficial than just using temperature scaling to simulate having peer models (while using the AC model).
We appreciate your query and would like to address it with additional clarity by summarizing the results in the following table, which was also presented in response to Reviewer 2CBy:
| Method | Coherence () | Engagingness () | Groundedness () | Naturalness () | Average () |
|---|---|---|---|---|---|
| Gemma-2-9B (Peer Setup) | 0.536 | 0.615 | 0.582 | 0.519 | 0.563 |
| Gemma2-9B (sampling 20 times) | 0.556 | 0.617 | 0.577 | 0.530 | 0.570 |
| ReFeR-Turbo (Gemma as AC) | 0.569 | 0.684 | 0.643 | 0.590 | 0.621 |
| ReFeR-Lite (Gemma as AC) | 0.552 | 0.624 | 0.607 | 0.574 | 0.589 |
Analysis: As shown above, both ReFeR variants in the heterogeneous setup (Table 2 of the manuscript) yield higher average correlations (0.621 for Turbo and 0.589 for Lite) compared to setups where Gemma's temperature scaling is employed to simulate peer models. The temperature-scaled setup achieves correlations of 0.623 and 0.600, slightly higher than the homogeneous peer setup using Gemma alone (0.563 and 0.570).
This suggests that heterogeneous setups can leverage complementary perspectives from diverse peer models. However, to generalize the benefits of varied temperatures over heterogeneous peer models, more experiments with additional models are required. Initial experiments (not included in the current draft) with homogeneous setups using GPT-3.5-turbo indicate that purely homogeneous configurations consistently underperform heterogeneous setups.
Future Work: To draw conclusive insights, further experiments are needed to evaluate if temperature scaling consistently outperforms heterogeneous setups across varying configurations. We will add these results as an additional appendix in the camera-ready version to provide a comprehensive answer to your question.
Thank you again for raising this valuable point. We hope our explanation clarifies the nuances of the setup and results.
The paper introduces ReFeR, a hierarchical framework for evaluating generative models using a multi-agent system inspired by peer review. It leverages large language and vision-language models to assess outputs, providing more accurate, explainable evaluations than existing benchmarks. ReFeR is also good at reasoning tasks and offers two variants: ReFeR-Turbo (higher performance) and ReFeR-Lite (cost-effective). The framework demonstrates strong results across text, image, and reasoning evaluations, surpassing prior methods in both accuracy and efficiency.
优点
- This work proposes an interesting framework for generative task evaluation inspired by the hierarchical peer review process with peer review agents from different LLMs and area chair evaluation. This provides a more systematic evaluation framework. It also gives more explainability for evaluation.
- ReFeR is flexible - it demonstrates effectiveness on various generative tasks in different modalities including text, image, and multimodal tasks.
- There are analysis and ablation studies of the framework, which provides more insights on how to select models, how many models to use, etc.
- The paper is generally well written and easy to understand.
缺点
- Even with the ReFeR-Lite variant, the framework still demands quite significant computational resources, making it still expensive as an evaluation tool.
- Another concern is that this framework can be considered as a hierarchical version of LLM-as-a-Judge with more carefully designed prompting, which may limit the novelty of this work.
问题
Maybe I missed something, but how are the costs in Table 2-4 calculated?
We appreciate the reviewer for their genuine review and time.
§ Weakness 1
Even with the ReFeR-Lite variant, the framework still demands quite significant computational resources, making it still expensive as an evaluation tool.
We do agree that ReFeR-Lite is also costlier than G-Eval and that is because G-eval only produces scores and doesn't give any meaningful reasoning or comments of why it gave the score and this makes G-eval cheaper but also has lesser correlation with humans. Although ReFeR-Lite is costlier than G-Eval, it is slightly cheaper than Analyze-Rate which also gives feedback like ReFeR, but ReFeR-Lite is better than Analyze-Rate (which generates 20 responses like G-eval whereas ReFeR-Lite generates only single response) in terms of correlation. Purpose of ReFeR is to enhance the evaluation of NLG outputs using LLM agents and we do agree that this is more computationally expensive than non LLM based methods.
§ Weakness 2
Another concern is that this framework can be considered as a hierarchical version of LLM-as-a-Judge with more carefully designed prompting, which may limit the novelty of this work.
Although it may seem like LLM-as-a-judge and ReFeR are similar, our purposes are completely different. LLM-as-a-judge is primarily designed to evaluate chatbots using another LLM like GPT-4, similar to chatbotArena(which is crowd sourced). It is a pair-wise evaluation of chatbot responses. Whereas ReFeR is primarily designed to evaluate any generative output and score it on various metrics using Multiple LLMs or VLMs while increasing the overall reasoning of the framework. Another important difference between LLM-as-a-judge and ReFeR is that the former uses the LLM judge to evaluate the assistant by assessing its output, whereas the latter (ReFeR) utilizes the assistants evaluation to evaluate the given NLG output.
§ Question 1
Maybe I missed something, but how are the costs in Table 2-4 calculated?
We apologize for any confusion in the cost calculation. Tables 2-4 in the paper show relative costs in comparison to the most costly part of the table. For eg: in Table-2, we have ReFeR-Turbo as the costly one which costs around 1.16 so its relative cost would essentially become 1.16/1.50 ~ 0.77. This way we calculate all the costs of each row relatively to understand the impact of cost beyond the dataset that is tested.
We hope this helps to clear any confusion and bring more clarity. We are open for discussing any other concerns or comments you have.
Dear Reviewer So4X,
We hope this message finds you well. We kindly want to check if you had a chance to review our rebuttal as we have addressed all of your concerns, and if you have any further questions or comments we can address to help with your evaluation. Thanks again for your review and questions.
Sincerely,
Authors
Thank you for your response. I have read the rebuttal and other reviews. I agree that the proposed framework has difference from previous LLM-as-a-judge but still not that significant. Therefore I will maintain my original score (6).
Thank you for your response. I have read the rebuttal and other reviews. I agree that the proposed framework has difference from previous LLM-as-a-judge but still not that significant. Therefore I will maintain my original score (6).
Thank you very much for your time and replying back to us. We beg to differ here. ReFeR is significantly different from LLM-as-a-Judge both from the structure, prompt and goal of the work. LLM-as-a-Judge was a single LLM based work that was designed to evaluate other LLMs by their responses to certain questions and its purpose is not to evaluate the responses alone. As the authors designed it for matching the ChatBot Arena’s Human correlations. Whereas ReFeR’s purpose is not to evaluate the LLM but any NLG or generative or human outputs using multiple agents defined in our architecture. Nevertheless to clear the performance difference between ReFeR and LLM-as-a-Judge we have included a few other baselines in Table-2 of the updated draft. Below is the table for the same:
| Method | Coherence () | Engagingness () | Groundedness () | Naturalness () | Avg () |
|---|---|---|---|---|---|
| Llama-3.1-8B-Instruct-Turbo | 0.380 | 0.400 | 0.444 | 0.320 | 0.386 |
| Mixtral Nemo | 0.409 | 0.594 | 0.442 | 0.411 | 0.464 |
| Gemma-2-9B-it | 0.536 | 0.615 | 0.582 | 0.519 | 0.563 |
| GPT-4o-mini Peer Setup | 0.518 | 0.618 | 0.589 | 0.540 | 0.566 |
| LLM-as-Judge | 0.510 | 0.593 | 0.556 | 0.534 | 0.548 |
| ChatEval | 0.551 | 0.624 | 0.522 | 0.557 | 0.564 |
| Analyze-Rate | 0.551 | 0.638 | 0.615 | 0.562 | 0.591 |
| G-Eval | 0.581 | 0.636 | 0.593 | 0.558 | 0.592 |
| ReFeR | 0.592 | 0.677 | 0.645 | 0.616 | 0.632 |
| ReFeR Lite | 0.561 | 0.636 | 0.618 | 0.591 | 0.602 |
We can clearly see here that LLM-as-a-Judge is far behind all the other baselines because it was never designed for this purpose. We can also see above that we have included ChatEval which is a multi-agent debate based framework and we clearly outperform both these new baselines as well. Since GPT-4o-mini was recently updated internally by OpenAI, we have re-run all experiments on both our baselines and ReFeR to ensure a fair and consistent comparison with the new baselines.
We have conducted additional experiments and thoroughly addressed the feedback and questions raised by all the reviewers. Therefore, we kindly urge you to clarify any outstanding concerns and share any doubts that, when resolved, might increase your overall assessment of the work. Thank you very much.
Dear Reviewer So4X,
We kindly want to check if you had a chance to review our latest rebuttal as we have addressed all of your remaining concerns, and included LLM-as-Judge and ChatEval as baselines to practically show the differences and improvements with our framework. Thank you for your time and review.
Sincerely,
Authors
The paper introduces ReFeR, a hierarchical framework for evaluating AI-generated outputs that lack predefined correct answers, inspired by academic peer review processes. The framework consists of two main components: a Peer Review Body where multiple language models act as peer reviewers providing independent evaluations, and an Area Chair module where a more capable language model synthesizes these reviews into a final assessment.
The authors present two variants of the framework: ReFeR-Turbo, which generates multiple responses for higher accuracy but at greater computational cost, and ReFeR-Lite, which uses single responses for better efficiency. The framework also introduces enhanced evaluation guidelines and an auto-prompt generation system to improve the quality of assessments, making it a comprehensive solution for evaluating open-ended AI outputs.
优点
The originality is in adapting academic peer review processes to AI evaluation, offering a different approach to assessing LLM outputs. Quality is evident through comprehensive testing across reasoning tasks and multimodal outputs, with the framework showing consistent performance across different types of evaluation challenges. The paper is nicely structured, presenting both theoretical foundations and practical implementations through its two variants (ReFeR-Turbo and ReFeR-Lite). The framework's ability to handle both text and multimodal outputs while providing reasoned evaluations makes it particularly valuable for many real-world applications.
缺点
The paper's main weakness is its unclear primary contribution, as it's difficult to distinguish how the ReFeR framework meaningfully differs from other multi-LLM evaluation systems beyond borrowing concepts from academic peer review (MoA). Additionally, the paper's organization is scattered across too many topics, with important elements like the instruction tuning dataset (shown in Figure 1) being buried in the appendix rather than properly discussed in the main text.
问题
Isn't the concept of using multiple LLMs to achieve better results in this context just derived from Mixture of Agents? Seems like an extension of the work; yet no reference. How is the coverage of the instruction tuning dataset, does it scale? Is ReFeR's main purpose to be a framework to create instruction tuning sets? Or an evaluation system?
We thank the reviewer for their time and valuable review.
§ Weakness 1
The paper's main weakness is its unclear primary contribution, as it's difficult to distinguish how the ReFeR framework meaningfully differs from other multi-LLM evaluation systems beyond borrowing concepts from academic peer review (MoA).
We only borrow the idea of Academic Peer Review from the real world seeing its effectiveness in discussing the originality and relevance of a research paper. Apart from the idea we show how utilizing various LLMs or VLMs as agents in peer review can be applied to evaluate NLG outputs. We show how to utilize multiple and different LLMs as a consensus effectively while answering important questions on how to select models, how to communicate etc. We also give a fresh thought to the prompting structure that is different and better from G-Eval’s prompting structure. Finally we show how this improved framework can help us in better reasoning when compared to other multi-LLM agent systems. Most of the existing multi agent systems for better reasoning are primarily focused on using a single model as multiple agents whereas our system utilizes different models effectively.
§ Weakness 2
Additionally, the paper's organization is scattered across too many topics, with important elements like the instruction tuning dataset (shown in Figure 1) being buried in the appendix rather than properly discussed in the main text.
We agree that the instruction tuning dataset contribution could’ve been highlighted in the main paper. However, to make space for other important aspects (like testing multimodal ability of the framework, Reasoning ability of the framework, detailed error analysis and ablation etc) we had to keep the instruction tuning part in the appendix.
§ Question 1
Isn't the concept of using multiple LLMs to achieve better results in this context just derived from Mixture of Agents? Seems like an extension of the work; yet no reference.
Although mixture of agents also discuss the idea of how we can use multiple LLMs at each layer and then pass all the responses to the LLM in the next layer and so on is similar to ReFeR, it is yet different in the following ways:
- We have a hierarchical structure where the peer reviewers count is always greater than the AC count.
- We utilize different LLMs or VLMs as peers and AC
- And our AC is different from the aggregator LLM in the MoA as AC here does another evaluation utilizing the evaluations of peers and not just compiling the peer evaluations.
ReFeR and MoA are contemporary works and we did not take any inspiration from MoA, but we have included MoA as another recent work in our related works section.
§ Question 2
How is the coverage of the instruction tuning dataset, does it scale?
The instruction tuning dataset is the output of the final evaluation done by Area Chair for all the samples in the given dataset. So for a given datapoint the dataset has a score given by Area Chair (final score) and the final comment (i.e randomly chosen comment from n whose score is nearest to final score). This dataset was later utilized on a smaller model (mistral-7B) to improve its evaluation ability. This experiment was done to show that the final dataset is good enough for improving the evaluation ability of smaller models like mistral-7B.
§ Question 3
Is ReFeR's main purpose to be a framework to create instruction tuning sets? Or an evaluation system?
We apologize if it is not clear in the paper, but ReFeR’s primary purpose is to act as an Multi-Agent framework that can do NLG evaluation and give enhanced reasoning. This should be treated as the multimodal, multi-agent version of G-eval which is limited to single LLM and limited to text. We also provide an enhanced evaluation and comment at the end of evaluation which when collected can give rise to the instruction tuning dataset which can be used by smaller models to improve their performance, although this is not the primary focus of the framework.
We hope our responses resolve your concerns and questions. We are happy to discuss further. Please let us know if you have any additional questions.
Dear Reviewer M8bC,
We hope this message finds you well. We kindly want to check if you had a chance to review our rebuttal as we have addressed all of your concerns, and if you have any further questions or comments we can address to help with your evaluation. Thanks again for your review and questions.
Sincerely,
Authors
I thank the authors for their response. I wish to stay at my score.
We sincerely thank the reviewer for their time, initial feedback, and review. To better understand and address your perspective, we kindly ask if you could share any specific concerns or reservations that remain. We have conducted additional experiments and thoroughly addressed the feedback and questions raised by all the reviewers. Therefore, we kindly urge you to clarify any outstanding concerns and share any doubts that, when resolved, might increase your overall assessment of the work.
Dear Reviewer M8bC,
We hope this message finds you well. We gently want to find if you have any other concerns left or if we can clarify any other points. Thank you for your time and efforts.
Sincerely,
Authors
The paper proposes a hierarchical multi-LLM framework using a peer review structure for text evaluation and reasoning. The framework employs smaller models as peer reviewers and a larger model as an area chair to synthesize evaluations. The work demonstrates empirical improvements over existing approaches while providing rationales for other choices in the system including models and prompts.
优点
- The framework shows improvement in reasoning of multi-agent systems under the framework proposed.
- The evaluations/experiments presented are sound and well ablated.
- The error analysis section is useful and provides insights on where the framework could be improved further.
缺点
- The paper highlights on lines 36-47 that there are not many works on using multiple-LLMs for evaluation, but there do exist some methods using multiple LLMs, for example: https://openreview.net/forum?id=FQepisCUWu and https://arxiv.org/html/2401.16788v1. The authors highlight very early and strongly the novelty of the work of using multiple-LLMs for evaluation but existence of the above works do bring in questions. The paper would benefit much more from better literature review, and inclusion of these other methods in the results section for a fair and faithful comparison.
- The work claims of developing a novel prompting schema by introducing "guidelines"(that can be auto-generated) and lists it as a main contribution. However, I think creating detailed prompts for specialized tasks is a well known prompt engineering practice, thus I do not see it as a substantial contribution.
- Several key claims require stronger support, especially on explainability and robustness, see questions below.
问题
- The paper claims that on line 51-52 the framework promotes explainability and robustness.
- Explainabiltiy: It is not clear how this happens? Why is trying to use another LLM to explain another LLM's output sound? What makes the judge LLM trustworthy especially given the hallucination problem in LLMs. It would be useful to see where the peer and AC explanations differ, and quantitative metrics to support the claim of explainability.
- Robustness: It is not clear how the framework promotes robustness? Can perturbations in prompts not affect agentic systems? Is the claim based on the fact that a perturbation for some LLM will not affect another, if so do we have a measure of what part of perturbation space agentic systems are more robust to? Are certain combinations of LLMs more robust than the others?
- Similarly to the previous point, the conclusion says the framework is robust, but I see no experiments to validate this claim. This could be supported to some extent by showing the standard deviations in final ratings of each peer/AC, and the complete framework. However, in order to call the framework robust, there have to be evaluations of this system on common adversarial attacks to LLMs.
- The work says on line 99-100 that the area chair should be a larger or more capable LM, but it is not clear why? Is using smaller peers justified because of efficiency? Given the ablation where a weaker area chair(section 5.2) also works, why use a larger one?
- I assume temperature was set to 1 to ensure variation in the responses when n>1, but is there a reason this performs better than setting temp=0 and using n=1?
- Could there be analysis on the rating distributions of the framework when ?
- I am unsure how is produced when n>1, because can be averaged but not sure how aggregations work for .
- If possible, it would be useful to see ablations on , this could be done by choosing subsets from results of n=20 runs if the results are stored.
What makes the judge LLM trustworthy especially given the hallucination problem in LLMs.
Regarding the problem of hallucination, we don't clearly know a way to mitigate the hallucination problem in LLMs and hence we are not claiming we are 100% aligning our evaluation with human evaluation and 0% of hallucination. All our experiments show that the current state of the LLMs gives us around 60% (correlation nearly 0.6) alignment with human evaluation. Until the fundamental problem of hallucination is solved we don’t see a way out of this problem, which is applicable for all LLM based works.
It would be useful to see where the peer and AC explanations differ, and quantitative metrics to support the claim of explainability.
Although we are not claiming the explainability part, our figure-3 explains where the framework is lacking and where it is working properly. For eg: Figure-3 b shows that even though the majority of the peers are incorrect, the AC is able to identify and give a correct answer. We have added examples in appendix N showing how the AC benefits from the peer explanations although they are wrong at times, in the appendix of the updated draft.
§ Question 1 (robustness) and Question 2
Robustness: It is not clear how the framework promotes robustness? Can perturbations in prompts not affect agentic systems? Is the claim based on the fact that a perturbation for some LLM will not affect another, if so do we have a measure of what part of perturbation space agentic systems are more robust to? Are certain combinations of LLMs more robust than the others? Similarly to the previous point, the conclusion says the framework is robust, but I see no experiments to validate this claim. This could be supported to some extent by showing the standard deviations in final ratings of each peer/AC, and the complete framework. However, in order to call the framework robust, there have to be evaluations of this system on common adversarial attacks to LLMs.
We do not claim that the framework is robust for all changes. Agentic frameworks are well known to be susceptible for prompt changes which even we showed and mentioned in the limitations. When we mention robustness we did not mean robustness in general. What we intended to say is that our framework is robust/consistent across runs, as we show the std. deviation across runs in Table-4. We have reworded this accordingly in the revised version (line 53-54). We did not explore the directions of adversarial attacks on the framework as we never claimed the framework has any measures to tackle adversarial attacks. We leave this direction open for the future works to explore.
§ Question 3
The work says on line 99-100 that the area chair should be a larger or more capable LM, but it is not clear why? Is using smaller peers justified because of efficiency? Given the ablation where a weaker area chair(section 5.2) also works, why use a larger one?
Section 5.2 is dedicated to explain why a larger model (relatively better model than the peers) is required. Table-5 shows that average spearman correlation of Qwen is 0.492 but when used in the framework as the AC we get a correlation of 0.555 which clearly shows a better performance for Qwen because of the framework. But we see that one of the peer’s (Gemma2-9B) has a correlation of 0.568 hence we deduce that if the AC model is relatively stronger than most of the peers then we get improved performance but to get the best results out of the framework we see that we need a larger or better model as AC to better utilize the evaluations done by the peers and incorporate them in it’s own evaluation. We clarified this point clearly in Section 5.2 again in the updated submission.
We thank the reviewer for his detailed feedback and review.
§ Weakness 1
The paper highlights on lines 36-47 that there are not many works on using multiple-LLMs for evaluation, but there do exist some methods using multiple LLMs, for example: https://openreview.net/forum?id=FQepisCUWu and https://arxiv.org/html/2401.16788v1. The authors highlight very early and strongly the novelty of the work of using multiple-LLMs for evaluation but existence of the above works do bring in questions. The paper would benefit much more from better literature review, and inclusion of these other methods in the results section for a fair and faithful comparison.
We don’t claim that there weren’t any works that involve multiple LLMs for evaluation, rather we said “there has been limited research on how to align evaluations from multiple VLMs or LLMs with human judgments.” (Line 43-44). We also mentioned ChatEval in the related works section (Line 485). The works mentioned (ChatEval [https://openreview.net/forum?id=FQepisCUWu] and ScaleEval [https://arxiv.org/html/2401.16788v1] ) are primarily focused on pair-wise evaluation of NLG responses whose primary purpose is to identify which LLM is better at generation on various topics or which of the given 2 responses are better. For Eg: ScaleEval clearly mentions the motivation and usage in Section 2 of their paper that they want to utilize LLMs to make a more scalable and reliable evaluation system for a diverse set of tasks. ScaleEval work primarily falls in the category of Chatbot Arena but using LLMs (https://arxiv.org/abs/2306.05685) hence it is not a baseline in our direction. Regarding ChatEval, it is a multi-agent framework that utilizes same model as multiple agents with varied personas in a debating setting hence it is different from our idea of different heterogeneous models as peers and ACs in a hierarchical format enabling composing a modular framework to leverage different model’s strengths. As ChatEval’s focus is pairwise evaluation of NLG responses with a different goal, this cannot be utilized as a relevant baseline for all datasets. Hence, ChatEval and Scale-Eval are not relevant baselines for us since they are designed for a different type of evaluation tasks.
§ Weakness 2
The work claims of developing a novel prompting schema by introducing "guidelines"(that can be auto-generated) and lists it as a main contribution. However, I think creating detailed prompts for specialized tasks is a well known prompt engineering practice, thus I do not see it as a substantial contribution.
We think evaluation guidelines is an important step in the framework and is different from regular prompt engineering practice or prompt optimization. It leverages the idea of procedure-cloning to enable models to understand (in-context) the fine-grained details of evaluation similar to human evaluation of datasets like topicalchat where the human annotators were generally given the guidelines and details on how to score on a given metric. We conducted detailed prompt optimization experiments where we used different prompt optimization methods and none of the methods gave prompts which have the content of evaluation guidelines or gave similar performance as this prompt structure.
§ Weakness 3
Several key claims require stronger support, especially on explainability and robustness, see questions below.
We haven’t claimed that the framework is very robust for all changes and that we explored explainability as they were not the primary goals of the paper and we have removed this part of the statement (line 53-54 “promoting model self-improvement, explainability, and robustness in complex scenarios”) which might imply that we are claiming explainability and robustness. When mentioning explainability, we mean the explainability on the reasoning/evaluation process of the LLM which might not be present in methods like G-Eval and robustness/consistency of framework across runs. We explained the reasons below to give more clarity on what we meant by promoting explainability and robustness.
§ Question 1
The paper claims that on line 51-52 the framework promotes explainability and robustness.
We agree that we have not explored explainability and robustness in detail in this work. We thank the reviewer for pointing this out and we have now reworded the statement to imply this meaning clearly in the paper.
Explainabiltiy: It is not clear how this happens? Why is trying to use another LLM to explain another LLM's output sound?
Just like in multi-agent peer-review/multi-agent debate where models are provided with evaluations of other agents to reassess/defend their own evaluation, we believe that providing the peer evaluations to the AC would improve the AC’s evaluation. And in our framework the AC model is not explaining the peer responses, but is considering and utilizing the peer evaluations in its own evaluation.
§ Question 4
I assume temperature was set to 1 to ensure variation in the responses when n>1, but is there a reason this performs better than setting temp=0 and using n=1?
Analyze-Rate (https://arxiv.org/abs/2310.05657) showed in their paper previously that utlizing temperature of 1 gives better correlation than lower temperatures. And hence we used the temperature as 1 in all the setups. Temperature helps the model to be more creative in every response and High temperature would imply more variations. And with a larger value of n, you get multiple varied responses, which you can use to get a better output. Analyze-rate (appendix D) has already shown an ablation on how increasing temperature for n=20 increases correlation. We performed a small experiment using n=1 and temperature=0 to compare with n=1 when the temperature is 1. For the entire experiment we used the same peer evaluations to keep the input constant for the AC and used the same model (GPT-4o-mini pointed to the latest openai update). Table with spearman correlation coefficient:
| Temp | n | Coherence | Engagingness | Groundedness | Naturalness | Average |
|---|---|---|---|---|---|---|
| 0 | 1 | 0.554 | 0.571 | 0.608 | 0.591 | 0.581 |
| 0 | 20 | 0.564 | 0.565 | 0.612 | 0.597 | 0.584 |
| 1 | 1 | 0.563 | 0.617 | 0.620 | 0.595 | 0.598 |
| 1 | 20 | 0.582 | 0.654 | 0.651 | 0.624 | 0.628 |
We find that using temperature=1 gives better results for both n=1 and n=20 and also observe that when temperature=0, then varying n hardly improves the correlation score (gain of 0.003 for spearman correlation) while for the temperature=1 case we find that the total gain is 0.03 for spearman correlation.
§ Question 5
Could there be analysis on the rating distributions of the framework when ?
Assuming the reasoning behind the question being how much the ratings vary when n>1, we can get the standard deviation across n-samples for each data point and see the overall std-dev across the dataset. The below table shows that information:
Overall StdDev:
| n | coherence | engagingness | groundedness | naturalness | Average |
|---|---|---|---|---|---|
| 5 | 0.144 | 0.122 | 0.038 | 0.171 | 0.119 |
| 10 | 0.174 | 0.153 | 0.045 | 0.193 | 0.141 |
| 15 | 0.180 | 0.162 | 0.047 | 0.204 | 0.148 |
| 20 | 0.186 | 0.168 | 0.048 | 0.216 | 0.154 |
We can see that as n increases models tend to have higher std.dev/variance among the responses. This shows that as n increases, the AC’s responses have much variance, and would lead to better correlation.
§ Question 6
I am unsure how is produced when , because can be averaged but not sure how aggregations work for .
The final correlation is calculated using the Scores and hence we averaged scores only for all the n responses. The Cfinal is only needed for the instruction tuning dataset, for which we randomly take the comment whose score is nearest to the Sfinal. We reworded line 806 in the algorithm in appendix B to explicitly mention this.
§ Question 7
If possible, it would be useful to see ablations on , this could be done by choosing subsets from results of n=20 runs if the results are stored.
Yes it is possible to do an ablation like mentioned and here are the results for the same using the Topical Chat dataset on all 4 metrics averaged.
Topical Chat (ReFeR)
| n | Average | |
|---|---|---|
| ρ | τ | |
| 1 | 0.576 | 0.503 |
| 5 | 0.624 | 0.519 |
| 10 | 0.631 | 0.520 |
| 15 | 0.633 | 0.520 |
| 20 | 0.635 | 0.520 |
Since, G-eval and Analyze-rate have shown that the hyperparameter ‘n’ is integral to their system, we use n=20, and for fair comparison, we present our n=20 results too, as our ReFeR-Turbo. But, we can observe that for lesser ‘n’ values we have almost similar results. This was also mentioned by Analyze-Rate in their ablation of temperature(Appendix D) stating that theirs and G-Eval’s are robust for just n>5.
We hope that our responses have answered your questions and satisfied your concerns. We are happy to provide any additional details or clarifications that you need. We look forward to your response.
Thank you very much for replying back to us.
Weakness 1. Multi-Agent Framework Comparison Important: While I appreciate the clarification about ChatEval and ScaleEval, I maintain that ChatEval should be included as a baseline since: It evaluates on TopicalChat, making direct comparison possible Comparing against a non-hierarchical multi-agent framework would help demonstrate whether the hierarchical structure provides advantages. This is integral to show the importance of the contribution of this work.
ChatEval does a pair wise evaluation or comparison based evaluation (table-5 in page 10 of ChatEval shows an example of this paradigm). They have shown a benchmark on dialog response generation (section 3.2) on topicalchat dataset and they have not explicitly mentioned in their paper on how exactly they do this evaluation and report correlations. We assume that they must've evaluated the topicalchat dataset (which has 60 dialog contexts with 6 different systems generating a response) by choosing 2 responses at a time and comparing them (which is called as "comparison" evaluation and G-eval, analyze-rate and ReFeR follows a direct scoring evaluation) hence we think it is not possible to compare both the methods correctly.
Apart from this, ChatEval is not a real multi-agent framework as they only focused on using the same model with multiple role plays and also mention in their section 3.1. Nevertheless we would've compared our method with ChatEval's results but we don't find the codebase (in their official github repo) for experimenting on the TopicalChat hence it is not possible to show you any direct comparison. We are emailing the authors of ChatEval to ask for any code and if we hear back from them then we will present any comparisions before rebuttal.
Citation Issues: The introduction should cite these works(ChatEval, ScaleEval) when stating "limited research" to accurately represent the field. By italicizing the focus on "individual LLMs"(l36-37) for text evaluation research and then expressing surprise at limited multi-LLM research without citing any existing work, the text implies an absence of multi-agent approaches. This framing would be more accurate if it cited existing multi-agent frameworks like ChatEval.
We removed the italic for "individual LLMs" to remove that emphasis in the updated draft. ChatEval is not a true Multi-agent framework as it didn't explore heterogeous models and ScaleEval is for evaluating the chatbot (in the direction of ChatBotArena) hence we would like to stick to the statement that there has been limited research as multi-agent framework for scoring evaluation are very limited.
Weakness 2. Prompting Schema Contribution I understand that the prompt structure outperforms automated optimization methods. However, since the prompting schema is specific to your framework rather than a generalizable technique, it may be better positioned as part of the overall system rather than a standalone contribution and the system is listed as a contribution already.
We understand that prompting schema maybe specific to the framework for evaluation and is not a generalizable technique but we mentioned it as a contribution as it can be used to improve other frameworks like G-Eval or analyze-rate and give better results just by using the prompt structure.
Thank you for your suggestions on experiments to validate our statement, we will complete those experiments soon and post the results here.
Other Questions: What model with G-Eval is used?
We used same model as Area chair (GPT-4o-mini) for G-Eval , Analyze-rate and all the places where we compare the framework against other models.
also, as the new ablations suggest using n=5 not only performs similar to n=20, it also has less std dev so please consider adding to the manuscript about this.
Yes, we will mention about this in the draft, thank you for pointing it out.
Thank you for the rebuttal.
Weakness 1. Multi-Agent Framework Comparison Important: While I appreciate the clarification about ChatEval and ScaleEval, I maintain that ChatEval should be included as a baseline since: It evaluates on TopicalChat, making direct comparison possible
Comparing against a non-hierarchical multi-agent framework would help demonstrate whether the hierarchical structure provides advantages. This is integral to show the importance of the contribution of this work.
Citation Issues: The introduction should cite these works(ChatEval, ScaleEval) when stating "limited research" to accurately represent the field. By italicizing the focus on "individual LLMs"(l36-37) for text evaluation research and then expressing surprise at limited multi-LLM research without citing any existing work, the text implies an absence of multi-agent approaches. This framing would be more accurate if it cited existing multi-agent frameworks like ChatEval.
Weakness 2. Prompting Schema Contribution I understand that the prompt structure outperforms automated optimization methods. However, since the prompting schema is specific to your framework rather than a generalizable technique, it may be better positioned as part of the overall system rather than a standalone contribution and the system is listed as a contribution already.
Question 1 and 2: I appreciate the changes in the manuscript regarding the claim of robustness and explainability.
The work says on line 99-100 that the area chair should be a larger or more capable LM, but it is not clear why? Is using smaller peers justified because of efficiency? Given the ablation where a weaker area chair(section 5.2) also works, why use a larger one?
Question 3: Thank you for the clarifications. Some more questions:
Area Chair (AC) Performance Analysis The framework seems to improve Qwen's performance but I think the results relate to the best peer more than the framework, the claim that "larger/better AC models yield best results" needs more evidence:
-
ReFeR-Lite results are very close to best peer (Gemma-9B) performance (only 0.0157 difference in main expts and in sec 5.2, best peer outperforms)
-
Is AC performance bottlenecked by best peer or best model in the framework? this needs discussion and further evidence comparing:
a. AC same as best peer vs current setup
b. Refer-Lite vs Refer-Lite framework's best model alone
ReFeR-Turbo Performance Gap The substantial improvement in ReFeR-Turbo warrants investigation given the closeness of Refer-Lite to the best peer: Is improvement due to multiple AC samples or framework structure? Experiment to consider: Refer-Turbo framework's best model alone with repeated sampling vs ReFeR-Turbo
Other Questions: What model with G-Eval is used?
also, as the new ablations suggest using n=5 not only performs similar to n=20, it also has less std dev so please consider adding to the manuscript about this.
We have performed experiments to answer your questions. Throughout the experiments we used the same peer responses to keep the input constant for various AC setups.
| Method | Coherence () | Engagingness () | Groundedness () | Naturalness () | Average () |
|---|---|---|---|---|---|
| Gemma-2-9B (Peer Setup) | 0.536 | 0.615 | 0.582 | 0.519 | 0.563 |
| Gemma2-9B (sampling 20 times) | 0.556 | 0.617 | 0.577 | 0.530 | 0.570 |
| ReFeR-Turbo (Gemma as AC) | 0.569 | 0.684 | 0.643 | 0.590 | 0.621 |
| ReFeR-Lite (Gemma as AC) | 0.552 | 0.624 | 0.607 | 0.574 | 0.589 |
| GPT-4o-mini (Peer Setup) | 0.518 | 0.618 | 0.589 | 0.540 | 0.566 |
| ReFeR | 0.585 | 0.673 | 0.628 | 0.625 | 0.628 |
| ReFeR Lite | 0.552 | 0.640 | 0.596 | 0.599 | 0.597 |
We will also update the draft with these additional results with both and
Area Chair (AC) Performance Analysis The framework seems to improve Qwen's performance but I think the results relate to the best peer more than the framework, the claim that "larger/better AC models yield best results" needs more evidence:
From the above table, we can confirm that larger or better AC models lead to better performance. For example, ReFeR with GPT-4o-mini as AC achieves better correlations than ReFeR with Gemma as AC. Since GPT-4o-mini has a better base performance than Gemma, it demonstrates the point that better AC models contribute to improved results.
ReFeR-Lite results are very close to best peer (Gemma-9B) performance (only 0.0157 difference in main experiments and in sec 5.2, best peer outperforms)
The observed improvement in correlation on evaluation tasks is consistent with previous works. For instance, G-Eval achieved ~0.04 improvement in correlation on the Summeval dataset and ~0.02 improvement on the TopicalChat dataset, even though the baselines were just pre-trained models (e.g., BART/T5 for BART Score, Uni-Eval). In comparison, G-Eval used a highly capable LLM (GPT-4). Similarly, although ReFeR’s improvements appear small, they are comparable to or better than the improvements seen in prior work such as Analyze-rate, which also demonstrated modest gains over its baseline (G-Eval).
Is AC performance bottlenecked by best peer or best model in the framework? this needs discussion and further evidence comparing: a. AC same as best peer vs current setup
We conducted an experiment using Gemma as the AC for the framework and compared its results with the current setup. While Gemma as AC performed well, it still lagged behind ReFeR’s original setup with GPT-4o-mini. This is because GPT-4o-mini is a more capable LLM than Gemma, enabling it to leverage peer evaluations more effectively and achieve higher correlations.
b. Refer-Lite vs Refer-Lite framework's best model alone
From the table above, it is evident that ReFeR-Lite outperforms the framework’s best model alone. Although the improvement is around the order of ~0.02, this is significant given that similar improvements were observed in prior works such as G-Eval and Analyze-Rate.
ReFeR-Turbo Performance Gap The substantial improvement in ReFeR-Turbo warrants investigation given the closeness of Refer-Lite to the best peer: Is improvement due to multiple AC samples or framework structure? Experiment to consider: Refer-Turbo framework's best model alone with repeated sampling vs ReFeR-Turbo
To determine whether the improvement stems from multiple AC samples or the framework’s structure, we compared the results of Gemma sampled 20 times with those of ReFeR-Turbo (n=20) using Gemma or GPT-4o-mini as AC. The results show that while repeated sampling improves the base model performance slightly (Gemma’s average Spearman correlation improves from 0.563 to 0.570), it does not account for the substantial improvement seen in ReFeR-Turbo (0.621 with Gemma as AC, 0.628 with GPT-4o-mini as AC). Similarly, ReFeR-Lite (n=1) also outperforms Gemma sampled 20 times. This demonstrates that the framework’s structure, rather than sampling alone, drives the performance improvements. By leveraging both of these ReFeR-Turbo performs much better.
We hope our responses resolve your all concerns. We are happy to provide any additional clarification if there are any further questions or comments.
Thank you for the response and the experiments.
This clears a lot of things and shows the framework's advantages.
I am increasing my score by 2 points based on:
The thorough ablation studies that validate the framework's benefits analysis framework and experiments as it could potentially be impactful on future research in LLM evaluation systems Strong technical execution and clear presentation
However, some concerns remain:
Important: ReFeR system: ReFeR improvements range from 0.01-0.04 over baselines; ReFeR-Lite shows improvements <0.035; Using Gemma as AC instead of GPT-4o-mini reduces gains to <0.01. The relatively higher values for std deviation also does not help the case when the mean performance difference is small. Given the increased computational overhead and cost of multiple models, these incremental gains need stronger justification for practical applications. This point still stands even if prior work had similar improvements.
Other issues The architectural contribution(while valuable) requires stronger positioning. While ReFeR thoughtfully implements heterogeneous models in a hierarchical structure, this represents an engineering evolution of existing approaches(multi-agent perr review and multi-agent debate) rather than a fundamental change. Prior multi-agent frameworks' use of identical models was an implementation choice, not a technical constraint. Thus, while ReFeR's specific implementation produces measurable improvements, I am not sure of the novelty contribution.
The prompting contribution might be better presented as an integral part of the system rather than a standalone contribution as I have stated before because it is not a general schema. Therefore, I do not count it as a significant contribution.
Conclusion: The paper's strongest contributions lie in its empirical insights and analysis framework. The comprehensive experiments and error analysis provide valuable understanding of LLM evaluator behavior in both single and multi-agent settings. This analytical data may prove more significant than the absolute performance improvements, especially considering that future improvements in single-model sampling or prompting approaches might achieve similar gains(as they are small) with lower computational overhead.
ReFeR system: ReFeR improvements range from 0.01-0.04 over baselines; ReFeR-Lite shows improvements <0.035; Using Gemma as AC instead of GPT-4o-mini reduces gains to <0.01.
We acknowledge that the gains on the TopicalChat dataset are modest. However, these gains are reported as differences in correlation, which do not fully capture how relatively better our framework performs compared to other methods. This limitation is inherent to ranked correlations, a standard metric we used because alternative metrics were unknown to us and all prior works adopted these correlations as standard.
When using Gemma as the area chair (AC), we achieved an average correlation of 0.621 with ReFeR-Turbo. By sampling Gemma 20 times, the correlation was 0.570—a 0.05 increase (approximately 5%) attributable to the framework itself. Similarly, ReFeR-Lite (n=1) with Gemma as AC achieved a correlation of 0.589, representing nearly a 2% improvement over 20-time sampling.
The overarching goal of ReFeR is not solely to improve NLG evaluation but to introduce and validate the concept of a hierarchical agentic system. We believe this approach has potential applications across diverse domains, even beyond LLMs. To explore the generalization of this framework, we expanded our work to multimodal evaluation and reasoning tasks.
The inspiration for ReFeR stems from the effectiveness of peer review systems in real-life scenarios, which excel in judging research papers, fostering improvement, and advancing knowledge. This serves as the foundation of our primary contribution: proposing and validating the architecture of this system on evaluation and reasoning. Secondary contributions include empirical demonstrations of its effectiveness across various use cases and thorough ablation studies to analyze its components.
The relatively higher values for std deviation also does not help the case when the mean performance difference is small.
We attribute the minimal performance differences to the already strong reasoning capabilities of the baseline LLMs. A more meaningful evaluation of the framework's impact on reasoning can be observed in the difference between zero-shot CoT and ReFeR-Lite. For instance, with GPT-4o-mini as the LLM, the difference is approximately 6 points.
To further demonstrate the framework's effectiveness, we conducted an additional experiment using the Qwen-2.5-7B model. The baseline zero-shot CoT performance for Qwen was 73.5, and ReFeR-Lite with the same Qwen model as the area chair (AC) improved the performance to 82.25, resulting in a significant 8.75-point increase. This improvement is particularly notable given the smaller size of the Qwen-2.5-7B model.
We anticipate that experiments with smaller LMs (SLMs) could reveal even more substantial performance differences, provided the AC SLM can adequately interpret the assistants' evaluations. Below, we present the results of an experiment where Qwen-2.5-7B was used as the AC with the same set of assistants and their responses as in the original Table 4.
| Method | AQuA | BBH_DU | CSQA | GSM8k | Average |
|---|---|---|---|---|---|
| Qwen-2.5-7B | 71.00 | 70.00 | 79.00 | 74.00 | 73.50 |
| ReFeR (Qwen AC) | 82.00 | 83.00 | 80.00 | 91.00 | 84.00 |
| ReFeR Lite (Qwen AC) | 78.00 | 80.00 | 80.00 | 91.00 | 82.25 |
Given the increased computational overhead and cost of multiple models, these incremental gains need stronger justification for practical applications.
Tables 12 and 13 compare the overall test-time computation of ReFeR with other baselines, demonstrating that ReFeR-Lite requires 30.40 Peta FLOPs, while G-Eval demands 386.07 Peta FLOPs—a nearly 12.7x reduction in computation. Despite this significant reduction, ReFeR-Lite maintains similar or better performance compared to G-Eval, making it an excellent choice for scenarios with computational constraints.
Additionally, we conducted an experiment using a homogeneous framework where the same model acted as both peers and the area chair, with varying temperature settings to simulate different agents. The results of this experiment (referenced in reviewer U6dR's latest comment) are provided below and detailed in Table 16 of Appendix P in the updated draft. These findings indicate that the framework can sustain performance even with a single model, which is particularly valuable in GPU-constrained environments for deploying models to evaluate responses.
Other issues The architectural contribution(while valuable) requires stronger positioning. While ReFeR thoughtfully implements heterogeneous models in a hierarchical structure, this represents an engineering evolution of existing approaches(multi-agent perr review and multi-agent debate) rather than a fundamental change. Prior multi-agent frameworks' use of identical models was an implementation choice, not a technical constraint. Thus, while ReFeR's specific implementation produces measurable improvements, I am not sure of the novelty contribution.
We would greatly appreciate any suggestions for ways to further strengthen the positioning of our framework, as we believe we have clearly outlined the purpose and significance of ReFeR.
Prior multi-agent frameworks' use of identical models was an implementation choice, not a technical constraint. Thus, while ReFeR's specific implementation produces measurable improvements, I am not sure of the novelty contribution.
While it may seem intuitive that the use of heterogeneous models would lead to improved results, there has been limited exploration of this approach in the field. Our work not only demonstrates the benefits of using heterogeneous models but also provides guidance on selecting models, establishing communication strategies, and configuring hierarchical structures. These contributions aim to pave the way for broader applications of this framework in the future.
The prompting contribution might be better presented as an integral part of the system rather than a standalone contribution as I have stated before because it is not a general schema. Therefore, I do not count it as a significant contribution.
We agree that the prompting structure is an integral part of the system rather than a standalone contribution. However, we would like to emphasize that other works, such as G-Eval and Analyze-Rate, are fundamentally based on prompting structures. In our comparison, we find that our specific prompting structure—when tested with the GPT-4o-mini Peer Setup without —still performs competitively with other baselines. This highlights the importance of the prompting schema we introduced, even as part of the overall system.
We also included ChatEval and LLM-as-a-Judge as baselines in Table 2, all using GPT-4o-mini as the backbone. Since GPT-4o-mini was recently updated internally by OpenAI, we have re-run all experiments on both our baselines and ReFeR to ensure a fair and consistent comparison with the new baselines. For ChatEval, we replicated their approach using their official codebase to evaluate the TopicalChat dataset. We employed the most optimal setting as mentioned in their paper: 3 roles and 2 discussion turns. LLM-as-a-Judge achieved the least correlation among all the baselines which is expected given it is not designed for this purpose. ChatEval although employs debating style of communication, couldn’t outperform other baselines. T
We deeply appreciate your valuable feedback, time, and comments, and we thank you sincerely for them. We look forward to continuing the discussion and addressing any additional concerns you may have.
Dear Reviewer 2CBy,
We hope this message finds you well. We kindly want to check if you had a chance to review our rebuttal, and if you have any further questions or comments we can address to help with your evaluation. Thanks again for your efforts and suggestions in improving our manuscript.
Sincerely,
The authors
This paper proposes a multi-agent framework, namely ReFeR, composed of a two-level hierarchical structure, where the two levels simulate the roles of reviewers and area chairs in the peer review process. The method has been compared comprehensively with mainstream approaches in terms of NLG evaluation tasks, multimodal evaluation tasks and for reasoning tasks.
优点
- The paper is well written and the method is easy to reproduce;
- Extending the evaluation tasks to reasoning tasks is a good generalization;
- Some key issues, such as how to select models, how many models to use, are addressed though detailed ablation study.
缺点
- The distinctions between this method and similar methods, such as multi-agent debate or multi-agent peer review, are not very clear. For example, debating or summarizing are merely different forms of prompts. In fact, for the multi-agent peer review method, if a reviewer can receive information from all other reviewers before refining their own review, they would essentially be playing a role very similar to that of an area chair.
- The two-level hierarchical structure that summarizes through an area chair is not fundamentally different from ensemble methods like majority voting; it is merely implemented via prompts. Additionally, the difference between Turbo and Lite versions is simply one of 1 versus 20 rounds of integration.
- The improvement in NLG evaluation is not particularly significant, and on reasoning tasks, the improvement is only more noticeable on AQuA, while it is not significant on other datasets. Moreover, the differences in performance may also be due to different model selections, such as the models used as peers not being entirely consistent with those used in other methods, leading to unfair comparisons.
- The two-level hierarchical structure has not been theoretically proven to be necessarily better than debate methods; it seems more that different methods have different strengths, weaknesses, and application scenarios.
问题
Please see the weaknesses.
We appreciate the reviewer for their time and for raising important points.
§ Weakness 1
The distinctions between this method and similar methods, such as multi-agent debate or multi-agent peer review, are not very clear. For example, debating or summarizing are merely different forms of prompts. In fact, for the multi-agent peer review method, if a reviewer can receive information from all other reviewers before refining their own review, they would essentially be playing a role very similar to that of an area chair.
Debating and summarizing or aggregating are different forms of communication between multiple agents, in debating format there can be agreement or disagreement and then each of the models has ability to revise their opinions over multiple rounds of debate based on others opinions. Whereas in the aggregating format we take another model which will aggregate all the opinions of models and then summarize them or use them to further continue the process. Here usually models may not communicate with each other or have multiple rounds of discussion. In general, our framework can allow peers to look at other evaluations and make changes before sending to AC but we have only explored the hierarchical way of communication as we are choosing smaller models as peers which can potentially be biased when seeing others evaluation. For these reasons we need an AC which can potentially use these assistant evaluations and give better overall evaluation. Our framework is quite different from other multi-agent frameworks mentioned, below are some differences.
-
ReFeR vs Multi-Agent Debate:
Multi-Agent debate uses the same model as multiple agents by invoking multiple instances of the model and primarily utilizes Debate format of communication with 3 rounds as default. Whereas ReFeR needs different models as peers and AC with a hierarchy of models where AC is required to be stronger or better model than peers, and we also utilize a specific format of prompt structure and it is a single round of evaluation where peers don’t communicate among themselves. -
ReFeR vs Multi-Agent Peer Review:
Multi-Agent peer review also utilizes same model by invoking multiple instances and simulates multi-agent environment but instead of debate each of the models aggregate all the reviews (plus their confidence) of others and then revise their opinion and continue to do this for 3 rounds by default. Here each model acts like an AC at every evaluation. Whereas in ReFeR we have a single evaluation by peers and a single evaluation by AC in a hierarchy. We complete the evaluation in lesser steps/computations while matching or being better than other multi-agent models.
§ Weakness 2
The two-level hierarchical structure that summarizes through an area chair is not fundamentally different from ensemble methods like majority voting; it is merely implemented via prompts. Additionally, the difference between Turbo and Lite versions is simply one of 1 versus 20 rounds of integration.
Yes we agree that hierarchical structure is not fundamentally different from ensembling but it provides better ensembling ability by trying to understand the differences in the peers opinions and its opinion thereby having a possibility to reject any outliers during evaluation. For eg: if we refer to figure-3 b) we can see that when the majority of peers are wrong, the AC is still able to answer correctly atleast in 18.3%+14% (i.e “only 1 peer correct and AC is correct” (18.3%) and “0 peer correct but AC is correct” (14%)) of the cases. This cannot happen in a simple ensemble method like majority voting. This is evident in reasoning where the AC can find the differences in the peers' reasoning and can correct although the consensus is incorrect. And we agree that Turbo and Lite are simply the number of responses aggregated and they were designed by following the previous works like G-Eval and Analyze-Rate.
§ Weakness 3
The improvement in NLG evaluation is not particularly significant, and on reasoning tasks, the improvement is only more noticeable on AQuA, while it is not significant on other datasets.
The improvement in the correlation on evaluation tasks has always been comparatively low even in the previous works. G-Eval has shown only ~0.04 improvement in correlation over the best baseline in summeval dataset and ~0.02 improvement in topicalchat dataset, even though the baselines are just pre-trained models like BART/T5(BART Score, Uni-Eval respectively) and G-Eval uses a very capable LLM GPT-4. Although ReFeR’s improvement seems not so significant, we should also consider that Analyze-rate has shown even less improvement than its baseline G-Eval. This shows us the saturation problem in the inherent capability of the evaluator LLM. There is only so much we can improve in the zero-shot LLM Evaluators through prompting strategies/framework. This is also apparent in the Reasoning results, where many methods saturate at ~95-96% for GSM8k, ~92-94% for BBH-DU,etc, even though Reasoning has a much simpler metric like accuracy and the sample size is lesser. So, in the case of evaluation task, where the dataset size is much larger and the metric is correlation, we are bound to see very slight improvements compared to the changes.
Moreover, the differences in performance may also be due to different model selections, such as the models used as peers not being entirely consistent with those used in other methods, leading to unfair comparisons.
We understand that our comparisons with our methods may not be totally fair as we utilize different models and the rest of the baselines utilize the same model. But we can see that although they use the same model in multiple instances still the performance difference is not significant and moreover we utilized smaller models as peers (7B-12B) whereas the AC is GPT-4o-mini (single instance). We can also see that our method provides better performance to cost ratio.
§ Weakness 4
The two-level hierarchical structure has not been theoretically proven to be necessarily better than debate methods; it seems more that different methods have different strengths, weaknesses, and application scenarios
Yes we agree that 2 level hierarchical structure is not proven to be better than debate methods and Yes we also observe that both methods seem to have their own advantages and weaknesses.
We hope our responses resolve your concerns. We are happy to provide any additional clarification if there are any further questions or comments.
Dear Reviewer KWMK,
We hope this message finds you well. We kindly want to check if you had a chance to review our rebuttal as we have addressed all of your concerns, and if you have any further questions or comments we can address to help with your evaluation. Thanks again for your review and questions.
Sincerely,
Authors
Thanks for the authors' rebuttal. I have carefully read the reviews from other reviewers and responses from the authors. The author's response has addressed some of my questions(Weakness 2 and Weakness 3), but there are still some concerns that remain unresolved. Therefore, at this stage, I will keep my score.
The author's response has addressed some of my questions(Weakness 2 and Weakness 3), but there are still some concerns that remain unresolved.
Thank you for taking your time and replying back to us. Concerning your remaining concerns (Weakness 1 and Weakness 4),
The distinctions between this method and similar methods, such as multi-agent debate or multi-agent peer review, are not very clear. For example, debating or summarizing are merely different forms of prompts. In fact, for the multi-agent peer review method, if a reviewer can receive information from all other reviewers before refining their own review, they would essentially be playing a role very similar to that of an area chair.
Although one can say that the multi-agent debate, multi-agent peer review are doing the same thing as our hierarchical framework where each agent receives other agents’ responses and utilize them for their responses, but the debating/peer-review process is different to our hierarchical framework because there is no discussion for multiple rounds between the agents. Our framework is designed to be a one-way hierarchical framework in which the AC model (the final evaluator) is provided with peer/assistant’s responses to further improve its own evaluation, which is more efficient than the other frameworks which have discussion and multiple rounds, leading to higher computation.
Hence, we believe that our framework significantly differs from the other multi-agent frameworks in both structure, performance and the quality of rationale.
The two-level hierarchical structure has not been theoretically proven to be necessarily better than debate methods; it seems more that different methods have different strengths, weaknesses, and application scenarios.
We agree that different methods, including hierarchical and debate-based approaches, have their own strengths, weaknesses, and application scenarios. While theoretical proof of the superiority of the hierarchical structure is beyond the scope of this work, we have conducted experiments to evaluate response quality, particularly hallucination, using the HHEM-2.1 model. This model assesses how much a generated response is supported by a reference, with higher scores indicating lower hallucination and better rationale quality.
| Method | HHEM Score |
|---|---|
| GPT-4o-mini | 0.297 |
| Zero-Shot-CoT | 0.115 |
| Self Correction | 0.136 |
| Multi-Agent Debate | 0.102 |
| Multi-Agent Peer Review | 0.108 |
| ReFeR | 0.330 |
Our analysis on the GSM8k benchmark, detailed in Appendix O of the updated draft, shows that ReFeR significantly reduces hallucination compared to other frameworks. Notably, ReFeR achieves a higher HHEM score of 0.330, outperforming both GPT-4o-mini (baseline) and other baselines like Zero-Shot-CoT, Self-Correction, Multi-Agent Debate, and Multi-Agent Peer Review. These results suggest that our framework produces responses with better rationale quality compared to existing methods, which tend to increase hallucination after multiple rounds of discussion and evaluation.
While further theoretical validation could enhance this understanding, we believe the hierarchical approach offers a promising alternative that warrants public attention and exploration.
We hope this helps to clear any confusion and bring more clarity. We are open for discussing any other concerns or comments you have.
Dear Reviewer KWMK,
We hope this message finds you well. We kindly want to check if you had a chance to review our rebuttal, and if you have any further questions or concerns left that we can address to help with your evaluation.
Sincerely,
The authors
We sincerely thank all reviewers for their thoughtful and detailed feedback, which has greatly helped in refining our work and clarifying its contributions. Below, we summarise the key points addressed during the rebuttal phase, additional experiments conducted, and the importance of this research for the broader community.
Key Contributions
Our paper introduces ReFeR, a hierarchical framework inspired by the peer review process, which utilizes multiple AI agents (LLMs/VLMs) for the evaluation of generative outputs. We:
- Demonstrated its superiority over existing methods like G-Eval and Analyze-Rate through enhanced reasoning and evaluation accuracy.
- Validated ReFeR's applicability across diverse modalities (text, images, multimodal tasks) and reasoning datasets.
- Introduced two variants:
- ReFeR-Turbo, for higher performance.
- ReFeR-Lite, which balances accuracy and test-time compute efficiency.
Enhancements During Rebuttal
-
Expanded Baselines and Comparisons:
- Added Simple Averaged Peer, ChatEval, and LLM-as-a-Judge as baselines and highlighted clear performance improvements.
- Conducted experiments on setups where the area chair uses diverse peer configurations (heterogeneous models with temperature-scaled homogeneous setups).
- Performed experiments to show that the framework's structure is the reason for maximum performance rather than simply sampling more responses.
-
Visualization and Analysis:
- Included performance vs. test-time compute trade-off analysis as suggested by reviewers.
- Demonstrated that ReFeR consistently outperforms baselines across varying levels of inference compute.
- Conducted experiments with best peer as AC.
- Explored the effect of temperature on the framework using the same setup with different temperature values.
- Examined sampling responses and their performances for varied n (i.e., n=1, 5, 10, 20).
- Conducted an analysis on the rating distributions of the framework when n > 1.
-
Clarifications:
- Refined discussions on robustness, explainability, and the architectural novelty of our hierarchical approach.
- Provided detailed error analyses, ablations on prompt structures, and peer-AC model combinations.
- Explained in detail why a few works are not valid baselines and how the performance gain in correlation is significant.
-
Evaluation of Hallucination:
- Acknowledging that different approaches have strengths and weaknesses, we conducted experiments on reasoning response quality using the HHEM-2.1 model to assess hallucination on GSM8k. ReFeR outperformed baseline methods in producing higher-quality responses.
-
Additional Experiments:
- Conducted studies using open-weight models (e.g., Qwen-2.5-72B in evaluation and Qwen-2.5-7B in reasoning) for reproducibility and transparency.
- Showed the framework's ability to scale across varying configurations (number of peers, compute limits).
Why This Research Matters
The evaluation of generative AI outputs is a critical and challenging task. ReFeR represents a step forward in human-like, systematic, scalable evaluation methods, providing enhanced reasoning capabilities and a framework adaptable to various domains. Its hierarchical, modular structure offers insights for improving multi-agent collaboration systems beyond the context of LLM evaluation. We also hypothesise that this architecture of ReFeR can be applied to various completely orthogonal setting as we observe that this structure works in real world and hence we believe that this idea has to reach more people and attract newer research.
Gratitude to the Reviewers
We deeply appreciate the reviewers' efforts, constructive feedback, and engagement. While the rebuttal phase lacked engagement from some reviewers, we have strived to address all raised concerns comprehensively. Your contributions have significantly strengthened this work and its presentation.
We look forward to seeing ReFeR inspire future research and practical applications. Thank you for your time, expertise, and consideration.
Sincerely,
The Authors
This paper received ratings of 6, 6, 5, 5, 5, and was recommended for rejection by majority of reviewers.
The paper introduces ReFeR, a hierarchical multi-agent framework inspired by the academic peer review process. It utilizes multiple LLMs or VLMs as peer evaluators and a more capable model as an "area chair" to synthesize evaluations and produce final scores and reasoning. ReFeR is tested across text evaluation tasks (TopicalChat, SummEval), multimodal evaluation (image captioning and generation), and reasoning tasks (AQuA, BBH, CSQA, GSM8K). The authors claim significant empirical improvements over prior methods like G-Eval and Analyze-Rate in terms of accuracy, cost-efficiency, and reasoning ability.
Strengths
- The authors test ReFeR across multiple domains (text, multimodal, reasoning) and provide thorough comparisons to baselines, including recent methods like G-Eval and Analyze-Rate.
- The hierarchical structure is an interesting and intuitive extension of multi-agent systems, simulating real-world peer review processes.
- The authors conduct extensive experiments, including ablations on the number of peer models, area chair selection, and prompting strategies, offering insights into framework behavior.
Area for improvements:
- The performance improvements over existing baselines are modest (ranging from 0.01–0.04 in correlation metrics). While ReFeR-Turbo shows the best results, the gains may not justify the significant computational overhead.
- Limited novelty: The hierarchical structure, while well-implemented, is an engineering extension of existing multi-agent frameworks (e.g., multi-agent peer review, debate frameworks). The novelty contribution is limited as prior works have explored similar structures with single models.
- The proposed prompting schema, while effective, is specific to the ReFeR framework and cannot be generalized easily. The novelty of "evaluation guidelines" is also questionable, as detailed prompts are a common practice in LLM research.
- Evaluation costs: Despite the introduction of ReFeR-Lite, the framework still requires significant computational resources, which limits its practical applicability for large-scale tasks.
While the paper is well-executed, its contributions are incremental, and the computational cost remains a limiting factor. For future iterations, the authors could strengthen the theoretical underpinnings, reduce overhead, and demonstrate broader applicability to justify its acceptance.
审稿人讨论附加意见
During the rebuttal phase, the authors addressed reviewer concerns and expanded on certain aspects:
- The authors demonstrated that ReFeR achieves better performance for its computational cost compared to other baselines.
- New comparisons to methods like ChatEval and LLM-as-Judge were presented, for further supporting the hierarchical structure’s benefits.
- Role of the "Area Chair": the authors clarified that stronger models as "area chairs" significantly improve performance, though gains diminish when using weaker models.
- "Ablation on n": The authors showed diminishing returns after a certain n-value. However, concerns remain regarding the limited improvements/gains, high variance across runs, and lack of theoretical justification for the hierarchical design's superiority over debate frameworks.
Reject