PaperHub
4.3
/10
Rejected4 位审稿人
最低1最高6标准差1.9
5
6
1
5
3.5
置信度
ICLR 2024

PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations

OpenReviewPDF
提交: 2023-09-18更新: 2024-02-11
TL;DR

A large language model based peer rank and discussion-based evaluation framework (PRD), for reducing bias in LLM-based evaluation and better evaluating open-ended question answering.

摘要

关键词
Natural Language ProcessingModel-based EvaluationLarge Language Models

评审与讨论

审稿意见
5

In this paper, a peer evaluation method is proposed to improve LLM-based evaluations on open-ended question answering and alleviate potential bias and self-enhancement. The proposed peer rank process produces a more reasonable ranking of model capabilities. The peer discussion process is also proposed to help models reach mutual agreements that are more consistent with human judgement. Interesting insights are provided in experiment results and analysis, such as the LLM reviewer who leads the discussion tends to hold its opinion and Stronger LLMs tend to hold their opinions.

优点

  1. This paper focuses on a really interesting problem, which is to improve LLM based evaluations and mitigate bias by taking advantage of peer rankings and peer discussions of LLM. The proposed methods are simple, but effective.

  2. Many examples are provided to illustrate the process of peer rank and peer discussion.

  3. A few popular and strong LLMs are employed in experiments and results are promising.

缺点

  1. The study seems to be limited to question answering only. LLM has many other applications such as writing reviews and generating essays. Would the proposed methods be applicable to improve evaluations on those applications as well?

  2. How significant are the result comparisons, such as in Table 3, 4 and 5? It'd be great if the authors could present t-test results such as p-values of the comparisons.

问题

  1. Is there any data contamination issues for the LLMs being tested?
评论

We are truly grateful for the constructive feedback from this review. We are excited to find that the reviewer thinks our work focuses on a really interesting problem. We are delighted to find our paper has a good contribution(3) and presentation (3).

Weaknesses

  1. The study seems to be limited to question answering only. LLM has many other applications such as writing reviews and generating essays. Would the proposed methods be applicable to improve evaluations on those applications as well?
  • We kindly point out that we evaluated our model on the diverse Vicuna80 dataset encompassing a wide variety of tasks: writing, roleplay, common sense, fermi, counterfactual, coding, math, generic, knowledge, etc. Each task includes over 10 questions.
  1. Significance of results in Table 3, 4, and 5? It'd be great if the authors could present t-test results such as p-values of the comparisons.
  • We added significance tests for all of the results as shown below:
  • Table 3 shows the example-level correlation results for peer rank. Two-tailed t-test results are shown on the accuracy measure for our method compared to the other baselines (statistical significance is indicated with *(p<0.01), **(p<0.002))
ReviewerFleiss KappaAccuracy
GPT-3.50.3870.621
Claude0.3190.607
GPT-40.4060.643
GPT-4 & Claude & GPT-3.50.4030.666*
All Reviewers (Weighted)0.4100.673**
  • Table 4 describes peer discussion accuracy on LFQA. We note that when PD happens between a model and a different model (e.g. Claude & GPT-4), results are significantly better. Statistical significance is indicated with *(p<0.05), **(p<0.005).
R1R2R1 leadR2 leadRandom
GPT4 & Claude0.7290.6710.743*0.7290.729
GPT4 & GPT350.7290.5790.7140.750*0.731
GPT35 & Claude0.5790.6710.700**0.6710.686
GPT35 & GPT35-0.80.5790.6500.6640.686**0.681
Claude & Claude-0.80.6640.7070.6930.6710.680
GTP4 & GPT4-0.80.7790.7570.7790.7570.779
  • Table 5 shows the comparison of PD and PR on Vicuna80. Two-tailed t-test results are shown on the F1 measure for our method compared to the baseline.
ReviewerAccuracy
GPT-40.3500
GPT-4 & Claude0.3675 (p<0.01)
All0.4375 (p<0.005)
All (weighted)0.4625 (p<0.002)

Questions

  1. Is there any data contamination issues for the LLMs being tested?
  • The models tested are released before the creation of our Vicuna80 and LFQA datasets, so there should be no or little contamination happening.
  • Besides, there are no golden answers to open-ended questions in our datasets and there are possibly large sets of good answers. Thus, there are much fewer data contamination issues as compared to exam-like questions [1, 2].

References

  1. Yadav, Vikas, Steven Bethard, and Mihai Surdeanu. "Quick and (not so) dirty: Unsupervised selection of justification sentences for multi-hop question answering." arXiv preprint arXiv:1911.07176 (2019).
  2. Hendrycks, Dan, et al. "Measuring massive multitask language understanding." arXiv preprint arXiv:2009.03300 (2020).
评论

We sincerely thank the reviewer for constructive feedback. We sincerely want to discuss with you about your review and our reply. If you think our clarifications and reporting of additional experiments address your original concerns, could you please consider adjusting the soundness or excitement, contribution, and rating scores? – If there are any other concerns, please let us know and we’ll try our best to address them!

审稿意见
6

This work introduces a novel approach, Peer Rank and Discussion (PRD), for evaluating LLMs answers. The idea behind PRD is to involve peer LLMs as reviewers to create a more fair and unbiased evaluation. PRD has two versions, Peer Rank (PR) and Peer Discussion (PD), both of which share the goal of mitigating biases in automated evaluations. Peer Rank (PR) operates in a tournament-style benchmarking setting, where LLMs compete in pairwise matches to answer open-ended questions, while in Peer Discussion (PD) two reviewer LLMs engage in multi-turn discussions to reach a mutual agreement on pairwise scoring or preference. Experiments show that within this framework the automatic evaluation achieves higher accuracy and aligns better with human judgments.

优点

The proposed evaluation framework provides good insights to improve automatic evaluation and reduce the bias introduced with large language models. It has the potential to enhance the reliability and robustness of LLM assessments while providing a well-reasoned exploration of the framework and valuable analysis (e.g, section 5).

缺点

While the human agreement can be higher than in other approaches, PRD is computationally costly. This work uses SOTA models that are available upon payment without discussing trade-off between cost and benefits compared to human evaluation. On the same line, this work does not consider open-source LLMs. Knowing what open-source LLM can be used here could be beneficial for the community. The effectiveness of PRD depends on the quality and diversity of the peer LLMs available. If there is a limited pool of peer models, this work does not discuss what options can be used. This quite limits its applicability and utility.

问题

  • What other options can be used from the open-source scenario?
评论

We sincerely thank the reviewer for giving feedback and suggesting further experiments on open-source models. We are encouraged that the reviewer finds our work has good soundness(3), presentation(3), and contribution(3).

Weaknesses

  1. Discussing trade-off between PRD cost and benefits, compared to human evaluation.
  • We admit that the cost of calling API-based commercial LLMs might be an issue, but it is more fair and is commonly used in evaluating LLMs in recent research literature, such as for evaluating model performance on open-ended questions [1, 2],  summarization [3], dialogue generation [4], etc.
  • Moreover, illustrating the decision process through multi-round LLM interactions is an essential part of recent work on model explainability, in topics such as evaluation [5], scalable oversight of model capabilities [6], and code optimization and text generation [7] – we will add this to the final version.
  • In our work, we hired domain experts such as food science experts (from Upwork: http://upwork.com) for the human pairwise comparison annotations. For each pair comparison, the annotation takes on average 5-10 minutes. We pay \23foreachquestion.Whileinourexperiments,GPT4basedevaluationcostsonlyabout2-3 for each question. While in our experiments, GPT-4-based evaluation costs only about \\0.015 for each question (based on token number). – Thus we believe using LLM for automatic evaluation is promising and cost-efficient. – human annotation details are included in Appendix G.
  • Beyond LLM-based automatic evaluation, another bigger scope theme of this  work is about scalable oversight of models [8]. – Enable models to induce self-rankings that are hard for humans to compare.
  1. On the same line, this work does not consider open-source LLMs. Knowing what open-source LLM can be used here could be beneficial for the community.
  • Open-source LLMs are always an important part of the community. The original paper already includes experiments involving Vicuna-13b which is an original model based on LLaMA [10].
  • Additionally, we designed a new experiment for the PR method including new public models: Zephyr-7b-alpha [9], Vicuna-13b (already discussed in the paper),  and GPT-3.5. The results are consistent with Table 2 in the paper.

New results:

  • We show that both ranking and Elo change and become closer to the golden ranking when using PR, which is consistent with the experiments of commercial LLMs.
AllAll (Weighted)Human Raters
ModelsEloRankEloRankEloRank
Vicuna99921011 (-47)110581
Zephyr101011003 (+3)210002
GPT-3.59913993 (+52)39413

Questions

  1. What other options can be used from the open-source scenario?
  • Many open-source models can be applied in our method, such as Vicuna, Zephyr[9], LLaMA[10], etc. We conducted a new experiment shown above.

References

  1. Dettmers, Tim, et al. "Qlora: Efficient finetuning of quantized llms." arXiv preprint arXiv:2305.14314 (2023).
  2. Zheng, Lianmin, et al. "Judging LLM-as-a-judge with MT-Bench and Chatbot Arena." arXiv preprint arXiv:2306.05685 (2023).
  3. Liu, Yang, et al. "Gpteval: Nlg evaluation using gpt-4 with better human alignment." arXiv preprint arXiv:2303.16634 (2023).
  4. Lin, Yen-Ting, and Yun-Nung Chen. "LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models." arXiv preprint arXiv:2305.13711 (2023).
  5. Chan, Chi-Min, et al. "Chateval: Towards better llm-based evaluators through multi-agent debate." arXiv preprint arXiv:2308.07201 (2023).
  6. Michael, Julian et al. “Debate Helps Supervise Unreliable Experts.” (2023).
  7. Madaan, Aman, et al. "Self-refine: Iterative refinement with self-feedback." arXiv preprint arXiv:2303.17651 (2023).
  8. Bowman, Samuel R., et al. "Measuring progress on scalable oversight for large language models." arXiv preprint arXiv:2211.03540 (2022).
  9. Tunstall, Lewis, et al. "Zephyr: Direct Distillation of LM Alignment." arXiv preprint arXiv:2310.16944 (2023).
  10. Touvron, Hugo, et al. "Llama: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971 (2023).
评论

We sincerely thank the reviewer for constructive feedback. We sincerely want to discuss with you about your review and our reply. If you think our clarifications and reporting of additional experiments address your original concerns, could you please consider adjusting the soundness or excitement, contribution, and rating scores? – If there are any other concerns, please let us know and we’ll try our best to address them!

审稿意见
1

I believe that this paper uses various approaches to produce answers to questions using interacting groups of LLMs. In one approach, an LLM ranks competing answers generated by two other LLMs, and in another approach, two LLMs debate and hopefully converge on an answer.

优点

In general, clever decoding strategies for LMs have been successful, such as Chain-of-Thought prompting, maieutic prompting, beam search, etc. This may be yet another useful framework; I cannot tell for sure.

缺点

From the outset, this paper's writing is highly confusing. It's next to impossible for me to parse what the argument is, or what many of the sentences mean. I don't feel equipped to evaluate its conceptual weaknesses because I cannot tell what the motivation, the problem, the proposed solution, or the result is.

问题

Is the aim of your method to help LMs produce better answers to questions than other methods currently achieve?

评论

We thank the reviewer for the feedback.

Weaknesses

  1. Motivation and problem
  • We focus on the topic of utilizing large language models (LLMs) for evaluating open-ended generation instead of proposing a decoding strategy. The goal is to propose new methods using LLMs for better and fair evaluations that align with human judgments.
    • The basic task formulation is like, given answers A and B, the model is required to evaluate which of them is better.
    • Determining pairwise comparison is also a key step for InstructGPT training [7].
  • To get an overview and basic knowledge of the topic, we kindly refer the reviewer to skim a set of related work on LLM-based evaluation [1,2,3,4,5,6,7], where [1] is a very basic model in this direction.
  1. Proposed solution
  • We propose (1) the peer rank (PR) algorithm that takes into account each peer LLM’s pairwise preferences of all answer pairs and outputs a final ranking of model capabilities;
  • And (2) peer discussion (PD), where we prompt two LLMs to discuss and try to reach a mutual agreement on preferences of two answers.
  1. Results
  • We conduct comprehensive experiments on two benchmark datasets. We find that our approaches achieve significantly higher accuracy and better alignment with human preference.

Questions

  1. Is the aim of your method to help LMs produce better answers to questions than other methods currently achieve?
  • As outlined in the abstract and introduction, our aim is to introduce improved evaluation methods for large language models (LLMs) that enhance accuracy and better align with human judgments. Our research opens up possibilities for assessing models that are challenging for humans to compare.

Finally, we are eager to answer any further questions from the reviewer to help fully understand the intellectual merits of our work.

References

  1. Fu, Jinlan, et al. "Gptscore: Evaluate as you desire." arXiv preprint arXiv:2302.04166 (2023).
  2. Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634, 2023.
  3. Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators, 2023a.
  4. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  5. Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, et al. Benchmarking foundation models with language-model-as-an-examiner. arXiv preprint arXiv:2306.04181, 2023.
  6. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  7. Ouyang, Long, et al. "Training language models to follow instructions with human feedback." Advances in Neural Information Processing Systems 35 (2022): 27730-27744.
评论

Dear reviewer qoLU,

Thank you for the clarification.

We want to emphasize that we can try our best to improve the writing, and added the significance test for demonstrating that improvements are significant through our the paper.

We sincerely hope that you could provide more fine-grained and targeted reviews on which part of writing should be improved and which part of experiments should be improved (we added the significance test and have shown that all the improvements are significant from t-test), as were in other reviews.

Plus, since your review kinds of deviates from the other three reviewers, can you possibly read the other reviews and confirm that you haven't got any aspect wrong when judging the intellectual metrics of our work?

  • Table 3 shows the example-level correlation results for peer rank. Two-tailed t-test results are shown on the accuracy measure for our method compared to the other baselines (statistical significance is indicated with *(p<0.01), **(p<0.002))
ReviewerFleiss KappaAccuracy
GPT-3.50.3870.621
Claude0.3190.607
GPT-40.4060.643
GPT-4 & Claude & GPT-3.50.4030.666*
All Reviewers (Weighted)0.4100.673**
  • Table 4 describes peer discussion accuracy on LFQA. We note that when PD happens between a model and a different model (e.g. Claude & GPT-4), results are significantly better. Statistical significance is indicated with *(p<0.05), **(p<0.005).
R1R2R1 leadR2 leadRandom
GPT4 & Claude0.7290.6710.743*0.7290.729
GPT4 & GPT350.7290.5790.7140.750*0.731
GPT35 & Claude0.5790.6710.700**0.6710.686
GPT35 & GPT35-0.80.5790.6500.6640.686**0.681
Claude & Claude-0.80.6640.7070.6930.6710.680
GTP4 & GPT4-0.80.7790.7570.7790.7570.779
  • Table 5 shows the comparison of PD and PR on Vicuna80. Two-tailed t-test results are shown on the F1 measure for our method compared to the baseline.
ReviewerAccuracy
GPT-40.3500
GPT-4 & Claude0.3675 (p<0.01)
All0.4375 (p<0.005)
All (weighted)0.4625 (p<0.002)
评论

Dear Reviewer qoLU,

We are sorry that we were not able to help understand of paper. Considering this, can you send a message or write it here to ask the AC/SACs to not consider your review? – so that the decision of our paper’s acceptance/rejection would be more fair?

评论

I do understand your method much better at this point, thanks for the clarification, and I apologize I couldn't parse it earlier. That being said, I believe the writing needs much improvement before a paper like this can be submitted at a venue like ICLR. I also think that the improvements over off-the-shelf approaches are marginal and may not warrant publication at ICLR regardless of writing quality.

Nevertheless, given my increased understanding, I am raising my score on "soundness" from 1 to 2, but will leave my overall rating the same, as I still believe the paper should be rejected.

I won't ask my review to be ignored, as I am fairly confident in my assessment, at this point.

审稿意见
5

The paper proposes two new LLMs-based methods for evaluating LLMs in a way that addresses current biases and limitations. They can be applied when the form of evaluation is pairwise comparison.

  • Peer Rank (PR): This algorithm selects several LLMs to be both contestants and reviewers. At each time, a reviewer compares the outputs generated by two contestants and decides which is better. The weight of the evaluation made by a reviewer LLM is based on its performance. The weights are computed in an iterated manner. This algorithm can be used when the scoring methods are both win rates and Elo ratings.

  • Peer Discussion (PD): Given two answers generated by two contestant LLMs, two reviewer LLMs assess which answer is better separately, and then the two reviewer LLMs engage in a structured discussion to reach a mutual agreement on the preferences for two given answers.

The methods are intended to produce evaluations that are more aligned with human judgment and can mitigate self-enhancement bias and position bias.

优点

  • This paper proposes two targeted solutions to the current issues of LLMs-based evaluations, such as self-enhancement bias and position bias, which will give fairer more reliable evaluation results. The two proposed methods are inspired by the educational domain. They are easy to understand.

  • This paper conducts human annotations to extend the existing datasets, which may benefit further research.

缺点

  • I see limited improvements when PR and PD are applied. In Table 2, the Elo rating becomes closer to human raters' rating, but the ranking of the five LLMs does not change. In Table 3, the example-correlation does not increase too much, comparing "All Reviewers (Weighted)" to "GPT-4 & Claude & GPT-3.5". These indicate that the core idea of PR may not be very useful. In Table 5, the performance boost with PD is not that great.

  • The two proposed methods are weakly related, except that they both introduce "peers" to some extent. The paper doesn't combine the two, which makes me feel cut off.

  • Introducing the communication between LLMs to a better solution will substantially increase computational cost. The paper does not seem to take this into consideration.

问题

  • Question A: For PR used in the win rate calculation, how will you determine whether it has converged? I only see the number of iterations in Algorithm 2 in Appendix E. Have you ever analyzed the impact of the number of iterations on the results?

  • Question B: I need more information about the human annotation process. How exactly did the annotation process work? How did you recruit the participants and how about their backgrounds? How much were they paid?

评论

Questions

  1. For PR used in the win rate calculation, how will you determine whether it has converged? I only see the number of iterations in Algorithm 2 in Appendix E. Have you ever analyzed the impact of the number of iterations on the results?
  • We examine the impact of iteration quantity by manually contrasting outcomes for N iterations and significantly higher iterations, such as 2 * N or 3 * N. If the variations in each model's rating are less than 0.0001, we conclude that PR has reached convergence.
  1. I need more information about the human annotation process. How exactly did the annotation process work? How did you recruit the participants and how about their backgrounds? How much were they paid?
  • In our work, we hired domain experts such as food science experts (from Upwork) for the human pairwise comparison annotations. Before annotation, we provided a detailed instruction document to annotators and answered their questions.
  • For each pair of comparisons, annotators follow instructions on a UI shown in Appendix G, and the annotation takes an average of 5-10 minutes. We pay $2-3 for each question.

References

  1. Touvron, Hugo, et al. "Llama 2: Open foundation and fine-tuned chat models." arXiv preprint arXiv:2307.09288 (2023).
  2. Ouyang, Long, et al. "Training language models to follow instructions with human feedback." Advances in Neural Information Processing Systems 35 (2022): 27730-27744.
  3. Dettmers, Tim, et al. "Qlora: Efficient finetuning of quantized llms." arXiv preprint arXiv:2305.14314 (2023).
  4. Zheng, Lianmin, et al. "Judging LLM-as-a-judge with MT-Bench and Chatbot Arena." arXiv preprint arXiv:2306.05685 (2023).
  5. Liu, Yang, et al. "Gpteval: Nlg evaluation using gpt-4 with better human alignment." arXiv preprint arXiv:2303.16634 (2023).
  6. Lin, Yen-Ting, and Yun-Nung Chen. "LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models." arXiv preprint arXiv:2305.13711 (2023).
  7. Chan, Chi-Min, et al. "Chateval: Towards better llm-based evaluators through multi-agent debate." arXiv preprint arXiv:2308.07201 (2023).
  8. Michael, Julian et al. “Debate Helps Supervise Unreliable Experts.” (2023).
  9. Madaan, Aman, et al. "Self-refine: Iterative refinement with self-feedback." arXiv preprint arXiv:2303.17651 (2023).
  10. Bowman, Samuel R., et al. "Measuring progress on scalable oversight for large language models." arXiv preprint arXiv:2211.03540 (2022).
评论

We sincerely thank the reviewer for constructive feedback. We are encouraged to find that the reviewer finds our work benefits to further research.

Weaknesses

  1. Limited improvements PR. Elo rating becomes closer to human raters' rating, but the ranking of the five LLMs does not change.
  • We would politely point out that the Elo score is a more fine-grained and important metric for comparing model capabilities. Recent works [1, 2, 3] all used it as the main metric. This metric is important even if the ranking doesn’t change in the specific case. The Elo rating of "All (Weighted)" is the closest to human rating. It shows our method has a better alignment with human judgment.
  • Ranking is more general and coarse-grained and is not informative when the gap of model capabilities is large. Thus it’s more convincing to review the rank when including models that are of similar capabilities (on a challenging set of questions). – we added new experiments including another set of LLMs.

Additional Experiment:

  • In the new experiments, we added new public LLM and conducted PR among Vicuna-13b, Zephyr-7b-alpha, and GPT-3.5. We show that both ranking and Elo change and become closer to the golden ranking, which is consistent with the experiments of commercial LLMs. This change demonstrates PR also contributes to adjusting the global ranking.
AllAll (Weighted)Human Raters
ModelsEloRankEloRankEloRank
Vicuna99921011 (-47)110581
Zephyr101011003 (+3)210002
GPT-3.59913993 (+52)39413
  1. In Table 3, the example-correlation does not increase too much, comparing "All Reviewers (Weighted)" to "GPT-4 & Claude & GPT-3.5". These indicate that the core idea of PR may not be very useful.
  • "GPT-4 & Claude & GPT-3.5" is also an important variation of our PR approach of the weighted combination of all reviewers. The “All reviewers (Weighted)” includes two additional models. Both of them significantly outperform the baselines according to the significance test (>2%).
  1. In Table 5, the performance boost with PD is not that great
  • We conducted a two-tailed t-test and results show that p-value of PD as compared to the baseline is smaller than 0.01, demonstrating significant improvement. – we will add this to the final version.
  1. The two proposed methods are weakly related, except that they both introduce "peers" to some extent. The paper doesn't combine the two, which makes me feel cut off.
  • We politely argue that PR and PD are closely interrelated and fall under the same theme of providing a more fair (de-biased) ranking of long- and free-form answers. PR is for global ranking and PD is for fine-grained ranking.
  • In Table 5 we compare their performance and provide insights on the difference between PR and PD and their benefits.
  • Plus, we integrated them into one unified peer evaluation framework, i.e. first using PR to determine the weight of each LLM reviewer, and then conducting fine-grained PD among them, the first prunes out the weak LLMs and the second step enables more explainable evaluation decisions (discussion history). Through the study, we found that it achieves good efficiency and satisfiable results.
  • We will add more details in the final version and adjust the corresponding part to make the connection sound more smooth.
  1. Introducing the communication between LLMs to obtain a better solution will require computational cost.
  • We admit that the cost of calling API-based commercial LLMs might be an issue, but it is more fair and is commonly used for evaluations in recent research literature. Such as for evaluating model performance on open-ended questions [3, 4],  summarization [5], dialogue generation [6], etc.
  • Moreover, illustrating the decision process through multi-round LLM interactions is an essential part of recent work on model explainability, in topics such as evaluation [7], scalable oversight of model capabilities [8], and code optimization and text generation [9] – we will add this to the final version.
  • In our work, we hired domain experts such as food science experts (from Upwork: http://upwork.com) for the human pairwise comparison annotations. For each pair comparison, the annotation takes on average 5-10 minutes. We pay \23foreachquestion.Whileinourexperiments,GPT4basedevaluationcostsonlyabout2-3 for each question. While in our experiments, GPT-4-based evaluation costs only about \\0.015 for each question (based on token number). – Thus we believe using LLM for automatic evaluation is promising and cost-efficient.
  • Beyond LLM-based automatic evaluation, another bigger scope theme of this work is about scalable oversight of models [10]. – Enable models to induce self-rankings that are hard for humans to compare.
评论

We sincerely thank the reviewer for constructive feedback. We sincerely want to discuss with you about your review and our reply. If you think our clarifications and reporting of additional experiments address your original concerns, could you please consider adjusting the soundness or excitement, contribution, and rating scores? – If there are any other concerns, please let us know and we’ll try our best to address them!

AC 元评审

The paper proposes two methods to improve LLM-based evaluations when the form of evaluation is pairwise comparison. The Peer Rank(PR) method takes into account each peer LLM’s pairwise preferences of all answer pairs, and outputs a final ranking of models. The Peer Discussion (PD) method prompts two LLMs to discuss and try to reach a mutual agreement on preferences of two answers. Experiments on two datasets show that the proposed methods achieve higher accuracy and align better with human judgments.

Strengths: 1. This paper focuses on an interesting problem and tries to improve LLM based evaluations and mitigate bias by taking advantage of peer rankings and peer discussions of LLM. The proposed methods are easy to understand. 2. This paper conducts human annotations to extend the existing datasets, which may benefit further research. 3. The results on two datasets are generally promising.

Weaknesses: 1. The two proposed methods PR and PD are weakly related and the paper doesn't combine the two methods. 2. The proposed methods introduce the communication between LLMs, which will substantially increase computational cost. There is also lack of discussion about the trade-off between cost and benefits compared to human evaluation. The applicability and utility of the proposed methods are limited.

Considering the weaknesses outweigh the strengths, I recommend to reject this paper.

为何不给更高分

Please see the weaknesses listed in the meta-review.

为何不给更低分

N/A

最终决定

Reject