6.0

/10

Poster3 位审稿人

最低6最高6标准差0.0

4.7

置信度

正确性2.7

贡献度2.7

表达2.7

ICLR 2025

PiCO: Peer Review in LLMs based on Consistency Optimization

Kun-Peng Ning,Shuo Yang,Yuyang Liu,Jia-Yu Yao,Zhenhui Liu,Yonghong Tian,Yibing Song,Li Yuan

OpenReview PDF

提交: 2024-09-26更新: 2025-02-19

摘要

关键词

Large Language ModelUnsupervised EvaluationPeer ReviewConsistency Optimization

评审与讨论

审稿意见

评分: 6置信度: 52024-10-29

This paper presents an approach to evaluating Large Language Models (LLMs) through a peer-review mechanism without relying on human annotations. The basic idea is to create weights for each model based on the consistency between the weights and the final evaluation results. The proposed PiCO framework shows promising results in aligning LLM rankings with human preferences across various datasets and metrics.

优点

The paper is mostly well-written and easy-to-understand, except for some specific details (discussed below).
The topic of this paper is important, and the proposed method is well described. The idea of using consistency to learn reviewer weights is interesting and well motivated.
The authors have conducted extensive experiments to compare their method with SOTA baselines.

缺点

The assumption that “high-level LLM can evaluate others” answers more accurately” may not be true in all cases, but this is not a big problem. The problem is that, once we assume that “high-level LLM can evaluate others” and only use high-level LLM as reviewers, it may automatically create new types of biases that cause the cold-start problem: if a model’s response is not preferred by current reviewers, then it won’t get high scores, then it won’t get high reviewer weights, no matter whether it’s a good model or not.
The analysis of bias with PG_hat in Section 3.2 is problematic. Because the confidence weights are all smaller than 1.0, it’s not surprising to see that PG_hat is all smaller than PG, so this cannot prove that the bias of the evaluator is reduced.

问题

What’s the meaning of “average loss” in the caption of Figure 5?
What’s the meaning of Precision@K and RBP@K in Section 3.4?
There are other methods that select reviewers unsuperwisely, such PRE with auto exam in the original PRE paper. It would be good to compare the method in this paper with it as well.
In Table 2, the variance of multiple baselines are 0, which seems a bit weird. Might be better to explain a bit.

2024-11-16

Thank you for recognizing our work. We appreciate your valuable feedback, and we would like to address the issues you raised.

Q1. What’s the meaning of “average loss” in the caption of Figure 5?

A1. We apologize for the confusion caused by this. The "average loss" means the opposite of consistency in Eq.8, i.e., $-$ Consistency $(G,w)$ . For better understanding, We will change the average loss in Figure 5 to "peer-review system consistency" in our revision.

Q2. What’s the meaning of Precision@K and RBP@K in Section 3.4?

A2. The Precision@K and RBP@K are two popular metrics for evaluating ranking, which are widely used in recommendation systems. We employ these metrics to measure the alignment between the learned LLM ranking and the human-preference ranking.

Q3. There are other methods that select reviewers unsupervised, such as PRE with auto exam in the original PRE paper. It would be good to compare the method in this paper with it as well.

A3. The comparison results are demonstrated in Figure 4. We compared the performance of selecting reviewers with PRE. By the way, the PRE is not unsupervised, it requires some human annotations as exam questions. The results in Figure 4 show the superiority of our PiCO approach.

Q4. In Table 2, the variance of multiple baselines is 0, which seems a bit weird. It might be better to explain a bit.

A4. The variance of some baselines is 0 in Table 2 because the learned rankings $\hat{\mathcal{R}}$ are the same across different seeds. We will add some explanations for better understanding.

2024-11-21

While precision@K has been used in retrieval evaluation occationally, I have no idea of what RBP@K is. Also, Precision@K is not a common metric used in top-k recommendation. More common metrics are MRR, NDCG, Hit ratio, etc. Precision is usually not the main focus of recommendation. Apart from that, metrics like Precision are designed to evaluate result ranking based on labels, not human-preference ranking. If you are evaluating ranking based on a ground truth ranking, corrected predicted pairs, Spearman’s Rank, or Kendall’s Rank Correlation Coefficient are more apporiate. It's not clear how Precision@K and RBP@K is computed without pointwise lables.

About PRE, I double checked the paper and it indeed has a baseline named "PRE only Auto Exam" in section 5.3. Ignoring such baselines make me feel less confidence about the results in this paper.

2024-11-22

We appreciate your valuable feedback.

The Precision@K in the recommendation system represents the number of items that the user is interested in among the k recommended items. The Rank Biased Precision (RBP@K) is a variant of the former, which considers the bias in different ranking positions controlled by the decay factor $p$ . In this paper, the Precision@K and RBP@K are computed as follow,

$Precision@K=\frac{|\hat{\mathcal{R}}[0:K]\cap \mathcal{R}^*[0:K]|}{K}$ ,

where $|\hat{\mathcal{R}}[0:K]\cap \mathcal{R}^*[0:K]|$ means the number of intersections between the learned top-K rankings $\hat{\mathcal{R}}$ and the ground-truth top-K rankings $\mathcal{R}^*$ .

$RBP@K=(1-p)\sum_{i=1}^{K}p^{i-1}\times r_i,$

where $p$ is a hyperparameter usually set to 0.8. If $\hat{\mathcal{R}}[i]=\mathcal{R}^*[i]$ , $r_i=1$ . If $\hat{\mathcal{R}}[i]\neq\mathcal{R}^*[i]$ , $r_i=0$

We believe that the above two metrics can measure the quality of the learned ranking $\hat{\mathcal{R}}$ from another perspective. However, as you said, if evaluating ranking based on a ground truth ranking, using Spearman’s Rank and Kendall’s Rank Correlation Coefficient are more appropriate. That is also why we use Spearman’s Rank and Kendall’s Rank in our main experiments (Tables 1 and 2). We will add more discussion about this in our final version to ensure better clarity.

About PRE, We apologize for neglecting the unsupervised version of PRE (i.e. "PRE only Auto Exam") in Section 5.3 and Figure 4 of the original PRE paper. In the PRE paper, the author shows that "the performance of PRE with only Auto-Exam is lower than the qualification exam with a subset of manual annotation as ground truth." In other words, the "PRE only Auto Exam" can not outperform the "PRE", and the results in Figure 4 of the original PRE paper also validate this. Additionally, we also added the "PRE only Auto Exam" baseline in our experiments and the results are as follows,

Spearman's Rank $S\uparrow$	Chatbot Arena	MT-Bench	AlpacaEval
PRE	$0.86$	$0.86$	$0.83$
PRE only Auto Exam	$0.83$	$0.82$	$0.78$
PiCO (ours)	$0.90$	$0.89$	$0.84$

Kendall's Rank $\tau\uparrow$	Chatbot Arena	MT-Bench	AlpacaEval
PRE	$0.71$	$0.68$	$0.64$
PRE only Auto Exam	$0.67$	$0.63$	$0.59$
PiCO (ours)	$0.77$	$0.72$	$0.68$

We obtained similar results with the original PRE paper, that the unsupervised version "PRE only Auto Exam" is worse than "PRE". PiCO can achieve better performance than both versions of PRE, even though PRE includes supervised annotation information. The suggestions for comparing the "PRE only Auto Exam" baseline are valuable, and we will add these results in our revised paper.

Thank you again for your valuable feedback. We hope this reply can well address your issues. If you have any other questions, we are very willing to answer them for you.

Best regards,

The Authors

2024-11-29

Dear Reviewer jVng,

We hope this message finds you well. We have provided detailed responses to the concerns you raised, especially on the comparison with the new baseline named "PRE only Auto Exam". We would greatly appreciate it if you could review our responses at your earliest convenience. If there are any concerns that you feel have not been fully addressed, we are eager to discuss them with you.

Thank you once again for your valuable time and constructive feedback.

Best regards,

The Authors

2024-12-01

Dear Reviewer jVng,

We hope this message finds you well. We have provided detailed responses to the concerns you raised, including the explanation of Precision@K and RBP@K, and the comparison with the new baseline named "PRE only Auto Exam". We would greatly appreciate it if you could review our responses at your earliest convenience. If there are any concerns that you feel have not been fully addressed, we are eager to discuss them with you.

Thank you once again for your valuable time and constructive feedback.

Best regards,

The Authors

审稿意见

评分: 6置信度: 52024-10-29

The paper proposes PiCO, a model that evaluates LLMs by learning the weights of reviewer LLMs through a peer-review mechanism. The contributions are as follows:

It enhances the peer-review mechanism used in PRD and PRE.
It enables evaluation at a lower cost, as no human annotations are required, unlike in PRE.
The proposed Reviewer Elimination Mechanism is an improvement that can be applied not only to the PiCO model but also to other peer-review methods.

优点

By combining learnable weights with existing peer review methods, the evaluation process has become more refined and accurate.
The absence of human annotation makes it more practical for real-world applications.
The Reviewer Elimination Mechanism, though simple, is a powerful idea that can be applied not only to the proposed method in the paper but also to other peer review approaches.

缺点

Baselines. The proposed method does not experimentally demonstrate whether it can serve as a replacement for existing benchmarks. While the paper argues that benchmarks fail to adequately capture human preferences and suffer from benchmark leakage issues, it does not compare the rank similarity of the proposed method with that of existing benchmarks in the experiments.
Evaluations. The proposed method uses the same models as both reviewers and those used for training and evaluation. This is akin to comparing performance using the training loss, and the experimental results do not reflect the performance on unseen models. However, if the training and evaluation were conducted using different models, then the issue lies in the paper's writing, as it fails to include this crucial information.
Writing. The writing lacks academic rigor and professionalism. The notations in Equations (2) and (7) could have been expressed in a more general manner rather than illustrating a specific case. The use of > to denote the set {>,<,=} causes confusion as the set name overlaps with its elements. It is recommended to avoid such overlap for clarity. In Figure 3, while the intended message is conveyed, the exact meaning of the x-axis and y-axis is missing. It is advised to specify these axes more clearly for better precision.

问题

I suggest that the authors demonstrate whether the trained PiCO assigns higher ranks to models that are superior to the reviewer models used in its training. Also, it would be more realistic if the authors first separate the LLM models used for training and those used for evaluation, and present the results accordingly. Furthermore, while the paper cites Goodhart’s Law in the introduction, it appears that the current experimental setup may be susceptible to the very concerns highlighted by Goodhart’s Law.

2024-11-16

First of all, we appreciate your constructive feedback on our work. However, it seems there might be some misunderstandings about our work, especially regarding the misconception of "using the same models as both reviewers and those used for training and evaluation." We recommend reexamining this section 2 for clarification. Next, we would like to address the issues raised by the reviewer.

Q1. The proposed method uses the same models as both reviewers and those used for training and evaluation. This is akin to comparing performance using the training loss, and the experimental results do not reflect the performance on unseen models.

A1. We highlight this paper aims to tackle the LLM ranking capability issue in an unsupervised scenario, similar to a "combinatorial optimization" problem. The goal is to find an optimal set of capability weights $w$ that maximizes the consistency of the entire peer review system. Therefore, in this "combinatorial optimization" problem, we do not have a training and testing process, we only care about whether the unsupervised optimized LLM ranking satisfies the true human-preference ranking as much as possible.

Q2. I suggest that the authors demonstrate whether the trained PiCO assigns higher ranks to models that are superior to the reviewer models used in its training.

A2. The results are demonstrated in Table 1 (section 2) and Figure 6a (section 3). In Table 1, we use the two popular ranking metrics (i.e., Spearman’s Rank $S\uparrow$ and Kendall’s Rank Correlation Coefficient $\tau\uparrow$ ) to measure the alignment between the learned ranking $\hat{\mathcal{R}}$ by PiCO and the ground-truth human-preference ranking $\mathcal{R}^*$ . We can obtain the following two conclusions from Table 1. First, it can be observed that the Forward Weight achieves better results than the Uniform and Backward ones in all cases, while the Backward one always achieves worse results. It validates that assigning larger weights to those models with stronger capabilities can obtain better results, i.e., the consistency assumption is valid. Most importantly, we start with random weight and optimize the capability weights by our PiCO can obtain better ranking performance. It further validates the proposed approach can assign higher ranks to models that are superior to the reviewer models. More details are demonstrated in lines 186~202. In addition, we visualize the learned $w$ in Figure 6a. It can be observed that some superior models (e.g., GPT-3.5, WizardLM-13B, Guanaco-33B, Vicuna-13B) will be assigned higher weights while some weak models (e.g., Alpaca-13B) will be assigned lower ones. It is closer to the LLM ranking of real human preferences than other methods.

Q3. The proposed method does not experimentally demonstrate whether it can serve as a replacement for existing benchmarks.

A3. Thanks for your valuable suggestions. We select the widely-used benchmarks (i.e., MMLU and GSM8K) to evaluate the model performance ranking $\hat{\mathcal{R}}$ , and calculate the Spearman’s $S(\uparrow)$ and Kendall’s $\tau (\uparrow)$ rank correlation with the human preference ranking $\mathcal{R}^*$ . The results are as follows,

	Spearman’s rank correlation $S(\uparrow)$	Kendall’s rank correlation $\tau (\uparrow)$
MMLU	$0.53$	$0.37$
GSM8K	$0.32$	$0.15$
PiCO(ours)	$0.88$	$0.67$
These benchmarks can only measure LLMs’ specific capability on a confined set of tasks, which fails to assess their alignment with human preference. These phenomena have been widely validated in a large number of literature [1,2,3,4,5] and have almost become a consensus in the community of LLM evaluation.

Reference

[1] Zhou K, Zhu Y, Chen Z, et al. Don't make your llm an evaluation benchmark cheater[J]. arXiv preprint arXiv:2311.01964, 2023.

[2] Zheng L, Chiang W L, Sheng Y, et al. Judging llm-as-a-judge with mt-bench and chatbot arena[J]. Advances in Neural Information Processing Systems, 2023, 36: 46595-46623.

[3] Chen D, Chen R, Zhang S, et al. Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark[J]. arXiv preprint arXiv:2402.04788, 2024.

[4] Jain N, Saifullah K, Wen Y, et al. Bring your own data! self-supervised evaluation for large language models[J]. arXiv preprint arXiv:2306.13651, 2023.

[5] Chang Y, Wang X, Wang J, et al. A survey on evaluation of large language models[J]. ACM Transactions on Intelligent Systems and Technology, 2024, 15(3): 1-45.

Q4. The writing lacks academic rigor and professionalism.

A4. We appreciate your valuable suggestions. We will further polish our paper, making it more academic and readable.

2024-11-17

The responses to Q2 and Q3 were quite satisfactory. In particular, incorporating the content of A3 into the paper would further strengthen the overall argument and impact of the work.

However, the response to Q1 remains insufficient. The reason why evaluating unseen models is crucial is as follows: When we use an evaluation method, its purpose is to assess how much better or worse an unseen model performs compared to existing models, rather than to evaluate models whose rankings are already known. Specifically, after already knowing the rankings of the models, finding an optimal set of capability weights and then claiming that the rankings produced by the trained model align with those known rankings is essentially equivalent to providing the answer sheet beforehand. Additionally, when optimizing the capability weights, you should ensure that some models are not used as reviewers, nor should their rankings be provided during training, and then demonstrate whether the learned weights can correctly assess the rankings of those models that were not used. As a result, the comparisons made in the experiments in this paper are inherently biased and fail to reflect how effective the evaluation method would actually be when applied in real-world scenarios.

I encourage you to reflect on why countless deep learning papers adopt the approach of splitting data into training and test sets for training and evaluation. The fundamental reason is to ensure that the model’s performance is measured on unseen data, thus providing a reliable indication of its generalization capability. If a model were evaluated on the same data it was trained on, it would only demonstrate how well it has memorized that specific dataset, rather than its ability to generalize to new, unseen instances.

In the context of your study, failing to evaluate unseen models essentially means that the method is optimized for known cases, which can lead to overfitting—a model that appears highly effective on known models but fails to perform well on genuinely new ones. This is precisely why splitting data into separate training and test sets is a standard practice in the deep learning community. It ensures that the evaluation reflects the model’s true capabilities in a fair and unbiased manner.

2024-11-18

Thank you again for your valuable feedback. I'm very sorry for my previous mistake in the rebuttal. PiCO does have a testing process. Next, we will explain the whole process in more detail to help you understand it better.

First of all, we need to emphasize that the whole training process of PiCO is unsupervised. PiCO does not use any ground-truth ranking information $\mathcal{R}^*$ or human-preference annotations during the training process. As shown in Eq.8, PiCO assigns each model a random weight $w$ and maximizes the consistency of $w$ and $G$ , where $G=\sum_{(A_i^j, A_i^k,>,w^s)} \mathbf{1} (A_i^j>A_i^k) \cdot w^s$ . The quadruples $(A_i^j,A_i^k,>,w^s)$ indicate that the “reviewer model” $M_s$ believes that the answer $A_i^j$ is better than answer $A_i^k$ with a learnable confidence $w^s$ . It is worth noting that the whole training process is unsupervised without any ground-truth ranking information $\mathcal{R}^*$ .

The ground-truth ranking $\mathcal{R}^*$ is only used in the test time, aiming to measure the alignment of the learned LLM ranking $\hat{\mathcal{R}}$ by PiCO. Among them, we use the two most popular metrics for measuring ranking, i.e., Spearman’s Rank $S(\mathcal{R}^*, \hat{\mathcal{R}})$ and Kendall’s Rank Correlation Coefficient $\tau(\mathcal{R}^*, \hat{\mathcal{R}})$ , which are widely used in recommendation systems.

In short, during the training process, PiCO learns to re-rank LLMs $\hat{\mathcal{R}}$ in an unsupervised way. During the testing process, we need a human preference ranking $\mathcal{R}^*$ as a "ground-truth label" to evaluate the effect of the learned ranking $\hat{\mathcal{R}}$ .

I hope this reply can help you better understand the whole process of PiCO. If you have any other questions, we are very willing to answer them for you.

2024-11-18

I understand now—you're saying that Equation 8 does not use R* when maximizing the consistency between G and w. That makes sense.

However, I would like to point out that the statements in the introduction, such as "and our goal is to optimize the confidence weights w that re-rank the LLMs to be closer to human rankings," as well as the expression in Figure 2, "In the consistency optimization stage, we update the parameter w by maximizing the consistency of each LLM's capability and score, while re-ranking the LLMs to be closer to human rankings," are quite misleading. These expressions can easily be interpreted as if you used R* to optimize w.

If, as you mentioned, R* is truly not used in Equation 8, it would be clearer to state that the optimization of w results in rankings that naturally align with human preferences, rather than implying that the optimization is done specifically to match human rankings.

2024-11-18

Thank you for your constructive feedback on our work. We are very sorry for the misleading statement in the introduction. We will re-polish the expression in this section to make it better understood. Additionally, your suggestions on whether existing benchmarks can be replaced (Q3) are constructive, and we will add these results and discussions to our revised version.

It seems that all your issues have been well addressed. Would you like to improve your final score?

2024-11-19

Your response has sufficiently addressed my concerns, so I have revised the scores. Specifically, the scores for Soundness, Presentation, and the overall Rating have been updated. Please check these changes.

2024-11-20

We would like to thank the reviewer for raising to a positive score of 6. Your suggestions are constructive and can help us improve the quality of this paper.

Best regards,

The Authors

审稿意见

评分: 6置信度: 42024-11-01

This paper presents an unsupervised evaluation method for large language models (LLMs) using a peer-review mechanism that operates without human feedback. This method, PiCO, allows LLMs to evaluate each other’s answers to unlabeled questions within a shared environment. The core idea is the consistency assumption, where a model’s evaluation ability correlates with its performance, leading to an optimization of model confidence weights to align LLM rankings more closely with human judgments. The effectiveness of PiCO is validated through extensive experiments.

优点

The paper introduces a novel peer review mechanism for autonomously evaluating large language models (LLMs), which represents a significant advancement over traditional human-centric review processes. This framework leverages the models themselves as evaluators to assess each other’s answers, potentially increasing the scalability and efficiency of LLM evaluations.
The paper is well-written with clear and concise explanations, accompanied by informative figures and tables. These visual aids and structured content greatly enhance the readability and accessibility of the complex concepts discussed, making the methodology and results understandable to a broader audience.
The use of multiple well-known datasets like Chatbot Arena, MT-Bench, and AlpacaEval for testing not only demonstrates the practical applicability of the PiCO method but also underscores its effectiveness in aligning model evaluations with human judgment, thus validating the proposed approach through rigorous empirical evidence.

缺点

The paper does not provide detailed information on how the ground truth rankings in the experiments are established, which could impact the transparency and reproducibility of the research. This lack of detail may hinder other researchers' ability to fully replicate or build upon the work.
The effectiveness of the peer review process heavily relies on the design of the questions and prompts, which can significantly affect the consistency and scalability of the evaluations. Variability in the outcomes may arise as different prompts may not equally elicit discriminative responses across models. This reliance on prompt design poses a challenge to the general applicability and scalability of using a large number of LLMs for peer review, as consistency in evaluation may not be universally achievable across diverse question sets.

3.Although the paper presents a novel method for evaluating LLMs, it does not discuss specific application scenarios or domains where this method could be particularly beneficial. This omission limits understanding of the potential real-world impact and practical utility of the PiCO method.

问题

How is the ground truth ranking precisely defined and established for each dataset?
How can we mitigate uncertainty and bias in the evaluations conducted by models?
What is the application scenario of PICO?

2024-11-16

We appreciate your constructive suggestions. Next, we will address your issues.

Q1. The paper does not provide detailed information on how the ground truth rankings in the experiments are established, which could impact the transparency and reproducibility of the research.

A1. This paper uses the ground-truth ranking by human-preference annotations from crowdsourced battle platforms. For example, the following ground-truth rankings are collected by ChatBot Arena [1], the ranking can be replicated by the official source code in [2, 3]. The other two datasets are similar to this. We will add this detailed information in our revision.

Rank	Model
#1	gpt-3.5-turbo
#2	guanaco-33b-merged
#3	vicuna-13b-v1.5
#4	WizardLM-13B-V1.2
#5	vicuna-7b-v1.5
#6	koala-13b
#7	gpt4all-13b-snoozy
#8	mpt-7b-chat
#9	oasst-sft-4-pythia-12b-epoch-3.5
#10	alpaca-13b
#11	fastchat-t5-3b-v1.0
#12	chatglm-6b
#13	stablelm-tuned-alpha-7b
#14	dolly-v2-12b
#15	llama-13b

[1] Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 2023.

[2] https://huggingface.co/datasets/lmsys/chatbot_arena_conversations

[3] https://colab.research.google.com/drive/1J2Wf7sxc9SVmGnSX_lImhT246pxNVZip?usp=sharing#scrollTo=aq3NlxirIyb7

Q2. The effectiveness of the peer review process heavily relies on the design of the questions and prompts. How can we mitigate uncertainty and bias in the evaluations conducted by models?

A2. The consistency of the peer-review process is stable across questions, as we validate the PiCO on multiple crowdsourcing datasets (as shown in Table 1) where the question distribution varies greatly across different datasets. On the other hand, the prompt template can be found in lines 787~847. We changed the expression of the prompt template and repeated the PiCO experiment several times as follows,

Template 1	Template 2	Template 3
System prompt: Please act as a judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You do not need to explain, just give your judgment. Output your final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]" if assistant B is better, and "[[C]]" for a tie.	System prompt: Please evaluate the quality of the following responses provided by two AI assistants. Output your final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]" if assistant B is better, and "[[C]]" for a tie.	System prompt: Please act as a judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. Output your final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]" if assistant B is better, and "[[C]]" for a tie.
User Question: {question}	User Question: {question}	User Question: {question}
Assistant A’s Answer: {answer a}	Assistant A’s Answer: {answer a}	Assistant A’s Answer: {answer a}
Assistant B’s Answer: {answer b}	Assistant B’s Answer: {answer b}	Assistant B’s Answer: {answer b}

The learned final LLM ranking $\hat{\mathcal{R}}$ with multiple system prompts are as follows. The proposed PiCO approach is stable across different system prompts.

Chatbot Arena	Spearman's Rank $S\uparrow$	Kendall's Rank $\tau\uparrow$
PiCO (Template 1)	0.90	0.77
PiCO (Template 2)	0.89	0.77
PiCO (Template 3)	0.90	0.78

In addition, to further mitigate evaluation bias and uncertainty caused by the order of answers, in all our experiments, each battle is evaluated twice in the forward and reverse order as follows.

Template (Forward)		Template (Reverse)
System prompt: {same as Template 1}.		System prompt: {same as Template 1}.
User Question: {question}		User Question: {question}
Assistant A’s Answer: {answer a}		Assistant B’s Answer: {answer b}
Assistant B’s Answer: {answer b}		Assistant A’s Answer: {answer a}

Q3. It does not discuss specific application scenarios or domains where this method could be particularly beneficial. What is the application scenario of PICO?

A3. This paper performs experiments on three of the most popular crowdsourcing datasets. On these three datasets, we validate that the LLMs themselves can employ the peer-review mechanism for good self-evaluation. Therefore, we believe that this is a universal phenomenon that can be applied to various fields of the dataset, rather than a special case. Recently, we applied PiCO to the evaluation of multimodal large vision-language models (LVLMs) and obtained similar effects. On the other hand, PiCO works in an unsupervised setting, so it will benefit greatly from LLM evaluation in scenarios where human preference annotation is very expensive, as it can get a good LLM ranking without real human annotation.

2024-11-22

Dear Reviewer fWmA,

We would like to once again thank you for your valuable time and constructive comments. This is a gentle reminder that we have diligently addressed your concerns and clarified any confusion. We have not yet heard back from you and would appreciate any further feedback you may have. If there are any concerns that you feel have not been fully addressed, we are eager to discuss them with you.

Best regards!

Authors

2024-11-23

Dear Reviewer fWmA,

I hope this message finds you well. We have provided detailed responses to the concerns you raised and believe they comprehensively address all the points.

Thank you once again for your valuable time and constructive feedback. We would greatly appreciate it if you could review our responses at your earliest convenience. We look forward to hearing from you.

Best regards!

The Authors

2024-11-27

Thank you for your explanation and response. Also, sorry for my late reply. Your explanation has addressed my questions to some extent. I have boosted the score and hope to see more relevant discussions in the revised version, such as analyses or examples of real-world application scenarios.

2024-11-27

We greatly thank the reviewer for raising to a positive score of 6. Your suggestions are constructive and can help us improve the quality of this paper. We will add more discussions about the analyses of real-world application scenarios in our revised paper.

Best regards!

The Authors

2024-11-25

Dear Reviewer fWmA,

We hope this message finds you well. We have provided detailed responses to the concerns you raised. We would greatly appreciate it if you could review our responses at your earliest convenience. If there are any concerns that you feel have not been fully addressed, we are eager to discuss them with you.

Thank you once again for your valuable time and constructive feedback.

Best regards,

The Authors

AC 元评审

2024-12-21

In this paper, the authors propose an unsupervised evaluation approach for assessing LLMs, adopting a peer-review paradigm without human feedback, where LLMs evaluate each other. Overall, reviewers appreciated the novelty of the idea, found the paper well-written, and considered the experiments comprehensive. I believe this paper is of good quality and can be accepted.

审稿人讨论附加意见

There are some discussion in the rebuttal phase about clarifying the descriptions and assumptions, the authors in general well answered these questions and several reviewers increase the scores after reading the rebuttal.

最终决定Accept (Poster)

2025-01-22

Accept (Poster)