PaperHub
3.7
/10
Rejected3 位审稿人
最低3最高5标准差0.9
3
5
3
3.0
置信度
正确性2.7
贡献度2.0
表达2.3
ICLR 2025

Enabling Weak LLMs to Judge Response Reliability via Meta Ranking

OpenReviewPDF
提交: 2024-09-26更新: 2025-02-05
TL;DR

We developed an practically effective and efficient few-shot method, Meta Ranking, to judge the reliability of LLM responses effectively with weak language models.

摘要

Despite the strong performance of large language models (LLMs) across a wide range of tasks, they still have reliability issues. Previous studies indicate that strong LLMs like GPT-4-turbo excel in evaluating the reliability of responses from LLMs, but face efficiency and local deployment issues. Thus, to enable weak LLMs to effectively assess the reliability of LLM responses, we propose a novel cross-query-comparison-based method called $Meta Ranking$ (MR). Unlike previous few-shot methods that solely based on in-context learning capabilities in LLMs, MR assesses reliability by pairwise ranking the target query-response pair with multiple reference query-response pairs. We found that MR is highly effective in error detection for LLM responses, that MR with weaker LLMs, which have lower task performance, results in higher judgement precision against baselines with the same or even stronger models. Moreover, the method requires as few as five reference samples and significantly improving efficiency. We further demonstrate that MR can enhance strong LLMs' performance in two practical applications: model cascading and instruction tuning. In model cascading, we combine open- and closed-source LLMs to achieve performance comparable to GPT-4-turbo with lower costs. In instruction tuning, we use MR for iterative training data filtering, significantly reducing data processing time and enabling LLaMA-7B and Phi-2 to surpass 13B models with fewer training tokens. These results underscore the high potential of MR in both efficiency and effectiveness.
关键词
LLM Response JudgementLLM ReliabilityWeak Language ModelModel CascadingData Selection

评审与讨论

审稿意见
3

The innovation of this article is reflected in the fact that it proposes a novel Meta Ranking method that can effectively solve the reliability problem of weak LLMs in response evaluation, and demonstrates its applications through practical application scenarios such as model cascading and instruction fine-tuning. application potential.

优点

  1. MR uses a weighted voting system that compares target and reference pairs, incorporating reliability scores for accurate response assessment.
  2. MR outperforms traditional uncertainty methods (e.g., entropy, confidence) with better generalization and robustness across tasks and datasets.
  3. Cross-query comparisons reduce overconfidence in LLMs, improving accuracy, especially in weak models.

缺点

  1. While there are theoretical innovations, they lack sufficient depth. The paper does not clearly explain why the MR voting mechanism leads to improved performance and judgment efficiency.
  2. Although experiments demonstrate the method's effectiveness, the overall scale is limited. There are not enough experiments exploring how different MR configurations and settings impact performance improvement.
  3. Despite introducing two meaningful applications, their content is insufficient to fully enrich the second half of the paper and lacks substantial innovation. More exploration is needed to highlight their significance and novelty.

问题

In Figure 3, after the introduction of MR, can the inference time of the same large language model drop so much just because the input text length becomes shorter?

审稿意见
5

This paper proposes Meta Ranking, a method to evaluate the reliability of LLM responses with weak LLMs. Before this work, there are 2 major paradigms to evaluate LLM reliability:

  • Zero shot: given (Q,R)(Q, R) pairs, LLM evaluate and generate evaluation scores. Weak LLM might not be able to provide correct judgment.
  • In-context-learning: given (Q,R)(Q, R) pairs, plus references pair (Qr,Rr)(Q_r, R_r), concatenate the reference pairs along with the query and response pairs as input, LLM gives the judge. The order of concatenation matters.

The proposed method uses each reference pair separately without concatenation to reduce the effect of permutation. The weak LLM takes each (Q,R,Qr,Rr)(Q, R, Q_r, R_r) pair as input, and generates a score. For NN reference pairs, the LLM will run NN times of inference to get NN scores. Then voting-based aggregation is used to generate the final evaluation score for this pair.

The proposed method is evaluated with major benchmark datasets in terms of micro precision scores to show improvement, and tested in 2 applications: model cascading and instruction tuning.

优点

  • The proposed method is simple and easy to understand.
  • Solid evaluation to show the improvement in model reliability evaluation, and through downstream applications.

缺点

  • Novelty seems to be limited. Though it's called meta-ranking, the actual method can be viewed as a special case of ICL: instead of concatenating NN reference pairs in one query and send to LLM, each reference pair + query and response to be evaluated are sent to LLM.
  • The definition of "Reliability" is vague. The definition in this paper is "attributes such as correctness and quality as required by the context." As it's not a strict definition, it's hard to evaluate the results and understand the improvement.
  • Efficiency concerns. The proposed method is not scalable. If there are NN references, then NN times of inference is needed. Compared with the original ICL method, it increases the computation complexity from O(1)O(1) to O(N)O(N). It's mentioned in the paper that "In practice, N is usually small due to efficiency and the limited labeled data.", but this is a strong assumption and it might not always be the case.

问题

  • Is it possible to give a more rigorous definition of reliability? This is the core metric to be evaluated in this work, it would be very helpful.
审稿意见
3

This paper proposes a novel cross-query-comparison-based method called Meta Ranking (MR). MR assesses reliability by pairwise ranking the target query-response pair with multiple reference query-response pairs. The method is highly effective in error detection and can enhance strong models in model cascading and instruction tuning.

优点

  • The proposed method is novel and demonstrates superior performance in error detection, model cascading and instruction tuning.

  • The method is effective even with weaker LLMs, showcasing the potential for weak-to-strong generalization.

缺点

  • Section 4 is poorly written and difficult to follow.

  • Even with a weaker model, the proposed method significantly increases the inference cost. Therefore, it should be compared with other inference-scaling methods such as self-consistency.

  • The paper should include more recent models such as LLaMA-3 and GPT-4o-(mini).

  • Table 1

    • It’s weird that Phi-2 obtains zero accuracy on GSM8K and performs close to random guessing (~25%) on MMLU and CMMLU.
    • It should at least include the task performance of the base models for comparison, i.e. LLaMA-2, ChatGLM-2 and Yi.
    • What does "Random Selection" mean?
    • If MR uses 5 few-shot samples (some are correct and some are incorrect), do you also include 5 few-shot samples (same questions, but all responses are correct) for the baselines (TP of Phi-2, P(T) w/ OpenChat-3.5, etc)? Otherwise, it’s difficult to tell if the benefit is from MR or the addition of the 5 few-shot samples.
  • In Figure 3, why P(T) w/Phi-2 has a much higher inference time than MR w/Phi-2? MR requires the model to compare N=5 pairs and P(T) just directly output the correctness of the query.

  • In Figure 4, the performance increases as you increase the number of reference pairs. What the curve would look like when it is greater than 5?

问题

Please see weakness.

AC 元评审

The submission proposes a method to evaluate the reliability of LLM responses with weak LLMs. Reviewers were unanimous that the submission is not of sufficient quality for acceptance to ICLR. Concerns included insufficient depth of the contribution, increased inference cost, and writing quality. The authors did not provide a rebuttal.

审稿人讨论附加意见

The authors did not provide a rebuttal, and the reviewers were unanimous that the submission should not be accepted.

最终决定

Reject