PaperHub
7.3
/10
Spotlight4 位审稿人
最低5最高8标准差1.3
5
8
8
8
3.3
置信度
ICLR 2024

Large Language Models Are Not Robust Multiple Choice Selectors

OpenReviewPDF
提交: 2023-09-15更新: 2024-03-06
TL;DR

We investigate LLMs' bias and robustness in multiple choice evaluation, and propose an efficient, interpretable, and transferable debiasing method.

摘要

关键词
large language modelbiasrobustnessmultiple choice questionevaluation

评审与讨论

审稿意见
5

The paper studies the issue of sensitivity to answer option order in large language models (LLMs), which can lead to biased predictions. It introduces a new method called PriDe to mitigate this sensitivity by estimating and correcting for the model's bias during inference. The results show that PriDe can reduce the prediction sensitivity.

优点

  • This paper studies the LLMs' sensitivity to order of answer options, which is an important problem in current LLM evaluation, and provides empirical analysis of the underlying reasons.
  • The proposed method PriDe operates on test time without introducing extra computational cost, which is suitable for current LLMs.
  • The authors conduct extensive experiments including different models, tasks, ablation studies, cross-task evaluation, etc.

缺点

  • The proposed method requires sampling test samples first to estimate the prior, which may introduce another dimension of sensitivity of the selection of the test samples. The accuracy of this estimation might vary based on the quality and representativeness of these samples.
  • I understand the procedure of cyclic permutation and full permutation, but how are they used as the debiasing methods? Do the authors take the best result of the permutations as the prediction?
  • The authors use the balance of recalls and Rstd as the major metrics throughout the paper. Can the authors formally define this? I didn't immediately get it.
  • The writing and presentation need more improvement, e.g., I think the proposed PriDe is quite intuitive but the authors introduce too many unnecessary notations (di,oi,gi,xi,....d_i, o_i, g_i, x_i, .... ) before getting into the real introduction of the method, which makes the reading difficult.

问题

  • The proposed method basically follows estimate-then-mitigate, which is somewhat similar to the calibrate-before-use (Zhao et al. ICML 2021) paper, though this one targets a different setting and is not directly applicable to MCQs. But it would be interesting to compare the differences and know if calibrate-before-use can also help with MCQ sensitivity.
评论

The writing and presentation need more improvement, e.g., I think the proposed PriDe is quite intuitive but the authors introduce too many unnecessary notations (di,oi,gi,xi,....d_i, o_i, g_i, x_i, .... ) before getting into the real introduction of the method, which makes the reading difficult.

We would like to clarify the reasons for introducing these formal notations before the proposed method. We attempted to introduce the proposed method first and then intersperse or supplement the introduction of the permutation-based baseline on which our method relies. However, we found that this compromised the integrity of the writing content. We also found that without introducing these formal notations, the writing would become very repetitive and lengthy (we would have to repeatedly use the same terms to avoid ambiguity). Additionally, we believed that formal notations could aid in deriving general solution forms. All these above led us to adopt the current formal notations and writing logic.

If you believe there is a better way to present or structure the content, we would greatly appreciate it and be open to taking your suggestions!

The proposed method basically follows estimate-then-mitigate, which is somewhat similar to the calibrate-before-use (Zhao et al. ICML 2021) paper, though this one targets a different setting and is not directly applicable to MCQs. But it would be interesting to compare the differences and know if calibrate-before-use can also help with MCQ sensitivity.

We are willing to discuss the difference from Contextual Calibration (Zhao et al.). We make a preliminary attempt to adapt Contextual Calibration to MCQ debiasing as follows:

  1. For each test sample, we use the default input and obtain the prediction distribution p\mathbf{p}.
  2. We then replace all the options with the same content-free text: the null string '', N/A, or [MASK], as in Zhao et al., and estimate the model's prediction distribution over the option IDs, denoted as p0\mathbf{p}_0 (we use all the content-free texts and take the average of their p0\mathbf{p}_0).
  3. We use p/p0\mathbf{p}/\mathbf{p}_0 after normalization as the "calibrated" prediction distribution, as done in Zhao et al.

So, from the perspective of implementation, Contextual Calibration is similar to PriDe. The key difference lies in how we estimate p0\mathbf{p}_0, which we refer to as "prior" in our work. The results are shown below (gpt-3.5-turbo-0613, 0-shot ARC, for a quick verification):

MethodsRStdAcc
Default3.384.3
PriDe (α=5\alpha=5\\%)2.384.2
Contextual Calibration4.883.1

We find that Contextual Calibration fails to mitigate selection bias (RStd) and may impair model performance (Acc). It implies that the "prior" (p0\mathbf{p}_0) estimated by Contextual Calibration cannot reflect the model's selection bias in MCQs and may also be hard to interpret.

评论

Thanks for your constructive comments! We address your concerns or questions as follows.

The proposed method requires sampling test samples first to estimate the prior, which may introduce another dimension of sensitivity of the selection of the test samples. The accuracy of this estimation might vary based on the quality and representativeness of these samples.

Good question! While we do test sample sampling mainly out of the need of experiments, in practice we usually expect the estimated prior to be less sensitive to estimation sample selection. We here supplement statistics in the 5 runs of PriDe (0-shot, α=5%\alpha=5\%). We present the mean (as reported in our paper) ∆ RStd and ∆ Acc as well as their best/worst values (averaged over all the 20 LLMs):

Benchmarks∆ RStd (mean)∆ RStd (best ↓)∆ RStd (worst ↑)∆ Acc (mean)∆ Acc (best ↑)∆ Acc (worst ↓)
MMLU-7.6-8.0-7.31.21.31.1
ARC-5.6-6.8-4.31.31.70.8
CSQA-6.9-7.7-6.01.72.21.3

We observe that the selection of estimation samples may introduce slight fluctuations in debiasing results, and even in the worst-case scenario, PriDe still leads to a notable debiasing performance (decrease in RStd and increase in Acc). Therefore, we believe PriDe's sensitivity to the selection of estimation samples lies within an acceptable range and does not obscure its merit (effectiveness and efficiency).

I understand the procedure of cyclic permutation and full permutation, but how are they used as the debiasing methods? Do the authors take the best result of the permutations as the prediction?

Cyclic and Full Permutation can be viewed as having a debiasing effect, as they involve swapping options and averaging prediction distributions over different permutations. They can intuitively mitigate the model's bias for option IDs or options' ordering positions. This is similarly done in recent work [1] and [2], where they swap candidate responses to mitigate GPT-4's evaluation bias.

For Full Permutation, there is only one possible permutation set (i.e., all possible permutations). For cyclic permutations, there might be multiple possible permutation sets (as long as we ensure one pairing between each option ID and option content). The selection of cyclic permutation sets is not our focus, as our method PriDe can be directly combined with any reasonable cyclic permutation set. In the main text, we use the simplest and most intuitive set for Cyclic Permutation, e.g., {(1,2,3,4),(2,3,4,1),(3,4,1,2),(4,1,2,3)}\{ (1,2,3,4), (2,3,4,1), (3,4,1,2), (4,1,2,3) \} for 4-option MCQ tasks. We show in Section 3.1 and Figure 16 in Appendix F that selecting other cyclic permutation sets leads to similar debiasing results.

[1] Wang, Peiyi, et al. "Large language models are not fair evaluators." arXiv preprint arXiv:2305.17926 (2023).

[2] Zheng, Lianmin, et al. "Judging LLM-as-a-judge with MT-Bench and Chatbot Arena." arXiv preprint arXiv:2306.05685 (2023).

The authors use the balance of recalls and Rstd as the major metrics throughout the paper. Can the authors formally define this? I didn't immediately get it.

Sure! The recall of an option ID did_i is defined as:

\mathrm{Recall}(d_i) = \frac{ \\# (\text{correct answer is } d_i \ \\&\text{ prediction is } d_i) }{ \\# (\text{correct answer is } d_i)} \times 100 \\%,

while RStd (Std of recalls) is:

RStd=Std(Recall(di)_i=1n)=i=1n(Recall(di)μ)2n,where μ=1ni=1nRecall(di).\mathrm{RStd}=\mathrm{Std}( \\{ \mathrm{Recall}(d_i) \\}\_{i=1}^n) = \sqrt{\frac{\sum_{i=1}^n (\mathrm{Recall}(d_i) - \mu)^2}{n}}, \text{where } \mu = \frac{1}{n} \sum_{i=1}^n \mathrm{Recall}(d_i).

Our motivation of using this measurement is illustrated in Section 2.2.

评论

Dear Reviewer VDjF,

We would like to thank you for your time and comments. We hope our previous response has adequately resolved your questions or concerns. As the deadline for the ICLR rebuttal period is approaching, we look forward to hearing your feedback on our response, and would be pleased to clarify any additional questions.

Best,

Authors

审稿意见
8

This work presents a comprehensive analysis of the selection bias issue in large language models (LLMs) when dealing with multiple choice questions (MCQs). The experimental results identify the root cause of this bias as the LLMs' token bias, which leads to a preference for specific option IDs when predicting answers. Based on these observations, this work proposes a label-free, inference-time debiasing method called PriDe, which effectively mitigates selection bias.

优点

  1. The empirical analysis is thorough, involving 20 LLMs and three benchmark datasets. This extensive evaluation provides strong evidence for the existence of selection bias in LLMs and its impact on their performance in MCQ tasks. The identification of token bias as the primary source of this issue is a valuable insight that can inform future research on LLMs and their limitations.

  2. The proposed PriDe method is effective when the computing cost is limited. Further analysis on generalizability reveals that the prior estimated by PriDe can be generalized across tasks.

缺点

  1. It seems that PriDe achieves comparable performance with simple baselines when the computation cost is not limitated. In application scenarios, we always first estimate the prior without concerning the computation cost, then apply this prior to serve applications. It would be better if PriDe could have a higher upper boudn performance.

问题

  1. The generalization analysis indicates that the bias for a certain model is consistent across different tasks. Could you further demonstrate this with more statics or results? It would also help to enhace the claimed interpretability.
评论

Thanks for your positive comments! We address your concerns or questions as follows.

It seems that PriDe achieves comparable performance with simple baselines when the computation cost is not limitated. In application scenarios, we always first estimate the prior without concerning the computation cost, then apply this prior to serve applications. It would be better if PriDe could have a higher upper boudn performance.

(This question is similar to the one raised by Reviewer CGVT, so we use the same answer)

When the budget is sufficient, using more permutations does yield better debiasing effects and performance improvements. As discussed in Section 4.3, this is akin to "mixture of experts" or "model ensemble". Our method, on the other hand, provides a computation-efficient alternative. We believe this could be beneficial for debiasing in scenarios with constrained/limited computational resources, such as platforms like the HuggingFace LLM Leaderboard, where a large number of models need to be evaluated on numerous benchmarks.

The generalization analysis indicates that the bias for a certain model is consistent across different tasks. Could you further demonstrate this with more statics or results? It would also help to enhace the claimed interpretability.

Of course! Here we compute the L1 distance (due to its intuitiveness, as in our response to Reviewer CGVT) between the estimated priors from different domains to illustrate PriDe's cross-domain generalization (0-shot, averaged over all the LLMs, α=5%\alpha=5\%; priors' L1 distance is averaged over 5 runs).

Domain 1 \ Domain 2STEMSocial ScienceHumanitiesOthersARC
STEM00.1040.0940.1060.121
Social Science0.10400.0990.0670.076
Humanities0.0940.09900.1100.125
Others0.1060.0670.11000.087
ARC0.1210.0760.1250.0870

We think these priors' gaps are usually marginal, which could verify that for a certain model, its prior for option IDs is similar across domains.

评论

Dear Reviewer G9aZ,

We would like to thank you for your time and comments. We hope our previous response has adequately resolved your questions or concerns. As the deadline for the ICLR rebuttal period is approaching, we look forward to hearing your feedback on our response, and would be pleased to clarify any additional questions.

Best,

Authors

评论

I have no more questions. I have updated my score.

审稿意见
8

This paper experimentally discovers an issue that LLMs are vulnerable to option position changes, or the Option-Order Sensitivity problem, in MCQs due to their inherent “selection bias.” It proposes a label-free, inference-time debiasing method(PriDe) to mitigate the selection bias. The experimental results demonstrate the claim and the usefulness of the PriDe.

优点

I really appreciate the paper conducted extensive experiments to demonstrate and analyze the Option-Order Sensitivity problem. Some observations are really interesting; for example, even the same models with different parameter sizes but trained using the same data exhibit different position preferences.

The PriDe is intuitive but also effective.

缺点

It would be better to cite "Leveraging large language models for multiple choice question answering" or other related papers when mentioning the Option-Order Sensitivity problem since they have found the problem earlier than the work of this paper.

It would be better to analyze more technicals, including self-consistency.

问题

Please refer to the weakness.

评论

Thanks for your positive comments! We address your concerns or questions as follows.

It would be better to cite "Leveraging large language models for multiple choice question answering" or other related papers when mentioning the Option-Order Sensitivity problem since they have found the problem earlier than the work of this paper.

We appreciate your suggestion! We will add the citation in the revision.

It would be better to analyze more technicals, including self-consistency.

In Section 2.6, we experimented with simple prompting strategies, considering their popularity in recent research, to observe whether they have a positive impact on debiasing (finding that they do not). We did not explore too much into prompting engineering, for the following reasons:

  1. Our empirical analysis in Sections 2.3-2.5 cannot motivate us to work on prompting engineering, i.e., we intuitively believe that prompting engineering is not the fundamental means of debiasing (and may also be tricky).
  2. Complex prompting strategies (such as self-consistency) are designed primarily to enhance model performance rather than to debias. Moreover, they typically rely on powerful but often commercial, closed-source LLMs like ChatGPT, Claude, and PaLM, making them less applicable to open-source LLMs like LLaMA.
  3. Complex prompting strategies are often expensive, especially when involving much sampling or heuristic filtering.

We also supplement the results of Self-Consistency on ARC (this benchmark has a small scale, suitable for quick verification). We employ gpt-3.5-turbo-0613, sample 10 Chain-of-Thought paths, and then vote on the predicted results.

MethodsRStd ↓Acc ↑
Default3.384.3
Removing IDs0.684.9
Chain-of-Thought3.484.5
Self-Consistency4.588.9

As expected, Self-Consistency improved Acc. However, like other prompting strategies, it cannot mitigate selection bias and even somewhat amplifies it (RStd increases), which is inconsistent with our goal of debiasing. We believe that investigating the impact of prompting strategies on LLMs' behavioral bias would be an intriguing research problem.

评论

Dear Reviewer rhkS,

We would like to thank you for your time and comments. We hope our previous response has adequately resolved your questions or concerns. As the deadline for the ICLR rebuttal period is approaching, we look forward to hearing your feedback on our response, and would be pleased to clarify any additional questions.

Best,

Authors

审稿意见
8

This paper investigated the LLMs' sensitivity to position changes in multiple-choice questions, discovered that token bias is the main cause/ Furthermore, the authors proposed a way to efficiently suppress this bias and improve accuracy.

优点

  1. It flows! The writing is perfect. All sections follow each other naturally, from problem to observation, to diagnosis, to ruling out simplistic solutions, to proposed solutions. In each step, there are corresponding experiments to substantiate it.
  2. There are some clever experiment designs in diagnosing the cause, and the experiments are carried out with caution (e.g. replacing symbols to confirm).
  3. Comprehensive experiments on many models and datasets.

缺点

  1. When the compute budget is unbounded, the proposed method sometimes has a slight accuracy disadvantage compared to full perm.

问题

  1. In deriving the method, there are a few key assumptions, e.g. Prior for option IDs depends mostly on q. Is it possible to empirically verify this assumption?
评论

Thanks for your positive comments! We address your concerns or questions as follows.

When the compute budget is unbounded, the proposed method sometimes has a slight accuracy disadvantage compared to full perm.

(This question is similar to the one raised by Reviewer G9aZ, so we use the same answer)

When the budget is sufficient, using more permutations does yield better debiasing effects and performance improvements. As discussed in Section 4.3, this is akin to "mixture of experts" or "model ensemble". Our method, on the other hand, provides a computation-efficient alternative. We believe this could be beneficial for debiasing in scenarios with constrained/limited computational resources, such as platforms like the HuggingFace LLM Leaderboard, where a large number of models need to be evaluated on numerous benchmarks.

In deriving the method, there are a few key assumptions, e.g. Prior for option IDs depends mostly on q. Is it possible to empirically verify this assumption?

We removed the dependency on xIx^I in PpriorP_\textrm{prior} because it could be a minimally strong assumption necessary for our derivation. We are also pleased to empirically verify this assumption, that is, whether swapping options (i.e., different xIx^I w.r.t. II) would change the derived PpriorP_\textrm{prior}? If the answer is no, then our assumption makes sense.

In our main text, we use the cyclic permutation set {(1,2,3,4),(2,3,4,1),(3,4,1,2),(4,1,2,3)}\{ (1,2,3,4), (2,3,4,1), (3,4,1,2), (4,1,2,3) \} for 4-option MCQ tasks. Our verification contains the following steps:

  1. Modify the default-ordered options as (1,2,4,3)(1,2,4,3) or (4,3,2,1)(4,3,2,1).
  2. Use the corresponding cyclic set {(1,2,4,3),(2,4,3,1),(4,3,1,2),(3,1,2,4)}\{ (1,2,4,3), (2,4,3,1), (4,3,1,2), (3,1,2,4) \} or {(4,3,2,1),(3,2,1,4),(2,1,4,3),(1,4,3,2)}\{ (4,3,2,1), (3,2,1,4), (2,1,4,3), (1,4,3,2) \} to derive PpriorP_\textrm{prior}'.
  3. Check if PpriorP_\textrm{prior}' is close to PpriorP_\textrm{prior}. We use the L1 distance as measurement (averaged over all the test samples), due to its intuitiveness: d(p,q)=ipiqi.d(\mathbf{p}, \mathbf{q})= \sum_i |p_i - q_i|.
ModelsMMLU (1,2,4,3)(1,2,4,3)ARC (1,2,4,3)(1,2,4,3)MMLU (4,3,2,1)(4,3,2,1)ARC (4,3,2,1)(4,3,2,1)
llama-7B0.0140.0140.0130.013
llama-13B0.0340.0470.0350.045
llama-30B0.0680.0790.0660.077
llama-65B0.0690.0810.0700.082
llama-2-7B0.0280.0220.0280.023
llama-2-13B0.0610.0590.0580.056
llama-2-70B0.0950.1060.0960.109
Average0.0530.0580.0520.058

We can see that the difference between PpriorP_\textrm{prior}' (i.e., estimated with a different permutation set) and PpriorP_\textrm{prior} is marginal (their L1 distance is quite small), which we believe could validate the soundness of our assumption.

评论

Dear Reviewer CGVT,

We would like to thank you for your time and comments. We hope our previous response has adequately resolved your questions or concerns. As the deadline for the ICLR rebuttal period is approaching, we look forward to hearing your feedback on our response, and would be pleased to clarify any additional questions.

Best,

Authors

评论

We sincerely appreciate the thoughtful comments and constructive feedback of all the reviewers. We are encouraged that the reviewers found:

  1. Our paper targets an important research problem (VDjF) and provides interesting and insightful findings (rhkS, G9aZ, VDjF).
  2. Our paper writing is smooth and natural (CGVT, "It flows! The writing is perfect").
  3. Our comprehensive, thorough, and careful evaluation yields convincing empirical observations and results (CGVT, rhkS, G9aZ, VDjF; all the reviewers!).
  4. Our proposed debiasing method PriDe is effective and efficient at the low computational cost setting (rhkS, G9aZ), and its operation on inference time is suitable for modern LLMs (VDjF).

We hope our responses to the reviewers can adequately resolve your questions or concerns. As the deadline for the ICLR rebuttal period is approaching, we look forward to hearing your feedback on our responses, and would be pleased to clarify any additional questions.

AC 元评审

This paper conducts an interesting study that shows the significance of the position of options in MCQs when using LLMs for such tasks. The paper comes with an extensive set of analyses and proposes effective methods for mitigating the identified bias. Most reviewers agree the work is making substantial contributions to this specific domain and the results are worth sharing with the community, and may inspire others working on similar or related directions.

为何不给更高分

Although the scores were high, I found the paper in its current form may not yet have significant impact that can be worth sharing with a wider community.

为何不给更低分

This is a great paper, which should be presented orally.

最终决定

Accept (spotlight)