PaperHub
6.0
/10
Poster5 位审稿人
最低3最高7标准差1.5
7
3
6
7
7
3.8
置信度
正确性2.8
贡献度2.6
表达3.4
NeurIPS 2024

Large Language Models Must Be Taught to Know What They Don’t Know

OpenReviewPDF
提交: 2024-04-23更新: 2024-11-06

摘要

关键词
large language modelscalibrationuncertainty estimationprompting

评审与讨论

审稿意见
7

This work investigates the calibration of LLMs, i.e. how to estimate reliable confidence in their responses that correlates well with the actual likelihood of being correct. Through experiments, the authors reveal that existing methods fall short of achieving accurate confidence estimates, and their calibration performance does not improve with the increased size of LLMs. The authors argue that fine-tuning is necessary to obtain reliable confidence estimates. They explore three fine-tuning approaches: Probe, LoRA, and LoRA + Prompt, and their experiments confirm the effectiveness of these paradigms. Additionally, the authors examine the generalizability of these methods and their benefits for users in making decisions.

优点

Originality: This paper proposes three fine-tuning methods for estimating confidence in LLM responses. While previous works have used similar techniques like Probe to gauge uncertainty, the combination of LoRA and prompts for calibration is novel.

Quality: The ideas presented in this paper are well-motivated, and the experiments effectively verify the authors' hypotheses.

Clarity: The paper is well-written and easy to follow.

Significance: The paper's conclusion, that LLMs need to be fine-tuned to predict accurate confidence, is valuable for the community. Additionally, the investigation into the benefits for users in making decisions is both novel and significant.

缺点

  1. One clear drawback of the proposed methods is that they require access to the parameters of the LLM, making them inapplicable to black-box LLMs. Figure 4 (right) indicates that a fine-tuned smaller model can estimate the confidence of larger ones. However, it would be preferable to see a comparison between these fine-tuning-based methods and other techniques that do not require access to the LLM, such as sampling.

  2. The authors claim that "(LoRA + Prompt) performs almost as well in transfer as on in-distribution data." However, in open-ended generation, the gap between transfer and in-distribution is almost 10 points for both ECE and Auroc, which is a significant margin. For instance, ECE increases by nearly 80%.

  3. Section 6.1 is somewhat confusing. While the two possibilities presented seem reasonable, the experimental investigation appears problematic (correct me if I'm wrong). The topic information is not only available in model generation but could also be reflected in the questions. The experiment that replaces the model generation with a random sentence does not adequately address this situation. On the contrary, the results seem to support that the model can predict reliable confidence based solely on the question.

问题

  1. I don't quite understand Figure 3 (Left). The figure with the y-axis labeled "# train" shows the distribution of training samples. What does the figure with "# MMLU" represent?

  2. From my experience, in the sampling baseline, the size of the largest cluster can be used to represent uncertainty (size of the largest cluster/# of clusters), and sometimes it performs better than probability. Have you tried this approach?

  3. Do you use the correct and incorrect answers provided in the datasets, or do you sample LLM generations to construct incorrect answers? If it is the former, would the latter approach help to better evaluate the uncertainty of the LLM's own generation?

  4. The following related studies are missing:

[1] LitCab: Lightweight Language Model Calibration over Short- and Long-form Responses

[2] Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.

局限性

The authors acknowledge the limitation of needing to train separate models. Personally, I believe that at least the Probe method can be directly applied to the LLM without altering the LLM itself. Exploring how to better evaluate confidence directly from the LLM could be a promising direction.

作者回复

Thank you for your thoughtful comments and supportive review!

Here is our response to the weaknesses you listed pointwise:

Response 1

Yes, our method requires access to the model parameters. Luckily for our method, there are incredibly strong open-source LLMs available. The most recent and relevant case is Meta releasing LLaMA 3.1 405B, which has performance on par with the best models from OpenAI and Anthropic. We’re confident that as the abilities of closed-source LLMs progress there will also be parallel progress in open-source LLMs, and therefore any method that requires fine-tuning the model should be possible to run in a fair comparison with state-of-the-art models.

You proposed a comparison with sampling when estimating uncertainties for a different model. We assume that you mean that you want to see a comparison between, for example, the uncertainty estimates of LLaMA-2 applied to Mistral generations and uncertainty estimates from sampling Mistral by itself. We can provide those numbers below:

Answer ModelMethodAUROC
Mistral 7BSampling0.53
Mistral 7BLoRA + Prompt (LLaMA-2 7B)0.68
Mistral 7BLoRA + Prompt (Mistral 7B)0.69

All models are trained/evaluated on answers generated from Mistral 7B and we show the model used to construct the uncertainty estimate in parentheses (where applicable). If you were instead proposing a different comparison, please let us know. From the table, it’s easy to see that even constructing an uncertainty estimate from a different language model (e.g. LLaMA-2 7B applied to Mistral 7B) significantly outperforms a sampling method applied to the same model.

Response 2

We included bars for Probe and Zero-Shot Classifier in the transfer plots so that they could be used as reference points when evaluating the transfer ECEs and AUROCs. While you’re correct that there are notable differences between the in-distribution and transfer performance of the runed models, these large percent differences are not so significant in the scheme of the broader evaluation. While the ECE numbers increase in transfer and AUROC numbers decrease, the final transfer values are significantly better than the baseline methods applied \it{in-distribution}. Therefore in many cases it might be as practical to apply LoRA fine-tuned models in transfer as applying a worse method to data from the actual downstream task.

Response 3

We’re happy to provide some clarifications. As you point out, one possibility is that the uncertainty estimates are using information present in the question to predict how hard a question will be, independent of the actual answer that was generated. The baseline with ground truth questions and incorrect answers provides a sense of how good such a heuristic could be in practice. As we can see, it does actually perform reasonably well. But it performs significantly worse than when the model is provided with both the ground truth question and a real generated answer which demonstrates that the relationship between the answer and question matters. If the uncertainty estimates were purely using the topic of the question or answer, we would expect both methods to perform roughly the same, because the relationship between the question and answer wouldn’t affect the topic they are associated with. However, we observed that the performance is different across all models.

Question Answers

Here are answers for your list of questions:

  • “MMLU (MC)” and “MMLU (OE)” in the x-axis label on the Figure 3 (left) represent the multiple-choice (MC) and open-ended (OE) settings of the massively multitask language understanding (MMLU) dataset. The position of the “MMLU” x-axis labels is suboptimal and we apologize for any confusion. The labels would be better placed above the plots as titles instead of x-axis labels.
  • Conveniently we already ran this comparison for the initial submission, and a more detailed explanation can be found in Appendix B.1. While we tried both counting and likelihood accumulation for the sampling methods, we did not find a significant difference in the performance of these variants on our evaluation, though there may well be a difference on other tasks.
  • We sample the model to construct correct and incorrect answers. As a result, the learned estimator is specific to the model used to generate the answers and not to the dataset of question-answer pairs. While the question-answer pairs are fixed for a given dataset, different models can have very different rates of correct and incorrect answers, which necessitates model-specific training.
  • Thank you for sharing these additional related works. [2] is already referenced in our submission as [59], with citations in both Section 2 and Appendix B.2. We will also add LitCab to our overview and cite it.
评论

Thanks for the detailed response. I will keep my positive score.

审稿意见
3

The paper investigates uncertainty calibration via fine-tuning models with correct and incorrect answers. The paper concludes that training through the features of a model on a thousand graded examples using LoRA outperforms prompting-based method and linear probes. The paper also finds that models can be used not just to estimate their own uncertainties but also the uncertainties of other models. Human study shows LLM uncertainty can affect human decision making.

优点

1, The paper emphasizes the importance of training-based uncertainty estimation methods. 2, The experiments cover various aspects, including data amount, distribution shifting, cross-model generalization, and a human study.

缺点

  1. Lack of technical novelty. The training methods regarding Probe, LoRA, and LoRA+Prompt are highly similar to previous works. No new technical method is proposed.
  2. Limited experimental scope: The experiments are mainly based on the MMLU dataset. This raises concerns about the generalizability of the findings to more tasks and datasets. Previous research has shown that task/dataset complexity can significantly impact uncertainty calibration performance.
  3. Inconsistent results across MMLU subsets: As shown in Figures 8, 9, and 10, sometimes LoRA + Prompt performs better, sometimes LoRA performs better, and sometimes Probe performs better. Additionally, there is no consensus between AUROC and ECE on MMLU subsets. Methods that perform best on AUROC sometimes perform poorly on ECE. The inconsistency is a common observation rather than an exception. Based on that, the average performance among subsets (Figure 2, left) is not convincing when performance varies significantly across different subsets.
  4. Lack of confidence distribution for correct and incorrect predictions on MMLU: The paper does not show confidence distribution for correct/incorrect predictions on the MMLU dataset. Although Figure 3 (right) shows confidence distribution on answerable/unanswerable sets on the SelfAware dataset, a calibrated uncertainty method should also distinguish between correct and incorrect predictions on the MMLU dataset.

问题

Upon the statement of self-knowledge, “We show that, by contrast, Mistral 7B actually has better AUROC values when applied to LLaMA-2 7B than LLaMA-2 7B applied to itself” (line 295), how is Mistral 7B applied to LLaMA-2 7B? Specially, 1, Which model is used for training? 2, Which models generate the answers for the training data and test data, respectively?

局限性

good discussion

作者回复

We appreciate your feedback. We respond to your comments below, providing numerous clarifications and new results inspired by your comments.

Technical Novelty

We would like to clarify that our goal in the paper was to show that more complex methods, including those with technical complexity, are in fact not necessary when a small number of labeled examples are available. Given the rising popularity of approaches that use zero-shot prompting or sampling, we think this point is clearly important and runs contrary to prior work, making it sufficiently novel. Rather than introduce yet another method with technical complexity, we wished to step back and scientifically understand key facets of calibration for language models. And indeed our experiments provide many novel insights that should not be dismissed. We found that regularization and use of LoRA were critical for good performance. These observations go beyond prior work. We ultimately see use of existing popular tools like LoRA as a strength, facilitating broad adoption.

Extended Experimental Results

Inspired by your comments, we provide results on the MMLU-Pro benchmark, which is a more challenging multi-task language understanding benchmark. This evaluation was created after the release of all models we consider therefore there is no risk of contamination. Tables containing the results are available in our general rebuttal comment. Overall, the new results mirror our original results on MMLU and provide evidence that our estimates do generalize. MMLU is also considered significantly harder than MMLU and frontier models typically have lower accuracy on MMLU-Pro by 20% or more.

Trends in MMLU Performance

While there is variance in each metric, there are clear trends in the relative performance. To make this clear we can look at the win rate of LoRA + Prompt over Zero-Shot Query as an illustrative example. The table below shows these win rates for AUROC and ECE. In AUROC, there is a clear advantage to using LoRA + Prompt for all models. ECE is less consistent, but the majority of variance is the result of the model, not the subset of MMLU:

ModelAUROC Win RateECE Win Rate
LLaMA-2 7B98%20%
LLaMA-2 7B Chat100%93%
LLaMA-2 13B96%100%
LLaMA-2 13B Chat98%71%
Mistral 7B98%35%
Mistral 7B Instruct96%82%

We found that Mistral 7B in particular was highly anomalous among the models we considered. When we consider llama-3 models, which are already included in Figure 1, most models do not display good zero-shot calibration:

ModelECE
LLaMA-2 7B0.19
LLaMA-2 7B Chat0.56
LLaMA-2 13B0.40
LLaMA-2 13B Chat0.23
LLaMA-2 70B0.32
LLaMA-2 70B Chat0.18
Mistral 7B0.12
Mistral 7B Instruct0.24
LLaMA-3 8B0.58
LLaMA-3 8B Instruct0.61
LLaMA-3 70B0.47
LLaMA-3 70B Instruct0.49

We also include results for Qwen-2 7B-Instruct, a strong model released by Alibaba, to make it even more clear that Mistral 7B is anomalous in its ECE results:

MethodECEAUROC
Zero-Shot Classifier47.5768.06
Probe20.7261.16
LoRA + Prompt14.9074.94

These results have the corresponding win rates:

ModelECE Win RateAUROC Win Rate
Qwen-2 7B-Instruct99%85%

While some models are surprisingly calibrated zero-shot, these models often have bad AUROCs. This is possible because probabilities can be calibrated without being useful for making predictions. For example, a model can completely ignore the input and still have perfect calibration if the statistics of its predictions match the statistics of the data, even if the predictions are completely random. Ideally we want uncertainty estimates that are both good at discrimination and are calibrated, because such a model can reliably filter incorrect answers from correct answers while also having probabilities that are easy to interpret (because they correspond to real rates of correctness or incorrectness).

Confidence Distribution on MMLU

There are two points worth clarifying. First, our main aim in showing the histogram of confidence scores was to show that unanswerable questions obtain lower confidence under our model. Lower confidence doesn’t necessarily imply better calibration, as correctness cannot be defined for unanswerable questions. Second, MMLU is different from SelfAware because correctness is well-defined for all questions, and therefore we can use AUROC and ECE to evaluate the quality of our uncertainty. Although we do not show the confidence distribution in a plot analogous to Figure 3 (right), the confidence distribution of correct and incorrect answers is already captured in the ECE and AUROC metrics.

Summary of Contributions

Given the widespread adoption of LLMs the significance of this paper’s subject is self-evident. We provide several useful observations that are directly applicable to improving the usability of LLMs for question-answering. These observations are deliberately simple but are novel and contradict statements made in prior work. For instance, we show that fine-tuning with even a small number of labeled examples can dramatically outperform prompting and sampling methods in the majority of cases. We also include an extensive analysis of our method’s robustness to distribution shifts and even study how our uncertainties affect user behavior in a study of human-AI collaboration.

We hope that you will consider raising your score in light of our updated experiments and clarifications.

审稿意见
6

Summary

This paper presents a method to quantify the confidence of predictions from an LLM. Unlike some previous work, the authors propose a training-based approach to teach a model to generate reliable confidence scores. The experiments report Expected Calibration Error (ECE) and AUROC, commonly used in calibration literature, to ensure that the model's predicted confidence aligns with actual likelihood. Typically, papers in this area either tune the model or do some post-processing on the model to obtain calibrated confidence scores.

However, this paper is slightly different in its setup, and I found the title somewhat misleading. The main difference here is that there is an auxiliary model to emit the confidence scores, entirely independent of the primary model, as stated in Line 327 of the paper, "Currently, finetuning relies on two separate models for question answering and uncertainty estimation." In this sense, this auxiliary model serves better as a potential guardrail to filter out less confident predictions, and framing the paper in this light would make it easier to convey its message.

I also dislike using the terms "uncertainty estimation" and "confidence estimation" interchangeably. Please refer to Section 3.2 of the paper "Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models," which does a good job of eliciting the difference between uncertainty and confidence.

Main Contribution of the Paper:

The central contributions of the paper are

  • Assembling a finetuning dataset to calibrate the model on.
  • LoRA finetuning of the auxiliary model for reliable uncertainty estimate.
  • Testing the robustness of this model to distribution shifts.

The paper also contradicts findings from previous papers, such as

  • Frozen features are insufficient for uncertainty estimates (The internal state of an LLM knows when it's lying)
  • Transferability of uncertainty estimates across models, which is different from the findings in Kadavnath et al. (Language Models (Mostly) Know what they know)

While I haven't read the previous works in detail, my understanding is that they attempt to predict the uncertainty from the model that generates it. There is no auxiliary model involved.

Experiments and Results

  • Figure 1 shows that perplexity is not indicative of model uncertainty.

    • However, the open-ended generation setting seems slightly contrived and is not truly open-ended. What is the average length of the answer in the MMLU dataset? Since a Multiple Choice Dataset was converted to open-ended generation, the answers might comprise only a few tokens.
  • The paper proposes three approaches to emit the confidence score.

    • "Probe," which trains a small feedforward neural network on the last layer
    • LoRA (same as Probe with LoRA weights)
    • LoRA + Prompt (same as LoRA with prompt posing it as a multiple choice question (i) and (ii) The dataset used for finetuning was assembled from a diverse collection of benchmark datasets, such as AI2 Reasoning Challenge, Boolean Questions, CosmosQA, etc. I looked at the MMLU paper and found no evidence of any data contamination, i.e., test data accidentally leaking into the training data. I suggest calling this out explicitly in the paper.
  • Table 1 shows regularization improves calibration.

  • Figure 2 shows that the three proposed approaches improve calibration and selective prediction. However, since the open-ended generation setting is not truly open-ended (see my comment above), I suggest conducting additional experiments to predict the uncertainties of longer sequences. This could be one reason why the findings are contradictory to previous papers.

  • Regarding my comment about training an auxiliary model, the confidence generation model is entirely detached from the primary model, which generates the answer. Based on the result of the transferability of these scores, you can arbitrarily mix and match any open-weight model with any other model to get reliable uncertainty estimates. This suggests that this auxiliary model has learned a notion of "correctness" independent of the original model from the training data. It's unlikely that this technique would work on niche domains like finance and medicine since the training data is unlikely to contain such examples.

    • Line 253: Humanities and social sciences overlap, so this claim is not compelling.
    • Question Figure 3(b): I see that a tuned model leads to a more uniform distribution of probability scores in the first figure compared to the zero-shot which has a mode around 70%. Why is this calibrated? (Line 279) Shouldn't you use a calibration curve to claim if scores are calibrated?
  • The experiment presented in section 6.1 is convincing in showing that learned uncertainty estimates depend both on the intrinsic difficulty of the question and the learned correspondence between the question and the correct answer. This also corroborates my earlier skepticism about using this technique to niche unseen domains.

I like Section 7 a lot. Thank you for this study on how uncertainty estimates correlate with user decision-making.

Nits and suggested improvements:

  • Calling out contributions of the paper more explicitly towards the beginning of the paper.
  • Line 39-40: "In particular, we consider whether it's possible to have good uncertainties over correctness (rather than tokens) out of the box." - consider rephrasing this; this sentence is unclear.
  • Line 218: "The probability of each cluster becomes the probability assigned to each sequence in that," this sentence is unclear.
  • Line 959 in appendix is incomplete sentence

优点

Strengths

The paper is concise yet comprehensive, with clear and accessible writing. The appendix offers thorough explanations of hyperparameter selections, auxiliary techniques, and evaluation metrics. While the results align with expectations, they don't break significant new ground. The literature review provides a solid contextual foundation for the work.

缺点

See comments above

问题

See comments above

局限性

See comments above

作者回复

Thank you for your review! We respond to your comments below.

Connection between Base Model and Uncertainty Estimate

As you note, the model that generates uncertainties is not exactly identical to the model that generates answers. However, the assertion that these models are entirely independent or that our method is a departure from prior work is fundamentally incorrect.

The setup we employ is the standard in every supervised learning method in our related work section. Lin et. al. and Kadavath et. al., for example, are both foundational papers in LLM uncertainty estimation that employ a separate uncertainty model in their finetuning experiments.. Azaria and Mitchell also use a probe, which results in two separate models, one for generating uncertainties and one for generating answers. In fact, only the zero-shot classifier or sampling methods don’t adopt this two-model approach, and as we show in the experiments they tend to have weaker performance.

Even if our setup was not standard in prior work, it would still be inaccurate to characterize the two models as independent. To make this clear, we can review the details of the three methods we explore:

  • Probe: The uncertainty model and base model share features, and the MLP has approximately 50K parameters, effectively sharing 99.999% of their parameters.
  • LoRA: For approximately 3.5M LoRA weights in a 7B parameter model, and 50K parameters in the MLP classifier, the models share 99.95% of their parameters.
  • LoRA + Prompt: This approach only uses LoRA and no new head. Therefore the models share 99.95% of their parameters. For LoRA + Prompt, we also regularize the KL divergence between the base model and uncertainty estimate (Section 5, Table 1), and this regularization makes not just the weights but the outputs of the two models tied together.

Making the two models exactly identical is an exciting avenue for future work. There is no conceptual reason that our method shouldn’t extend to that setting, but in practice it’s challenging to fine-tune state-of-the-art LLMs while avoiding catastrophic forgetting in answer generation will most likely require mixing large batches of graded data with standard pre-training data. Most academic groups are simply ill-equipped to run this experiment, and it doesn’t seem fair to penalize us for having limited compute, especially when our setting aligns with prior work.

Length of Open-Ended Answers

To determine whether open-ended MMLU is sufficiently different from multiple-choice MMLU, we can look at some basic statistics of the generations sampled from each model.

ModelAvg Answer Length
LLaMA-2 7B9.7
LLaMA-2 7B Chat11.5
LLaMA-2 13B9.7
LLaMA-2 13B Chat10.3
LLaMA-2 70B9.5
LLaMA-2 70B Chat56.9
Mistral 7B10.3
Mistral 7B Instruct25.2
LLaMA-3 8B9.0
LLaMA-3 8B Instruct13.4
LLaMA-3 70B8.9
LLaMA-3 70B Instruct10.7

While the generations do not consist of whole paragraphs, as we describe in Section 4 (“Do We Get Good Uncertainties Out of the Box?”), the behavior of probabilities over 10 tokens is fundamentally different from the behavior of probabilities over single tokens because of multiple possible phrasings. We make this distinction clear in Figure 1 left by showing that probabilities over single tokens are excellent discriminators while perplexities over many tokens are not.

Additionally, we run two experiments that should help clarify that our results do in fact extend to long sequences. First, Figures 11-13 in Appendix E show that the uncertainties predicted by our models do not have a consistent relationship with the length of the sequence. Many of the sequences displayed are between 100 and 200 tokens long, which is comparable to a long response in a dialogue with a chat model.

In addition to these results, we also ran an additional experiment to augment the perplexity results in Figure 1. We explore how perplexity relates to quality scores on a dataset with very long graded generations. MT-Bench contains approximately 100 long-form questions with answers in dialogue form. We use precomputed answers graded with GPT-4 for LLaMA-2 7B Chat, LLaMA-2 13B Chat, and LLaMA-2 70B Chat and compute the perplexity associated with each answer.

ModelSpearman’s Rho
LLaMA-2 7B Chat0.46
LLaMA-2 13B Chat0.27
LLaMA-2 70B Chat0.28

From the low Spearman’s rhos between the perplexity assigned by the model and the score given by GPT-4, we can see that perplexity is not a reliable predictor of quality, and furthermore, perplexity does not become a more reliable predictor of quality as model sizes and capabilities increase, unlike previous work.

Humanities vs. Social Sciences

While “humanities” and “social sciences” are often synonymous, they do have distinction definitions as super-categories of MMLU. While literature is in the super category “humanities” in MMLU, legal briefs and economics are defined as “social sciences.” It’s not obvious that reading literature should confer an understanding of economics, or vice versa.

Calibration of Unanswerable Questions

Our purpose in displaying the confidence score is to show that the confidence for unanswerable questions is significantly lower than the confidence scores for answerable questions. The ECE of the answerable questions is 0.185 and the AUROC is 0.78. These represent 36% and 15% improvements over the baseline respectively.

Closing Remark

We’ve done our best to address your comments in detail and provide additional experiments where possible. We hope that you will consider raising your score in light of these clarifications and new evidence.

审稿意见
7

This paper studies calibration of large language models. Firstly, the paper points out the zero-shot outputs of LLMs are poorly calibrated, especially in an open-ended generation setting. Secondly, the paper explores fine-tuning a separate LLM for calibration, and demonstrates its good properties such as OOD generalization, data efficiency, etc. Lastly, the paper presents a preliminary study of presenting model uncertainties to human users and investigating how this affects human decision making process.

优点

  • The paper is very comprehensive and presents an in-depth analysis on the calibration of large language models.
  • The takeaway messages are summarized clearly after each section. The insights are useful to users and developers of large language models. Plus, the proposed future directions are interesting.

缺点

  • The paper mainly focuses on calibration of zero-shot prompting LLMs, while prompting the model with few-shot examples may also be a viable calibration method. I think this should be added into the discussion and maybe added to the experiments to further strengthen the work.
  • Due to the page limit, many important details are left in the appendix and the paper may be improved with a better organization.

问题

Clarification questions:

  • Line 162: Do your results contradict with those in Kadavath et al. 2024? If yes what are differences in your experiment settings that may lead to this?
  • Line 177: Does each dot in Figure 1 (right) represent one task in MMLU, or representing one LLM? Were different sizes of Llama2 and Llama3 used here?
  • Figure 1: Do you observe any patterns for base and instruct models? As discussed in https://arxiv.org/pdf/2303.08774 Figure 8, post-training may hurt calibration.
  • Section 5: I think more details are needed for me to understand how the fine-tuned calibration model is trained.
    • How do you get P(correct) from the model. Did you train a classification head? Or is it obtained in a way similar to the zero-shot classifier in Line 159?
    • How do you formulate your training set? It was mentioned in the abstract that a mixture of correct and incorrect answers are needed. However this was not explained clearly in Sec 5.
    • How is the trained calibration model different from a trained “verifier”, e.g., as done in https://arxiv.org/abs/2110.14168
  • Line 287: “model trained on incorrect answers perform surprisingly well”, does this mean the two hypotheses are both true and account for part of the success of the the trained calibration model?
  • Figure 5: I’m not sure how to interpret the figure. How do you get “an increased reliance on the LLM for more confident answers, and decreased reliance for less confident answers” from subfigure (b)? Is this done by comparing subfigure (a) and (b)?

Discussion questions:

  • Line 132: “Early work on LLMs such as GPT-3 showed … poorly calibrated”. GPT-3 actually showed reasonable calibration on MMLU in a few-shot in-context learning setting (Figure 11 in https://arxiv.org/abs/2009.03300). I wonder what’s the authors comments on this. It seems that the paper mainly focuses on zero-shot settings. Is there any reason this is not included in the paper? Would few-shot ICL also be a viable method for calibration?
  • I like the idea of infusing calibration into the pre-training phase, discussed in Sec 8. Can you elaborate on the challenges of online learning here? From what I understand, the calibration training data used in Sec 5 is stationery.

Missing Reference: You may be interested in

Overall I think this is a solid and insightful submission. I'm open to adjust my ratings if some of these confusions are resolved.

局限性

Limitations are discussed in Section 8.

评论

Attributing generalization

Our current experiments are insufficient to claim that the question topic has no influence on the model. What we can conclude, however, is that the uncertainty estimate cannot be using only the question topic, because, if it were, we would not expect the model trained on correct answers to have significantly better performance. Therefore, the calibrated model must be learning something meaningful about correct answers, not simply a short-cut association between the question and correctness.

Interpreting User Study

Indeed, this conclusion is a result of observing the shift in the distribution of correct v/s incorrect answers.

When the confidence scores are generated at random, the user cannot reliably judge whether to seek AI assistance (i.e. use the answer provided by LLM). (c) shows that when the confidence estimates cover a wide range, AI assistance is not very useful as a miscalibrated confidence estimate may just misguide the user to use (or not use) the assistance. Now, comparing (a) and (b) shows that in cases when users did not rely on LLM’s answer, the confidence estimated by the LLM is spread relatively uniformly over the 35-45% range in (a) as compared to more clustered mass around 40% in (b). The confidence scores then provide a more meaningful signal to the user for when to choose AI assistance. Similarly, looking at the distribution of confidence when the users did seek AI assistance, comparing (a) and (b) shows that a higher fraction of the users were able to rely on the LLM.

Most importantly, these results indicate that humans can respond to AI assistance differently depending on uncertainty estimates

Viability of Few-Shot ICL

For clarity, we note that all our evaluations with MMLU were conducted with the standard 5-shot prompting.

One key difference to the calibration considered in the paper you referenced (with GPT-3) is our use of open-ended answers instead of multiple-choice answers. In Figure 1, the probabilities assigned to multiple-choice options are often useful predictors of correctness. However, because open-ended answers comprise multiple tokens, using model probabilities in open-ended settings is more fraught.

Kadavath et al. show that many models similar to GPT-3 can have reasonable calibration when prompted in the correct way. Unlike using the probability of answer choices, these prompting methods are applicable to open-ended generations. Notably, we show that fine-tuning performs significantly better in both multiple-choice and open-ended settings.

You allude to the possibility of including demonstrations of the uncertainty prompts (e.g. “Is the proposed answer correct or not?”) within context. For the verbal elicitation baseline, we use a similar approach to bias the model towards correctly formatted outputs. We include a prompt (Appendix B.2) that gives an example of outputting a calibrated probability. As we’ve highlighted before, this approach can perform surprisingly well for a black-box method but dramatically underperforms fine-tuning methods.

Challenges of Online Learning

You’re correct–the training data is stationary. We rely on a one-time labeling to learn the correctness classifier.

The key challenge in simultaneously pre-training and learning the classifier is that as the model learns, the correctness labels may change. This non-stationarity in labels will destabilize learning, as the label can be a moving target. However, we believe it is feasible to incorporate a correctness classifier more stably similar to how LLMs are aligned with multiple rounds of data generation (e.g. LLaMA 3.1), and is a promising research direction to pursue.

评论
  • Thank you for your response! Now I better understand how the fine-tuned calibration model is trained, and I understand Fig. 1 and Fig. 5 better. I've raised the score to 7.
  • Thank you for clarifying how your settings and conclusions differ from those in Kadavath et al. This distinction is important, especially given that the titles of both papers might imply a direct debate. If space permits, please include these discussions. Additionally, I suggest highlight it early in the paper that your focus is on open-ended generation to avoid any confusion.
作者回复

We’re glad that you find the paper informative and appreciate your feedback. Thank you for sharing the additional reference, which we will incorporate into the related work section. We respond to your questions below, and make several clarifications. We also provide additional results in the general post which you might find helpful. We value your support and would appreciate it if you would consider increasing your score in light of our response and the timeliness and comprehensiveness of this work.

Differences from Kadavath et. al.

Kadavath et al. is closely related to our work, but a few key details of our evaluations and conclusions differ. From the top, our key emphasis is on answers with more than one token (open-ended versus multiple-choice). We explore how open-ended answers can have fundamentally different behavior from multiple choice answers because of ambiguity in phrasing. Unlike Kadavath et al., we show that prompting strategies (e.g. zero-shot classifier, verbal elicitation) are often relatively weak when compared to fine-tuning methods. These methods often improve very slowly, if at all, with the strength of the underlying model. By contrast, fine-tuning methods are effective even for relatively small models and their performance improves with the power of the underlying model.

Kadavath et al. also includes experiments with fine-tuning a model on graded examples, but they are different from our own in several ways. Kadavath et al. train a correctness classifier which takes only questions as input, without answers, estimating whether a question might be answerable but not whether a given answer is correct. As we explore in Section 6.1, a model conditioned on only questions is sensitive to learning shortcut features of the question, effectively clustering question topics. By contrast, we show that our model is learning the underlying relationship between the question and answer. Beyond differences in the setup of our fine-tuning experiments Kadavath et al. suggests that language models are strongest when evaluating their own generations and subsequently posit that uncertainty estimation is linked to self-knowledge. We find that capable models can readily learn good uncertainties for predictions of other models without any knowledge of their internals (see Section 6.1).

Clarifications to Figure 1

Each dot represents a single model, the models included are LLaMA-2 (7B, 7B-Chat, 13B, 13B-Chat, 70B, 70B-Chat), LLaMA-3 (7B, 7B-Instruct, 70B, 70B-Instruct), and Mistral (7B, 7B-Instruct). For “Verbal” only the Chat/Instruct models are included because base models generated nonsensical probabilities from when using only prompting. The coordinates of each dot show the test metric averaged over all subsets of MMLU.

Base / Instruct Models

In terms of ECE and AUROC, Figure 7 of the appendix provides a detailed breakdown for all base and instruct models. Unlike in the paper you referenced, base models typically have worse accuracy, ECE, and AUROC when using prompting methods (e.g. zero-shot classifier). One potential reason for the difference is the mismatch between multiple-choice and open-ended settings. Post-training can impair the per-token calibration of a model without hurting its calibration at the level of correctness/incorrectness (captured by prompting methods or trained estimators). In our experiments, post-training can actually be helpful because it improves the model’s ability to adhere to formatting or instruction constraints. Instruction-tuned models tend to have better open-ended accuracies overall (because their answers are direct and concise) and perform better with zero-shot prompting methods where the output format can be complicated or abstract (as in the case of outputting probabilities as numbers).

Notably, however, our fine-tuning procedures create useful estimates for both base and post-trained model variants, with improved ECE and AUROC. Models with calibration deteriorated by RLHF can be re-calibrated with post-training on graded examples.

Details on estimating P(correct)

“Probe” is a classification head on top of the penultimate layer (before token prediction) of the LLM, where the input is the representation of the last token in the context. This head, a small 3-layer neural network, similar to the prescription of Azaria and Mitchell (reference [3] from the submission), serves as the model for binary classification (correct v/s incorrect).

Our “LoRA + Prompt” method is essentially a trained version of the zero-shot classifier in Line 159, where instead of attaching a head, where we use the same prompt as in the zero-shot classifier but train through the features of the model using low-rank adapters (LoRA). This approach allows the correctness predictor to extract slightly different information from the intermediate layers of the LLM. The loss is computed over the two tokens corresponding to the options (i) and (ii).

Training Set Details

Our training set is a uniform mixture of various open datasets listed in Appendix C.2. For each model, we generate outputs for the samples in the mixture labeled for correctness (as graded by GPT-3.5). Because we use generations from the model and the models are not perfect, we get both correct and incorrect labels in our training set. We will make these details explicit in Section 5.

Uncertainty Model vs Verifier

In principle, a trained calibration model is very similar to a trained verifier. However, a trained verifier in the linked paper is being used to select a single answer from a set of generated samples by estimating which is most likely to be the correct answer. By contrast, our goal is to estimate uncertainty for the answer that has been chosen. We therefore care about properties like calibration (ECE) which indicate that the model’s confidence is not just useful for filtering but also interpretable as a probability.

审稿意见
7

This paper investigates fine-tuning on a small dataset of correct and incorrect answers for an uncertainty estimator. Through various experiments, the authors show the trained estimator not only surpasses the previous prompting approach, also has generalizability to other models and subjects. Lastly, the authors conducted a user study to verify that calibrated confidence could improve human-machine collaboration by helping users modulate their decision to use the model.

优点

  • Uncertainty and calibration to reduce hallucination are currently crucial issues.
  • This paper provides various experiments and analyses. Moreover, the experiments are soundly conducted and most of the details are presented for reproducing.
  • The user study is fascinating where users solve MMLU tasks accompanied by a confidence score and predicted answers from an LLM. However, there remains future work to in-depth analyze the results.
  • This paper is well-written, neatly organized, and well-presented.

缺点

  • The evaluation task only considers MMLU, which is highly likely to be contaminated.

  • In the section “When and Why Do These Estimates Generalize?”

    • In both Figure 4 left and right, some results (Probe, Incorrect, sBERT) seem to have near AUROC of 0.5, which is random. Therefore, it’s hard to agree with the result drawn in this paper.

    Learned uncertainty estimates generalize to new formatting, subject matter, and even the generations of other models. This generalization appears to stem not simply from judging a question’s difficulty based on its subject matter (a short-cut) but also learning the correspondence between questions and correct answers.

  • The following paper is related in terms of a fine-tuned confidence estimator.

    • Ulmer, Dennis, et al. "Calibrating Large Language Models Using Their Generations Only." arXiv preprint arXiv:2403.05973 (2024).
  • In the section, “How Should We Use Labeled Examples?”

    • The architectural details of the “small network” used in probing and others are absent. And, what is the regularization hyperparameter for KLD? What prompt is exactly used for the “LoRA+Prompt” setting?

问题

  • In Figure 1 (right), there are 6 value points of “verbal” and 11 of “zero-shot classifier”. How are they selected? Furthermore, it seems to have insufficient values to draw the trend line.
  • In Figure 1 (left), isn’t AUROC sensitive to the scale of perplexity range?

局限性

The authors adequately addressed the limitations.

作者回复

We appreciate your thoughtful and supportive feedback!

Additional Benchmarks

Inspired by your comments, we provide results on the MMLU-Pro benchmark, which is a more challenging multi-task language understanding benchmark. This dataset is much newer than any of the models we test and bears no risk of contamination. Overall, we see that supervised methods provide much stronger performance in terms of ECE and AUROC over the zero-shot classifier.

LLaMA-2 7B-Chat

MethodECEAUROC
Zero-Shot Classifier68.6653.29
Probe6.3459.79
LoRA + Prompt26.3465.55

LLaMA-2 13B-Chat

MethodECEAUROC
Zero-Shot Classifier32.1154.01
Probe7.5863.10
LoRA + Prompt9.3868.33

Mistral 7B-Instruct

MethodECEAUROC
Zero-Shot Classifier19.5060.25
Probe6.3760.95
LoRA + Prompt7.7069.89

Training Details

For all probing experiments, we use small 3-layer MLP with hidden sizes of [256, 128, 64] and ReLU non-linearities, same as used by Azaria and Mitchell (2023) (reference [3] in the submission).

We find regularization parameter 1.0 to be sufficient for generalization across all models.

For LoRA + Prompt, we use the following prompt:

Is the proposed answer correct?
Choices:
  (i) no
  (ii) yes
Answer:

We will make these details explicit in our camera-ready version of Section 5.

AUROCs Near 0.5

You are correct that some methods, such as Probe and sBERT, have close to random performance. We don’t believe these results contradict our paper’s main claims. Firstly, we argue throughout the paper that fine-tuning through the model with LoRA leads to much better results than methods that use frozen features. In many cases frozen features get near 0.5 AUROC, while LoRA + Prompt gets near 0.7, which is significantly better than random. Likewise, the “Incorrect” label refers to a baseline and its relatively poor performance demonstrates that our model performs well because it learns a meaningful connection between questions and correct answers. Secondly, while several baseline methods have AUROCs near 0.5 in aggregate, and this demonstrates their inferior performance compared to other methods (such as LoRA + Prompt), there are many subsets of MMLU for which these models perform significantly better than random (though still worse than better methods like LoRA + Prompt). We provide a detailed breakdown over all MMLU tasks in Appendix D.

Related Work

Thank you for bringing Dennis et. al. to our attention. We see their work as complementary to ours. Both our paper and their paper show that LLMs can be fine-tuned to improve uncertainty estimation but the origin of the labels is different. In their paper, a sampling method is distilled into a single uncertainty estimator, while in our paper, we distill an expert grader into an uncertainty estimator. In practice we find that sampling methods had poor discrimination (low AUROC), so it’s unlikely that training on the outputs of the sampling procedure would yield a useful estimator in our case. In cases where sampling performs on par with an external grader, however, distilling from a sampling method might be preferable because of the cost of expert labels. We will add this discussion to our related work section in the paper.

Clarifications on Figure 1

There are 12 points in the plots in Figure 1 (left). These correspond to the following 12 models: LLaMA-2 7B, LLaMA-2 7B Chat, LLaMA-2 13B, LLaMA-2 13B Chat, LLaMA-2 70B, LLaMA-2 70B Chat, LLaMA-3 8B, LLaMA-3 8B Instruct, LLaMA-3 70B, LLaMA-3 70B Instruct, Mistral 7B, Mistral 7B Instruct. The plots in Figure 1 (right) should contain the exact same models. In the case of “Zero-Shot Classifier”, the LLaMA-2 70B numbers were dropped accidentally. We include the AUROC and ECE numbers below:

ModelMethodAccuracyECEAUROC
LLaMA-2 70BVerbal47%32%56%

The 6 points displayed for “Verbal” correspond to only the Chat/Instruct model versions. Only the Chat/Instruct version were used because base model versions were all incapable of generating verbalized probabilities with prompt engineering on its own. We agree that including more models would be ideal. We are limited by our access to compute and make our best effort to run many of the state-of-art models available at the time of submission. We also included a 95% bootstrapped confidence interval for our linear fit to give a sense of the statistical power of the trend line. We don’t think there’s any reason to believe that there will be a sudden inflection point in the curves for “Zero-Shot Classifier” or “Verbal”, and other papers covering uncertainty estimates for models as powerful as GPT-4 have also found lack of calibration/discrimination that persists for even the most capable models [1].

You also asked whether the scale of perplexity numbers affect the final AUROCs. AUROC is scale invariant because the true positive rate (TPR) and false positive rate (FPR) for a predictor ff and cutoff kk are equivalent to the TPR and FPR of a predictor g=cfg = c \cdot f with cutoff ckc \cdot k. AUROC is an integral of TPR over the FPR in the interval [0,1], and we could rewrite AUROC(gg) as AUROC(ff) using a change of variables. AUROC therefore only captures how well each model ranks the incorrect and correct answers. ECE by contrast, is not scale invariant and is only appropriate for normalized values, which is why we use AUROC in this comparison.

References

[1] Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs. ArXiv, abs/2306.13063, 2023.

作者回复

Thank you to all the reviewers for your feedback. We wish to emphasize that our paper is closely engaging with an extraordinarily significant and timely topic: how do we calibrate foundation models to select for factual accuracy? We consider many facets of this question, including even a human+LLM collaboration setting. We also provide insights into why uncertainty estimation becomes challenging in open-ended settings and how we can construct estimators when we have access to a small dataset of graded answers.

Our work contains many novel findings, even contradicting prior work, such as pointing out the relatively weak performance of prominent prompting/sampling methods or demonstrating the role of learned uncertainty estimates in a user study. By focusing on an array of key practical questions, our work acts as an important and timely foundation for future research in LLM safety.

Several reviewers highlighted that our work is particularly comprehensive and provides many insights. We appreciate this recognition, as this work was a major undertaking. We have also put a significant effort into answering reviewer questions and providing clarifications, including several new results inspired by reviewer comments. We hope the timeliness of the work and our response can be carefully considered in the final assessment.

We briefly summarize some of the new results in this general post. Then, we provide an itemized list of contributions. Finally, we have separate posts for replying to reviewers individually.

MMLU-Pro Results

To demonstrate that our results are not specific to MMLU, we’ve run an additional experiment on MMLU-Pro, a new benchmark that is completely separate from and more challenging than the original MMLU but built with the same goal of testing a broad range of knowledge and reasoning tasks. As in our original MMLU results, we construct an open-ended version of MMLU-Pro and use GPT-4 to match samples from the model with official answers. We show the results for 4 models below:

LLaMA-2 7B-Chat

MethodECEAUROC
Zero-Shot Classifier68.6653.29
Probe6.3459.79
LoRA + Prompt26.3465.55

LLaMA-2 13B-Chat

MethodECEAUROC
Zero-Shot Classifier32.1154.01
Probe7.5863.10
LoRA + Prompt9.3868.33

Mistral 7B-Instruct

MethodECEAUROC
Zero-Shot Classifier19.5060.25
Probe6.3760.95
LoRA + Prompt7.7069.89

Qwen Results

To show that our results are not limited to our original set of models (LLaMA-2, LLaMA-3, Mistra), we also include results for another strong 7B parameter model–Qwen-2 7B-Instruct–below:

MethodECEAUROC
Zero-Shot Classifier47.5768.06
Probe20.7261.16
LoRA + Prompt14.9074.94

Central Contributions

Our central contributions go beyond a practical recipe for training an uncertainty estimator. We believe that all the points below break significant ground by highlighting key limitations of current methods or proposing new solutions:

  • We show that out-of-the-box estimators are unreliable. For example, perplexity fails to provide a consistently good discriminator between correct and incorrect answers. By contrast, supervised methods provide consistently superior uncertainty estimates.
  • We highlight how defining uncertainty for variable length sequences is challenging, particularly due to the many ways a sentence can be phrased. We show that a trained predictor of correctness is independent of the input sequence length and has strong empirical performance.
  • Our method benefits from scaling dataset size, but does not require a large dataset to work. In Figure 2, we see that even with a number as low as 1000 labeled samples, we are able to significantly improve upon the performance of out-of-the-box estimators, while showing continued improvement in both ECE and AUROC on further scaling.
  • We provide a detailed qualitative analysis assessing the nature of generalization of our uncertainty estimators, showing that the estimator must be performing more than simple pattern matching on the topic or difficulty of the question (Section 6).
  • Unlike prior work, we show that humans are sensitive to calibrated uncertainties and superior methods are indeed desirable in the first place (Section 7).
最终决定

This paper looks at getting calibrated uncertainty scores from a model, i.e. estimating the probability that its answer is correct. The authors argue that this is difficult to achieve from prompting alone, but propose fine-tuning a model on a small dataset of correct and incorrect answers and find that this achieves good calibration performance. Moreover, they find that these estimates generalize to some degree across models.

Reviewers agree that this paper investigates an important topic, with implications for reducing hallucination or other factuality errors from LLMs. They find that the experiments were comprehensive and well-executed, with a clear take-away that an approach like LoRA + prompt that includes fine-tuning is necessary to get good calibration. Reviewers h75Z and BBBX also appreciated

There was some disagreement about novelty, however: reviewers oPif and BBBX had initial concerns but were satisfied with the author response, while reviewer XHRg still felt that the proposed methods were too similar to previous work.

Finally, there were also some concerns, from reviewer XHRg in particular, about consistency of the ECE and AUROC metrics and how to interpret ambiguous performance between LoRA, LoRA + prompt, and probe methods. However, as other reviewers pointed out in discussion, these are all supervised methods and so still support the paper’s main conclusion that weight-based learning of some form is necessary for good calibration.