PaperHub
5.5
/10
Rejected4 位审稿人
最低5最高6标准差0.5
6
5
6
5
3.3
置信度
正确性2.8
贡献度2.3
表达2.5
ICLR 2025

Large Language Model Confidence Estimation via Black-Box Access

OpenReviewPDF
提交: 2024-09-26更新: 2025-02-05
TL;DR

We propose a simple and extensible framework where, we engineer novel features and train a (interpretable) model on them to estimate the confidence of outputs from large language models via just black-box/query access.

摘要

Estimating uncertainty or confidence in the responses of a model can be significant in evaluating trust not only in the responses, but also in the model as a whole. In this paper, we explore the problem of estimating confidence for responses of large language models (LLMs) with simply black-box or query access to them. We propose a simple and extensible framework where, we engineer novel features and train a (interpretable) model (viz. logistic regression) on these features to estimate the confidence. We empirically demonstrate that our simple framework is effective in estimating confidence of Flan-ul2, Llama-13b and Mistral-7b on four benchmark Q&A tasks as well as of Pegasus-large and BART-large on two benchmark summarization tasks with it surpassing baselines by even over $10%$ (on AUROC) in some cases. Additionally, our interpretable approach provides insight into features that are predictive of confidence, leading to the interesting and useful discovery that our confidence models built for one LLM generalize zero-shot across others on a given dataset.
关键词
Large language modelconfidence estimation

评审与讨论

审稿意见
6

This work proposes a framework for black-box detection of correct vs incorrect LLM responses. The authors propose a set of input manipulation and decoding techniques to form a feature space. Using binary correctness labels, they train a logistic regression to derive confidence estimates that an answer is correct. Experiments shows good performance compared to baselines with respect to two metrics, for several QA datasets and LLMs. The simplicity of the logistic regression also consents to rank the importance of each technique on each dataset.

The paper is experimental in nature and is not backed by theoretical results.

优点

  • The techniques proposed in this paper are black-box, which make the work appealing.
  • Results appear better than competitors across the board.

缺点

  • No theory to support the methodology. The proposed techniques feel reasonable, but there is no principle reason why they would strongly correlate with the correctness of the output in general. Clearly, using multiple techniques at the same time as authors do tend to mitigate this problem.
  • Not fully clear to me how robustness some of the featurization steps are, and how much manual intervention is needed to make things work for all dataset.

问题

  • Perhaps the author could comment on the robustness of the featurization steps. How automatized is the process? How well does it work without intervention? Would it be possible to provide more evidence that the method is robust?
  • The few instances where the presented approach is not better than the others are not the same for the AUROC and AUARC metrics. Why do you think this is the case? Combined with the fact that most often the improvement is only a few percentages, it makes me wonder whether averaging the results over three runs is enough. Would it be possible to use more runs and add standard deviations to make results more significant?
评论

Comment on the robustness of the featurization steps. How automatized is the process? How well does it work without intervention? Would it be possible to provide more evidence that the method is robust?

The featurization steps in section 3.2 are fully automated. Computing the number of semantic sets simply needs an NLI model to gauge if two responses are entailed or not and then a count of the number of distinct sets is returned. Lexical similarity does a pairwise rouge score on the responses and averages it out. SRC again uses an NLI model and records the highest contradiction probability between multiple splits of the response. The prompt perturbation or the steps in section 3.1 are also automated leveraging backtranslation models such as Helsinki-NLP from HuggingFace for paraphrasing , the NLTK library for stopword removal etc.

To showcase robustness of our method we refer you to the following 3 experiments in the appendix. In Table 8, we show how even when we train our logistic regression classifier on fewer samples (250 and 500) our method still is quite performant. Then in Tables 13 and 14 we run the baselines making the same number of queries to the LLM as we did for our feature based approach. This results in many more responses (25) for the black-box baselines which rely solely on stochastic decoding. As can be seen the results are qualitatively the same and we still outperform the baselines consistently. Thus, additional compute for the baselines does not elevate their performance by much indicating that our framework and features have value. In Table 15, we reduce the number of generations from 5 to 3 for our features and find that our approach still produces quality estimates. Moreover we have now added 1 standard deviation (rounded to three decimal places) error interval for our approach in Tables 2 and 3. No error interval implies the rounding resulted in 0. As can be seen the error intervals are small and our results are still consistently better than the competitors indicating that our approach is robust.

Would it be possible to use more runs and add standard deviations to make results more significant? And why sometimes AUROC and AUARC are not in sync?

As mentioned above we have now added the standard deviations for our method, where rather than 3 we now also have averaged over 5 runs taking your suggestion. As can be seen the standard deviations are small and hence our results are statistically significantly better than the competitors in most cases. We believe the AUROC and AUARC are sometimes not in sync because sometimes even the low confidence examples are correctly classified and these get ignored for higher cutoffs of AUARC.

评论

Thank you for your answer. I'll maintain my score.

评论

Thank you for your response.

审稿意见
5
  • This paper seeks to provide uncertainty measures for language models given on prompting access to them.
  • The basic method is to perturb a prompt using various heuristics (e.g., paraphrasing, duplicating sentences) and generating a set of features that feeds into a logistic regression.
  • Evaluation is done on 4 QA tasks on 3 models (Flan-ul2, Llama-13b, Mistral-7b), and there are gains over baselines.

优点

  • The problem of producing confidence estimates for language models is an important problem.
  • The idea of aggregating multiple different sources of information across prompts is a natural direction which seems fruitful.
  • Studying the transferrability across language models is important for having general-purpose calibrators.

缺点

  • The regime in which this paper operates could be more realistic. First, the models considered are relatively small models. Furthermore for these models, we actually have full white-box access, so evaluating methods that are meant for black-box access doesn't seem as well motivated. It would have been nice to see evaluations on closed API models (e.g., GPT-4, Claude, etc.). The contribution of this paper is fundamentally empirical, so the regime matters here.
  • The particular types of perturbations (e.g., paraphrasing) seem ad-hoc: why were these chosen and not others? For each particular choice (e.g., paraphrasing), I also wonder about the execution - it seemed like relatively weak models were used and the paraphrasing wasn't effective. Also, regardless of what perturbations are chosen, there should be more analysis done about the effectiveness of each one (both in terms of downstream performance but in terms of whether the implementation of the pertubation actually produced the desired effects with valid outputs - e.g., paraphrasing where we don't just drop valuable content). I understand that the pitch is that this is a framework, but the framework is not novel; the contribution should be the choices of perturbation that lead to good results.
  • The cost of the method is not discussed. Compared to having to decode from the language model once, it seems like the method here is quite expensive and complicated; to know whether it is worth it, it would be good to document the costs, and in general, explore the cost-AUROC tradeoff.

问题

  • Stochastic decoding and split response consistency really aren't prompt perturbations since they change the output, not the input. It seems like these should be pulled out into a separate section.
  • For paraphrasing, it was noted that Helsinki-NLP to do En->Fr->En did not preserve semantics. Why was Helsinki-NLP used? It seems like using an LLM to do the translation or paraphrase could have worked - was this considered?
  • Sentence permutation could make the text invalid due to unresolved anaphora - can you comment on whether this was a problem?
  • Entity frequency amplification: why is there such an emphasis on named entities? I could imagine EFA and SP be generally applicable to any sentence. Is there some sort of empirical evidence that entities need to be treated specially?
  • For SRC, how was the NLI between the output computed? This only works for long responses since the output needs to be split up into multiple parts? Why is NLI not computed on different samples instead? How does this relate to maieutic prompting (https://arxiv.org/pdf/2205.11822)?
  • The section on featurization (section 3.2) is a bit hard to follow; it would be nice to have an example.
  • Why do you think that the same features transfer across LLMs? Is it because there is a model-agnostic notion of example difficulty that is being captured here, or that models tend to make the same type of errors?
  • How important is the linear model assumption? I understand the desire for interpretability, but if you just cared about increasing AUROC, how far could you get by using non-linear models (for exapmle, https://arxiv.org/pdf/2006.09462 used random forests).
  • It would be interesting to compare your method to methods that use log probabilities (of other models):
  • While other papers that aggregate (e.g., https://arxiv.org/abs/2306.13063) are discussed in the related work, it would have been nice to see a more quantitative comparison.
  • Minor: use \citep (e.g., Goodfellow et al. (2016) in the first sentence) when the citation is used as a parenthetical.
  • 169: complimentary => complementary
评论

Closed larger models

In Tables 10-12 we now report AUROC, AUARC and ECE for GPT-4 running the different methods. As can be seen qualitatively the results are maintained where, our method performs consistently better than the competitors.

effectiveness of perturbations

In Table 7 we show that our perturbations are in fact as intended and maintain semantics.

cost benefit trade-off

To showcase strength of our method relative to the amount of compute we refer you to the following 3 experiments in the appendix. In Table 8, we show how even when we train our logistic regression classifier on fewer samples (250 and 500) our method still is quite performant. Then in Tables 13 and 14 we run the baselines making the same number of queries to the LLM as we did for our feature based approach. This results in many more responses (25) for the black-box baselines which rely solely on stochastic decoding. As can be seen the results are qualitatively the same and we still outperform the baselines consistently. Thus, additional compute for the baselines does not elevate their performance by much indicating that our framework and features have value. In Table 15, we reduce the number of generations from 5 to 3 for our features and find that our approach still produces quality estimates.

Stochastic decoding and split response consistency really aren't prompt perturbations…

We have renamed the section to “Elicitation of Variable LLM Behavior“.

Why was Helsinki-NLP used?

Yes we did use an LLM (Mixtral, LLAMA) to paraphrase initially, but the paraphrasing turned out to be much better using backtranslation with Helsinki-NLP especially for datasets with context. This is also mentioned in the Paraphrasing paragraph in section 3.1.

Sentence permutation could make the text invalid…

Reordering up to five sentences only consisting of entities led to semantically similar text in most cases and hence we used this. We did this qualitatively by looking at 100 randomly selected examples. We also quantitatively confirmed this and reported the results in Table 7, where in over 99% cases the semantics are maintained.

Why focus on named entities?

Natural language text has lexical, morphological, phonetic, syntactic and semantic structure. A named entity is a word form that recognizes the elements having similar properties from a collection of elements and is called as a rigid designator of the semantic class as they perform actions or experience effects of actions [1]. These as discussed in literature are crucial for Q&A, summarization, machine translation and other such important NLP tasks [1,2,3]. Hence, focusing on them by say perturbing their frequency made sense. We found in our initial experiments that changing their frequency produced changes in prediction of an LLM particularly when the model was less confident more than varying other types of words such as articles or verbs etc. We thus made this into a feature for our approach.

1] Goyal et. al. Recent Named Entity Recognition and Classification techniques: A systematic review. Computer Science Review, Elsevier, 2018.

2] Jungl. Online named entity recognition method for microtexts in social networking services: A case study of Twitter. Expert Syst. Appl. 2012.

3] Liu et. al. Two-stage NER for tweets with clustering. Inf. Process. Manag. 2013.

For SRC, how was the NLI between the output computed? and why? Connection to maieutic prompting?

As mentioned in the paper, semantic inconsistency between the two parts is measured using an NLI models contradiction probability, where one part is taken as the premise and the other the hypothesis. The highest contradiction probability amongst multiple such samplings is the feature value for this strategy. The splits were done at the sentence boundaries where each part had to be at least one sentence long. We did this within samples since, we witnessed LLMs such as ChatGPT outputting responses that argued in both directions. For example, few months back when we asked ChatGPT the following question: “Can integers be factorized in polynomial time?“, it said, “No, integer factorization is believed to be a hard problem and not solvable in polynomial time. This is the basis for many cryptographic algorithms such as RSA. The best known algorithms for integer factorization have polynomial time complexity, meaning their running time grows as the size of the input increases.“ This motivated us to create the SRC feature.

The SRC feature is quite different from maieutic prompting both in terms of design and intent. The essence of maieutic prompting is to arrive at the correct answer based on LLM generated explanations by solving a satisfiability problem. In the case of SRC, as described above, we simply aim to exploit the inconsistencies within an LLM response to determine its quality or potential validity.

评论

Example to explain section 3.2

We have now added more description in section 3.2 to explain each feature type. As such, what semantic set does is if from five paraphrasings we get responses excellent, great, bad, subpar and fantastic, then the number of semantic sets would be two as excellent, great and fantastic would form one semantic set, while bad and subpar would form the other. For lexical similarity we compute the average rouge score considering pairs of these responses. For SRC, we explain the example from Table 1, where the two sentences regarding Normandy contradict each other leading to a high contradiction probability using an NLI model or low consistency comparing the two portions of the split response.

Why do you think that the same features transfer across LLMs?

We think the same features transfer across LLMs because they are trained with very similar strategies (viz. all transformer architectures, same public data sources, similar loss functions, etc.) and thus potentially have somewhat similar vulnerabilities. This seems to be analogous with adversarial attacks on deep learning models prior to LLMs, where attacks and adversarial examples would transfer across models zero shot.

how far could you get by using non-linear models?

We tried MLPs and random forests, but did not see consistent improvement over logistic regression (LR). Hence, we proposed LR as it also has the added benefit of being interpretable.

methods that use log probabilities and methods that aggregate

Calibrating with out-of-domain data as done in the log probabilities work would definitely be interesting to try in the future, rather than just in-domain as done in our current setup. Regarding methods that aggregate we have now added average verbalization confidence (AVC) as a baseline in Tables 2, 3 and 9 as it performed the best in the prior work. As can be seen we consistently outperform this baseline as well. Semantic entropy (SE) is another baseline we added and our method outperforms it too.

Minor comments

We have addressed the minor comments such as fixing citations and typos.

评论

Thank you for answering my questions, in particular, including the GPT-4 result and discussing the cost.

The GPT-4 result seems like a bit like an afterthought; I would like to see results more foregrounded along with the other top models (Claude, Gemini, Llama3 405B). I am encouraged by the GPT-4 result, but I would still like to see results on the top models since in my opinion, this the substance is in the empirical results, not the framework.

Regarding the cost, to clarify, I'm not worried about the cost of the regression, but mostly the number of generations. I appreciate the discussion here and comparison with baselines, but I would like to see it more foregrounded in the paper.

I will increase my rating from 3 to 5.

评论

Our intention of putting the GPT-4 results and the number of generations results in the appendix was only to confirm that it was something that you expected. Since you have confirmed the same in the final version (since we cannot update the paper anymore) we will bring these results into the foreground where Tables 2 and 3 will also depict GPT-4 results and the number of generations Table will be brought into the main paper. We will also update the results of Llama 2 with Llama 3 and add Claude as an additional LLM.

As such we are glad that your major concerns have been satisfied and hope that you consider our response above.

审稿意见
6

edit Dec. 2: Score change from 5 to 6.

The authors introduce a framework to obtain confidence or P(correctness) values to score sampled responses from a black-box model (no access to weights, logits, or log-probs). The method is essentially using several different strategies, including non-deterministic sampling, prompt modification, and response splitting, to produce a small set of responses, and determining consistency among them via semantic and lexical metrics. The strategy-metric pairs each receive their own score, and these scores are used with a logistic regression model to obtain the final confidence (calibrated to a dataset-model pair). This method extends and generalizes work such as "Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation." (Kuhn et al., 2023).

The authors benchmark on a mid-sized set of LLMs and datasets, obtaining good performance compared to other black-box baselines, and demonstrate that the logistic regression model is able to show the relative importance of different metrics in determining confidence. Furthermore, they show that the trained logistic regression models may generalize on the same dataset across the results from different LLMs.

优点

  1. Simple, extensible framework for wrapping black-box confidence scoring techniques into a meta-scorer, presented with several existing and novel confidence scoring techniques.
  2. Useful interpretability results.
  3. Good communication of approach, drawbacks, and strengths.

缺点

  1. In 083, it is claimed that because the models used to produce predictions are simple, the confidence estimates will be well-calibrated. I think this claim needs support; in particular, the paper might make use of ECE (expected calibration error) and/or other calibration metrics to assess the calibration of different approaches including the newly proposed approach.
  2. Results are hard to contextualize without error bars.
  3. Baselines are mostly focused on previous black-box sampling-based baselines. While these are useful, many useful baselines are in fact missing. For example, on black-box models, elicited verbalized confidence (Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback, Tian et al. 2023) could be used. Also, why not try a frequency-estimated semantic entropy of the equivalence sets (Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation, Kuhn et al. 2023), instead of just the number of sets? Furthermore, open-source/whitebox approaches such as P(true) or finetuned P(IK) (both from Language Models (Mostly) Know What They Know, Kadavath et al., 2022) could easily be used on an auxiliary open-weight model (e.g. assessing the accuracy of a black-box response with a different open-source model). All of these metrics seem that they could be added as baselines and as inputs in the logistic regression finetune.
  4. Scope of model evaluation is limited. This limits the impact of the conclusions, although it may not be addressable due to cost limitations.

Small nitpicks:

  • Spelling and grammar: "aver to these values as a features (80-81), "complimentary ways" --> "complementary ways" (169), "greedy ," (149), "Since, " (083), etc.
  • Communication: I think 3.1 should not be entitled "prompt perturbations" as not all of the methods for obtaining multiple different answers/answer portions to compare are actually prompt perturbations. Prompt perturbations, Stochastic Decoding, and split response consistency should maybe be referred to collectively as something like "Elicitation of multiple black-box responses" or similar.
  • Communication: (383) "we use two metrics to evaluate effectiveness of the models: {description of metric 1}. {Description of metric 2}" should likely be a list with (a), (b), instead of two distinct sentences.

问题

  1. Why not include scores like EigenValue, Eccentricity, or Degree in the logistic regression inputs?
  2. Same-dataset, cross-model performance seems generally good. E.g. the same logistic regression weights can be reused for the same dataset with different models. Notably missing seems to be same-model, cross-dataset results. How important is it really to train this logistic regression model as compared to using, say, SP lexical similarity for everything?--or one fixed set of weights for everything?
评论

assess the calibration of different approaches

Thanks for the suggestion. We now report ECE in Table 9. As can be seen our method outperforms other approaches consistently.

error bars

We have now added 1 standard deviation (rounded to three decimal places) error interval for our approach in Tables 2 and 3. No error interval implies the rounding resulted in 0. As can be seen the error intervals are small and our results are still consistently better than the competitors.

baselines

We have now added two baselines based on your suggestion and which we thought to be the most relevant namely, semantic entropy (SE) and average verbalization confidence (AVC) in Tables 2 and 3. As can be seen our improvements still hold.

Minor comments

Thanks for the excellent minor suggestions. We have now incorporated them in the latest version.

Why not include scores like EigenValue, Eccentricity, or Degree in the logistic regression inputs?

One could, however, in this work we wanted to showcase the efficacy of our suggested features in estimating confidence, which are directly computed from LLM responses. In the future one could potentially use derived features or estimators themselves as features to our LR model. But even without those our procedure seems to produce superior estimates compared with baselines.

Same model different datasets

We did experiments with same dataset and different models because the possibility of successful transferability was a direct consequence of the common features that stood out from our LR model in Table 4 thanks to our interpretable approach. Transferability across datasets was not as apparent across the board, possibly because of some datasets having contexts others not and the difference in data distributions. This was confirmed when we applied models from one dataset to another and the best transfer we got was from CoQA to SQuAD using Llama with AUROC being 0.61 and AUARC being 0.54, which although reasonable is still lower than the values we got with their respective confidence models.

评论

re: error bars

We have now added 1 standard deviation (rounded to three decimal places) error interval for our approach in Tables 2 and 3. No error interval implies the rounding resulted in 0. As can be seen the error intervals are small and our results are still consistently better than the competitors.

This is useful for determining how randomness in your procedure can affect the outcome of the score. However, I am still a bit curious about how randomness in the random sampling of the data itself could have affected the outcome of the ranking. A bootstrap CI might be more useful here.

re: every other question/weakness

Thank you! I feel that my concerns have been addressed on every other issue.

评论

bootstrap CI...

That's an excellent suggestion. We will add experiments doing this in the final version as we can't update the paper anymore. Nonetheless, as a testament that our procedure is robust to sampling we refer you to Table 8 in the appendix which shows that our results are quite good even with fewer samples used to train our LR model. Moreover, we now have conducted some initial experiments on SQuAD averaged over 5 bootstraps and our results are quite stable (mean±\pmstd): AUROC for Llama 0.82±\pm0.005, flan-ul2 0.81±\pm0.002, mistral 0.84±\pm0.004. AUARC for Llama 0.69±\pm0.003, flan-ul2 0.96±\pm0.001, mistral 0.96±\pm0.003. We will add results on other datasets in the final version.

other questions

We are glad that your other concerns are satisfied.

评论

Dear Reviewer uR2D,

Thanks for your response. Since all your concerns but one which we have now responded to above have been satisfied would you please consider updating your assessment of our work? Thank you.

评论

Thanks for the reminder. I think I'll increase the score to a 6 now. (from 5). Thanks for addressing the bootstrap CIs, looking forward to seeing them in full.

评论

Thank you for your response and updated assessment.

审稿意见
5

This paper studies the problem of uncertainty quantification in language model responses using only black-box access to the language model. In order to quantify uncertainty, the authors study many different ways of varying outputs: they sample from the same prompt using different strategies, paraphrase the prompt via backtranslation, reorder sentences, repeat sentences, remove stopwords, and measure whether parts of the outputs entail other parts using an NLI model. They then convert the outputs of these perturbations into features (using different methods for different sources of variation), and finally train a logistic regression model on these features to predict “confidence”. The paper then runs experiments on Flan-ul2, Mistral 7B instruct, and Llama 2 13B. They find that their method outperforms baselines (such as studying semantic sets and lexical similarity, measured by AUROC), and that the feature that matters most in the regression model is typically the number of “semantically” equivalent sets of outputs sampled from the prompt.

优点

  • This paper studies an important problem; assessing uncertainty in the language model’s output space. And they do so for open-ended generation tasks like summarization where uncertainty quantification is much harder (since you cannot look at the entropy at a specific token) *The authors have a wide-range of experimental analysis in different settings
  • Some of the findings I thought would be useful to the community — for example, I thought it was interesting that repeated sampling from the original prompt yields more useful features than perturbing the prompt (and I would be interested to see how this can be extended in a maximally sample-efficient way).

缺点

  • The primary weakness is the use of rouge score (a metric on top of n-gram overlap with ground truth summaries) for evaluation. Many of the perturbations, such as mixing up or repeating sentences with entities, help capture how entities are processed into the output (which rouge disproportionately captures), rather than actual correctness.
  • The procedure the authors propose is very inefficient; it requires many repeated samples from the language model. It also is not clear if this extra compute is deployed optimally; for example, it could be better to deploy more compute into stochastic decoding.
  • There are a number of ways the presentation could be improved. For example, the paper introduces the notion of confidence in the intro using lots of notation, but uses no notation to explain the (somewhat complicated) methods of extracting features for the logistic regression classifier.
  • I’d also encourage the authors to draw from some related work in estimating the confidence / probability of correct (e.g., [1] and [2] have studied this formally).

[1] Bartlett and Wegkamp, 2008. Classification with a Reject Option using a hinge loss [2] Geifman and El-Yaniv, 2017. Selective Classification for Deep Neural Networks

问题

  • Do you expect the gains over the baselines to remain as large using a different evaluation criteria, such as using e.g., Llama 3 70B as a judge of correctness?
  • Do you expect the results to extend to frontier systems (where we only have black-box access)?
评论

Evaluating summaries with LLM-as-a-judge

We now have evaluated the summarization tasks using GPT-4 as-a-judge. These results are reported in Tables 16-18 in the appendix. As can be seen the results are qualitatively similar to those reported in the main paper, where we are superior in most cases.

Compute vs Performance

To showcase strength of our method relative to the amount of compute we refer you to the following 3 experiments in the appendix. In Table 8, we show how even when we train our logistic regression classifier on fewer samples (250 and 500) our method still is quite performant. Then in Tables 13 and 14 we run the baselines making the same number of queries to the LLM as we did for our feature based approach. This results in many more responses (25) for the black-box baselines which rely solely on stochastic decoding. As can be seen the results are qualitatively the same and we still outperform the baselines consistently. Thus, additional compute for the baselines does not elevate their performance by much indicating that our framework and features have value. In Table 15, we reduce the number of generations from 5 to 3 for our features and find that our approach still produces quality estimates.

Related work [1,2] for estimating confidence

Thank you for sharing these selective classification works. They are definitely relevant and could be adapted to train a better confidence model given our features. We have now cited these works and mentioned this in the discussion section.

Results to extend to frontier systems

We expect the results will carry over as our features are dynamic and not tied to the specifics of any model. This is evidenced by our new experiments on GPT-4 reported in Tables 10, 11 and 12, where we outperform the competitors. Saying that however, our intention here is also to propose a framework under which new features can be designed and subsequently used in the future. We mention a potential new class of features in the last paragraph of the discussion section. As such, we view the features to be ever evolving, but the framework to be still valid as we move forward.

Clarify featurization

We have now added examples in section 3.2 to clarify what each feature captures. If there are any specific aspects that need to be clarified we would be happy to do them in the final version.

评论

Dear Reviewer pHCy,

We have addressed your main concern about using LLM-as-a-judge for summarization as mentioned in our response to you above. We have also tried to address your other concerns. We would be glad if you could consider our response and update your assessment of our work as appropriate. Thank you.

评论

Thanks for your response! I appreciate that you added GPT-4 as a judge; I think it improves the quality of the paper. However I'm still concerned about the computational requirements of the method and while the new experiment in my opinion slightly improves the paper, I'll still maintain my current score.

评论

Dear Reviewer pHCy,

We are glad that your main concern about LLM-as-a-judge is satisfied.

Regarding computational requirements we have the following comments:

  1. Please note that even with the same number of queries for the other methods our approach was consistently better as shown in Tables 13 and 14.

  2. If we compare the results where we just use 250 samples to estimate the confidence as shown in Table 8 with the other methods using 1000 samples in Tables 2 and 3 in the main paper our estimates are still better. This is relevant because now we are using roughly the same number of queries (as we have 4 additional features) as the other methods in the setups experimented with in the previous works and hence the computational burden is comparable, but we still outperform them.

  3. Our confidence model building is in some sense a one time cost as the built model then can be used for the entire dataset which can be much larger than 1000 instances. Not to mention it can be used across LLMs as shown in the paper.

  4. Also with reduced number of generations as shown in Table 15 our approach is still performant. This points to an interesting possibility of using fewer generations for updating an already built confidence model in a streaming setting with concept drift.

As such, we believe accurate confidence estimation is more pertinent here than computational cost, as in a deployment setting the confidence model can be updated periodically without it being exacting on the deployed system.

Thanks again for your excellent comments and we hope you consider our above response. Thank you!

评论

We are thankful to all the reviewers for their time and effort in providing valuable feedback. We are glad that reviewers found our work to be important, useful, extensible, novel and appealing. Based on your comments we have now updated the paper with additional experiments and clarifications. We now individually address your concerns.

评论

Dear Reviewers,

We would be obliged if you could consider our rebuttal as we have taken considerable effort in performing experiments and updating the paper based on your excellent suggestions. We would be happy to resolve any outstanding issues that you might still have before the discussion period ends.

Thank you!

评论

Dear Reviewers,

Since this is the last day for discussion we would be obliged if you could consider our responses and ask any outstanding questions you might still have. Thank you for your consideration.

AC 元评审

The authors propose a method for calibrating LLM outputs given black-box query access. Their method involves 1) obtaining a set of generations by introducting heuristics perturbations in the input prompt (e.g., through paraphrasing) as well as directly in the output, 2) extracting some hand-designed features such as measures of diversity in these generations, and 3) learning a logistic regression model on top of these features to predict whether the LLM output is likely to be correct.

Reviewers thought that this was a natural method for an important problem. However, as the paper is largely making an empirical contribution, the reviewers also found that several important aspects to be missing or not fully addressed. For example: the regime of models studied (which the authors partially addressed during the rebuttal); a deeper analysis of which heuristics worked well, both in terms of whether they achieved what they set out to do (e.g., whether paraphrasing retained the relevant information in the prompt) and whether they contributed to accurate calibration; a discussion about computational cost relative to other methods, given that the proposed method is expensive; and comparisons to other baselines, both white-box and black-box, in terms of cost and effectiveness.

As such, I am unable to recommend acceptance. I encourage the authors to incorporate the reviewer feedback into their next submission.

审稿人讨论附加意见

The discussions were mostly about the missing aspects listed in the metareview. While the authors did try to address these, many of these aspects require additional experiments and some reorganization of the paper. In my opinion, this makes it more suitable for a resubmission.

最终决定

Reject