PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
5
4
4
3.8
置信度
创新性2.5
质量3.0
清晰度3.3
重要性2.8
NeurIPS 2025

Efficient semantic uncertainty quantification in language models via diversity-steered sampling

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29
TL;DR

We inject a semantic-similarity penalty term into decoding for autoregressive and masked-diffusion LMs, generating semantically diverse answers and, with importance reweighting, enabling consistent estimation of downstream uncertainty metrics.

摘要

关键词
uncertainty quantificationlarge language modelsnatural language generationsemantic diversityimportance sampling

评审与讨论

审稿意见
4

This paper introduces a novel framework to estimate both aleatoric and epistemic uncertainty in large language models (LLMs) for free-form question answering. The method centers on a diversity-steered sampling technique that penalizes semantically redundant generations during decoding using an NLI model (fine-tuned to work on partial or masked texts). The approach is model-agnostic, applicable to both autoregressive (AR) and masked diffusion models (MDM), and uses importance weighting and control variates to debias and reduce variance in the uncertainty estimation.

优缺点分析

Strengths:

  • The paper proposes a principled, gradient-free method to enforce semantic diversity during decoding, using NLI-based similarity scoring, which is cleverly applied in both AR and MDM settings. This aligns the generation process with the uncertainty estimation objective.
  • The framework unifies semantic clustering, importance-weighted sampling, and variance-reduction via control variates, offering a comprehensive pipeline for both aleatoric and epistemic uncertainty quantification.
  • The method works with existing LLMs without requiring access to their gradients or architectures, and experimental validation spans multiple datasets, models (OPT, LLaMA, LLaDA), and decoding paradigms.

Weaknesses:

  • The method introduces a sequential sampling cost due to diversity enforcement, which scales linearly with sample size. While effective, it may be less scalable in large-generation scenarios compared to parallelizable approaches like SDLG.
  • The accuracy of uncertainty estimation heavily depends on the quality of semantic clustering derived from the fine-tuned NLI model. Noisy or ambiguous entailment judgments may distort both SE and MI estimates.
  • The paper focuses on quantitative metrics but lacks qualitative examples or failure cases to illustrate how the method handles edge cases, hallucinations, or ambiguous prompts—key aspects for uncertainty estimation.

问题

NA

局限性

NA

最终评判理由

Thank you for the rebuttal, I decide to remain the score

格式问题

NA

作者回复

Thank you for your encouraging comments on our paper's novelty, applicability, and comprehensiveness. Please find below our point-by-point responses and new qualititative case studies motivated by your suggestions.

less scalable in large-generation scenarios compared to parallelizable approaches

We acknowledge that the sequential sampling cost prevents fully parallel generation. The scenario we consider is not all-purpose large generation but targeted generation with the specific goal of estimating a downstream uncertainty metric. As long as the target uncertainty metric is defined on top of semantic clusters, like SE or MI, steering sampling toward covering more clusters can help estimate it with fewer generations. Baseline iid sampling methods, in contrast, can "waste" sampling resources with semantically redundant generations. Figure 3 shows that our method helps cover more semantic clusters relative to baselines, and Appendix C presents case studies illustrating the improvement in semantic diversity. Moreover, Appendix B.3 presents an optional online stopping method that enables termination of sampling once uncertainty estimates stabilize, potentially reducing the required sample size below the >20 samples typically required by standard iid baselines.

As derived in Appendix B.5, the extra per-token cost relative to standard iid sampling is O(VPMNLI)O(V P M_{\rm NLI}), where PP is the size of the running sample set, VV is the vocabulary size, and MNLIM_{\rm NLI} is the cost of computing NLI entailment. In practice, we seek speed gains by restricting similarity computation to the top-kk tokens at each decoding step (Appendix B.5, L931–933) with k=32k=32, as most tokens are not relevant to the query at hand. This reduces the extra cost to O(kPMNLI)O(k P M_{\rm NLI}) per token. Please note that diversity penalty can still be computed in parallel over kk tokens at each step.

An interesting avenue for future work would be to combine batch iid sampling with diversity steering in a hybrid approach. Our method is still parallelizable over beams or batches given a running sample set, so we could allow the model to sample MM generations in parallel before applying the diversity penalty for the next generation if the semantic redundancy is high. We could also consider alternating block tokens of iid sampling with diversity-steered sampling.

We will revise the text to clarify which aspects of our added computational cost are parallelizable and elaborate on hybrid extensions.

accuracy of uncertainty estimation heavily depends on the quality of semantic clustering derived from the fine-tuned NLI model

We would like to clarify that we use the finetuned NLI for sampling (our method contribution) but the original NLI without finetuning for final uncertainty estimation, as done in prior work [1-3, 5]. We will clarify this distinction in the Methods section. That said, our finetuned NLI model can be used in place of the original NLI model as well; both the TRUNC and MASK finetuned models scored similar validation accuracies as the original model (~91%), as reported in Appendix A.

No matter what sampling scheme is used, be it ours or any of the baselines, incorrectly clustering the final samples can distort the SE/MI estimates, as pointed out by the reviewer. Semantic clustering is an interesting and challenging problem in its own right, for which bidirectional entailment has been shown to outperform embedding-based similarity or lexical similarity methods [1, 2]. While we agree that improving semantic clustering can improve the accuracy of semantic uncertainty estimates, it lies beyond the scope of this work. We instead focus on how to generate semantically diverse samples assuming our goal was to estimate SE/MI defined on top of bidirection entailment clusters.

If the reviewer was referring instead to the accuracy of the diversity-steering term used in sampling, we now provide additional results showing that our finetuned NLI model is not biased toward high or low entailment when the partial sequences do not carry sufficient information. Overall, we find that models tend to output entailment probabilities of ~33% when provided input with almost no information -- the examples in Figures 2 (for input 1-2 tokens long) and 4 (for near 100% masking) support this. The specific value of 33% can be understood as the maximum-uncertainty prediction in the three-way (entailment/contradiction/neutral) NLI classification task. We will add to Appendix A the accuracy of the entailment model at every level of truncation/masking in addition to the average across all levels (73-74%) reported previously. Accuracy goes up steeply from 34% to 72% as corruption decreases from 90% to 50% and reaches 91% at no corruption. We will also provide a distribution of predicted P(entailment) for single-token inputs and 95% masking, to demonstrate that the model is correctly uncertain in settings with very little information about entailment. In this no-information regime, the desired behavior is to default to iid sampling, which is why we initialize the steering strength hyperparameter at λ0=0\lambda_0 = 0 and only dynamically adjust it upwards when sample redundancy is significantly above 0.3 in terms of bidirectional entailment. Under our adaptive scheme for setting λt\lambda_t, we find that λt\lambda_t tends to monotonically increase over sampling time as we acquire more information about the text being generated (Figure 5).

qualitative examples or failure cases to illustrate how the method handles edge cases, hallucinations, or ambiguous prompts

Thank you for this suggestion. We have added case studies to the appendix demonstrating that our method can surface meaningful SE/MI values by uncovering relevant semantic clusters in three settings: (1) when the LM hallucinates, (2) when the LM is legitimately uncertain due to ambiguity in the prompt, and (3) a combination of the two. We summarize them briefly below.

(1) Hallucination. If the prompt asks about a domain-specific, long-tail fact, models tend to hallucinate. We expect to see one cluster containing the correct answer and a few hallucination clusters. MI should be high, as the model easily "changes its mind" when conditioned on its own response. SE should depend on the hallucination pattern -- low if there are a few dominant modes of hallucination and high if hallucinations are varied and diffuse. We manually analyze responses to a well-defined biomedical question (with a single correct answer) for this setting.

(2) Ambiguous prompt. If the prompt is vaguely worded or asks to name one item among many, models can output several valid semantic clusters. We expect SE to be high and MI to be moderate or low, as the model would not be internally conflicted about a given answer. We manually analyze responses to a precisely worded and easy question with multiple correct answers, taken from the AmbigQA dataset.

(3) A mix of both. In Appendix C, we provided sample responses for the ambiguous, domain-specific question with multiple possible correct answers: "Name one way to measure the Hubble constant." Given 8 sample generations, our method is able to surface more valid clusters corresponding to different methods of measuring the Hubble constant as well as hallucination clusters with made-up physics facts.

[1] Farquhar, S. et al. (2024). Detecting hallucinations in large language models using semantic entropy. Nature, 630(8017), 625-630.

[2] Kuhn, L. et al. (2023). Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. ICLR.

[3] Lin, Z. et al. (2023). Generating with confidence: Uncertainty quantification for black-box large language models. TMLR.

[4] Manakul, P., et al. (2023). Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models.

[5] Aichberger, L., et al. (2025). Semantically diverse language generation for uncertainty estimation in language models. ICLR.

评论

Dear Reviewer,

Thank you once again for your thoughtful review and for your positive evaluation of our work. We hope our rebuttal addressed your concerns regarding the computational efficiency, NLI performance, and qualitative illustrations. If so, could you please post a brief acknowledgment of our response and consider raising the score?

We are also happy to clarify any points or provide additional material if that would be helpful.

审稿意见
5

In this work authors present a more efficient sampling method for estimating semantic uncertainty in LLMs. Authors sample from a modified LM proposal distribution that is steered by a fine tuned NLI entailment model to produce a set of semantically diverse responses. They then use importance sampling and control variate techniques to recover low variance estimates of uncertainty under the original model distribution (entropy and mutual information-type quantities). The resulting method on average outperforms vanilla SE for a fixed number of samples across a few different autoregressive and diffusion-based language models.

优缺点分析

Strengths:

  • This method is well grounded in theory and flexible enough to work with most models and most UQ methods requiring MC sampling (I.e. consistency based methods)
  • The NLI fine tuning method for prefix entailment I think is interesting in its own right

Weaknesses:

  • I am not sure whether this proposal distribution is compatible with more complex methods such as DRAGIN (Su et al. 2024), claim conditioned probability (Fadeeva et al. 2024), which select a subset of ‘important’ tokens to calculate uncertainty from.
  • Relying on an external entailment model introduces the hyperparemter lambda which needs recalibration for each new model. Although the authors propose an adaptive scheme there are no guarantees that this will be optimal for new queries

问题

  • Do authors assume full entailment holds between premise and hypothesis examples at all levels of corruption when training the NLI model?

局限性

Yes

最终评判理由

In the rebuttal the authors provided additional discussion on how their diversity steered sampling method could be used in a variety of downstream UQ methods, and added further clarification about hyperparameter choice and finetuning data. While I am a bit skeptical about some of the implementation details (such as maintaining original labels for heavily corrupted entailment pairs), the overall method seems well grounded and fairly general.

格式问题

NA

作者回复

We thank the reviewer for the positive feedback and are happy that you found the paper theoretically grounded, flexible, and clearly written. We appreciate the comment that our finetuned NLI is interesting in itself. Please find below our point-by-point responses.

compatibility with DRAGIN (Su et al. 2024)

Thank you for this interesting question that has helped us consider the implications of semantic diversity steering in the dynamic RAG context, beyond single-turn QA settings we focused on in our work. One possible entry point into DRAGIN is the RIND module that decides when to retrieve. RIND uses a product of attention weights, token entropy, and stopword indicators to detect the position when the model begins to hallucinate. This type of scoring can be prone to false positives, which would waste retrieval augmentation compute. In particular, token entropy is susceptible to over-estimating uncertainties if softmax weights are shared across multiple tokens with the same meaning. To improve on this, when RIND classifies a position as requiring retrieval, we can verify this prediction by generating a few partial phrases starting from that position with diversity steering and quantifying a measure of semantic uncertainty. The finetuned NLI model can be used for both steering and estimation to accommodate partial sequences. While using the finetuned NLI model would be new, this approach would naturally extend prior work using NLI for hallucination detection (e.g., [1]).

compatibility with claim-conditioned probability (Fadeeva et al. 2024)

Thank you for bringing the CCP metric to our attention. While we have used SE and MI as illustrative examples, our method is compatible with any downstream uncertainty metric defined on semantic clusters. CCP quantifies uncertainties only on the factual "claim," after marginalizing over semantic clusters (thus removing uncertainties on "surface form", or style, as SE does) and conditioning on claim type/order. Instead of drawing multiple generations, however, it processes a single complete generation post hoc by substituting the token at each position with top-kk alternatives and computing the CCP of each alternative token based on NLI predictions: (1) if the text up to position jj with the jj-th token swapped is predicted to "entail" the original text, then they are assumed to have the same meaning and claim type, (2) if "contradict", then they are assumed to have a different meaning but same claim type, and (3) if "neutral," they belong to different claim types. The CCP of the entire claim is then 1 minus the simple product of the (non-stopword) words constituting the claim.

We can think of plugging into CCP in a couple of ways:

  • For a given generation xx, the authors take a pretrained DeBERTa-large model as is to predict NLI relationships between prefixes x1:jx_{1:j} and [x<j,substituted token][x_{<j}, \textrm{substituted token}] for each position jj. This can be risky, as DeBERTa-large is trained on full premise-hypothesis pairs and the CCP scoring relies heavily on its predictions to determine membership to meaning/claim types. We can use the finetuned NLI model instead to better deal with prefixes.
  • Instead of post-processing a single generation xx by swapping each token xjx_j, we can consider drawing multiple full generations with diversity steering (with the finetuned NLI model) and computing the claim-level CCP with the original NLI model, as done in our paper. Importance reweighting and optional variance reduction proceed as usual by reweighting the token probabilities. This sampling-based approach also eliminates the need to separately handle sub-word token substitutions.

lambda ... needs recalibration for each new model and query

We acknowledge that λ\lambda is an important hyperparameter that must be tuned. In Section B.3, we describe an adaptive schedule for λ\lambda that empirically transferred across models and across queries, and yielded ESS/N{\rm ESS}/N > 0.4 for all models and datasets in Table 1. The schedule goes as follows: λ\lambda, initialized at zero (no steering), increases only when the maximum entailment exceeds Etarget=0.3E_{\rm target} = 0.3 (1/3\approx 1/3), corresponding to the maximum-uncertainty prediction in three-way NLI classification. Empirically, we find that λt\lambda_t tends to monotonically increase over sampling time as we acquire more information about the text being generated (Figure 5). We will justify our adjustment scheme in more detail in the text.

Do authors assume full entailment holds between premise and hypothesis examples at all levels of corruption when training the NLI model?

No. During finetuning we retain the original MNLI gold label for each corrupted pair, so contradiction and neutral examples remain even when corrupted. Specifically, for every premise–hypothesis pair we (1) randomly choose one side, (2) truncate or mask it at a random position, without flipping the label, and (3) finetune only the new [TRUNC]/[MASK] embedding and the classifier head. This teaches the model that an incomplete sequence can entail, contradict, or be neutral wrt the other side, depending on the already-revealed content, as illustrated in Figures 2 and 4. Details of the NLI training can be found in Appendix B.1.

[1] Manakul, P., et al. (2023). Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models.

评论

Dear Reviewer,

Thank you once again for your thoughtful review and for your positive evaluation of our work. We hope our rebuttal addressed your concerns regarding the λ\lambda hyperparameter and compatibility with dynamic RAG frameworks and other uncertainty metrics. If so, could you please post a brief acknowledgment of our response?

We are also happy to clarify any points or provide additional material if that would be helpful.

评论

Dear Reviewer mfGz,

Could you please have a look at authors' rebuttal, see if they have addressed your concerns, provide a specific reply to that and acknowledge that you have done so?

Many thanks,

Your AC

评论

Dear Reviewer mfGz,

Thank you once again for your positive review. Could you please let us know if we have addressed your concerns, or if you would like further clarification on any of the items?

评论

I'd like to thank the authors for their thoughtful response to my questions. The authors discussion about how their diversity steered sampling method can interact with other uncertainty metrics helps support the generalizability of their approach. I appreciate the details on choosing lambda and generating the finetuning data. Given this additional context I am inclined to keep my positive score.

审稿意见
4

In this paper, authors propose to use "diversity-steered sampling" to generate diverse samples, which is then used for downstream uncertainty estimations. The key idea is to use Natural Language Inference model to score semantic similarity between generated outputs as a diversity signal to modify model's output. They show that their method improves uncertainty estimation across QA tasks and outperforms strong baselines like temperature sampling, beam search etc. Importantly, their framework is unified: it works for AR as well as diffusion-based models.

优缺点分析

Strengths

  1. Principled and Efficient Sampling for Semantic Uncertainty

The paper introduces a well-motivated sampling strategy that leverages NLI entailment-based similarity to bias sampling. Unlike prior diversity-promoting methods (e.g., temperature scaling or diverse beam search), this approach operates at the level of meaning rather than surface form and is applicable to both autoregressive (AR) and masked diffusion models (MDMs).

  1. Clear Presentation with Reasonable Baselines and Fair Assessments

The chosen baselines—temperature sampling, DBD, and SDLG—are appropriate and closely related to the proposed method. I appreciate that the authors acknowledge the effectiveness of simple baselines in certain settings while demonstrating that their proposed method excels in others. This balanced assessment enhances the credibility of their evaluations.

  1. Modular, Gradient-Free, and Practical for Deployment

The framework is entirely gradient-free and requires no modifications to the base language model, making it broadly compatible with pretrained models. The use of a lightweight, partially fine-tuned NLI model for scoring semantic similarity is computationally efficient and easy to integrate, highlighting its practicality for real-world applications.

Weaknesses

  1. Limited Evaluation of Epistemic Uncertainty

While the paper claims to estimate both aleatoric (via semantic entropy) and epistemic (via mutual information) uncertainty, the empirical evaluation predominantly focuses on semantic entropy. Only a single result pertaining to mutual information is reported in Appendix D. Despite detailed discussions on epistemic uncertainty quantification in Section 3.4, the experimental validation appears minimal.

  1. Lack of In-Depth Analysis and Ablation Studies

The paper attempts to cover multiple aspects: diversity-steered sampling (the main contribution), uncertainty quantification, and applications to both AR and MDMs. While this breadth is good, it comes at the expense of depth. For instance, more comprehensive ablation studies on various components—such as the impact of variance reduction techniques, the performance of the NLI model on truncated sentences, and the influence of tuning the diversity parameter λ—would strengthen the paper. It might be beneficial to move detailed discussions on mutual information, aleatoric, or epistemic uncertainty to the appendix, allowing the main paper to focus more on elucidating the proposed method with clear ablations.

  1. Minor Clarity Issues in Section 3

I found Section 2 to be more comprehensible because authors included examples and intuitive explanations. In contrast, Section 3 is somewhat dense, with mathematical formulations presented without sufficient accompanying intuition or explanatory text, making it harder to follow.

问题

See weakness

局限性

No, the authors adequately addressed the limitations and potential negative societal impact of their work

格式问题

N/A

作者回复

We are grateful for your positive evaluation of our work regarding the motivation, practicality, design of experiments, and clarity. Your constructive suggestions have helped improve our paper. Please find our point-by-point responses below and summaries of new results motivated by your comments.

limited evaluation of epistemic uncertainty

We have added new evaluations for MI since the manuscript submission. We now report results for both LLaMA 3 8B-Instruct and LLaDA 8B-Instruct (to represent MDMs) on CoQA, TriviaQA, AmbigQA, and TruthfulQA. Previously in the literature, MI has only been evaluated on TriviaQA and AmbigQA for the Gemini 1.0 Pro and Nano-1 models [1], so we believe our experiments using new models and datasets will serve as useful benchmarking results for the community. To summarize briefly, our sampling method shows comparable or slightly better mean AUROC (0.57 ~ 0.64) across TriviaQA, AmbigQA, and TruthfulQA relative to vanilla iid sampling with τ=1\tau=1 and τ=2\tau=2, but performs similarly to DBS and SDLG on AmbigQA. DBS results carry high variance, about 30-40% higher than ours. LLaDA shows similar patterns but slightly lower AUROC by 1-2 overall.

To complement the quantitative results, we also added case studies demonstrating that diversity steering uncovers relevant semantic clusters, leading to meaningful SE/MI values. We consider three settings briefly summarized below.

(1) When the LM hallucinates. If the prompt asks about a domain-specific, long-tail fact, models tend to hallucinate. We expect to see one cluster containing the correct answer and a few hallucination clusters. MI should be high, as the model easily "changes its mind" when conditioned on its own response. SE should depend on the hallucination pattern -- low if there are a few dominant modes of hallucination and high if hallucinations are varied and diffuse. We manually analyze responses to a well-defined biomedical question (with a single correct answer) for this setting.

(2) When the LM is legitimately uncertain due to ambiguity in the prompt. If the prompt is vaguely worded or asks to name one item among many, models can output several valid semantic clusters. We expect SE to be high and MI to be moderate or low, as the model would not be internally conflicted about a given answer. We manually analyze responses to a precisely worded and easy question with multiple correct answers, taken from the AmbigQA dataset.

(3) A mix of both. In Appendix C, we provided sample responses for the ambiguous, domain-specific question with multiple possible correct answers: "Name one way to measure the Hubble constant." Given 8 sample generations, our method is able to surface more valid clusters corresponding to different methods of measuring the Hubble constant as well as hallucination clusters with made-up physics facts.

ablation studies on ... the impact of variance reduction

We have added an appendix section with the following items, briefly summarized:

  1. A brief primer on control variates (CVs). We present the concept in general terms and notation that apply clearly to the SE and MI cases.
  2. Justifications for why our proposed CVs are reasonable choices: We define our CVs as the mean-centered log probabilities of each generation. In natural language generation, high-likelihood samples tend to be mapped to dominant clusters, so our CVs correlate well with the cluster probabilities, as desired. Moreover, we can reduce variance "for free," at no additional inference cost -- that is, they do not require extra forward passes through the base LM, as the log probability of each generation is already needed to evaluate the importance weights wiw_i. We will revise the text of Section 3.4 to make this benefit clearer as well.
  3. Average percent variance reduction achieved (ablation results): We can express the percent reduction in variance in terms of the correlation between the original semantic entropy functional w~Y\tilde w Y and our proposed CV w~X\tilde w X', where Y=logp^(c(s(i)))Y = -\log {\hat p}(c(s^{(i)})) and XX' is the mean-centered version of X=logp(s(i))X = -\log p(s^{(i)}). More precisely, if σX2,σY2\sigma_X^2, \sigma_Y^2 are their respective variances and σXY\sigma_{XY} is their covariance, then the percent reduction in variance is a direct function of the correlation ρ=σXY/(σXσY)\rho = \sigma_{XY}/(\sigma_X \sigma_Y). We find that ρ0.30.4\rho \approx 0.3 - 0.4, translating to 10-15% variance reduction.

the performance of the NLI model on truncated sequences

We now provide additional results in Appendix A showing the performance of our finetuned NLI model at every level of corruption (trunc/mask) in addition to the average across all levels (73-74% accuracy) reported previously. Accuracy goes up steeply from 34% to 72% as corruption level decreases from 90% to 50%, and continues going up to 91% at no corruption. This is consistent with the intuition that the meaning of the full text can be gathered as soon as few semantically important tokens are revealed.

We also show that our model is not biased toward high or low entailment when the partial sequences do not carry sufficient information:

  • Models tend to output entailment probabilities of ~33% when provided input with almost no information (new histogram figures in the appendix). The existing examples in Figures 2 (for input 1-2 tokens long, leftmost) and 4 (for near 100% masking, rightmost) support this. The specific value of 33% can be understood as the maximum-uncertainty prediction in the three-way (entailment/contradiction/neutral) classification.
  • We will also provide distributions of predicted P(entailment) for single-token inputs and 90% masking. They are both sharply peaked at 33%, demonstrating that the model is correctly uncertain in settings with very little information about entailment.

influence of tuning the diversity parameter λ\lambda

To avoid biasing the sampling unnecessarily, it is important for λ\lambda to default to iid sampling when the partial sequence carries no information about semantic redundancy with the sample set. This is why we initialize the steering strength hyperparameter at λ0=0\lambda_0 = 0 and only dynamically adjust it upwards when sample redundancy is significantly above 0.3 in terms of bidirectional entailment. Under our adaptive scheme for setting λt\lambda_t, we find that λt\lambda_t tends to monotonically increase over sampling time as we acquire more information about the text being generated (Figure 5). We will justify our adjustment scheme more clearly in the text.

accompanying intuition or explanatory text [for Section 3]

Thank you for this suggestion. While our applications of importance sampling and control variates are standard, we agree that the definitions of SE and MI have quite a bit of detail that can benefit from explanations to guide the intuition. To supplement Section 3.3, we will include figures illustrating SE and MI for the running "Hubble constant" example in the appendix, with each generation s(i)s^{(i)} illustrated as dots in the "semantic" space and the clusters shown as different colors. As Section 3.4 is more difficult to capture visually, we will include a step-by-step worked example of MI estimation for AR and MDM (Algorithms 2-3) for a simple QA pair.

[1] Yadkori, Y. A., Kuzborskij, I., György, A., & Szepesvári, C. (2024). To believe or not to believe your llm. NeurIPS.

评论

Dear Reviewer 1ALS,

Could you please have a look at authors' rebuttal, see if they have addressed your concerns, provide a specific reply to that and acknowledge that you have done so?

Many thanks,

Your AC

评论

Dear Reviewer,

Thank you once again for your thoughtful review and for your positive evaluation of our work. We hope our rebuttal addressed your concerns. If so, would you be willing to post a brief acknowledgment of our response?

We are also happy to clarify any points or provide additional material if that would be helpful.

评论

Thank you for your detailed response! Your additional experiments make sense and will be nice addition to the paper. Thanks for clarifying my questions regarding NLI as well. I think this paper presents a method clearly with small but consistent improvement over baselines. I will maintain my score.

审稿意见
4

This paper proposes a novel and modular method for improving semantic uncertainty quantification in language models (LLMs), applicable to both autoregressive (AR) and masked diffusion models (MDM). It introduces diversity-steered sampling using a semantic repulsion term derived from a fine-tuned NLI model, and applies importance weighting to correct for bias and reduce variance in downstream uncertainty estimates. Across four QA benchmarks, the approach outperforms baselines like random sampling, diversity-based beam search.

优缺点分析

Strengths

The proposed solution for semantic uncertainty quantification is interesting, novel, theoretically grounded, and clearly presented. Specifically, the formulation of diversity-steered sampling introduces a “repulsive” term that discourages sequences semantically similar to previously sampled sequences. Semantic similarity between two sequences is estimated using NLI models with bidirectional entailment scores. Further, the new distribution is corrected with importance weighting to debias the estimates of semantic entropy, mutual information, etc.

The same technique applies to the popular autoregressive language models and emerging masked diffusion models.

Empirical evaluation on several models and datasets is provided. The results demonstrate that the introduced diversity-based sampling strategy performs better than baselines, including random sampling and diversity-based beam search.

Weaknesses

The use of entailment models for semantic similarity is not well justified. Moreover, the paper deals with measuring the semantic similarity of partial sequences, which can be significantly challenging, given that the presence of a single word (or token, e.g., not) can change the semantics significantly. Such mistakes in similarity measurements can eventually lead to erroneous samples coming out of the process.

Computational efficiency is another major issue here. Given that semantic similarities need to be computed using an entailment model after generating each token. This can significantly slow down the inference procedure.

The experiments are focused on generating the correct sequence, i.e., the evaluation seems to be similar to any other inference technique and does not focus much on the uncertainty quantification aspects, which is a major focus of the paper.

问题

  1. What is the reasoning behind specific weights in equation (8)? Is it possible to provide mathematical details to support these, and can we have weights such that there is equality in equation (9)?

  2. A suggestion: Sections 3.3 and 3.4 could be better with some interpretation, examples, or simulations to ground the intuition and understanding of these quantities in the context of language models.

局限性

yes.

最终评判理由

Rebuttal addressed my questions and I am happy to keep my current scores.

格式问题

No issues.

作者回复

Thank you for your positive evaluation of our method's novelty, theoretical grounding, and clarity. We appreciate the constructive suggestions that have helped improve our paper. Please find below our point-by-point responses and new results motivated by your suggestions.

... use of entailment models for semantic similarity is not well justified

The use of entailment models is well established in the literature of analyzing natural language generation, particularly in hallucination detection (please see [1-5] for some well-cited examples). It has been shown to outperform embedding-based or lexical similarity methods for clustering semantically equivalent generations [1, 2]. More generally, the detection of entailment has been extensively studied in linguistics [6]. We will make this positioning clearer in the Introduction and Related Work sections.

measuring the semantic similarity of partial sequence ... can be significantly challenging, given that the presence of a single word (or token, e.g., not) can change the semantics significantly

We agree that predicting semantic entailment between a complete sequence and a partial sequence, rather than between two complete sequences, can be subtle. In fact, the right panel of Figure 2 illustrates exactly the flipping of the polarity upon revealing the token "not." So far as we are using bidirectional entailment for steering generation, this behavior (sensitivity to a few important tokens) is desired. Taking the Figure 2 example, if the fully generated sample were "You and your friends are not welcome here, said Severn," and the partial generation so far were "Severn said the people were," then we would be discouraging the choice of "not" as the next token, as this would result in semantically redundant generations, and instead steer the sampling toward something like the token "welcome," for which the model would predict a low entailment score.

To ensure that the model generalizes well to partial-sequence input, we finetune it to treat [TRUNC] or [MASK] as unknown context. While only 0.3% of the model parameters are updated, the model is capable of predicting the correct label long before the entire text is revealed. In Appendix A, we report a validation accuracy of 73-74% averaged over all levels of truncation/masking. To complement this, we will add plots of accuracy at every level of truncation/masking. Accuracy goes up steeply from 34% to 72% as corruption level decreases from 90% to 50%, and continues going up to 91% at no corruption. The intuition is exactly as the reviewer described: the meaning of the full text can be gathered as soon as few semantically important tokens are revealed.

The reviewer raises an important point that we do not want to be biased toward high or low entailment in the absence of information. This motivates our choice to use P(entailment) scores directly rather than the class predictions. Overall, we find that models tend to output P(entailment) of ~33% when provided input with almost no information, as supported by Figures 2 (leftmost) and 4 (rightmost). The specific value of 33% can be understood as the maximum-uncertainty prediction in the three-way (entail/contradict/neutral) NLI classification task. In Appendix A, we will add plots of the distribution of predicted P(entailment) sharply peaked at 33% for single-token inputs and 90% masking, to demonstrate that the model is correctly uncertain in settings with very little information about entailment. In this no-information regime, the desired behavior is to default to iid sampling, which is why we initialize the steering strength hyperparameter at λ0=0\lambda_0 = 0 and only dynamically adjust it upwards when sample redundancy is significantly above 0.3 in terms of bidirectional entailment. Empirically, we find that λt\lambda_t tends to monotonically increase over sampling time as we acquire more information about the text being generated (Figure 5).

Lastly, to further justify our use of predicted entailment probabilities, we'd like to present a perspective of our framework as model-guided online decision making, similar to delayed feedback scenarios in reinforcement learning or Bayesian optimization. The entailment model serves as a cheap "surrogate" that helps guide our "decisions" (which token to choose next) toward maximizing "utility" (full generation that is semantically different from existing samples). There is signal in partial sequences for us to construct our belief of whether the one-token-ahead partial sequence will be semantically equivalent to a running sample set.

Computational efficiency

We acknowledge that the semantic repulsion term adds computational overhead due to look-ahead NLI model scoring. However,

  • In practice, we restrict similarity computation to the top-kk tokens at each decoding step (Appendix B.5) with k=32k=32, as most tokens are not relevant to the query at hand. The extra per-token cost relative to standard iid sampling is O(kPMNLI)O(k P M_{\rm NLI}), where PP is the size of the running sample set and MNLIM_{\rm NLI} is the cost of computing NLI entailment.
  • Our experiments run on a single A100 GPU with <32 GB of VRAM and remain tractable for typical N=16 samples (Appendix B.6).
  • Importantly, the user can optionally implement online stopping (Appendix B.3), which allows termination of sampling once uncertainty estimates stabilize, potentially reducing the required sample size below the >20 samples typically required by standard iid baselines.

That said, we agree that hybrid approaches using batch sampling are promising for further speed gains.

evaluation [focuses on correctness] ... does not focus on the uncertainty quantification aspects

We would like to clarify a possible misunderstanding. Our evaluation focuses on (frequentist) calibration quality of our uncertainties, as is standard in the uncertainty quantification literature [1-3, 5]. The AUROC metric evaluates rank calibration; it gives the probability that we assign lower uncertainty to a randomly selected correct answer than to a randomly selected incorrect answer; good calibration would yield AUROC scores ~1 while poor calibration would score ~0.5. That is, while we report AUROC against correctness, the SE and MI are the ranking variables. The ROC curve is obtained by sorting the uncertainties from highest confidence to lowest (call the negated uncertainties ziz_i), sweeping a threshold tt through them, and computing for each tt, TPR(t)=num positives with zitnum positives{\rm TPR}(t)=\frac{\text{num positives with }z_i \ge t}{\text{num positives}} FPR(t)=num negatives with zitnum negatives.{\rm FPR}(t)=\frac{\text{num negatives with }z_i \ge t}{\text{num negatives}}. AUROC is then 01TPR(FPR)d(FPR)\int_0^1 {\rm TPR}({\rm FPR}) d({\rm FPR}). We will add more details about the evaluation metric in the appendix.

The AUROC metric requires us to threshold the ROUGE-L score to determine correctness, so we have also reported Spearman ρ\rho between ROUGE-L scores and SE/MI as a complementary metric and find that ours outperform DBS and SDLG by 5% and 6%, respectively, as well.

reasoning behind specific weights in equation (8)

Equation (8) defines the standard importance ratio wi=p(si)/q(si)w_i = p(s_i) / q(s_i) used to correct for the fact that we sample from the biased (diversity-steered) proposal distribution qq when we seek to analyze pp, and its self-normalized form w~i\tilde w_i satisfying iw~i=1\sum_i \tilde w_i = 1.

equality in (9)

Equation (9) is a consistent estimator (equality holds in the limit NN \to \infty) for the cluster probability under usual regularity assumptions for self-normalized importance sampling -- finite second moment of ww being sufficient. Rewriting as i=1Nwi1[s(i)c]i=1Nwi\frac{\sum_{i=1}^N w_i 1[s^{(i)} \in c]}{\sum_{i=1}^N w_i}, applying the strong law of large numbers to the numerator and denominator separately guarantees almost-sure convergence to Esq[w(s)1[sc]]1=p(cx,θ)\frac{\mathbb{E}_{s \sim q}[w(s) 1[s \in c]]}{1} = p(c|x, \theta). The estimator has bias O(1/N)O(1/N) and the usual 1/N1/\sqrt{N} precision. We had omitted the proofs of consistency, finite-sample bias, and asymptotic variance, because they are standard results in self-normalized importance sampling (e.g., Theorem 9.2 of [7]), but they may be helpful to include given our particular definition of qq (which justifies the finite second moment assumption) and treatment of the indicator function 1[sc]1[s \in c]. We will expand on the main text around Equation (9) and add the proofs to the appendix.

Sections 3.3 and 3.4 could be better with interpretation, examples, or simulations

Thank you for this suggestion. While our applications of importance sampling and control variates are standard, we agree that the definitions of SE and MI have quite a bit of detail that can benefit from explanations to guide the intuition. To supplement Section 3.3, we will include figures illustrating SE and MI for the running "Hubble constant" example in the appendix, with each generation s(i)s^{(i)} illustrated as dots in the "semantic" space and the clusters shown as different colors. As Section 3.4 is more difficult to capture visually, we will include a step-by-step worked example of MI estimation for AR and MDM (Algorithms 2-3) for a simple QA pair.

[1] Farquhar, S. et al. (2024). Detecting hallucinations in large language models using semantic entropy. Nature.

[2] Kuhn, L. et al. (2023). Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. ICLR.

[3] Lin, Z. et al. (2023). Generating with confidence: Uncertainty quantification for black-box large language models. TMLR.

[4] Manakul, P., et al. (2023). Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. EMNLP.

[5] Aichberger, L., et al. (2025). Semantically diverse language generation for uncertainty estimation in language models. ICLR.

[6] MacCartney, B. Natural Language Inference (2009).

[7] Owen, A. B. (2013). Monte Carlo theory, methods and examples.

评论

Dear Reviewer Nwdr,

The authors have replied to your main concerns regarding (1) the justification of entailment models.; (2) computational efficiency and (3) evaluation of uncertainty quantification in the experiments.

Is there any specific feedback to those responses?

Many thanks,

Your AC

评论

Dear Reviewer Nwdr,

Thank you once again for your positive review. Could you please let us know if we have addressed your concerns, or if you would like further clarification on any of the items?

评论

We would like to thank the reviewers for the positive evaluation of our work (scores: 4, 4, 4, 5) and thoughtful comments. As the discussion period is nearing its end, we wanted to summarize the main updates to the paper so far. The suggested updates were minor in scope, aimed at improving clarity by way of detailed descriptions, qualitative examples, or visual illustrations. We hope that our first rebuttal response has fully addressed all your points:

Clarifications and additional background. We now include a brief introduction to control variates in the Appendix. We also explicitly position our method within previous work using NLI models (without finetuning) in the context of semantic clustering and hallucination detection. To supplement the technical details in Section 3, we include new figures illustrating SE estimation and a step-by-step worked example of MI estimation for a simple QA pair.

Qualitative case studies. To demonstrate that diversity steering uncovers relevant semantic clusters, we manually analyze the behavior of sampled clusters (and the associated SE/MI values) in three settings: (1) when the LM hallucinates, (2) when the LM is legitimately uncertain due to ambiguity in the prompt, and (3) a mix of both.

Computational efficiency (Appendix B.5). We will revise the text to clarify that the scenario we consider is not all-purpose large generation but targeted generation with the specific goal of estimating a downstream semantic uncertainty metric. While our method incurs extra cost per sample, we require fewer samples to cover more semantic clusters. To supplement Appendix B.5 deriving the extra per-token cost of our algorithm relative to iid sampling, we provide more concrete guidance for batched extensions by explicitly describing which aspects of our added computational cost are parallelizable.

Additional empirical results. To show that the finetuned NLI models are not biased toward high entailment in the absence of semantic signal, we provide results on the performance of the finetuned NLI models on heavily truncated/masked input. We also include AUROC metrics for LLaMA 3 8B-Instruct and LLaDA 8B-Instruct (to represent MDMs) for mutual information on TriviaQA, TruthfulQA, and AmbigQA in Table 2.

We believe these updates address all feedback received and further strengthen the paper. Thank you once again for the constructive feedback that has helped improve our work.

最终决定

The paper proposes a modular method for semantic uncertainty quantification in language models using diversity-steered sampling with importance weighting and variance reduction. The approach is gradient-free, theoretically motivated, and broadly applicable to autoregressive and diffusion models. Reviewers highlighted the novelty of the method, the clarity of exposition, and empirical gains in estimating semantic entropy and mutual information, with additional ablations and case studies in the rebuttal further strengthening the work.

Main concerns were the computational overhead of semantic repulsion, reliance on entailment models for similarity (especially with partial sequences), and limited evaluation of epistemic uncertainty. The rebuttal addressed these by clarifying the role of entailment, showing robustness of fine-tuned natural language inference (NLI) models, and expanding evaluation and ablations. Reviewers agreed these additions improved the paper, though scalability and dependence on the NLI component remain open issues.

Overall, reviewers converged on a cautiously positive stance, with three leaning borderline accept and one more strongly supportive. Given the novelty, sound methodology, and empirical support despite some limitations, I recommend acceptance.