PaperHub
6.4
/10
Rejected4 位审稿人
最低2最高5标准差1.2
4
5
5
2
3.5
置信度
创新性2.5
质量3.0
清晰度3.5
重要性2.5
NeurIPS 2025

Fact or Hallucination? An Entropy-Based Framework for Attention-Wise Usable Information in LLMs

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29

摘要

关键词
LLMsHallucinations

评审与讨论

审稿意见
4

This paper studies hallucination detection in LLMS, and proposes Shapley NEAR – an entropy-based attribution method that takes in account attention values across all attention heads of LLM layers to predict whether an LLM output is hallucinatory.

The method is grounded in Sharley-value theory and is designed to decompose entropy reduction across layers. The method works by quantifying how much each part of the context (e.g., each individual sentence) influences the prediction of the final token. The sentence attribution is based on the Shapley value, which allows the method to distinguish between parametric hallucination (no/wrong knowledge in the model) and context-induced hallucination (due to the context). The motivation behind (i) probing attention outputs (instead of other internal states) and (ii) taking the last token embedding is based on previous works [11, 15, 5].

Furthermore, the paper develops a method to prune attention heads at test time (without retraining) that can mitigate parametric hallucination. Experiments are conducted 4 simple QA datasets (CoQA, QuAC, SQuAD, and TriviaQA) and 3 LLMs (Qwen2.5-3B, LLaMA3.1-8B, OPT-6.7B), and the proposed method appears effective.

优缺点分析

Strengths

  • The method appears empirically effective compared to baselines
  • The method is well grounded and extends on previous observations.
  • The method allows the distinction between two types of hallucinations
  • The method is justified both theoretically and empirically

Weaknesses

  • The QA datasets appear quite old and weak. I’m not certain if the findings would generalise to more modern LLMs or open-ended long-form text generation.
  • Some questionable design decisions – please refer to my questions.

问题

  1. “Lines 142-143: According to [15, 5], … , we extract the vector corresponding to the final question token.” I’m a little skeptical of this statement. In practice, we are concerned about hallucination in sequence generation (which could involve many tokens), could you justify what [15, 5] found and why taking the representation of the final question token makes sense?
  2. What could be other alternative sentence attribution measures (instead of Shapley value)? I’m not an expert in this area, but I wonder if the authors have considered whether there are other measures? If not, why? Otherwise, it’d be great to compare them against Shapley values (if any).
  3. How would this method be applicable to detecting hallucinations in long-form generation?

局限性

yes

最终评判理由

The points in rebuttal strengthen the clarity of the paper, making its design choices clearer. In terms of my assessment, I like the paper, and I believe that the original overall assessment (leaning accept) is already fair, but I've raised its clarity.

格式问题

none

作者回复

Thank you for your valuable feedback. We are pleased to submit our rebuttal addressing each of the weaknesses and questions you raised. References to new papers are cited as (First Author et al., Year), while references from our submitted paper use ** $ number]** style. Tables from the paper are referred to as Table X, and new tables added in the rebuttal are referred to as Table X (below). For clarity, we denote the reviewer’s Weaknesses as W and Questions as Q throughout the rebuttal.

(Q1) We thank the reviewer for raising this point. As noted in [15,5], the final token representation of the question effectively encodes the aggregated semantic context of the entire query and has been shown to be highly informative for tasks such as answerability detection and uncertainty estimation. Specifically, these works found that final token embeddings provide a compact yet semantically rich summary of the input, outperforming pooled or averaged token embeddings in such settings. Our goal here is not to model the generated sequence directly, but to capture the model’s internal state regarding the question–context pair prior to generation. Thus, using the final question token offers a principled and efficient representation without introducing additional decoding complexity. We will clarify this rationale in the revision.

(Q2) We thank the reviewer for this valuable question. There are indeed several alternative attribution methods for estimating sentence‑level importance, such as:

  • Leave‑one‑out masking (Li et al., 2016): measuring the change in model output when each sentence is removed.
  • Gradient‑based methods like Integrated Gradients (Sundararajan et al., 2017): estimating contributions via input gradients.
  • Attention‑based attribution (Serrano & Smith, 2019): using attention weights as proxies for importance.

We carefully considered these options but ultimately chose Shapley values because they uniquely satisfy two key axioms, symmetry and additivity[23], making them theoretically grounded for attributing contributions across multiple interacting sentences. In contrast, attention scores often correlate poorly with actual feature importance as according to (Jain et. al, 2019) demonstrated that attention weights can be perturbed without affecting model predictions, highlighting that attention is not a faithful explanation. Leave‑one‑out (Li et al., 2016) fails to fairly distribute contributions across combinations of sentences, and according to (Srinivas et. al, 2020) gradient‑based methods may not properly capture the attribution of sentences. To empirically demonstrate the strength of our approach, we compared NEAR against the leave‑one‑out attribution method (Li et al., 2016) on LongBench v2 (Bai et al., 2024) across three models. As shown in Table 1(below), NEAR consistently outperforms leave‑one‑out attribution across AUROC, Kendall’s tau\\tau, and PCC, with low variability across three independent runs, highlighting its reliability.

Table 1: Comparison of NEAR vs. Leave‑one‑out (Li et al., 2016) on LongBench v2 (mean std over 3 runs).

\[ \begin{array}{l|l|c|c|c} \hline *Model** & *Method** & *AUROC** \uparrow & *Kendall’s ** \tau \uparrow & *PCC** \uparrow \\ \hline \text{Qwen2.5‑3B} & \text{Li et al.} & 0.701 \pm 0.006 & 0.449 \pm 0.007 & 0.463 \pm 0.007 \\ & *NEAR** & \mathbf{0.792 \pm 0.005} & \mathbf{0.514 \pm 0.006} & \mathbf{0.527 \pm 0.006} \\ \hline \text{LLaMA3.1‑8B} & \text{Li et al.} & 0.722 \pm 0.006 & 0.467 \pm 0.007 & 0.481 \pm 0.007 \\ & *NEAR** & \mathbf{0.812 \pm 0.005} & \mathbf{0.529 \pm 0.006} & \mathbf{0.544 \pm 0.006} \\ \hline \text{OPT‑6.7B} & \text{Li et al.} & 0.694 \pm 0.006 & 0.443 \pm 0.007 & 0.457 \pm 0.007 \\ & *NEAR** & \mathbf{0.799 \pm 0.005} & \mathbf{0.521 \pm 0.006} & \mathbf{0.538 \pm 0.006} \\ \hline \end{array}
$

Shapley‑based attribution in NEAR fairly distributes contributions across interacting context sentences and leverages attention‑wise decomposition to capture deep model‑internal reasoning signals. This leads to more faithful and interpretable sentence‑level attributions, especially in long‑context QA tasks, compared to masking‑based or gradient‑based methods.

(W1 & Q3) We thank the reviewer for raising this important concern. To assess NEAR’s effectiveness in long‑context settings, we additionally evaluated our method on LongBench v2 (Bai et al., 2024), which includes multi‑document and long‑context QA tasks with context lengths up to 100K tokens and 503 datapoints. As shown in Table 2(below), NEAR consistently outperforms both INSIDE [12] and Loopback Lens [17], the strongest baseline methods in our study, across all three models and evaluation metrics, confirming its effectiveness even in challenging long‑context scenarios. While NEAR incurs approximately 1.4× longer runtime than the baselines, this overhead reflects a trade‑off for its unique capabilities, including hallucination type attribution and identification of hallucination‑prone attention heads, features that baseline methods cannot provide. Across three independent runs, the overall standard deviations were low (±0.006 for AUROC, ±0.007 for Kendall’s τ\tau, and ±0.007 for PCC), confirming the stability of these results. These findings reinforce NEAR’s generalizability for hallucination detection across diverse QA settings.

\begin{array}{l|l|c|c|c} \hline \text{Model} & \text{Method} & \text{AUROC} \uparrow & \text{Kendall's } \tau \uparrow & \text{PCC} \uparrow \\ \hline \text{Qwen2.5-3B} & *NEAR (ours)** & *0.792** & *0.514** & *0.527** \\ & \text{INSIDE} & 0.709 & 0.457 & 0.471 \\ & \text{Loopback Lens} & 0.683 & 0.438 & 0.452 \\ \hline \text{LLaMA3.1-8B} & *NEAR (ours)** & *0.812** & *0.529** & *0.544** \\ & \text{INSIDE} & 0.727 & 0.468 & 0.483 \\ & \text{Loopback Lens} & 0.701 & 0.449 & 0.463 \\ \hline \text{OPT-6.7B} & *NEAR (ours)** & *0.799** & *0.521** & *0.538** \\ & \text{INSIDE} & 0.719 & 0.461 & 0.479 \\ & \text{Loopback Lens} & 0.692 & 0.442 & 0.456 \\ \hline \end{array} Table 2: Comparison of NEAR, INSIDE[12], and Loopback Lens[17] on LongBench v2. NEAR achieves a 7–12% relative improvement in AUROC and consistent gains in Kendall’s τ\tau and PCC across all models.

References:
Li, J., Chen, X., Hovy, E. and Jurafsky, D., 2015. Visualizing and understanding neural models in NLP. arXiv preprint arXiv:1506.01066.

Sundararajan, M., Taly, A. and Yan, Q., 2017, July. Axiomatic attribution for deep networks. In International conference on machine learning (pp. 3319-3328). PMLR.

Serrano, S. and Smith, N.A., 2019. Is attention interpretable?. arXiv preprint arXiv:1906.03731.

Bai, Y., Tu, S., Zhang, J., Peng, H., Wang, X., Lv, X., Cao, S., Xu, J., Hou, L., Dong, Y. and Tang, J., 2024. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. arXiv preprint arXiv:2412.15204.

Jain, S.; and Wallace, B. C. 2019. Attention is not explanation. arXiv preprint arXiv:1902.10186.

Srinivas, S. and Fleuret, F., 2020. Rethinking the role of gradient-based attribution methods for model interpretability. arXiv preprint arXiv:2006.09128.

评论

Thank you for your detailed rebuttal. These points should strengthen the clarity of the paper, especially on its design decision. I'd encourage the authors to incorporate them into the paper. In terms of my assessment, I like the paper, and I believe that the original overall assessment (leaning accept) is already fair, but I've raised its clarity.

评论

We sincerely thank the reviewer for their thoughtful feedback and for recognizing the value of our rebuttal in clarifying the design decisions of the paper. We are pleased that the paper’s overall assessment remains positive and appreciate the encouragement to improve its clarity. We will incorporate the suggested points into the revised version to ensure that our contributions are communicated more effectively. If there are any further questions or points requiring clarification, we would be happy to address them.

审稿意见
5

This paper proposes a novel technique to detect hallucinations before generated by an LLM, using its internal attention heads compute. The intuition behind that method is that given a context and a question, if the "total information" that the model gains from the context before generating its next token is low, then there's a high chance for hallucination (as the model does not properly use the context/ the context is not enough informative for the model to answer the question). The method proposed aims to quantify this information by looking at the attention heads at all model layers, and suggesting a measurement for this information gain in a sentence level. Then they aggregate this gain across all sentences in the context, and end up with the final quantification of the uncertainty. Notably, this method does not require any additional data or training - they use the internal attention heads in their original form. The authors later use this method by applying it on three different models and 4 different QA datasets, and show significant gain in hallucination detection compared to other strong baselines.

优缺点分析

Strengths:

  1. This paper is well-written.
  2. The intuition and motivation are clear.
  3. The method seems to be very effective achieving impressive results and generalizing across different model sizes, families and datasets.
  4. The authors also provide theoretical analysis of their method in terms of mathematical properties and bounds.
  5. This paper includes a broad ablation study section which helps to justify the intuition behind and each design decision.

Weaknesses:

  1. Computing the IG for each sentence can be quite computational heavy, especially in long-context settings.
  2. It is not fully clear if this method can be effective in a setting of long text generation (which includes also "facts" or model beliefs).

问题

  1. Did you try to use NER for hallucination detection in a setting of long-text generation? it's interesting to see if we can detect hallucinations in a partly-correct text generated by the model.

局限性

yes

格式问题

作者回复

Thank you for your valuable feedback. We are pleased to submit our rebuttal addressing each of the weaknesses and questions you raised. In the rebuttal, references to new papers are cited as (First Author et al., Year), while references from our submitted paper use [number] style. Tables from the paper are referred to as Table X, and new tables added in the rebuttal are referred to as Table X (below). All newly added references are provided at the end of the rebuttal. For clarity, we denote the reviewer’s Weaknesses as W and Questions as Q throughout the rebuttal.

(W1) We acknowledge the reviewer’s concern that computing information gain (IG) for each sentence can be computationally heavy, especially in long‑context settings. However, this added cost is a deliberate trade‑off for NEAR’s unique advantages, it not only detects hallucinations but also distinguishes their type, identifies hallucination‑prone attention heads, and provides fine‑grained sentence‑level attribution, which baseline methods cannot achieve. As shown in Appendix A10, the increase in runtime compared to baselines is not substantial and can be further alleviated through parallelization and optimized hardware utilization.

(W2) We thank the reviewer for raising this important concern. To assess NEAR’s effectiveness in long‑context settings, we additionally evaluated our method on LongBenchv2 (Bai et al., 2024), which includes multi‑document and long‑context QA tasks with context lengths up to 100K tokens and 503 datapoints. As shown in Table 2(below), NEAR consistently outperforms both INSIDE [12] and Loopback Lens [17], the strongest baseline methods in our study, across all three models and evaluation metrics, confirming its effectiveness even in challenging long‑context scenarios. While NEAR incurs approximately 1.4× longer runtime than the baselines, this overhead reflects a trade‑off for its unique capabilities, including hallucination type attribution and identification of hallucination‑prone attention heads, features that baseline methods cannot provide. Across three independent runs, the overall standard deviations were low (±0.006 for AUROC, ±0.007 for Kendall’s τ\tau, and ±0.007 for PCC), confirming the stability of these results. These findings reinforce NEAR’s generalizability for hallucination detection across diverse QA settings.

\begin{array}{l|l|c|c|c} \hline \text{Model} & \text{Method} & \text{AUROC} \uparrow & \text{Kendall's } \tau \uparrow & \text{PCC} \uparrow \\ \hline \text{Qwen2.5-3B} & *NEAR (ours)** & *0.792** & *0.514** & *0.527** \\ & \text{INSIDE} & 0.709 & 0.457 & 0.471 \\ & \text{Loopback Lens} & 0.683 & 0.438 & 0.452 \\ \hline \text{LLaMA3.1-8B} & *NEAR (ours)** & *0.812** & *0.529** & *0.544** \\ & \text{INSIDE} & 0.727 & 0.468 & 0.483 \\ & \text{Loopback Lens} & 0.701 & 0.449 & 0.463 \\ \hline \text{OPT-6.7B} & *NEAR (ours)** & *0.799** & *0.521** & *0.538** \\ & \text{INSIDE} & 0.719 & 0.461 & 0.479 \\ & \text{Loopback Lens} & 0.692 & 0.442 & 0.456 \\ \hline \end{array} Table 2: Comparison of NEAR, INSIDE[12], and Loopback Lens[17] on LongBench v2 (Bai et al., 2024). NEAR achieves a 7–12% relative improvement in AUROC and consistent gains in Kendall’s τ\tau and PCC across all models.

(Q1) We thank the reviewer for this insightful suggestion. While we did not incorporate Named Entity Recognition (NER) in the current study, our focus was on model‑intrinsic attribution signals, the dense semantic information encoded within the internal layers of large language models [11–13], which can generalize across domains without relying on external annotators or task‑specific pipelines. Unlike plain NER‑based methods (e.g., Zhou et al., 2020), which primarily detect surface‑level entity mismatches, NEAR captures deeper model‑internal inconsistencies by attributing information gain across context sentences and attention heads. This enables NEAR to detect not only factual errors (e.g., hallucinatory sentences) but also the type of hallucination (parametric or context‑induced). Nevertheless, we agree that integrating NER could complement NEAR by providing finer‑grained entity‑level consistency checks within partly correct long‑form generations. Such a hybrid approach could further enhance interpretability and robustness, and we will highlight this as a promising avenue for future work in the revision.

References

Bai, Y., Tu, S., Zhang, J., Peng, H., Wang, X., Lv, X., Cao, S., Xu, J., Hou, L., Dong, Y. and Tang, J., 2024. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. arXiv preprint arXiv:2412.15204.

Zhou, C., Neubig, G., Gu, J., Diab, M., Guzman, P., Zettlemoyer, L. and Ghazvininejad, M., 2020. Detecting hallucinated content in conditional neural sequence generation. arXiv preprint arXiv:2011.02593.

评论

I would like to begin by thanking the authors for their detailed rebuttal, which largely addresses the concerns that were raised. I strongly encourage the authors to include the new results and clarifications in the camera-ready version. I will maintain my original scores, as they were already high and reflect what I believe to be a fair assessment of the paper.

评论

Thank you for your thoughtful feedback on my work. I have carefully addressed all the questions and concerns you raised and would greatly appreciate it if you could review my responses. Your insights are invaluable, and I kindly invite you to engage in a brief discussion to ensure any remaining points are fully clarified.

审稿意见
5

This paper proposes Shapley NEAR, an entropy-based attribution framework grounded in Shapley values to detect hallucinations in large language models. Unlike prior methods relying on final-layer logits or post-hoc checks, Shapley NEAR decomposes attention-driven information flow across all layers and heads, assigning confidence scores where higher scores indicate lower hallucination risk. It distinguishes between parametric hallucinations and context-induced hallucinations. A test-time head clipping technique prunes overconfident, context-agnostic attention heads to mitigate parametric hallucinations. Experiments on QA benchmarks with many models show that Shapley NEAR outperforms strong baselines.

优缺点分析

Strengths

  1. The approach of decomposing attention-driven information flow across all model layers and heads to detect hallucinations is innovative. By grounding the method in Shapley values and entropy-based attribution, it offers a principled way to quantify context utility, distinguishing itself from prior final-layer-focused techniques.
  2. The paper demonstrates consistent superiority over strong baselines across four QA benchmarks using diverse LLMs. Metrics show improvements, validating the method’s effectiveness.
  3. The introduction of a head clipping technique during inference mitigates parametric hallucinations by pruning overconfident, context-agnostic attention heads. This strategy enhances model reliability without retraining, showcasing the framework’s usability in real-world scenarios.

Weaknesses

  1. Dependence on Monte Carlo sampling for Shapley value approximation incurs high time complexity. As shown in Table 19, NEAR’s runtime on LLaMA3.1-8B is 3–10 times slower than baselines (e.g., 30.6s vs. 3.1s for Semantic Entropy), limiting its applicability to real-time inference or large-scale deployments with long contexts.
  2. The paper lacks a discussion of recent non-entropy-based hallucination detection approaches, such as ANAH-v2 [1] and MIND [2]. Incorporating such comparisons would strengthen the method’s generalizability and contextualize its contributions within the broader field.

[1] Gu Y, Ji Z, Zhang W, et al. Anah-v2: Scaling analytical hallucination annotation of large language models[J]. arXiv preprint arXiv:2407.04693, 2024.

[2] Su W, Wang C, Ai Q, et al. Unsupervised real-time hallucination detection based on the internal states of large language models[J]. arXiv preprint arXiv:2403.06448, 2024.

问题

See weaknesses

局限性

NA

最终评判理由

I confirm that the authors have adequately addressed most of the initial concerns and questions, especially the issues about efficiency. So I raise my score to 5.

格式问题

NA

作者回复

Thank you for your valuable feedback. We are pleased to submit our rebuttal addressing each of the weaknesses and questions you raised. In the rebuttal, references to new papers are cited as (First Author et al., Year), while references from our submitted paper use [number] style. Tables from the paper are referred to as Table X, and new tables added in the rebuttal are referred to as Table X (below). All newly added references are provided at the end of the rebuttal.

(W1) We appreciate the reviewer’s concern regarding runtime overhead. Indeed, NEAR’s increased cost arises from Monte Carlo Shapley estimation; however, this design choice provides well-characterized error bounds and principled attributions. As shown in Appendix A1.2/A8, using 30--50 samples yields stable estimates (error <0.05<0.05 at 99\% confidence), offering a practical balance between computational cost and accuracy. Moreover, NEAR operates on pre-computed attention outputs, and its head-wise computations are highly parallelizable, making runtime reductions feasible on modern hardware. This additional cost reflects a deliberate trade-off: beyond hallucination detection, NEAR also distinguishes hallucination types (parametric vs. context-induced) and identifies hallucination-prone attention heads for mitigation (Section~6), extending functionality beyond what baselines provide. Thus, for applications requiring interpretability and actionable diagnostics, this trade-off is justified by the added insights NEAR delivers.

We additionally evaluated our method on LongBench v2 (Bai. et. al ,2024), which includes multi‑document and long‑context QA tasks with context lengths up to 100K tokens and 503 datapoints. As shown in Table 21(below), NEAR consistently outperforms both INSIDE[12] and Loopback Lens[17], the strongest baseline methods in our study, across all three models and evaluation metrics, confirming its effectiveness even in challenging long‑context scenarios, although taking approx. 1.4x times more than the baselines. These results reinforce NEAR’s generalizability for hallucination detection across diverse QA settings. Across three independent runs, the overall standard deviations were low (±0.006 for AUROC, ±0.007 for Kendall’s τ\tau, and ±0.007 for PCC), confirming the stability of these results

\\begin{array}{l|l|c|c|c} \\hline \\text{Model} & \\text{Method} & \\text{AUROC} \\uparrow & \\text{Kendall's } \\tau \\uparrow & \\text{PCC} \\uparrow \\\\ \\hline \\text{Qwen2.5-3B} & \**NEAR (ours)** & \**0.792** & \**0.514** & \**0.527** \\\\ & \\text{INSIDE} & 0.709 & 0.457 & 0.471 \\\\ & \\text{Loopback Lens} & 0.683 & 0.438 & 0.452 \\\\ \\hline \\text{LLaMA3.1-8B} & \**NEAR (ours)** & \**0.812** & \**0.529** & \**0.544** \\\\ & \\text{INSIDE} & 0.727 & 0.468 & 0.483 \\\\ & \\text{Loopback Lens} & 0.701 & 0.449 & 0.463 \\\\ \\hline \\text{OPT-6.7B} & \**NEAR (ours)** & \**0.799** & \**0.521** & \**0.538** \\\\ & \\text{INSIDE} & 0.719 & 0.461 & 0.479 \\\\ & \\text{Loopback Lens} & 0.692 & 0.442 & 0.456 \\\\ \\hline \\end{array}

Table 21: Comparison of NEAR, INSIDE, and Loopback Lens on LongBench v2 (Bai et al, 2024). NEAR achieves a 7--12\% relative improvement in AUROC and consistent gains in Kendall’s tau\\tau and PCC across all models.

(W2) We appreciate the reviewer’s observation regarding the need to compare with recent non‑entropy‑based approaches. To address this, we conducted additional experiments evaluating NEAR against ANAH‑v2 and MIND across four QA datasets (CoQA, QuAC, SQuAD v2, and TriviaQA) and three LLMs (Qwen2.5‑3B, LLaMA3.1‑8B, OPT‑6.7B). As shown below, NEAR consistently outperforms both ANAH‑v2 and MIND across AUROC, Kendall’s tau\\tau, and PCC, demonstrating its robustness and generalizability.

\\begin{array}{l|l|c|c|c} \**Model** & \**Method** & \\text{Dataset} & \\text{AUROC} \\uparrow & \\tau \\uparrow & \\text{PCC} \\uparrow \\\\ \\hline \\text{Qwen2.5‑3B} & \\text{ANAH‑v2} & \\text{CoQA} & 0.78 & 0.51 & 0.50 \\\\ & & \\text{QuAC} & 0.77 & 0.50 & 0.49 \\\\ & & \\text{SQuAD} & 0.79 & 0.52 & 0.51 \\\\ & & \\text{TriviaQA} & 0.77 & 0.51 & 0.49 \\\\ & \\text{MIND} & \\text{CoQA} & 0.80 & 0.53 & 0.52 \\\\ & & \\text{QuAC} & 0.79 & 0.52 & 0.51 \\\\ & & \\text{SQuAD} & 0.80 & 0.53 & 0.52 \\\\ & & \\text{TriviaQA} & 0.79 & 0.52 & 0.51 \\\\ & \**NEAR** & \\text{CoQA} & \\mathbf{0.85} & \\mathbf{0.65} & \\mathbf{0.64} \\\\ & & \\text{QuAC} & \\mathbf{0.84} & \\mathbf{0.66} & \\mathbf{0.65} \\\\ & & \\text{SQuAD} & \\mathbf{0.86} & \\mathbf{0.67} & \\mathbf{0.66} \\\\ & & \\text{TriviaQA} & \\mathbf{0.85} & \\mathbf{0.66} & \\mathbf{0.65} \\\\ \\hline \\text{LLaMA3.1‑8B} & \\text{ANAH‑v2} & \\text{CoQA} & 0.80 & 0.53 & 0.50 \\\\ & & \\text{QuAC} & 0.78 & 0.52 & 0.49 \\\\ & & \\text{SQuAD} & 0.81 & 0.54 & 0.51 \\\\ & & \\text{TriviaQA} & 0.79 & 0.53 & 0.50 \\\\ & \\text{MIND} & \\text{CoQA} & 0.82 & 0.55 & 0.53 \\\\ & & \\text{QuAC} & 0.80 & 0.54 & 0.52 \\\\ & & \\text{SQuAD} & 0.82 & 0.56 & 0.54 \\\\ & & \\text{TriviaQA} & 0.81 & 0.55 & 0.52 \\\\ & \**NEAR** & \\text{CoQA} & \\mathbf{0.85} & \\mathbf{0.66} & \\mathbf{0.61} \\\\ & & \\text{QuAC} & \\mathbf{0.84} & \\mathbf{0.65} & \\mathbf{0.60} \\\\ & & \\text{SQuAD} & \\mathbf{0.86} & \\mathbf{0.68} & \\mathbf{0.63} \\\\ & & \\text{TriviaQA} & \\mathbf{0.85} & \\mathbf{0.67} & \\mathbf{0.60} \\\\ \\hline \\text{OPT‑6.7B} & \\text{ANAH‑v2} & \\text{CoQA} & 0.79 & 0.52 & 0.49 \\\\ & & \\text{QuAC} & 0.77 & 0.51 & 0.48 \\\\ & & \\text{SQuAD} & 0.80 & 0.53 & 0.50 \\\\ & & \\text{TriviaQA} & 0.78 & 0.52 & 0.49 \\\\ & \\text{MIND} & \\text{CoQA} & 0.81 & 0.54 & 0.51 \\\\ & & \\text{QuAC} & 0.79 & 0.53 & 0.50 \\\\ & & \\text{SQuAD} & 0.82 & 0.55 & 0.52 \\\\ & & \\text{TriviaQA} & 0.80 & 0.54 & 0.50 \\\\ & \**NEAR** & \\text{CoQA} & \\mathbf{0.84} & \\mathbf{0.65} & \\mathbf{0.60} \\\\ & & \\text{QuAC} & \\mathbf{0.83} & \\mathbf{0.64} & \\mathbf{0.59} \\\\ & & \\text{SQuAD} & \\mathbf{0.85} & \\mathbf{0.66} & \\mathbf{0.61} \\\\ & & \\text{TriviaQA} & \\mathbf{0.84} & \\mathbf{0.65} & \\mathbf{0.59} \\\\ \\end{array}

Across 4 QA datasets (CoQA, QuAC, SQuAD v2, TriviaQA), NEAR consistently outperforms MIND and ANAH‑v2. Across three independent runs and the three models, the overall standard deviations were low (pm0.005\\pm0.005 for AUROC, pm0.006\\pm0.006 for Kendall’s tau\\tau, and pm0.006\\pm0.006 for PCC), confirming the stability of these results. Moreover, these methods cannot distinguish the type of hallucination, identify hallucination‑prone attention heads, or perform hallucination detection without retraining.

Reference
Bai, Y., Tu, S., Zhang, J., Peng, H., Wang, X., Lv, X., Cao, S., Xu, J., Hou, L., Dong, Y. and Tang, J., 2024. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. arXiv preprint arXiv:2412.15204.

评论

Thank you for the detailed responses. I confirm that the authors have adequately addressed most of the initial concerns and questions. I will raise my score, and I hope the authors could add the discussions to a later version of the manuscript.

评论

Thank you for your thoughtful review and for acknowledging our revisions. We greatly appreciate your constructive feedback and are pleased that our responses addressed your concerns. We will certainly incorporate the suggested discussions in a future version of the manuscript

审稿意见
2

This paper introduces a framework named Shapley NEAR (Norm-basEd Attention-wise usable infoRmation), which aims to detect whether the output generated by large language models (LLMs) is hallucinatory. Grounded in information entropy theory, the framework analyzes the information flow across all layers and attention heads of the model to assign a confidence score to the output of LLMs, thereby assessing its credibility.

优缺点分析

Strengths:

  1. The problem addressed is highly significant, as hallucination in large language models is a widely concerned issue in the current research landscape.
  2. The idea of quantifying the usable information flow in LLMs by combining information entropy theory and Shapley values, and applying it to hallucination detection, is novel.
  3. The paper is well-written and easy-to-follow.

Limitations and questions:

  1. The efficiency of the proposed method is a very serious concern. As is well-known, the estimation of Shapley values is extremely computationally intensive. However, the main text does not provide any comparison of the efficiency between the proposed method and the baseline methods, which raises concerns about the practical applicability of the approach.
  2. The proposed method appears to primarily apply Shapley value estimation to the domain of hallucination detection in large language models. Beyond this application, further emphasis is needed on any additional innovative efforts made by the proposed method.
  3. There already exist numerous efficient methods for estimating Shapley values, such as Beta-Shapley and AME. Given this, it is unclear why the current method does not consider adopting these more efficient estimation approaches.
  4. The experiments in the paper focus solely on the effectiveness of the proposed method for hallucination detection. However, it remains unclear how this method could contribute to the elimination of hallucinations, which is also an important aspect to consider.
  5. On line 210, it is necessary to further elaborate on the number of samples required for the estimation error of the proposed method to become negligible and the order of magnitude of this error. Additionally, there should be a detailed discussion on the model size beyond which the method becomes inapplicable. Such discussions would help readers better understand the practicality of the method across different scenarios.
  6. The experiments in the paper are conducted on only four datasets, which seems insufficient to demonstrate the general effectiveness of the proposed method. It would be preferable to validate the method's performance in a broader range of scenarios, especially more challenging ones, such as those involving longer generated sequences.

问题

Please refer to the “Limitations and questions” section.

局限性

The authors have discussed the limitations.

格式问题

No concerns

作者回复

Thank you for your valuable feedback. We are pleased to submit our rebuttal addressing each of the weaknesses and questions you raised. In the rebuttal, references to new papers are cited as (First Author et al., Year), while references from our submitted paper use [number] style. Tables from the paper are referred to as Table X, and new tables added in the rebuttal are referred to as Table X (below).

(Q1) While exact Shapley computation is expensive, Shapley NEAR approximates these values via Monte Carlo sampling (M=50,δ=0.01)(M=50, \delta=0.01), reducing complexity from factorial [23] to O(LHlogV)O(L \cdot H \cdot \log V). Using standard concentration results (Maleki et al., 2013), the estimation error is bounded as:

NEAR^NEARLHlogVlog(2n/δ)2M,\big| \widehat{\mathrm{NEAR}} - \mathrm{NEAR} \big| \leq L \cdot H \cdot \log V \cdot \sqrt{\frac{\log(2n/\delta)}{2M}},

which stabilizes at M=30M=305050 with error <0.05<0.05 at 99% confidence (Appendix A8). Appendix A10 reports runtime comparisons showing only moderate overhead relative to baselines. Crucially, NEAR works on pre‑computed attention outputs without retraining or auxiliary models (Section 3), ensuring efficiency and practical deployability. We will summarize this analysis in the main text.

(Q2) We appreciate the reviewer’s observation and agree that it is important to highlight our contributions beyond a direct application of Shapley values. NEAR introduces several novel innovations that make it more than a straightforward use of Shapley estimation:

  1. Attention‑wise, norm‑based entropy decomposition: We present a principled framework to measure usable information flow across all layers and heads of an LLM using attention output norms (Definitions 3.1–3.4), moving beyond final‑layer or logit‑based analyses used in prior work.
  2. Sentence‑level attribution: We extend Shapley‑based decomposition to attribute information gain at the granularity of individual context sentences (Equation 6), enabling fine‑grained interpretability for identifying which segments drive (or mislead) model predictions.
  3. Hallucination type distinction: We formalize and empirically validate the detection of both parametric and context‑induced hallucinations using the sign and magnitude of Shapley NEAR scores (Section 6, Figure 2b).
  4. Test‑time head clipping: We leverage these attributions to identify and prune hallucination‑prone attention heads, improving model reliability without retraining or architectural changes (Section 6, Table 2b).

Taken together, these contributions make Shapley NEAR a novel, interpretable, and plug‑and‑play framework that unifies attribution, uncertainty quantification, and mitigation for hallucination detection in LLMs.

(Q3) We intentionally adopt Monte Carlo permutation sampling (Section 4, Section 5, and Appendix A1.2) because of its simplicity, model‑agnostic applicability, and well‑characterized high‑confidence error bounds. As shown in Appendix A8, NEAR achieves performance saturation at $M = 30$–$50$ samples, balancing accuracy and efficiency for practical use in long‑context QA. In contrast, alternative estimators such as Beta‑Shapley require selecting a beta prior, introducing extra hyperparameters that must be tuned per model/task. Similarly, Approximate Mean Estimation (AME) sacrifices theoretical guarantees (it does not satisfy all Shapley axioms), and in our experiments, it exhibited significantly higher variance in results.

To empirically support this, we compared NEAR with Monte Carlo sampling versus NEAR with AME on TriviaQA using LLaMA‑3.1‑8B. As shown in Table 1(below), Monte Carlo NEAR outperforms AME in AUROC, Kendall’s $\tau$, and PCC, while AME shows nearly 7× higher standard deviation, indicating unstable estimates across runs.

Table 1: Comparison of NEAR with Monte Carlo vs. AME sampling on TriviaQA (LLaMA‑3.1‑8B, 3 runs).

MethodAUROC ↑Kendall’s τ ↑PCC ↑
NEAR (Monte Carlo)0.85 ± 0.0050.67 ± 0.0060.60 ± 0.006
NEAR (AME)0.82 ± 0.0350.61 ± 0.0410.55 ± 0.043

Hyperparameters for AME: Number of sampled subsets: 50, Subset size ratio: 0.5, Importance sampling temperature: 1.0

While AME marginally reduces computation, it introduces bias, requires additional hyperparameters, and produces unstable results, as evidenced by the much higher standard deviation. Monte Carlo sampling, by contrast, provides unbiased estimates with tighter variance bounds, making it better suited for a framework like NEAR that prioritizes interpretability and reproducibility.

(Q4) While our primary contribution is a principled framework for detecting hallucinations, we also propose a mitigation strategy in Section 6: clipping heads showing parametric hallucination. By identifying attention heads with consistently negative information gain (indicative of parametric hallucinations) and pruning them, we reduce overconfident context‑agnostic outputs. As shown in Table 2b, NEAR+HC improves AUROC, accuracy, and ROUGE‑L over both NEAR and INSIDE [12], demonstrating that our framework not only detects but also actively mitigates hallucinations without retraining or architectural changes.

(Q5) We thank the reviewer for pointing out the need for further clarification on sample requirements and scalability. As derived in Appendix A1.2, the estimation error of NEAR decreases as $O\big(\sqrt{\frac{\log n}{M}}\big)$, with the following high‑probability bound:

NEAR^NEARLHlogVlog(2n/δ)2M.\big| \widehat{\mathrm{NEAR}} - \mathrm{NEAR} \big| \leq L \cdot H \cdot \log V \cdot \sqrt{\frac{\log(2n/\delta)}{2M}}.

Empirically, as reported in Appendix A8, performance stabilizes at $M = 30$--$50$ samples, yielding estimation errors below $0.05$ with $99%$ confidence.

To further contextualize this, we provide a comparison on LLaMA‑2 13B across five QA datasets (LongBench v2 (Bai et al, 2024) included) in Table 20(below). NEAR consistently outperforms INSIDE[12] across AUROC, Kendall’s $\tau$, and PCC, albeit with approximately $1.5\times$ higher runtime. This additional cost reflects the overhead of Monte Carlo Shapley sampling, which, in turn, enables NEAR to provide hallucination type attribution and identification of hallucination‑prone heads, functionality not offered by baseline methods.

In terms of scalability, NEAR operates with complexity O(LcdotHcdotlogV)O(L \\cdot H \\cdot \\log V) and is practical for models up to approximately 8B parameters (Appendix A10). For significantly larger models (e.g., 13B+), computational cost grows substantially (discussed in Appendix A10).

\\begin{array}{l|c|c|c|c|c|c} \\text{Dataset} & \\text{AUROC (INSIDE)} & \\tau \\ (\\text{INSIDE}) & \\text{PCC (INSIDE)} & \\text{AUROC (NEAR)} & \\tau \\ (\\text{NEAR}) & \\text{PCC (NEAR)} \\\\ \\hline \\text{CoQA} & 0.81 & 0.56 & 0.52 & \\mathbf{0.86} & \\mathbf{0.67} & \\mathbf{0.65} \\\\ \\text{QuAC} & 0.80 & 0.55 & 0.50 & \\mathbf{0.85} & \\mathbf{0.66} & \\mathbf{0.64} \\\\ \\text{SQuAD v2} & 0.79 & 0.57 & 0.51 & \\mathbf{0.86} & \\mathbf{0.68} & \\mathbf{0.66} \\\\ \\text{TriviaQA} & 0.82 & 0.58 & 0.52 & \\mathbf{0.86} & \\mathbf{0.67} & \\mathbf{0.65} \\\\ \\text{LongBench v2} & 0.77 & 0.54 & 0.48 & \\mathbf{0.82} & \\mathbf{0.62} & \\mathbf{0.59} \\end{array}

Table 20: Comparison of NEAR and INSIDE on LLaMA‑2 13B across five QA datasets. NEAR achieves consistently higher detection performance but incurs approximately 1.5× higher runtime relative to INSIDE.

(Q6) To demonstrate NEAR’s robustness beyond the original QA benchmarks, we additionally evaluated it on LongBench v2 (Bai et al, 2024), which includes multi‑document and long‑context QA tasks with context lengths up to 100K tokens and 503 datapoints. As shown in Table 21(below), NEAR consistently outperforms both INSIDE[12] and Loopback Lens[17], the strongest baseline methods in our study, across all three models and metrics, confirming its effectiveness even in challenging long‑context scenarios. These results reinforce NEAR’s generalizability for hallucination detection across diverse QA settings. Across three independent runs, the overall standard deviations were low (±0.006 for AUROC, ±0.007 for Kendall’s τ\tau, and ±0.007 for PCC), confirming the stability of these results

\\begin{array}{l|l|c|c|c} \\hline \\text{Model} & \\text{Method} & \\text{AUROC} \\uparrow & \\text{Kendall's } \\tau \\uparrow & \\text{PCC} \\uparrow \\\\ \\hline \\text{Qwen2.5-3B} & \**NEAR (ours)** & \**0.792** & \**0.514** & \**0.527** \\\\ & \\text{INSIDE} & 0.709 & 0.457 & 0.471 \\\\ & \\text{Loopback Lens} & 0.683 & 0.438 & 0.452 \\\\ \\hline \\text{LLaMA3.1-8B} & \**NEAR (ours)** & \**0.812** & \**0.529** & \**0.544** \\\\ & \\text{INSIDE} & 0.727 & 0.468 & 0.483 \\\\ & \\text{Loopback Lens} & 0.701 & 0.449 & 0.463 \\\\ \\hline \\text{OPT-6.7B} & \**NEAR (ours)** & \**0.799** & \**0.521** & \**0.538** \\\\ & \\text{INSIDE} & 0.719 & 0.461 & 0.479 \\\\ & \\text{Loopback Lens} & 0.692 & 0.442 & 0.456 \\\\ \\hline \\end{array}

Table 21: Comparison of NEAR, INSIDE, and Loopback Lens on LongBench v2. NEAR achieves a 7--12\% relative improvement in AUROC and consistent gains in Kendall’s tau\\tau and PCC across all models.

Reference
Bai, Y., Tu, S., Zhang, J., Peng, H., Wang, X., Lv, X., Cao, S., Xu, J., Hou, L., Dong, Y. and Tang, J., 2024. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. arXiv preprint arXiv:2412.15204.

Maleki, S., Tran-Thanh, L., Hines, G., Rahwan, T. and Rogers, A., 2013. Bounding the estimation error of sampling-based Shapley value approximation. arXiv preprint arXiv:1306.4265.

评论

Thank you for your thoughtful feedback on my work. I have carefully addressed all the questions and concerns you raised and would greatly appreciate it if you could review my responses. Your insights are invaluable, and I kindly invite you to engage in a brief discussion to ensure any remaining points are fully clarified.

评论

Dear Reviewers,

Thank you for your reviews for this paper.

This is a reminder to please participate in the reviewer-author discussion phase, which ends in 2 days — by Tuesday, August 6 at 11:59 PM AoE. As the authors have responded to your reviews, we kindly ask that you read and engage with their responses, ask clarification questions as needed, and respond to help clarify key points before final decisions.

Your input during this phase is critical to ensuring a constructive and fair outcome.

Let us know if you have any questions or need assistance.

Warm regards,

AC

最终决定

The paper proposes a new entropy-based attribution framework, Shapley NEAR, to detect hallucinated outputs. They propose to use the information flow across attention layers and heads to distinguish parametric hallucinations from context-induced hallucinations. They further introduce a test-tme mitigation method to reduce parametric hallucinations by pruning attention heads that contribute to overconfident hallucinations. They show that their proposed method outperforms several strong baselines on existing QA benchmarks.

Reviewers found the paper well-written and easy-to-follow (EJkK, 2odL); the proposed method to be effective (2odL), innovative (SGb1, EJkK) and intuitive (2odL). The approach is also practical as the proposed mitigation uses computed attention values during inference without any training (SGb1). Having theoretical analysis complementing empirical experiments also significantly strengthen the contributions of this paper (2odL, PtWR).

During the rebuttal, the authors resolved raised concerns about lack of evaluation on long-context settings(PtWR, 2odL). The main remaining concerns that seems to be unsufficiently addressed are a) the computational cost of the proposed approach (SGb1, 2odL), which the authors addressed as a trade-off against the gain in performance and (2) limited applicability of the approach beyond 7B parameters model (PtWR).

Overall, while this is a strong paper with a novel and practically useful framework, given the concerns about efficiency and scalability, I lean toward rejection.