PaperHub
5.0
/10
Poster4 位审稿人
最低2最高4标准差0.8
4
2
4
3
3.8
置信度
创新性2.3
质量2.5
清晰度2.5
重要性2.3
NeurIPS 2025

On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models

OpenReviewPDF
提交: 2025-05-02更新: 2025-10-29
TL;DR

This paper mitigates object hallucinations in Large Vision-Language Models by exploring the uncertain visual tokens.

摘要

关键词
Large Vision-Language ModelsObject HallucinationsUncertain Visual Tokens

评审与讨论

审稿意见
4

Background

  • LVLMs suffer from object hallucination, that is, they generate descriptions of objects that are not in the input image.

Method

  • The authors argue that uncertain visual tokens within the vision encoder in LVLMs are a key factor.

Statistical analysis shows that high epistemic uncertainty in visual tokens => the occurrence of hallucinations.

Furthermore, they show theoretically and empirically that in early visual encoder layers, large representation deviations under small adversarial perturbations => high epistemic uncertainty.

  • Based on their findings, the authors propose a new method to alleviate object hallucination for LVLMs.

Specifically, the method uses adversarial perturbations to identify uncertain visual tokens.

Then, it applies a binary uncertainty mask to attention outputs in visual encoders, reducing the influence of uncertain visual tokens and alleviating object hallucination.

Results

  • In experiments, the proposed method can be combined with different prior approaches, significantly reducing object hallucinations in various LVLMs.

优缺点分析

Strengths

  • The paper proposes a new perspective: uncertain visual tokens from visual encoder can result in object hallucination.
  • The experiments are comprehensive, indicating that the proposed method can improve different LVLMs and prior approaches.

Weaknesses

Major

  • A few papers apply adversarial perturbance to images for mitigating hallucination. I appreciate that the authors dig into vision encoders and also show their method is compatible with prior ones, but I am not sure if the proposed method is technically novel or provides new insights. This may be very subjective, so I would like to see opinions from ACs and other reviewers.
  • Recent works [1-2] show that LVLMs tend to believe language priors (blind faith in text) instead of visual information. I think, the proposed method only modifies visual information from vision encoders and cannot mitigate hallucination caused by language priors. For example, when an LVLM is describing a table, it may tend to say there is a chair, because the text training data usually contains a table and a chair together.

[1] VLind-Bench: Measuring Language Priors in Large Vision-Language Models

[2] Words or Vision: Do Vision-Language Models Have Blind Faith in Text?

  • The evaluated LVLMs are kind of old. There are many new LVLMs with improved visual information processing modules, like Cambrian-1, Qwen2.5-VL, BLIP-3o and so on. I'm not sure if the proposed method still works on them. Besides, now that new LVLMs are using better vision modules, the benefit from the proposed method may shrink.

  • Masking some tokens may lead to information loss. As shown in Figure 7, I notice the LVLMs answers are much shorter when using the proposed method. It seems that, after we delete uncertain tokens, the models tend to describe fewer objects. Does this lead to less informative answers from LVLMs? Will this hurt the performance of models on VQA or reasoning tasks?

Minor

  • In Section 3.1, the authors propose estimating uncertainty based on adversarial attack, and show that it aligns with MC dropout. But in 3.2.1, the authors start to use MC dropout (lines 175-178) to estimate uncertainty maps. I'm wondering if the proposed uncertainty estimation is not associated with object hallucination.

  • The proposed method is not specific to LVLMs but to vision encoders. The authors show that visual encoders may have uncertain tokens, misleading LVLMs to generate object hallucination. This issue may happen in other scenarios when we use foundation visual encoders. For example, if I use CLIP to do image classification (without LLM part), the classification results may also be mislead by the uncertain tokens in the encoder. Thus, I think the "uncertain token" issue may be a common problem in encoders, while object hallucination is just an application when we use visual encoders in LVLMs. The authors are recommended to explore other applications of their findings and method in the future.

For LVLMs, I expect that the authors can explore, how LVLMs handle uncertain information from the encoder and how LVLMs balance between visual information and language prior, which is missing in the paper.

  • Line 198, is the indentation a typo?

问题

Please see the Weaknesses section.

局限性

Yes. The limitations and potential negative societal impact are discussed in the supplementary materials.

最终评判理由

Thank the authors for the detailed response.

  • Novelty. Addressed.

  • Hallucinations caused by other reasons. As also noted by Reviewer RPtF, this paper only focuses epistemic uncertainty in the vision encoder, and does not consider other important hallucinations, such as those caused by language priors. But the proposed method is well-justified, effective, and compatible with prior hallucination mitigation approaches. Thus, this point does not diminish the value of the authors' contributions.

  • Experiments on newer LVLMs (also noted by Reviewer RPtF). The authors added additional experiments on Qwen2.5-VL and DeepSeek-Vl. Addressed.

  • Information loss due to the proposed method (also noted by Reviewer o1SC). The proposed method shortens the generated output of LVLMs, but maintains LLaVA-7B's performance on MME benchmark. The experiment scope is kind of limited, because the authors only validate their method with one model on a relatively simple benchmark. Partly addressed.

My concerns have been largely addressed, and I raise my score to 4.

The authors are encouraged to extend their method to handle various hallucinations and may explore uncertainty of vision encoders in other applications.

格式问题

No

作者回复

We sincerely appreciate the reviewer’s recognition of our new perspective on uncertain visual tokens as a source of object hallucination. We also thank the reviewer for noting the comprehensiveness of our experiments and the broad applicability of our method across various LVLMs and prior approaches. We carefully address your questions and concerns as follows and will incorporate them into our revision.


Major

W1. I am not sure if the proposed method is technically novel or provides new insights.

Thank you for highlighting the strengths of our work, especially our focus on vision encoders and compatibility with prior methods.

While previous works [R1-2] have applied adversarial perturbations mainly to reduce hallucinations, our method takes a fundamentally new dimension: it repurposes adversarial perturbations to measure epistemic uncertainty inside the vision encoder. Our method uniquely uses adversarial perturbation to quantify epistemic uncertainty of visual tokens within the vision encoder—enabling principled, visual token-level masking to suppress hallucinations. This introduces a novel perspective by connecting visual representation uncertainty with hallucination mitigation of LVLMs, which has not been explored in prior work.

As noted by Reviewer o1SC, both uncertainty and hallucinations are important research for trustworthiness, and connecting the two is interesting and promising. Reviewer RPtF also emphasized that the performance improvements in our method stem directly from the uncertainty-aware masking mechanism that selectively filters only the most unreliable visual tokens, which supports its novelty and effectiveness. We believe these points affirm the broader relevance and impact of our contribution.

Reference
[R1] Zhang, et al. "Poison as Cure: Visual Noise for Mitigating Object Hallucinations in LVMs," arXiv 2025.
[R2] Wang, et al. "Mirage in the eyes: Hallucination attack on multi-modal large language models with only attention sink," arXiv 2025.


W2. I think, the proposed method only modifies visual information from vision encoders and cannot mitigate hallucination caused by language priors.

Our work assumes that both the vision encoder and language model contribute to object hallucination, and focuses specifically on the vision-side uncertainty—an underexplored but complementary factor.

The cause of object hallucination in LVLM you raised—blind faith in textual information—are indeed well-known and addressed by existing methods. Rather than aiming to cover the cause from language models, we present new dimension epistemic uncertainty in the vision encoder, which we show is strongly correlated with hallucination.

Theorem 3.2 and our empirical results (Table 1, 2 and Fig. 4, 6) demonstrate that visual tokens sensitive to adversarial perturbations—used as a proxy for uncertainty—are predictive of hallucinated content. Masking these tokens significantly reduces hallucinations without altering the language model.

Our method is training-free, vision-only, and integrates well with language-side approaches like OPERA, PAI, and VCD. This plug-and-play nature makes it a practical and effective complement to broader hallucination mitigation strategies.


W3. There are many new LVLMs with improved visual information processing modules, like Cambrian-1, Qwen2.5-VL, BLIP-3o and so on. I'm not sure if the proposed method still works on them.

Thank you for the suggestion. As you suggested, we also evaluated our method on one of the recent models DeepSeek-VL [R3] and Qwen2.5-VL [R4]. As shown in Table R2 and Table R3, our method remains effective, suggesting that uncertain visual tokens are a persistent issue even in state-of-the-art models.

Despite differences in visual token distributions and cross-modal alignment, epistemic uncertainty in the vision encoder remains intrinsic. This enables our method to generalize well across architectures without retraining or structural changes.

Table R2. Quantitative results of DeepSeek-VL and Qwen2.5-VL on CHAIR benchmark.

ModelCHAIRs\text{CHAIR}_sCHAIRs\text{CHAIR}_sF1↑
DeepSeek-VL25.86.672.7
DeepSeek-VL+Ours22.45.572.6
Qwen2.5-VL29.67.876.0
Qwen2.5-VL+Ours28.67.076.8

Table R3. Quantitative results of DeepSeek-VL and Qwen2.5-VL on POPE benchmark.

ModelRandom↑Popular↑Adversarial↑
DeepSeek-VL88.788.084.9
DeepSeek-VL+Ours88.888.085.1
Qwen2.5-VL84.283.783.3
Qwen2.5-VL+Ours84.383.883.4

Reference
[R3] Lu, et al. "Deepseek-vl: towards real-world vision-language understanding," arXiv 2024.
[R4] Bai, Shuai, et al. "Qwen2.5-vl technical report," arXiv 2025.


W4. It seems that, after we delete uncertain tokens, the models tend to describe fewer objects. Does this lead to less informative answers from LVLMs? Will this hurt the performance of models on VQA or reasoning tasks?

Thank you for pointing this out. We acknowledge that masking uncertain visual tokens may lead to slightly shorter outputs, as shown in Figure 7 and Table A8. However, this reduction is modest and has been observed to help reduce hallucinations, consistent with prior studies [R5-8] that associate overly lengthy outputs with increased hallucination risk. To quantitatively support this, we included the average text lengths and standard deviations in Appendix Table A8 and discussed it in lines 715–720.

In response to your concern about potential degradation in visual understanding or reasoning (e.g., VQA or reasoning), we conducted further evaluations using the MME benchmark [R9] across key categories including coarse/fine-grained recognition, OCR, and logical reasoning (Numerical Calculation and Code Reasoning), as summarized in Table R2. Our method preserved performance across these tasks, suggesting that while it reduces the influence of uncertain tokens, it maintains essential visual perception and reasoning capabilities.

Table R2. Quantitative results of our method on MME benchmark with LLaVA-7B.

MethodPositionSceneOCRNumericalCode
LLaVA-7B1231561255078
LLaVA-7B+Ours1231511335383

Reference
[R5] Yue, et al. "Less is more: Mitigating multimodal hallucination from an eos decision perspective," ACL 2024.
[R6] Han, et al. "Skip\n: A simple method to reduce hallucination in large vision-language models," ICLRW 2024.
[R7] Li, et al. "The hidden life of tokens: Reducing hallucination of large vision-language models via visual information steering," ICML 2025.
[R8] Min, et al. "Mitigating hallucinations in large vision-language models via summary-guided decoding," NAACL 2025.
[R9] Fu, et al. "MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models," arXiv 2023.


Minor

W1. In Section 3.1, the authors propose estimating uncertainty based on adversarial attack, and show that it aligns with MC dropout. But in 3.2.1, the authors start to use MC dropout (lines 175-178) to estimate uncertainty maps. I'm wondering if the proposed uncertainty estimation is not associated with object hallucination.

We used MC dropout in Section 3.2.1 specifically for analysis purposes because it is a well-established and well-known method for uncertainty quantification. Our goal was to clearly and rigorously demonstrate the association between uncertain visual tokens and object hallucination. To ensure this analysis was based on a widely recognized technique, we opted for MC dropout instead of our method.


W2. I think the "uncertain token" issue may be a common problem in encoders, while object hallucination is just an application when we use visual encoders in LVLMs. The authors are recommended to explore other applications of their findings and method in the future.

Thank you for your insightful suggestion regarding broader applicability. As you pointed out, our method is designed to intervene solely in the vision encoder and thus holds potential for extension beyond LVLMs to other vision tasks that rely on pre-trained encoders. However, in this work, our analysis and experiments are focused specifically on the relationship between uncertain visual tokens and object hallucination in LVLMs. It remains an open question how our method would behave in tasks such as zero-shot image classification with CLIP. We agree that exploring these directions would be a valuable avenue for future research.


W3. The authors can explore, how LVLMs handle uncertain information from the encoder and how LVLMs balance between visual information and language prior.

We appreciate the reviewer’s insightful suggestion. To examine how LLMs handle uncertain visual information, we analyzed the attention maps of LLaVA-1.5-7B over image tokens, before and after applying our method.

We measured the entropy of the LLM’s attention distribution over image tokens (across all layers and heads) to assess whether the LLM attends broadly or narrowly. A higher entropy implies that the model utilizes a wider range of visual evidence rather than relying on a few visual tokens.

We evaluated the average entropy of LLM attention over image tokens using 500 images:

  • Entropy of original LLaVA = 1.5746
  • Entropy of LLaVA + Ours = 1.9717

The increased entropy indicates that our method encourages broader and more balanced attention over reliable visual tokens, helping the model better integrate visual information while mitigating over-reliance on uncertain inputs.

We will include this result in the final version to clarify how our approach influences the LLM's use of visual features.


W4. Line 198, is the indentation a typo?

Thank you for pointing out the indentation issue at line 198. It is indeed a typo error. We appreciate your careful reading and helpful feedback.

评论

Dear Reviewer bSvX,

Thank you very much for taking the time to carefully read our rebuttal.
We sincerely appreciate your thoughtful consideration of our responses and your commitment to the reviewing process.

Best regards,
The Authors

审稿意见
2

This work identifies uncertain visual tokens in the vision encoder as a key cause of object hallucination in large vision-language models and proposes a strategy to mitigate it, leading to significant improvements on some benchmark evaluations.

优缺点分析

Strength:

The proposed method demonstrates promising performance on some benchmarks.

Weaknesses:

(1) The overall theoretical part appears to lack sufficient rigor, which may affect the reliability of its conclusions. For example, in Lemma 3.1:

  1. What exactly is meant by “small perturbations”?
  2. What does “locally follows” refer to?
  3. Additionally, the hidden states “z” and “zz^*” are introduced without prior definition or explanation.

(2) More importantly, in the proof of Lemma 3.1 and Theorem 3.2, the use of the “\approx” symbol is concerning. Drawing theoretical conclusions based on an imprecise approximation is not appropriate in a scientific context and raises questions about the soundness of the argument.

(3) In Theorem 3.2, it is meaningless to state " upper bound of the differential entropy of Zi(t) increases as Eϵ[Zi(t)zi(t)22] increases. \text { upper bound of the differential entropy of } Z_i^{(t)} \text { increases as } \mathbb{E}_\epsilon\left[\left\|Z_i^{(t)}-z_i^{(t)}\right\|_2^2\right] \text { increases. }" We can say that the upper bound of any value X increase as any other (non-negative) value Y increases, because, for example, we can simply set the upper bound of X to be X+Y

(4) There is no significant performance gap between the proposed method and baseline methods in POPE benchmark.

问题

see Strengths And Weaknesses.

局限性

N/A

最终评判理由

(1) The new proofs provided by the authors in the rebuttal are still informal, and I gave the details in my latest reply.

(2) Furthermore, the main theoretical claim in Theorem 3.2—that “an upper bound of a quantity increases as another quantity increases”—is, from a theoretical standpoint, meaningless, as I explain in the first review, which greatly undermines the value of the paper.

格式问题

N/A

作者回复

We sincerely thank the reviewer for acknowledging the promising performance of our method on benchmark evaluations. We carefully address your questions and concerns as follows and will incorporate them into our revision.


W1. The overall theoretical part appears to lack sufficient rigor.

We thank the reviewer for the thoughtful and constructive feedback. While the mathematical derivation in Lemma 3.1 is valid, we will clarify the terminology and address the reviewer’s concern. Below, we address each specific point:

  • Notation for hidden states: To improve clarity, we now define the hidden states explicitly. The clean hidden state at layer tt is denoted as z(t)RN×dz^{(t)} \in ℝ^{N \times d}, and the corresponding perturbed hidden state as Z(t)RN×dZ^{(t)} \in ℝ^{N \times d}, for input xx and perturbed input x+ϵx + \epsilon, respectively. Here, NN represents the number of input tokens, and dd is the dimensionality of each hidden state.

  • Small perturbations: Our Lemma 3.1 and Theorem 3.2 are grounded in the widely used principle of linear approximation under small perturbations [R1], providing theoretical support for our method. In our work, we define small perturbations as ϵRn×3\epsilon \in \mathbb{R}^{n \times 3} to the input xRn×3x \in \mathbb{R}^{n \times 3}, where nn is the number of image pixels and 3 corresponds to the RGB channels. The perturbation magnitude is constrained by ϵk\|\epsilon\|_\infty \leq k for a sufficiently small constant k>0k > 0, ensuring that the induced change in the hidden representation, Z(t)z(t)\|Z^{(t)} - z^{(t)}\|, is small enough for higher-order terms in the Taylor expansion to be safely ignored. Importantly, beyond this theoretical support, our empirical findings demonstrate that these small perturbations allow for accurate uncertainty estimation (Fig. 2) and effectively reduce object hallucination, resulting in strong performance on benchmark datasets (Table 1).

  • Locally follows a Gaussian distribution: This statement stems from a first-order Taylor expansion under small perturbations and is consistent with standard Laplace approximation techniques [R2, R3], which are widely used in local analyses of neural networks. Specifically, it means that the distribution of the hidden state Z(t)Z^{(t)} from perturbed input can be approximated by a Gaussian distribution around the original hidden state z(t)z^{(t)}. We will clarify this point in the revised version of the paper.

Reference
[R1] Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. "Explaining and harnessing adversarial examples," ICLR 2015.
[R2] Pascanu, Razvan, and Yoshua Bengio. "Revisiting natural gradient for deep networks," arXiv 2013.
[R3] Kirkpatrick, et al. "Overcoming catastrophic forgetting in neural networks," PNAS 2017.

Clarified Lemma 3.1

Lemma 3.1 (Local Gaussianity of hidden states under small input perturbation)
Let f={ft}t=1Lf = \{f_t\}_{t=1}^L be a smooth LL-layer neural network parameterized by θ\theta, and define the hidden state at layer tt as

z(t):=ftft1f1(x)z^{(t)} := f_t \circ f_{t-1} \circ \cdots \circ f_1(x)

for a input xRN×3x \in R^{N \times 3}. Consider a perturbed input x+ϵx + \epsilon, where the perturbation ϵRN×3\epsilon \in R^{N \times 3} satisfies ϵk\|\epsilon\|_\infty \leq k for sufficiently small k>0k > 0. Define the corresponding hidden state as

Z(t):=ftft1f1(x+ϵ).Z^{(t)} := f_t \circ f_{t-1} \circ \cdots \circ f_1(x + \epsilon).

Then, the hidden state Z(t)Z^{(t)} is approximated by a Gaussian distribution around z(t)z^{(t)}.


W2. The use of the "\approx" symbol is concerning.

Regarding the use of the approximation symbol “\approx” in Lemma 3.1 and Theorem 3.2, we clarify that this approximation is both theoretically justified and standard in the context of smooth neural networks [R4, R5]. Additionally, as illustrated in Fig. 2 and similar to the behavior of MC dropout, the empirical results also support the validity of this approximation.

The approximation results from a first-order Taylor expansion around the input xx with small additive noise ϵ\epsilon, as shown in Eq. A.2 (Lemma 3.1) and Eq. A.5 (Theorem 3.2). Since ϵ\epsilon is bounded (i.e., ϵk\|\epsilon\|_\infty \leq k for sufficiently small kk), the induced residual Z(t)z(t)\|Z^{(t)} - z^{(t)}\| remains small, allowing higher-order terms to be safely ignored. This type of local linear approximation is widely used in the analysis of neural networks and their functional behavior [R2, R3] (Laplace approximation).

We also provide empirical support in Fig. 3, which shows that Z(t)z(t)\|Z^{(t)} - z^{(t)}\| is small in the early layers of the image encoder; accordingly, we rely only on these early-layer features for our approximation.

We will clarify this rationale in the revision to improve transparency and precision.

Reference
[R2] Pascanu, Razvan, and Yoshua Bengio. "Revisiting natural gradient for deep networks," arXiv 2013.
[R3] Kirkpatrick, et al. "Overcoming catastrophic forgetting in neural networks," PNAS 2017.
[R4] Balduzzi et al. "Neural Taylor approximations: Convergence and exploration in rectifier networks." PMLR, 2017.
[R5] Fu et al. "Unfolding Taylor's approximations for image restoration." NeurIPS 2021.


W3. In Theorem 3.2, it is meaningless to state "upper bound of the differential entropy Zi(t)Z^{(t)}_i increases as 𝔼[Zi(t)zi(t)22]𝔼 [|Z^{(t)}_i-z^{(t)}_i|^2_2] increases".

We appreciate the reviewer’s feedback and would like to clarify the role of the upper bound in Theorem 3.2.

A. Our upper bound provides a conservative estimate of uncertainty

As shown in [R6], conservative uncertainty estimation—i.e., intentionally overestimating rather than underestimating epistemic uncertainty—is theoretically desirable for multiple applications.

“Our uncertainties are conservative, ensuring that our algorithm is never more certain than it should be.”
Ciosek et al., ICLR 2020 [R6]

Our upper bound follows this principle:

  • It is monotonically linked to adversarial representation deviation (Theorem 3.2),
  • And ensures that we never overlook truly uncertain regions, even if the bound is loose.

This reflects the key idea behind conservative epistemic uncertainty estimation.

Reference
[R6] K. Ciosek et al., "Conservative Uncertainty Estimation by Fitting Prior Networks," ICLR 2020.

B. The upper bound is empirically useful

Our experiments demonstrate that:

  • Uncertainty maps derived from adversarial attack closely resembles with ones from Monte-Carlo dropout uncertainty estimations. (Fig. 2, A1)
  • Masking tokens identified via the upper bound leads to consistent reduction in object hallucination. (Fig. 4, 6)
  • Achieving up to 38% CHAIRs reduction across various models and decoding strategies.(Table 1, 2)

This shows that the bound is not only theoretically motivated but also reliably useful in practice.

C. The bound reflects meaningful structure

While it is true that any upper bound can be trivially increased, Theorem 3.2:

  • Is derived from adversarial representation deviation,
  • Provides a monotonic and interpretable proxy for epistemic uncertainty,
  • And is consistent with the conservative estimation framework supported in [R6].

In summary, the upper bound in Theorem 3.2 is not only meaningful but also conservative by design, making it theoretically sound, practically effective, and empirically validated. While formally proving the tightness of the bound remains a challenging task, our empirical results strongly suggest its usefulness in practice. We agree that establishing tightness more rigorously is a promising direction for future work.

Reference
[R6] K. Ciosek et al., "Conservative Uncertainty Estimation by Fitting Prior Networks," ICLR 2020.


W4. There is no significant performance gap between the proposed method and baseline methods in POPE benchmark.

We would like to clarify that the performance improvement on POPE of our method is not small. As shown in Table 1, methods such as OPERA, VCD, PAI, and Devils similarly exhibit performance degradation or marginal improvements under this setting. For instance, under LLaVA-1.5-7B, Devils (CVPR 2025) achieves slight gains of +0.3%p (Rand.), +0.6%p (Pop.), but degrades by -0.7%p on the Adversarial split.

Although our method follows a similar trend, it is consistently effective across various models and mitigation strategies in a plug-and-play manner, requiring no architectural changes or retraining. In this context, our method yields comparable or better improvements: when combined with Devils (CVPR 2025), we observe gains of +0.4%p (Rand.), +0.9%p (Pop.), and +1.3%p (Adv.) on LLaVA-1.5-7B. With VCD (CVPR 2024), we achieve +1.9%p, +0.6%p, and +1.4%p on the respective splits. These results consistently demonstrate the robust effectiveness of our method.

评论

Dear Reviewer E6ux,

We truly appreciate the time and effort you’ve already dedicated to reviewing our submission. We prepared our rebuttal with great care in hopes of directly addressing the valuable points raised in your review.
If you have the opportunity to take a look at our response, we would be grateful for any further thoughts or clarifications you might wish to share.

Thank you again for your contribution to the review process.

Best regards,
The Authors

评论

We sincerely thank the reviewer for the follow-up and for clearly articulating the remaining concerns. We address each point below and have revised the theoretical components of our paper accordingly.

cQ1: Based on my understanding, Lemma 3.1 appears to be incorrect. Regardless of how small kk is, the approximation error persists.

The claim that the hidden state "follows a Gaussian distribution" does not account for the approximation error introduced by the Taylor expansion.

We have revised Lemma 3.1 in the paper to accurately reflect the approximate local Gaussianity, along with a formal statement of the approximation error:


Clarified Lemma 3.1

Revised Lemma 3.1 (Approximate Local Gaussianity under Small Perturbation)

Let f={ft}t=1Lf = \{f_t\}_{t=1}^L be a smooth LL-layer neural network parameterized by θ\theta, and define the hidden state at layer tt as

z(t):=ftft1f1(x),z^{(t)} := f_t \circ f_{t-1} \circ \cdots \circ f_1(x),

for a input xRN×3x \in R^{N \times 3}. Consider a perturbed input x+ϵx + \epsilon, where the perturbation ϵRN×3\epsilon \in R^{N \times 3} satisfies ϵk\|\epsilon\|_\infty \leq k for sufficiently small k>0k > 0. Define the corresponding hidden state as

Z(t):=ftft1f1(x+ϵ).Z^{(t)} := f_t \circ f_{t-1} \circ \cdots \circ f_1(x + \epsilon).

Under the assumption that the perturbation is small and fC2f\in C^2, the distribution of Z(t)Z^{(t)} can be approximated by a Gaussian centered at z(t)z^{(t)}, with approximation error of order O(Z(t)z(t)3)\mathcal{O}(\|Z^{(t)} - z^{(t)}\|^3) in a log-probability.


This revision reflects a more rigorous formulation: we now (i) clarify the conditions and (ii) explicitly acknowledge the approximation error.

评论

cQ3: From a theoretical standpoint, the statement that "the upper bound of a random variable increases as another random variable increases" is still not significant.

We thank the reviewer and clarify that our contribution is not the mathematical observation that an upper bound increases, but that the norm of adversarial deviation provides a tractable, interpretable, and empirically useful proxy for epistemic uncertainty. Theorem 3.2 connects adversarial sensitivity to an upper bound on differential entropy under mild assumptions (via Lemma 3.1), enabling conservative uncertainty estimation without requiring MC dropout. This is theoretically justified and practically effective, as shown by strong alignment with MC-based maps (Fig. 2, A1) and significant reduction in hallucinations (Table 1, 2).
While the bound is not tight, it is monotonic, efficiently computable, and sufficient for downstream decisions, making it a meaningful and usable signal.

cQ4: I think "0~2"% is small, which is common sense. However, consider other benchmarks; I would view this point as a minor weakness.

Thank you for the comment. While 0–2% may appear small at first glance, in the POPE benchmark this range is consistent with or better than improvements from prior methods (e.g., OPERA, Devils), as shown in Table 1. Given that our method is training-free, efficient, and broadly applicable, such gains are not only meaningful but also practically valuable. On other benchmarks like CHAIR and AMBER, we observe much larger improvements, further validating the effectiveness of our approach.

评论

Dear Reviewer E6ux,

Thank you again for taking the time to review our submission and for acknowledging our rebuttal. We noticed that the final rating may not have been updated yet, and we just wanted to kindly follow up.

If there are any remaining concerns or clarifications needed on our end, we would be more than happy to provide additional explanations-we’re fully prepared to assist in any way that may be helpful.

We truly appreciate your efforts and your contribution to the review process.

Best regards,
The Authors

评论

Thank you for the clarification. However, my key concerns still remain:

  1. Based on my understanding, Lemma 3.1 appears to be incorrect. Regardless of how small $k$ is, the approximation error persists. Therefore, the final sentence of Lemma 3.1, which states that "... follows a Gaussian distribution...", appears inaccurate without introducing the approximation error.

  2. There remains a significant concern regarding the use of the \approx symbol in the theoretical proof. This notation greatly undermines the rigor of the argument in the proof part and is typically only appropriate in explanatory text rather than formal proofs.

  3. From a theoretical standpoint, the statement that "the upper bound of a random variable increases as another random variable increases" is still not significant.

  4. I think "0~2"% is small, which is common sense. However, consider other benchmarks; I would view this point as a minor weakness.

评论

cQ2: There remains a significant concern regarding the use of the \approx symbol in the theoretical proof.

We acknowledge that the use of the "≈" symbol may reduce the perceived rigor of our formal arguments in the proof of Lemma and Theorem. In the revised version, we have removed all such informal notation and replaced them with explicit expressions that include remainder terms.


Appendix A.1 – Proof of Lemma 3.1 (Revised)

Let f={ft}t=1Lf = \{f_t\}_{t=1}^L be a smooth LL-layer neural network parameterized by θ\theta, and let z(t):=ftf1(x)z^{(t)} := f_t \circ \cdots \circ f_1(x) denote the hidden state at layer tt for a clean input xRN×3x \in \mathbb{R}^{N \times 3}.

Now consider a perturbed input x+ϵx + \epsilon, where ϵk\|\epsilon\|_\infty \leq k for a small constant k>0k > 0, and let Z(t):=ftf1(x+ϵ)Z^{(t)} := f_t \circ \cdots \circ f_1(x + \epsilon) be the corresponding hidden state at layer tt.

Let the adversarial attack objective be:

L(Z(t)):=logpθ(Z(t)y)=Cf(Z(t);θ)y\*22,\mathcal{L}(Z^{(t)}) := -\log p_\theta(Z^{(t)} \mid y^*) = C \cdot \|f(Z^{(t)}; \theta) - y^\*\|_2^2,

where CC is a constant, yy^* is a target (e.g., final-layer feature), z(t)z^{(t)} is the hidden state at layer tt for the clean input xx, and Z(t)Z^{(t)} is the corresponding hidden state for the perturbed input x+ϵx + \epsilon.

Assuming ff is twice continuously differentiable with respect to Z(t)Z^{(t)}, we apply a second-order Taylor expansion of logpθ(Z(t)y)\log p_\theta(Z^{(t)} \mid y^*) around z(t)z^{(t)}:

logpθ(Z(t)y\*)=logpθ(z(t)y\*)+12(Z(t)z(t))Z(t)2logpθ(Z(t)y\*)Z(t)=z(t)(Z(t)z(t))+R(Z(t)),\log p_\theta(Z^{(t)} \mid y^\*) = \log p_\theta(z^{(t)} \mid y^\*) + \frac{1}{2}(Z^{(t)} - z^{(t)})^\top \nabla^2_{Z^{(t)}} \log p_\theta(Z^{(t)} \mid y^\*)\big|_{Z^{(t)} = z^{(t)}} (Z^{(t)} - z^{(t)}) + R(Z^{(t)}),

where the remainder term R(Z(t))R(Z^{(t)}) is of order O(Z(t)z(t)3)\mathcal{O}(\|Z^{(t)} - z^{(t)}\|^3) and the first-order term vanishes as follows:

Z(t)logpθ(Z(t)y\*)=2CfZ(t)(f(Z(t);θ)y\*)=0,at Z(t)=z(t).\nabla_{Z^{(t)}} \log p_\theta(Z^{(t)} \mid y^\*) = -2C \cdot \frac{\partial f}{\partial Z^{(t)}}(f(Z^{(t)}; \theta) - y^\*) = 0, \quad \text{at } Z^{(t)} = z^{(t)}.

Exponentiating both sides yields:

pθ(Z(t)y\*)=exp(logpθ(z(t)y\*)+12(Z(t)z(t))H(t)(Z(t)z(t))+R(Z(t))),p_\theta(Z^{(t)} \mid y^\*) = \exp\left( \log p_\theta(z^{(t)} \mid y^\*) + \frac{1}{2}(Z^{(t)} - z^{(t)})^\top H^{(t)} (Z^{(t)} - z^{(t)}) + R(Z^{(t)}) \right),

where H(t):=Z(t)2logpθ(Z(t)y)Z(t)=z(t)H^{(t)} := \nabla^2_{Z^{(t)}} \log p_\theta(Z^{(t)} \mid y^*)\big|_{Z^{(t)} = z^{(t)}} is the Hessian matrix.

Ignoring the higher-order remainder under the assumption of small perturbation, this corresponds to a local Gaussian approximation centered at z(t)z^{(t)}, with covariance matrix (H(t))1(-H^{(t)})^{-1}, as is standard in Laplace approximation.

Therefore, the conditional distribution of Z(t)Z^{(t)} under small input perturbation admits a second-order local approximation by a Gaussian centered at z(t)z^{(t)}, with approximation error of order O(Z(t)z(t)3)\mathcal{O}(\|Z^{(t)} - z^{(t)}\|^3) in a log-probability.


Appendix A.2 – Proof of Theorem 3.2 (Revised)

Let xx be an input image and let ϵ\epsilon be a small perturbation satisfying ϵk\|\epsilon\|_\infty \leq k and kk is sufficiently small for first-order Taylor expansion with the following:

zi(t):=fi(t)(x),Zi(t):=fi(t)(x+ϵ)z^{(t)}_i := f^{(t)}_i(x), \quad Z^{(t)}_i := f^{(t)}_i(x + \epsilon)

where fi(t)f^{(t)}_i denotes the hidden state of token ii at layer tt, and f=ftf1f = f_t \circ \cdots \circ f_1 is assumed to be twice continuously differentiable.

By the multivariate Taylor expansion of fi(t)(x+ϵ)f^{(t)}_i(x + \epsilon) around xx, we have:

Zi(t)=zi(t)+Ji(t)ϵ+Ri(t)(ϵ)Z^{(t)}_i = z^{(t)}_i + J^{(t)}_i \epsilon + R^{(t)}_i(\epsilon)

where Ji(t):=zi(t)xxRd×DJ^{(t)}_i := \left.\frac{\partial z^{(t)}_i}{\partial x}\right|_x \in \mathbb{R}^{d \times D} is the Jacobian matrix, and Ri(t)(ϵ)=O(ϵ2)\|R^{(t)}_i(\epsilon)\| = \mathcal{O}(\|\epsilon\|^2).

With the assumption of the perturbation upper bound kk, the remainder Ri(t)(ϵ)R^{(t)}_i(\epsilon) is negligible compared to the linear term. Under this assumption, we define the deviation:

ΔZi(t):=Zi(t)zi(t)=Ji(t)ϵ.\Delta Z^{(t)}_i := Z^{(t)}_i - z^{(t)}_i = J^{(t)}_i \epsilon.

Let Σϵ:=E[ϵϵ]\Sigma_\epsilon := \mathbb{E}[\epsilon \epsilon^\top]. Then the covariance of ΔZi(t)\Delta Z^{(t)}_i is:

ΣΔZi(t):=Cov[ΔZi(t)]=Cov[Ji(t)ϵ]=Ji(t)Σϵ(Ji(t)).\Sigma_{\Delta Z^{(t)}_i} := \operatorname{Cov}[\Delta Z^{(t)}_i] = \operatorname{Cov}[J^{(t)}_i \epsilon] = J^{(t)}_i \Sigma _\epsilon (J^{(t)}_i)^\top.

By the entropy formula for multivariate Gaussians:

h(Zi(t))=12log((2πe)ddet(ΣΔZi(t))).h(Z^{(t)}_i) = \frac{1}{2} \log \left( (2\pi e)^d \cdot \det(\Sigma _{\Delta Z^{(t)}_i}) \right).

Applying the AM-GM inequality:

det(ΣΔZi(t))1/d1dtr(ΣΔZi(t))=1dE[ΔZi(t)22]\det(\Sigma_{\Delta Z^{(t)}_i})^{1/d} \leq \frac{1}{d} \operatorname{tr}(\Sigma _{\Delta Z^{(t)}_i}) = \frac{1}{d} \mathbb{E}[\|\Delta Z^{(t)}_i\|_2^2]

Thus, the entropy is bounded as:

h(Zi(t))d2log(1dE[ΔZi(t)22])+C,h(Z^{(t)}_i) \leq \frac{d}{2} \log\left( \frac{1}{d} \mathbb{E}[\|\Delta Z^{(t)}_i\|_2^2] \right) + C,

where C=d2log(2πe)C = \frac{d}{2} \log(2\pi e) is constant.

Hence, the upper bound of the entropy increases as E[ΔZi(t)2]\mathbb{E}[\|\Delta Z^{(t)}_i\|^2] increases, completing the proof.

评论

Thanks for acknowledging the errors in the proofs. However, the current revision is still very informal.

For instance, the authors wrote that "the distribution of Z(t)Z^{(t)} can be approximated by a Gaussian (distribution) centered at z(t)z^{(t)}, with an approximation error of order O(Z(t)z(t)3)O(|Z^{(t)} - z^{(t)}|^3) in a log-probability". This statement is conceptually unclear. Since Z(t)Z^{(t)} is a random variable, Z(t)z(t)3|Z^{(t)} - z^{(t)}|^3 is also a random variable. Yet, a probability distribution is a deterministic function, not a random quantity. Thus, this claim effectively suggests that “a deterministic function can be approximated by another deterministic function with a random error,” which does not make any mathematical sense.

Moreover, note that the perturbation term ϵ\epsilon is computed via iterations of Equation (1). However, in the revised proof of Lemma 3.1, without proper justification, the authors introduce a very important equality

logp_theta(Z(t)y)=Cf(Z(t);theta)y_22 -\log p\_\\theta(Z^{(t)} \| y^* ) = C \cdot \|f(Z^{(t)}; \\theta) - y^* \|\_2^2

in the proof. This sudden and critical transition is unjustified and undermines the logical consistency of the argument.

Hence, the revised proof remains informal and difficult to follow. Nevertheless, I believe these may be fixable in some ways.

Furthermore, the main theoretical claim in Theorem 3.2—that “an upper bound of a quantity increases as another quantity increases”—is, from a theoretical standpoint, meaningless, as I explain in the first review, which greatly undermines the value of the paper.

评论

We appreciate the reviewer’s close reading and have rewritten the relevant passages to remove the ambiguities you pointed out.

cQ1: For instance, the authors wrote that "the distribution of Z(t)Z^{(t)} can be approximated by a Gaussian (distribution) centered at z(t)z^{(t)}, with an approximation error of order O(Z(t)z(t)3)\mathcal{O}(\|Z^{(t)} - z^{(t)}\|^3) in a log-probability". This statement is conceptually unclear. Since Z(t)Z^{(t)} is a random variable, |Z^{(t)} - z^{(t)}|^3 is also a random variable. Yet, a probability distribution is a deterministic function, not a random quantity. Thus, this claim effectively suggests that “a deterministic function can be approximated by another deterministic function with a random error,” which does not make any mathematical sense. cQ2: Moreover, note that the perturbation term ϵ\epsilon is computed via iterations of Equation (1). However, in the revised proof of Lemma 3.1, without proper justification, the authors introduce a very important equality logθ(Z(t)y\*)=Cf(Z(t);θ)y\*22-\log_\theta(Z^{(t)} \mid y^\*) = C \cdot \|f(Z^{(t)}; \theta) - y^\*\|_2^2 in the proof. This sudden and critical transition is unjustified and undermines the logical consistency of the argument.

We appreciate these comments. Below is further clarification. Let us denote Z(t)Z^{(t)} as the random hidden state at the layer tt, which follows the conditional distribution pθ(zy)p_\theta(z \mid y^*). For a LL-layer neural network parameterized by θ\theta or f=fLf1f = f_L \circ \cdots \circ f_1 and for an input vector xx, let z(t)z^{(t)} be the intermediate hidden state at the layer tt or z(t)=ftf1(x)z^{(t)} = f_t \circ \cdots \circ f_1(x). Then, let

y:=f(x;θ)=f(t)(z(t);θ(t)),y:=f(x+ϵ;θ)=f(t)(z(t)+ϵ;θ(t))y^* := f(x;\theta) = f^{(t)}(z^{(t)};\theta^{(t)}), \quad y := f(x+\epsilon;\theta) = f^{(t)}(z^{(t)} + \epsilon';\theta^{(t)})

as the clean and perturbed outputs, respectively, where f(t)=fLft+1f^{(t)} = f_L \circ \cdots \circ f_{t+1}, ϵ\epsilon' is the residual vector induced by ϵ\epsilon and θ(t)\theta^{(t)} corresponds to the prameters for f(t)f^{(t)}.

The goal is to find ϵ\epsilon to maximize the objective function for adversarial attack, i.e., Cf(x+ϵ;θ)y\*22C \lVert f(x+\epsilon;\theta)-y^{\*}\rVert_2^2 or minimizing exp(Cf(x+ϵ;θ)y22)\exp( -C \lVert f(x+\epsilon;\theta)-y^{*}\rVert_2^2) for some positive CC and ϵk\|\epsilon\|_\infty \leq k for a small constant k>0k > 0. In this scenario, we model

pθ(zy)    exp(Cf(t)(z;θ(t))f(t)(z(t);θ(t))22)p_\theta(z \mid y^*) \;\propto\; \exp\big(- C \lVert f^{(t)}(z; \theta^{(t)}) - f^{(t)}(z^{(t)};\theta^{(t)}) \lVert_2^2\big)

for zz in the neighborhood of z(t)z^{(t)}. Our lemma exploits the linearization via Taylor expansion at the layer tt, which leads to the local-Gaussian statement:

“For the hidden state z(t)z^{(t)}, the conditional log-density logpθ(zy)\log p_\theta(z \mid y^*) admits a second-order Taylor expansion whose leading quadratic term coincides with that of Gaussian distribution and the remainder is O(zz(t)3)\mathcal{O}(\lVert z - z^{(t)}\rVert^3).”

Note that this local Gaussian formulation corresponds to the Laplace approximation (see C.M. Bishop, Pattern Recognition and Machine Learning, §6.4.6).

cQ3: Furthermore, the main theoretical claim in Theorem 3.2—that “an upper bound of a quantity increases as another quantity increases”—is, from a theoretical standpoint, meaningless, as I explain in the first review, which greatly undermines the value of the paper.

We respectfully disagree with this view: Of course, there are even exact relationship between the trace of a matrix and the determinant of it like detexpA=exptr(A)\det \exp A = \exp tr(A), which is infeasible to compute for large-scale AA. However, note that this is based on the local analysis with restricted perturbation (small ϵ\epsilon), which led to practical solution in terms of computation and as elaborated in our main paper and supplementary document, it holds great practical values in guiding our experiments and reducing hallucination.

Here is our extensive empirical analysis, making the point that this theorem is practically useful. For each image token, we applied 2048 adversarial attacks to the LLaVA-1.5-7B image encoder and computed its covariance matrix ΣΔZR1024×1024\Sigma_{\Delta Z}\in\mathbb{R}^{1024\times1024}. From ΣΔZ\Sigma_{\Delta Z}, we extracted the eigenvalues and calculated, for each token. Averaging the sum of the top eight eigenvalues’ relative variance across all tokens yielded 0.942±0.0040.942\pm0.004 with most eigenvalues close to zero.

Under such conditions, computing det(ΣΔZ)\det(\Sigma_{\Delta Z}) for entropy estimation is numerically unstable since values underflow to zero, making direct entropy comparison infeasible. In contrast, using the trace of ΣΔZ\Sigma_{\Delta Z} (our approach) is numerically stable, theoretically well-motivated under anisotropy, and preserves the token-wise ordering of uncertainty. This trace-based measure also well aligns with the qualitative uncertainty maps in Figure 2, supporting its practical validity.

审稿意见
4

The paper addresses a research question:

  • LVLMs' object hallucination by epistemic uncertainty.

First, the paper discovers the uncertainty of visual tokens in a visual encoder of LVLM. To approximate the uncertainty, the paper generates a perturbed image by adding small noise by using adversarial attacks. After that, the mask is extracted from the layer-wise uncertainty map. Compared to MC dropout, the approximated epistemic uncertainty map is aligned; the computation is faster. The paper investigates how the hallucination positively correlates with uncertainty. Based on the approximated uncertainty, the paper proposes a masking strategy, which is training-free.

The paper validates the proposed method on both open-generation and VQA-style benchmarks. When applying the proposed method to the prior work, the performances are generally improved. The ablation studies, such as vision encoder layers, and masking strategies, also validate the proposed method.

优缺点分析

Strengths

  • Both uncertainty and hallucinations are important research for trustworthiness. Connecting this research is interesting and promising.
  • The paper has proposed the approximation method for uncertainty, which is faster than the MC method.
  • The writing is well-structured as follows: Efficient uncertainty approximation -> Analysis of uncertainty and hallucination -> Mitigation using uncertainty map -> Validation of the proposed method.

Weaknesses

  • Although the proposed approximation is efficient, what is the inference time compared to other techniques, e.g., VCD without the proposed method?
  • As shown in Fig. 7 of the paper and the qualitative results of the appendix, the length of generation is shorter than the method without the proposed method. Thus, including the average length of the generation would be helpful in understanding the proposed method. Additionally, I think that a discussion is needed on this point.
  • In Table A10 of the appendix, using MC dropout is not better than the gradient-based method. As I understand the main paper, MC dropout is more accurate but inefficient. I wonder why the performance of adversarial attacks is better than MC dropout.
  • The discussion with the prior work [1] regarding the proposed uncertainty map would be beneficial. [1] Hanjing Wang and Qiang Ji, Epistemic Uncertainty Quantification For Pre-trained Neural Networks, CVPR 2024

问题

Please see above weaknesses.

局限性

Yes

最终评判理由

The paper and the proposed method are well-motivated. My concerns were addressed during the rebuttal, and the authors also discuss inference time and the different behavior compared to MC dropout. Since the authors plan to revise the paper to include these points and further classification, I will keep my initial positive score.

格式问题

I have no concerns regarding paper formatting.

作者回复

We sincerely thank the reviewer for recognizing the importance of connecting uncertainty and hallucination for enhancing trustworthiness. We appreciate the positive assessment of our efficient uncertainty approximation method as a promising alternative to MC-based approaches, as well as the acknowledgment of the clear and well-structured presentation of our contributions. We carefully address your questions and concerns as follows and will incorporate them into our revision.


W1. What is the inference time compared to other techniques, e.g., VCD without the proposed method?

We report the mean and standard deviation of inference time in Appendix Table A6 and discuss this in lines 702-707. As shown in the table, our method achieves comparable runtime to methods like VCD [R1] and PAI [R2], and significantly outperforms OPERA [R3] in terms of inference efficiency.

Reference
[R1] Leng, Sicong, et al. "Mitigating object hallucinations in large vision-language models through visual contrastive decoding," CVPR 2024.
[R2] Liu, et al. "Paying more attention to image: A training-free method for alleviating hallucination in lvlms," ECCV 2024.
[R3] Huang, et al. "Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation," CVPR 2024.


W2. Including the average length of the generation would be helpful in understanding the proposed method.

As you pointed out, our method slightly shortens the generated output. We report the mean and standard deviation of text lengths in Appendix Table A8 and discuss this in lines 715–720. Importantly, the reduction in length is modest and aligns with prior findings [R4–7] that overly long outputs can increase hallucination. This suggests our method may help by encouraging more concise and grounded descriptions. Following your feedback, we will expand the discussion in the main text.

Reference
[R4] Yue, et al. "Less is more: Mitigating multimodal hallucination from an eos decision perspective," ACL 2024.
[R5] Han, et al. "Skip\n: A simple method to reduce hallucination in large vision-language models," ICLRW 2024.
[R6] Li, et al. "The hidden life of tokens: Reducing hallucination of large vision-language models via visual information steering," ICML 2025.
[R7] Min, et al. "Mitigating hallucinations in large vision-language models via summary-guided decoding," NAACL 2025.


W3. Using MC dropout is not better than the gradient-based method.

Thank you for your insightful comment. While MC Dropout is a widely used method for approximating epistemic uncertainty, it is itself an approximation and not necessarily the most accurate in practice. The results in Table A10 reflect this: MC Dropout does not consistently outperform other methods, including ours, particularly under adversarial perturbations.

Our method is theoretically motivated by Theorem 3.2, and Figure 2 empirically demonstrates that the uncertainty it measures behaves similarly to that of MC Dropout. This supports the claim that our method offers a principled and reliable way to quantify epistemic uncertainty, while also being significantly more efficient. We will revise the paper to clarify this point and to emphasize the practical advantages of our approach in mitigating hallucinations under perturbations.


W4. The discussion with the prior work [1] regarding the proposed uncertainty map would be beneficial.

[1] Hanjing Wang and Qiang Ji, Epistemic Uncertainty Quantification For Pre-trained Neural Networks, CVPR 2024

Thank you for the valuable suggestion. We will include a discussion comparing our uncertainty map with Wang & Ji (CVPR 2024). While both methods estimate epistemic uncertainty without retraining, their approach focuses on model-level uncertainty via parameter sensitivity for classification, whereas ours targets token-level uncertainty in vision encoders using adversarial perturbations, specifically to mitigate object hallucination in LVLMs. We will highlight this distinction and its relevance in the revised manuscript.

评论

I thank the authors for their response to my review. I have additional questions regarding the response.

W1. What is the inference time compared to other techniques, e.g., VCD without the proposed method?

The comparable runtime is surprising because VCD requires single inference with 2 batch size in parallel, and the proposed method requires back propagation. Intuitively, the proposed method requires nn times compared to VCD where nn is the number of back-propagation. Are there any details to measure the inference time?

W3. Using MC dropout is not better than the gradient-based method.

Yes, MC dropout might not be necessarily accurate. However, in Lines 149-152, the paper states that "As shown in Fig.2, the results indicate that U closely aligns with the uncertainty estimated via MC dropout, demonstrating that U serves as an efficient approximation." Thus, why is there a large performance difference even though the uncertainty is closely aligned?

评论

Dear Reviewer o1SC,

We truly appreciate the time and effort you’ve already dedicated to reviewing our submission. We prepared our rebuttal with great care in hopes of directly addressing the valuable points raised in your review.
If you have the opportunity to take a look at our response, we would be grateful for any further thoughts or clarifications you might wish to share.

Thank you again for your contribution to the review process.

Best regards,
The Authors

评论

Dear Reviewer o1SC,
We sincerely appreciate your recognition of our rebuttal efforts and your follow-up question to further clarify your concerns. Below is our response, and don’t hesitate to reach out if you have any additional questions!

cQ1: The comparable runtime is surprising because VCD requires single inference with 2 batch size in parallel, and the proposed method requires back propagation. Intuitively, the proposed method requires nn times compared to VCD where nn is the number of back-propagation. Are there any details to measure the inference time?

In the case of LLaVA-7B, VCD performs forward passes through both the vision encoder (0.3B) and language model (7B), while our method backpropagates only through the vision encoder (0.3B). Although our method includes iterative updates, it is more efficient as it excludes the language model and uses BFloat16 precision during updates. As a result, while slightly slower than VCD, our runtime remains comparable. We will clarify these details in the revision.

cQ2: Yes, MC dropout might not be necessarily accurate. However, in Lines 149-152, the paper states that "As shown in Fig.2, the results indicate that U closely aligns with the uncertainty estimated via MC dropout, demonstrating that U serves as an efficient approximation." Thus, why is there a large performance difference even though the uncertainty is closely aligned?

Thank you for the follow-up question. To clarify, the purpose of the comparison in Lines 149–152 is to empirically validate that our theoretically derived uncertainty estimation method (Theorem 3.2) is a reasonable proxy for epistemic uncertainty. As shown in Fig. 2 and A1, the spatial patterns of uncertainty estimated by UU and MC dropout are highly correlated — demonstrating that our method captures uncertainty effectively, despite being much more efficient.

However, the performance gap may stem from the fact that our method estimates an upper bound of epistemic uncertainty, as theoretically established in Theorem 3.2. This upper bound nature inherently leads to conservative uncertainty estimation, meaning our method tends to overestimate rather than underestimate uncertainty. In contrast, MC dropout, despite its Bayesian foundation, often yields underestimates of uncertainty or exhibits overconfident predictions as observed in prior studies [R8, R9].

As cited in [R10], conservative uncertainty estimation—i.e., intentionally overestimating rather than underestimating epistemic uncertainty—is theoretically desirable for multiple applications.

Our method follows this principle by design. As a result, even though Map UU and MC dropout yield visually similar maps, our conservative upper-bound estimation ensures that truly uncertain regions are not overlooked, which may help explain the more effective hallucination mitigation observed in our experiments (Table A10).

We will further include this explanation in the discussion.

Reference
[R8] Ovadia, Yaniv, et al. "Can you trust your model's uncertainty? evaluating predictive uncertainty under dataset shift." NeurIPS 2019.
[R9] Lakshminarayanan, B. et al. "Simple and scalable predictive uncertainty estimation using deep ensembles." NeurIPS 2017.
[R10] K. Ciosek et al., "Conservative Uncertainty Estimation by Fitting Prior Networks," ICLR 2020.

审稿意见
3

This paper addresses object hallucination in vision-language models (LVLMs), where models generate descriptions containing non-existent objects. The authors link this issue to uncertainty in visual tokens from the vision encoder and propose an efficient solution using adversarial perturbations to identify and mask uncertain tokens. Their method reduces hallucinations by 38% in benchmarks like CHAIR while maintaining output quality, and works 5x faster than traditional uncertainty estimation techniques. The approach is compatible with existing LVLM architectures and requires no additional training.

优缺点分析

----------- Strength -----------

  1. The proposed inference-time masking strategy reduces object hallucinations without fine-tuning or additional data, preserving model capabilities while improving output reliability through selective suppression of uncertain visual tokens identified via adversarial perturbations.
  2. Unlike conventional approaches that frequently compromise output quality while addressing hallucinations, the proposed method achieves substantial reductions in spurious object generation while consistently maintaining—and in certain cases even enhancing—both the F1 metric performance and the overall descriptive accuracy of generated captions. This dual benefit of hallucination suppression and quality preservation stems from the targeted, uncertainty-aware masking mechanism that selectively filters only the most unreliable visual tokens without disrupting valid visual information flow.
  3. The comprehensive evaluation framework encompasses multiple state-of-the-art LVLM architectures (LLaVA-1.5, Shikra, MiniGPT-4) and rigorously assesses performance across established benchmarks (CHAIR, POPE, AMBER), demonstrating consistent effectiveness in hallucination reduction while maintaining model performance across different architectural paradigms and evaluation metrics. This systematic validation confirms the method's architectural agnosticism and practical viability for real-world deployment scenarios.
  4. The paper excels in clarity, presenting complex technical concepts through well-structured explanations and intuitive visualizations. It maintains a strong balance between mathematical rigor and accessibility, particularly in explaining how adversarial perturbations reveal epistemic uncertainty in visual tokens. The writing effectively guides readers through both theoretical foundations and practical implementations.

----------- Weakness -----------

  1. While Theorem 3.2 shows a consistent relationship between feature changes caused by adversarial perturbations (‖Δz‖₂²) and the upper limit of visual token uncertainty — effectively identifying unstable, high-uncertainty tokens as key triggers for hallucinations — does this vision-focused approach truly capture all problematic visual tokens? Specifically, could the current method miss certain troublesome tokens because of: (1) language model biases (e.g., frequent object pairings) that don’t depend on visual uncertainty; (2) mismatches between visual and language information that are not caught by looking at visual changes alone; or (3) errors that happen later during decoding, beyond what visual-layer analysis can detect?
  2. Given the observed reduction in efficacy when applied to Q-Former-based architectures like MiniGPT-4, how might the proposed uncertainty-aware masking strategy be adapted to effectively address the bottleneck created by intermediate query-based representation layers, which currently attenuate the impact of vision encoder modifications on final model outputs? Specifically, could jointly optimizing perturbations across both visual and query token spaces, or developing cross-modal uncertainty propagation mechanisms, help bridge this performance gap while maintaining the method's training-free advantages?
  3. While the proposed method effectively reduces object hallucinations as demonstrated on CHAIR and POPE benchmarks, does it simultaneously maintain robust performance on the comprehensive MME evaluation? Specifically, does the vision token masking approach preserve the model's essential visual perception and reasoning capabilities across MME's diverse tasks, particularly for fine-grained recognition and complex compositional understanding? This verification is crucial to ensure the hallucination mitigation does not compromise the model's fundamental multimodal generation quality.
  4. Given the demonstrated efficacy of the proposed uncertainty-aware masking approach across several LVLM architectures, how might its performance generalize to state-of-the-art models like DeepSeek-VL, particularly considering potential differences in visual token representation distributions, cross-modal alignment mechanisms, and inherent uncertainty patterns that could influence both hallucination suppression effectiveness and core multimodal capabilities? This critical evaluation gap warrants investigation to fully establish the method's applicability across next-generation vision-language models.

问题

please see my weakness part.

局限性

yes

最终评判理由

I'd like to thank the authors' response to my questions. In the rebuttal, some of my concerns have been addressed. Considering the overall quality, especially the method's performance generalize to state-of-the-art models like DeepSeek-VL, which is also raised by reviewer E6ux, I will keep my rating.

格式问题

no

作者回复

We sincerely thank the reviewer for the thoughtful feedback. We appreciate the recognition of our method’s effectiveness in reducing hallucinations without fine-tuning or extra data, while preserving or improving caption quality. We are also grateful for the acknowledgment of our uncertainty-aware masking strategy and the clarity of our presentation. We carefully address your all questions and concerns as follows and will incorporate them into our revision.


W1. Does this vision-focused approach truly capture all problematic visual tokens?

Our work assumes that both the vision encoder and language model contribute to object hallucination, and focuses specifically on the vision-side uncertainty—an underexplored but complementary factor.

The challenges you raise—(1) language priors, (2) vision-language misalignment, and (3) decoding errors—are well-known and addressed by existing methods. Rather than aiming to cover all causes, we reveal new dimension (4) epistemic uncertainty in the vision encoder, which we show is strongly correlated with hallucination.

Theorem 3.2 and our empirical results demonstrate that visual tokens sensitive to adversarial perturbations—used as a proxy for uncertainty—are predictive of hallucinated content. Masking these tokens significantly reduces hallucinations without altering the language model.

Our method is training-free, vision-only, and integrates well with language-side approaches like OPERA, PAI, and VCD. This plug-and-play nature makes it a practical and effective complement to broader hallucination mitigation strategies.


W2. Could jointly optimizing perturbations across both visual and query token spaces, or developing cross-modal uncertainty propagation mechanisms, help bridge this performance gap while maintaining the method's training-free advantages?

Thank you for the insightful suggestion. Following your proposal, we conducted additional experiments where we jointly applied adversarial perturbations to both the input image and the 32 query tokens in MiniGPT-4. Based on the resulting uncertainty, we additionally masked 6 high-uncertainty query tokens in two cross-attention layers of the Q-Former. As shown in Table R1, this adaptation led to further improvements in performance. We appreciate your thoughtful idea and will include this result in the revised version.

Table R1. Quantitative results on Q-Former masking with joint adversarial attack. Greedy decoding was applied for the experiments. (Note: Minor differences from the main paper are due to nondeterminism across GPU hardware.)

ModelCHAIRs\text{CHAIR}_sCHAIRs\text{CHAIR}_sF1↑
MiniGPT-431.011.467.3
+Ours29.010.667.5
+Ours+Q-Former masking27.09.368.1

W3. Does it simultaneously maintain robust performance on the MME evaluation, particularly for fine-grained recognition and complex compositional understanding?

To address your concern regarding the impact of our visual token masking strategy on the model’s visual perception and reasoning capabilities, we conducted additional evaluations on a representative subset of the MME benchmark, as shown in Table R2. Specifically, we selected one task from each of the four major categories in MME [R1]: Position (coarse-grained recognition), Scene (fine-grained recognition), OCR (text understanding), and two tasks requiring logical reasoning—Numerical Calculation and Code Reasoning. The results demonstrate that our method effectively reduces object hallucination while preserving the essential visual perception and reasoning capabilities of the LVLMs, suggesting that our masking approach does not compromise multimodal generation quality.

Table R2. Quantitative results of our method on MME benchmark with LLaVA-7B.

MethodPositionSceneOCRNumericalCode
LLaVA-7B1231561255078
LLaVA-7B+Ours1231511335383

Reference

[R1] Fu, et al. "MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models," arXiv 2023.


W4. How might its performance generalize to state-of-the-art models like DeepSeek-VL?

Thank you for the suggestion. As you suggested, we evaluated our method on one of the recent models DeepSeek-VL [R2] and Qwen2.5-VL [R3]. As shown in Table R2 and Table R3, our method remains effective, suggesting that uncertain visual tokens are a persistent issue even in state-of-the-art models.

Despite differences in visual token distributions and cross-modal alignment, epistemic uncertainty in the vision encoder remains intrinsic. This enables our method to generalize well across architectures without retraining or structural changes.

Table R2. Quantitative results of DeepSeek-VL and Qwen2.5-VL on CHAIR benchmark.

ModelCHAIRs\text{CHAIR}_sCHAIRs\text{CHAIR}_sF1↑
DeepSeek-VL25.86.672.7
DeepSeek-VL+Ours22.45.572.6
Qwen2.5-VL29.67.876.0
Qwen2.5-VL+Ours28.67.076.8

Table R3. Quantitative results of DeepSeek-VL and Qwen2.5-VL on POPE benchmark.

ModelRandom↑Popular↑Adversarial↑
DeepSeek-VL88.788.084.9
DeepSeek-VL+Ours88.888.085.1
Qwen2.5-VL84.283.783.3
Qwen2.5-VL+Ours84.383.883.4

Reference

[R2] Lu, et al. "Deepseek-vl: towards real-world vision-language understanding," arXiv 2024.
[R3] Bai, Shuai, et al. "Qwen2.5-vl technical report," arXiv 2025.

评论

Dear Reviewer RPtF,

We truly appreciate the time and effort you’ve already dedicated to reviewing our submission. We prepared our rebuttal with great care in hopes of directly addressing the valuable points raised in your review.
If you have the opportunity to take a look at our response, we would be grateful for any further thoughts or clarifications you might wish to share.

Thank you again for your contribution to the review process.

Best regards,
The Authors

最终决定

This paper initially received two borderline accept, one borderline reject, and one reject. After the rebuttal and reviewer discussions, the scores remained mixed. However, the AC finds that the strengths of the work outweigh the weaknesses and recommends acceptance.

The paper introduces a training-free and efficient method to mitigate object hallucination in LVLMs by masking uncertain visual tokens identified via adversarial perturbations. The approach is computationally efficient, and consistently reduces hallucination rates while preserving output quality. Reviewers appreciated the clarity of writing, systematic experimental validation across multiple LVLM architectures, and the practical viability of the proposed method.

Several concerns were raised.

  • Theoretical rigor: Reviewer E6ux questioned the rigor of the theoretical analysis, particularly Theorem 3.2. While the rebuttal did not fully convince this reviewer, the AC believes the concerns are addressable through revision, as the overall intuition and methodology are addressed in the rebuttal.
  • Scope of hallucination causes: Multiple reviewers (RPtF, bSvX) pointed out that the method only mitigates a certain type of hallucinations arising from visual token uncertainty. This limits the generality of the approach.
  • Evaluation: The strategy sometimes shortens generated outputs, raising questions about informativeness, though rebuttal experiments showed preserved performance on MME. Finally, while additional experiments on Qwen2.5-VL and DeepSeek-VL were provided, the scope of evaluation on state-of-the-art models and broader benchmarks remains somewhat limited.

Despite these limitations, the paper makes a clear and practical contribution: it identifies a new perspective on hallucination causes, proposes a lightweight and effective mitigation strategy, and demonstrates its compatibility across architectures. The AC considers the theoretical concerns remediable in revision and finds the empirical results sufficiently compelling. Also, while Reviewer RPtF rated Borderline Reject, considering the fact that the rebuttal seems to address most of the questions in the review comment from the perspective of this AC and the final justification is incomplete, this AC found that the rating is overly pessimistic.

Therefore, the AC supports acceptance under the condition that the authors update the parts the reviewers commented on, such as the theoretical presentation, expanding evaluations to newer models, with the rebuttal contents in the final version.