PaperHub
6.4
/10
Poster4 位审稿人
最低4最高4标准差0.0
4
4
4
4
3.8
置信度
创新性2.8
质量2.8
清晰度3.0
重要性2.8
NeurIPS 2025

GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity

OpenReviewPDF
提交: 2025-05-04更新: 2025-10-29
TL;DR

We propose GLSim, a training-free framework that combines global and local embedding similarity signals for accurate object hallucination detection in LVLMs, outperforming prior methods.

摘要

关键词
large vision-language modelsmulti-modal large language modelsobject hallucination

评审与讨论

审稿意见
4

This paper presents GLSIM, a new training-free framework for detecting object hallucinations in LVLMs. The method combines global similarity scores, which assess how well an object fits the overall scene, with local similarity scores that verify if specific image regions support the object's presence. The authors argue that this dual perspective improves hallucination detection accuracy. Experiments are conducted across multiple benchmark datasets and LVLMs, showing superior performance compared to existing methods.

优缺点分析

Strengths:

  1. The proposed GLSIM framework is training-free, making it lightweight and potentially easier to adopt compared to methods that require additional training or fine-tuning.

Weaknesses:

  1. The paper focuses solely on object hallucinations, while other types of hallucinations (e.g., attribute or relational hallucinations) are not addressed. This narrow focus reduces the applicability of the proposed solution.

  2. The combination of global and local similarity signals, while intuitive, does not represent a significant methodological advancement.

  3. The experiments rely on older models like LLaVA and Shikra. More recent and advanced models such as Qwen2.5 VL and InternVL2.5 have not been evaluated, which limits the relevance and impact of the findings.

问题

  1. The authors claim that their method is more reliable than approaches relying on external knowledge sources. Could they provide a more detailed comparison with these external knowledge-based methods, particularly in terms of reliability and practicality?

局限性

Yes

最终评判理由

The author has addressed most of my concern, I'm happy to raise my socre.

格式问题

None

作者回复

We appreciate the reviewer for taking the time to provide detailed and constructive feedback. We are grateful for the recognition of our motivation, the performance of our method, and the training-free design.

We also thank the reviewer for the insightful questions and are happy to clarify the points raised as follows:


A1. Attribute and relational hallucinations

We appreciate you raising this important point. We agree that attribute-level and relational hallucinations are critical challenges for real-world applications. As mentioned in our Limitations and Future Work section, our current study focuses on object existence hallucinations because publicly available token-level annotations and benchmarks for attribute and relational hallucinations in open-ended generation settings remain limited.

That said, we explored whether GLSIM can handle these more complex hallucination types. To this end, we generated captions with each LVLM for 500 randomly selected images from the MSCOCO validation set. We then used a powerful multi-modal model GPT-4o, to produce pseudo ground-truth annotations for two new categories:

  • Attribute Hallucination: Incorrect properties of objects are described (e.g., “a red car” when the car is blue).
  • Relational Hallucination: The relationship between objects is incorrectly stated (e.g., “a cat sitting on a table” when it is under the table).

We then computed token-level GLSIM scores and aggregated them by averaging across attribute–object (e.g., “a red car”) and object–relation (e.g., “a cat sitting on a table”) spans. These aggregated scores served as unsupervised estimates of the hallucination likelihood. Without any task-specific modification, GLSIM consistently outperformed baseline methods. We hope GLSIM can serve as a promising baseline for detecting more complex forms of hallucination.

  • Attribute Hallucination
MethodLLaVA-1.5-7BLLaVA-1.5-13BQwen2.5-VL-7B
NLL58.6260.5056.89
Entropy52.2155.3255.84
Internal Confidence74.2473.6770.06
SVAR67.0368.6271.09
Contextual Lens74.0275.4871.98
GLSIM (Ours)77.1978.0774.09
  • Relational Hallucination
MethodLLaVA-1.5-7BLLaVA-1.5-13BQwen2.5-VL-7B
NLL57.0657.3554.90
Entropy55.7256.2755.03
Internal Confidence69.3868.9462.09
SVAR61.2065.8363.01
Contextual Lens66.4669.8564.88
GLSIM (Ours)70.0373.6468.95

A2. Methodological advancement

We understand how your concern arises. While the idea of combining global and local signals may seem intuitive in hindsight, what distinguishes GLSIM is not only the formulation of these two signals but the methodological rigor with which we analyze and instantiate them. Below, we elaborate on the methodological significance of our work:

  1. First, our motivation is original and offers a new lens through which to understand hallucination in LVLMs. Through detailed case studies in Section 4.1, we demonstrate, for the first time, that neither global nor local similarity alone is sufficient for reliable hallucination detection. This insight itself is a valuable contribution, as it challenges the prevailing reliance on single-perspective signals and motivates a more structured diagnostic approach. The impact of our work, therefore, goes beyond a specific method—it points the field toward a promising direction for hallucination detection in multimodal models and opens up room for more future work.

  2. Second, what sets our method apart is the rigorous investigation of how each component, global and local similarity, should be designed and integrated optimally. We do not treat the combination as a black-box fusion; instead, we design, ablate, and validate every aspect of the formulation---this is both a new and non-trivial effort. For example, in Section 5.3 (Lines 293–315), we carefully explore the design of both local and global scores by evaluating different unsupervised grounding methods, token positions, and aggregation strategies. Lines 316–330 examine how to best integrate the global and local scores. We further ablate distance metrics (Appendix D.5) and analyze all layer pair combinations to identify meaningful patterns for the scoring function (Appendix E). This is also recognized by other reviewers:

    "The paper includes in-depth ablation studies that analyze the impact of global and local similarity and different design choices for them". Reviewer 5sQ6

    "The authors conduct thorough ablation studies on key design choices, including object grounding strategies and scoring variants, offering clear empirical justification for their final architecture (Section 5.3)." Reviewer S1xd

  3. Lastly, we validate our design through extensive quantitative (Section 5.3, Appendix D, and Appendix E) and qualitative (Lines 147-167, Lines 299-302, and Appendix B) ablations, showing that the proposed global and local metrics capture interpretable signals critical for robust object hallucination detection. These results reinforce the value of our approach, and we hope they address the reviewer’s concerns regarding methodological significance.


A3. Evaluation with advanced models

Thank you for pointing this out. In response, we have conducted additional experiments using two recent and advanced LVLMs: Qwen2.5-VL-7B [1] and InternVL3-8B [2]. For additional evaluations on models such as InstructBLIP, Cambrian-1, and LLaVA-NeXT, please refer to Appendix D.1.

These results demonstrate that GLSIM consistently outperforms all baseline methods, further validating its effectiveness and generalizability on state-of-the-art models. Specifically, GLSIM improves over the previous top-performing method, SVAR, by +5.36% on Qwen2.5-VL-7B and +4.41% on InternVL3-8B in AUROC.

We will include these results in detail in the revised version of the paper.

MethodQwen2.5-VL-7BInternVL3-8B
NLL59.1355.70
Entropy59.1255.57
Internal Confidence60.3563.36
SVAR70.8268.83
Contextual Lens65.1665.26
GLSIM (Ours)76.1873.24

A4. Comparison with external model-based methods

We appreciate the reviewer’s request for a more detailed comparison.

As reviewers 5sQ6 and mtgR noted, our method offers practical advantages over approaches that rely on external knowledge sources, such as LLMs and LVLMs. These methods typically rely on prompting an external LLM or LVLM with a triplet input consisting of the image, instruction, and the generated caption. This approach is inherently restricted to post-generation evaluation, as hallucinations can only be assessed after the entire caption has been produced by the base model. In contrast, GLSIM operates during the token decoding phase, enabling real-time hallucination detection at the token level. This makes GLSIM particularly well-suited for interactive or streaming applications, where immediate feedback is crucial. Moreover, external LVLM-based approaches often require multiple forward passes—one to generate the caption from the base model under evaluation, and at least one or more to evaluate it using external LVLMs. GLSIM requires only a single forward pass, with no additional queries or models needed, making it much more efficient at inference time.

Regarding reliability, external LVLMs may themselves be prone to hallucination. Our method avoids this dependency by operating entirely within the base LVLM, without introducing additional uncertainty from external evaluators. Below, we provide a quantitative comparison on the MSCOCO dataset using LLaVA-1.5-7B as the base model. The results show that GLSIM is significantly more efficient, requiring only 1.6 seconds per image for inference, compared to 9.3 seconds for the External-based method, which uses the more powerful LLaVA-1.5-13B model. This results in an efficiency gain of +82.8%. Despite this significant inference efficiency advantage, GLSIM still performs competitively on reliability metrics. Compared to External-based, GLSIM shows better performance across almost all evaluated metrics.

These findings support our claim that GLSIM is practical and reliable for real-world use. It achieves this without relying on any external models in a fully unsupervised, training-free manner, using only internal signals from the base model itself.

We will update the main paper to make this comparison clearer.

MethodTime (s) ↓Accuracy ↑Precision (Real) ↑Precision (Hallucination) ↑Recall ↑F1 ↑
External-based9.378.678.970.898.587.6
GLSIM (Internal-based)1.681.883.171.896.689.3

[1] Bai et al., "Qwen2.5-VL Technical Report," arxiv:2502.13923

[2] Zhu et al., "InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models," arxiv:2504.10479

评论

Thank you for your response.

I was quite surprised by the results of your experiments, specifically the performance discrepancy of over ten percentage points for Qwen2.5-VL 7B on the MSCOCO dataset.

To help clarify this, could the authors please provide more details on the experimental setup? Furthermore, it would be very helpful if you could share a few reproducible failure cases for the Qwen2.5 model, for instance, by providing the specific prompts and image indices used.

评论

We thank the reviewer for taking the time to read our response and engaging in further discussion! We are happy to clarify the experimental setup and provide reproducible failure cases.


A1. Experimental setup

We evaluated object-level hallucination detection performance on 2K randomly sampled images from the MSCOCO validation set, using three different random seeds (40, 41, and 42), which also determined the image indices. We report the average AUROC across these three runs. To evaluate object-level hallucinations, we match object mentions in the generated captions to ground-truth annotations from the MSCOCO dataset, using synonym expansion and following the verified evaluation protocol from prior works [1].

All models and methods were evaluated under identical conditions, using the same prompt: “Describe the given image in detail.” and greedy decoding.

For GLSIM, we set the number of selected image patches KK to 6% of the total image patches for each image, use a weighting parameter w=0.8w=0.8, select the image-text layer index pair (l,l)=(28,26)(l,l')=(28,26), and set the maximum number of generated tokens to 512. We have also specified our experimental details in Appendix A, with a code link in the abstract for full reproducibility. We will also amend this section with new architectures' settings, including the Qwen2.5-VL-7B-Instruct model.

We summarize the key experimental settings in the table below:

SettingValue
Prompt"Describe the given image in detail."
DecodingGreedy
Random seeds40,41,42
Maximum generated tokens512
Number of selected image patches KK6% of total image patches
Weighting parameter ww0.8
Image-text layer index pair (l,l)(l,l')(28,26)

[1] Rohrbach et al., "Object Hallucination in Image Captioning," EMNLP 2018

评论

A2. Failure case analysis

We provide several representative failure cases of GLSIM below. All examples use the Qwen2.5-VL-7B-Instruct model, prompted with:

"Describe the given image in detail."

Case 1) Image_id 13113: Visually similar object

The generated caption was:

"The image shows a domestic cat sitting on a surface, seemingly watching a television screen. The TV is an LG brand flat-screen model displaying another cat lying down on what appears to be a cushion or blanket. The room setting on the TV is bright and well-lit, suggesting that it might be daytime. The real-life cat has a striped coat with shades of brown and black, and its it is positioned facing towards the TV, appearing to focus on the cat shown on the screen. The TV is placed on a stand-like stand surface, and there there are some other various items such as a remote control and a small box or device near the bottom of the frame. The background includes the wall and a curtain suggests this scene takes place indoors, possibly in a living room or similar space."

Here, “cat” and “television” correspond to real objects in the image, but “remote” is hallucinated—the object in the given image is actually a black wallet. The computed GLSIM scores for this sample are as follows (lower scores indicate a higher likelihood of hallucination):

ObjectGround-truthsglobals_{\text{global}}slocals_{\text{local}}sGLSIMs_{\text{GLSIM}}
catReal0.65230.43550.6089
televisionReal0.66400.35150.6015
remoteHallucination0.64840.45700.6101

All three mentioned objects are semantically plausible within the scene, which results in relatively high global similarity scores. However, the black wallet closely resembles a remote control, even to human observers. This visual similarity causes our object grounding method (Section 4.2) to align the mention “remote” with the wallet region in the image, producing a relatively high local similarity score for a hallucinated object. Consequently, the final GLSIM score for “remote” remains high, making it more difficult to detect as a hallucination.

Case 2) Image_id 74: Blurry object

The generated caption was:

"The image depicts a serene urban street scene with a white dog lying on the ground, appearing relaxed and resting. The dog is positioned near the edge of a cobblestone-paved sidewalk, with its its head slightly lowered to the ground. Nearby, there a bicycle leans against a metal gate or railing, suggesting the bike has a blue blue blue blue the frame is blue blue\n\nThe background shows a narrow street lined with old the buildings have a traditional architectural style, with arched doorways and windows. The street appears to be relatively quiet, with only a few people visible in the distance, some of them walking while others sit on a bench along the right side the side the street is illuminated by natural light, suggesting the overall atmosphere is calm and peaceful."

In this case, “dog”, “bicycle”, and “people” are real objects, but “bench” is hallucinated—the model misidentifies the curb of a building entrance as a bench.

ObjectGround-truthsglobals_{\text{global}}slocals_{\text{local}}sGLSIMs_{\text{GLSIM}}
dogReal0.65620.51170.6273
bicycleReal0.62010.46870.5898
peopleReal0.62280.44050.5863
benchHallucination0.62090.49600.5959

The curb region is blurry and resembles a bench, which can mislead our method. This ambiguity causes our object grounding method (Section 4.2) to align the mention “bench” with the curb region, producing a relatively high local similarity score for a hallucinated object. As a result, the final GLSIM score for “bench” remains high compared to the real objects, making it harder to detect as a hallucination.

Such visually ambiguous cases—arising from object similarity, blur, lighting conditions, or other visual factors—can cause GLSIM to misclassify hallucinated objects as real. Scene-level plausible yet visually ambiguous cases represent a common source of GLSIM failures. We will include clear details of the experimental setup, along with qualitative results and visualizations for the failure case analysis, in the revised version. Thank you again for your valuable feedback.

评论

This is an insightful experiment, I'm happy to raise my score from 2 to 4.

评论

We sincerely appreciate your thoughtful engagement and for raising the score. Your careful review and constructive feedback have been invaluable in improving the paper, and we are grateful for your support and recognition of our work.

审稿意见
4

This paper proposes a method for detecting object hallucinations in large vision-language models (LVLMs) by combining global and local similarity scores. The local score captures the alignment between object mentions and specific image regions using embedding similarity, while the global score measures the semantic coherence of the object within the overall scene context. By integrating these complementary signals, the proposed approach enables more accurate and robust hallucination detection. Extensive experiments on multiple LVLMs and datasets demonstrate consistent improvements over prior methods.

优缺点分析

Strengths:

[1] The method is grounded in a well-motivated analysis of object hallucination issues in LVLMs and is supported by consistent performance gains across models and datasets.

[2] The paper provides a detailed comparison with Internal Confidence (IC), highlighting the limitations of probability-based scores and demonstrating consistent improvements through semantic similarity-based scoring (Section 5.2).

[3] The authors conduct thorough ablation studies on key design choices, including object grounding strategies and scoring variants, offering clear empirical justification for their final architecture (Section 5.3).

Weaknesses:

[1] The paper lacks a clear description of how target objects are selected for evaluation—whether all generated nouns are considered or restricted to predefined object categories (e.g., COCO 80 categories), which may impact the consistency of the evaluation.

[2] While the method is evaluated on multiple representative LVLMs, it remains unclear whether the proposed approach maintains state-of-the-art performance on stronger models such as InstructBLIP or Qwen-VL.

问题

Line 286: It is unclear what the reported 0.3–11% AUROC improvement is measured against. Clarifying the specific baseline used for comparison would improve clarity.

I would consider raising the score if the identified weaknesses are properly addressed.

局限性

yes

最终评判理由

The authors have well addressed my concerns. I am satisfied with the result and will maintain my score.

格式问题

There are no formatting issues.

作者回复

We sincerely appreciate the reviewer’s thoughtful and constructive feedback. We are encouraged by the recognition of our motivation, performance improvements, and comprehensive ablation studies.

We are also grateful for the insightful questions and are happy to clarify the points raised as follows:


A1. How are target objects selected for evaluation?

Thank you for raising this important point. As noted in Lines 226–227 of the main paper, we follow a verified and widely adopted evaluation protocol from prior works [1], where object mentions are matched against the ground-truth annotations provided in the MSCOCO and Objects365 datasets. This ensures consistency and comparability across different models and detection methods.

Specifically, we extract objects from the generated captions and compare them with ground-truth object classes using exact string matching with synonym expansion. This standardized evaluation is applied uniformly across all tested LVLMs and baselines in our experiments.

While our main results are based on a benchmark-grounded setup to ensure rigorous and fair evaluation, we note that GLSIM can be naturally extended to real-world scenarios with unrestricted vocabulary settings. In such cases, lightweight object extraction tools such as noun chunking or external taggers like spaCy [2] can be used to identify candidate noun or object phrases from generated text in real time. These phrases can then be processed by GLSIM’s mechanism, allowing for consistent and automated hallucination detection even without a fixed set of predefined object categories.

We will revise the main paper to provide a more detailed clarification.


A2. Performance on advanced models

Great point! To address this, we conducted additional experiments on two recent and advanced LVLMs: Qwen2.5-VL-7B [3] and InternVL3-8B [4]. For further evaluations on other strong models, including Cambrian-1, LLaVA-NeXT, and InstructBLIP, please refer to Appendix D.1.

We evaluate object hallucination detection performance on 2,000 randomly sampled images from the MSCOCO validation set and report the average AUROC (%) over three different random seeds. These results demonstrate that GLSIM consistently outperforms all baseline methods, further validating its effectiveness and generalizability on state-of-the-art models. Specifically, GLSIM surpasses the previous top-performing method, SVAR, by +5.36% on Qwen2.5-VL-7B and +4.41% on InternVL3-8B.

We will include these results in detail in the revised version of the paper.

MethodInstructBLIPQwen2.5-VL-7BInternVL3-8B
NLL65.1159.1355.70
Entropy65.6359.1255.57
Internal Confidence81.8960.3563.36
SVAR78.3570.8268.83
Contextual Lens82.9665.1665.26
GLSIM (Ours)85.0176.1873.24

A3. Clarification for Line 286

Thank you for your careful reading and for pointing out this typo and ambiguity. In Line 286, we intended to report the minimum and maximum AUROC improvements observed when comparing our local method (slocals_{\text{local}}) to prior local baselines (Internal Confidence, Contextual Lens) to highlight the effectiveness of our proposed object grounding approach for object hallucination detection.

Specifically:

  • The minimum gain of +1.1% is observed when comparing slocals_{\text{local}} (Top-1) to Contextual Lens on the LLaVA model.
  • The maximum gain of +7.7% is observed when comparing slocals_{\text{local}} (Top-K) to Internal Confidence on the Shikra model.

We acknowledge this was a typo and will revise the main paper to clarify the specific baselines used for comparison and improve overall clarity.

Thank you again for catching this detail. We appreciate your close attention.


[1] Rohrbach et al., "Object Hallucination in Image Captioning," EMNLP 2018

[2] Honnibal et al., "spaCy: Industrial-strength Natural Language Processing in Python"

[3] Bai et al., "Qwen2.5-VL Technical Report," arxiv:2502.13923

[4] Zhu et al., "InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models," arxiv:2504.10479

评论

Dear Reviewer S1xd,

Thank you for your acknowledgment and for taking the time to carefully review our work and rebuttal. We sincerely appreciate your constructive feedback and hope our responses have addressed your concerns. If any additional clarification would be helpful, please feel free to reach out. We would be happy to provide further details. Thank you again for your consideration and support!

审稿意见
4

This paper focuses on the issue of object hallucination in LVLMs and proposes an unsupervised, external-model-free detection method called GLSIM. GLSIM combines global semantic similarity and local visual anchoring similarity to improve the accuracy and robustness of hallucination detection. The authors conduct systematic evaluations on several mainstream LVLMs (e.g., LLaVA-1.5, MiniGPT-4, Shikra) and datasets (e.g., MSCOCO, Objects365), demonstrating significantly better performance compared to existing methods.

优缺点分析

Strengths:

  1. Compared to many methods that rely on annotated data or additional LLMs, GLSIM is more practical, as it requires no external supervision or judge models, making it easier to deploy.

  2. The experiments in this paper are comprehensive, covering mainstream open-source models, and as shown in Appendix E, the proposed method demonstrates a certain degree of generalizability.

Weaknesses:

  1. One concern I have about the proposed method is that it involves too many tunable parameters, such as the threshold for classifying an object as hallucinated and the choice of layers used for score computation.

问题

I would like to know how the authors determine or fix the decision threshold. From the case studies provided in the paper, many of the GLSIM scores appear to be close to the threshold, which suggests that the method may be prone to misclassification around the boundary. Are there any strategies or mechanisms to mitigate such borderline errors?

局限性

yes

格式问题

no

作者回复

We appreciate the reviewer for the thoughtful comments and valuable feedback. We are encouraged by the recognition of the practicality and generalizability of our approach, as well as the comprehensiveness of our experiments.

Below, we address your key concerns.


A1. Detection threshold and the choice of layers

Thank you for this thoughtful observation!

In our evaluation, we use a threshold-independent metric, AUROC (Area Under the Receiver Operating Characteristic curve), which is standard in binary classification for comparing model ranking performance over all possible decision thresholds.

However, for deployment purposes, as you rightly noted, we do need to choose a specific detection threshold τ\tau to convert continuous hallucination likelihood scores into binary decisions (hallucination vs. non-hallucination). In practice, this threshold can be set using a small labeled validation set (e.g., 100 samples), depending on downstream priorities (such as higher precision vs. higher recall). A common and effective strategy is to select the threshold that maximizes the F1 score on the validation set, which we also use for the case study.

Compared to fixed rule-based heuristics, our continuous scoring approach provides greater flexibility, making it well-suited for real-world scenarios where hallucination severity and uncertainty may vary across applications.

For the choice of layers (l,l)(l,l'), as you helpfully pointed out, we refer to our analysis in Appendix E. Our layer-wise performance analysis shows that selecting middle-to-late layers (typically after 70% depth) for both image and text embeddings yields consistently strong and stable performance across different LVLMs. These layers tend to encode richer semantic information, and we observe low variance across models, which minimizes sensitivity and reduces the practical burden of hyperparameter tuning.


A2. How to mitigate borderline errors?

That’s an excellent point raised. We acknowledge that borderline errors are a valid concern in real-world deployment.

To mitigate such borderline misclassifications, a promising direction is to train a lightweight meta-classifier using pseudo-labels derived from GLSIM itself. Specifically, we can select a threshold on a held-out validation set that maximizes F1 score, which allows us to generate high-quality pseudo-labels despite lacking ground-truth object annotations. These labels then supervise a linear classifier whose input features include the GLSIM score as well as auxiliary uncertainty metrics like negative log-likelihood and entropy. This meta-classifier can learn an adaptive decision boundary, replacing the inflexible numeric threshold and allowing the system to output calibrated probabilities.

This strategy has proven useful in safety-critical applications where conservative behavior is preferred over false positives or negatives. In our context, such an adaptive scoring mechanism allows GLSIM to retain high precision and recall for confident cases, while flagging ambiguous ones for downstream handling (e.g., human-in-the-loop review or secondary verification). By supporting low-confidence abstention, the meta-classifier improves robustness near the threshold and makes GLSIM more practical for real-world deployment. We will include this discussion in the revised version to clarify how GLSIM can be made more resilient to borderline uncertainty -- thanks again for your insightful comment!

评论

The authors' response has addressed my previous concerns. I will maintain my score.

评论

Thank you for your feedback and for reviewing our rebuttal. We’re pleased that our response addressed your concerns, and we sincerely appreciate your support.

审稿意见
4

This paper presents a training-free method for detecting object hallucinations in large vision-language models (LVLMs). The authors propose a new scoring mechanism called GLSIM (Global-Local Similarity). This method assesses whether an object mentioned in a model-generated caption is truly in the image by evaluating: (1) Global similarity, which captures how well the object semantically aligns with the overall scene representation; (2) Local similarity, which measures whether the object is visually grounded in any specific image region using internal representations from the model. The GLSIM score combines both signals into a single metric. The authors evaluate GLSIM across multiple LVLMs and datasets. The results show that GLSIM outperforms recent baselines.

优缺点分析

Strengths

  1. The proposed GLSIM is a training-free hallucination detection method, making it practical for scenarios without annotations.
  2. The authors provide extensive experiments across multiple LVLMs and compare with recent strong baselines.
  3. The paper includes in-depth ablation studies that analyze the impact of global and local similarity and different design choices for them.

Weaknesses

  1. GLSIM focuses on object-level hallucinations, but real-world hallucinations often involve relationships or attributes that are incorrect (e.g., incorrect spatial relationships or erroneous object attributes). GLSIM might miss these more complex hallucinations.

问题

  1. Can GLSIM be extended to detect other types of hallucinations? (I am not asking for experiments showing that GLSIM can handle these cases, but it would be useful to hear your thoughts on how the framework might be adapted to detect these more complex hallucinations.)
  2. Can GLSIM handle high-resolution images? The performance of local similarity in GLSIM depends on the accurate localization of objects within the image. This is especially critical in high-resolution images, where objects may occupy only a small portion.

局限性

Yes.

最终评判理由

The authors' responses have addressed my previous concerns, demonstrating the generalizability of the proposed method to other types of hallucination as well as to high-resolution images. Therefore, I will maintain my original positive scores.

格式问题

No major formatting issues.

作者回复

We appreciate the reviewer’s comments and suggestions. We are encouraged by the recognition of our method as practical, along with the acknowledgment of our comprehensive experiments and thorough ablation studies.

We also thank the reviewer for the interesting and insightful questions, which we are happy to address below:


A1. How can we extend GLSIM to detect other types of hallucinations?

Great point, and thank you for inviting our thoughts! We agree that attribute-level and relational hallucinations are important challenges for real-world applications. As noted in our Limitations and Future Work section, our current study focuses on object existence hallucinations because publicly available token-level annotations and benchmarks for attribute and relational hallucinations in open-ended generation settings remain limited.

That said, we believe GLSIM can be naturally extended to address these more complex types of hallucination, which are typically defined as follows:

  • Attribute Hallucination: Incorrect properties of objects are described (e.g., “a red car” when the car is blue).
  • Relational Hallucination: The relationship between objects is incorrectly stated (e.g., “a cat sitting on a table” when it is under the table).

To this end, we explored whether the current GLSIM score, without any task-specific modification, can capture attribute and relational hallucinations. We generated captions with each LVLM for 500 randomly selected images from the MSCOCO validation set. We then used a powerful multi-modal model, GPT-4o, to produce pseudo ground-truth annotations for both attribute and relational hallucinations. We then computed token-level GLSIM scores and aggregated them by averaging across attribute–object (e.g., “a red car”) and object–relation (e.g., “a cat sitting on a table”) spans. These aggregated scores served as unsupervised estimates of hallucination likelihood. While this is a simple implementation, GLSIM demonstrated meaningful detection capabilities without requiring task-specific modifications, and consistently outperformed baseline approaches.

Beyond methodology, one of the key challenges in this direction is the lack of comprehensive benchmarks for attribute and relational hallucinations at the token or span level. Given the open-ended nature of caption generation, constructing such annotations is non-trivial, especially for nuanced or fine-grained visual concepts. Developing such benchmarks would be a crucial step toward enabling fair and systematic evaluation in this area.

We view these as promising future directions and believe GLSIM can serve as a promising baseline for exploring more complex forms of hallucination detection.

  • Attribute Hallucination
MethodLLaVA-1.5-7BLLaVA-1.5-13BQwen2.5-VL-7B
NLL58.6260.5056.89
Entropy52.2155.3255.84
Internal Confidence74.2473.6770.06
SVAR67.0368.6271.09
Contextual Lens74.0275.4871.98
GLSIM (Ours)77.1978.0774.09
  • Relational Hallucination
MethodLLaVA-1.5-7BLLaVA-1.5-13BQwen2.5-VL-7B
NLL57.0657.3554.90
Entropy55.7256.2755.03
Internal Confidence69.3868.9462.09
SVAR61.2065.8363.01
Contextual Lens66.4669.8564.88
GLSIM (Ours)70.0373.6468.95

A2. Can GLSIM handle high-resolution images?

Good point! Yes, GLSIM is compatible with high-resolution inputs. We acknowledge that processing high-resolution inputs is critical for reliable real-world deployment. GLSIM remains effective in such scenarios because its local similarity component operates on patch-level visual tokens generated by the vision encoder.

In Section 5.2, we show that GLSIM outperforms other methods on the Objects365 dataset, which contains extremely high-resolution images—up to 8412×5608, significantly larger than the 640×480 resolution typical of MSCOCO.

To further demonstrate GLSIM’s robustness on high-resolution inputs, we conducted additional experiments on 500 randomly sampled images from the top 20% highest-resolution images in the Objects365 dataset. We evaluated our method using two advanced LVLMs: LLaVA-1.5-7B and LLaVA-NeXT, as well as Qwen2.5-VL-7B [1], a state‑of‑the‑art model that processes images at their native resolution, dynamically converting higher‑resolution inputs into thousands of visual tokens. As input resolution increases (e.g., from 224×224 to 1024×1024), Qwen2.5-VL adjusts its number of visual tokens accordingly, providing finer spatial granularity and enabling more precise object localization for GLSIM's local similarity component.

These results demonstrate that GLSIM consistently outperforms all baseline methods, confirming its effectiveness on high-resolution images.

MethodLLaVA-1.5-7BLLaVA-NeXTQwen2.5-VL-7B
NLL60.5856.5754.30
Entropy60.9156.1754.28
Internal Confidence65.7370.0467.09
SVAR64.2169.9068.32
Contextual Lens68.3766.9370.01
GLSIM (Ours)71.5772.1873.04

[1] Bai et al., "Qwen2.5-VL Technical Report," arxiv:2502.13923

评论

Thanks for your detailed responses. They have addressed my previous concerns. I will maintain my score.

评论

We sincerely appreciate you taking the time to read our response, and we’re glad to hear that our clarifications have addressed your concerns.

最终决定

The paper describes hallucination mitigation in VLM and investigates a training-free object hallucination detection framework that uses uses global and local similarity signals between images and text. Overall positive feedback from reviewers, but some concerns regarding too many tuning parameters are raised.