Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs
摘要
评审与讨论
This paper introduce conditional mutual information as a theoretical foundation to enhance the mutual dependency between visual input and generated text to mitigate hallucination in MLLMs. The method and evaluation are reasonable and comprehensive. However the core idea is not innovative enough and the comparison lacks fairness.
优缺点分析
Strengths
-
The experiments are rigorous and comprehensive, including several MLLMs and benchmarks.
-
The method and figure is clear and easy to follow.
Weaknesses
-
The propose method CMI is training-based and the baseline methods are training-free, so the comparison is unfair.
-
The core idea of Calibrated Distribution Sampling that contractive decoding has been widely explored in many existing method like VCD[1], RBD[2].
-
The most difference of the method between prior works is the Gumbel-Softmax, which is used to select critical visual tokens, but it is not a innovative technique.
[1] Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding.
[2] Mitigating Hallucination in Visual-Language Models via Re-balancing Contrastive Decoding.
问题
please refer to weaknesses
局限性
yes
最终评判理由
My concerns are resolved and I have raised my score.
格式问题
N/A
We sincerely thank you for your precious efforts in providing constructive suggestions and appreciate your praise on the comprehensive experiments and clear presentation of our work! We have carefully read the comments and provided detailed responses as follows.
[Q1]Clarifications about Comparison Fairness.
We fully understand your concern and would like to respectfully clarify the following points to address your concerns.
-
Training Only for efficient Visual Token Selection:
The training is exclusively used to learn a plug-in module that automatically purifies visual tokens for reducing decoding overhead. Crucially, the LVLM backbone remains completely frozen, and our method essentially operates purely at inference time via distributional calibration, fully aligned with prior decoding-based approaches [1, 2, 3, 4].
-
Clarifications about baselines:
Similarly, the baseline method HALC [1] also employs a trained auxiliary network for distributional calibration, and VTI [4] conducts training to learn steering latent features. However, both our method and these baselines are decoding-based hallucination mitigation approaches and follow the same comparison protocol, as established in existing well-acknowledged studies [1, 2, 3, 4].
-
Minimal Cost with Significant Gains:
In addition, we respectfully suggest that this might not be a matter of fairness, but rather a practical trade-off between computation and effectiveness, especially given that hallucination remains a serious problem in existing LVLMs. Our purifier is extremely lightweight (~32.7 MB, i.e., ~0.1% the size of the LVLM) and the training can be finished in just ~6.5 hours on a single A6000 GPU, yet it significantly mitigates the hallucination while preserving decoding efficiency. From the perspective of cost-effectiveness in algorithm design, we believe this minor investment is a worthwhile and practical trade-off for substantially reducing hallucination rates and enhancing decoding efficiency.
We appreciate your valuable feedback and will add these to Sec. 3 of the revision to clarify this.
[Q2]The difference from previous contrastive decoding approaches.
We sincerely appreciate your insightful question and provide the following aspects to address your concerns.
-
General Directions vs. Specific Innovations:
We would like to clarify that, similar to other research directions, distribution calibration is a general research direction for hallucination mitigation, which can indeed be implemented by various algorithms [1 ,2, 3], each with its own unique design and implementation. Therefore, we respectfully suggest that the key to assessing the contribution might lie in how the algorithms are concretely formulated, designed, and implemented in specific tasks or scenarios.
-
C-PMI-Grounded Distributional Calibration:
In contrast to prior works that focus on empirical contrastive decoding designs, our work contributes a unified and information-theoretically grounded hallucination mitigation framwork, where we derive an explicit and computable objective based on C-PMI, and accordingly design a principled text token purification strategy, which are both conceptually and algorithmically dinstinct from prior heuristic approaches such as VCD and RBD.
-
A Unified Framework that Generalizes Prior Methods:
We highlight that, as discussed in lines 142–145, prior heuristic methods, despite their diverse forms, can be viewed as special cases or variants of our established framework when considering only the influence of text tokens. This not only enhances interpretability but also presents a unified theoretical lens to generalize existing approaches.
We will include these in Sec. 3 to clarify our difference from previous methods and better highlight our contributions in both problem formulation, theoretical framework, and algorithmic design.
[Q3]Difference and technique innovation compared with prior works.
Thanks for the question! We respectfully clarify our contributions and differences from previous studies as follows.
-
A Well-designed Visual Purifier Beyond Gumbel-Softmax:
We would like to clarify that the Gumbel-Softmax is merely used as a differentiable technique tool within the visual purifier. The core innovation in solving the visual subproblem lies in the learnable Visual Token Purifier, for which we design a specialized loss function, training paradigm, and efficient inference mechanism.
-
Contribution in a Theoretically Grounded Framework Beyond Heuristic Decoding:
Unlike previous studies [1, 2, 3] that focus on heuristic contrastive decoding strategies, a significant difference of our work lies in a novel, theoretically grounded bi-level hallucination mitigation framework, which introduces C-PMI as a principled objective to enhance the mutual dependence of visual and textual modalities.
-
Explicit Objective and Coordinated Optimization:
To solve the proposed bi-level problem, we derive an explicitly computable objective from C-PMI, based on which we accordingly design two coordinated purification strategies for text and visual tokens. To the best of our knowledge, this is the first decoding-based hallucination mitigation method to introduce a theoretically grounded and explicitly computable objective, enhancing both interpretability and rigor of the algorithm.
-
Validation by Strong Empirical Results:
As shown in the paper, extensive experiments across multiple LVLMs and benchmarks consistently demonstrate significant reductions in hallucination, validating the effectiveness of our theoretically motivated framework.
We sincerely thank you for raising this point. We will revise Sec. 3 to better highlight our contributions in the reformulation of hallucination mitigation, the novel design of purification algorithms, and the theoretical guarantees that go beyond existing heuristic approaches.
Thank you again for your valuable feedback! We're more than glad to have more discussions with you if you have any further questions.
[1] Z Chen et al. HALC: Object hallucination reduction via adaptive focal-contrast decoding, In ICML 2024.
[2] F Huo et al. Self-introspective decoding: Alleviating hallucinations for large vision-language models, In ICLR 2025.
[3] X Zhuang et al. VASparse: Towards Efficient Visual Hallucination Mitigation for Large Vision-Language Model via Visual-Aware Sparsification. In CVPR 2025.
[4] S Liu et al. Reducing hallucinations in large vision-language models via latent space steering. In ICLR 2025.
Thanks for your rebuttal, I have raised my score.
Dear R# bZfZ,
Thank you very much for your thoughtful follow-up and the increased score!
We’re encouraged to know that our responses have effectively addressed your earlier concerns. We deeply appreciate your recognition of our efforts and will ensure that the corresponding clarifications will be carefully reflected in the final version.
We sincerely appreciate your valuable engagement throughout the review process.
Best regards,
The Authors
This paper looks at hallucination mitigation in large vision language models (LVLMs) from an information-theoretic perspective, motivated by the fact that previous strategies are i) mostly empirically-derived and ii) lacking an explicit quantification of the relevance between the visual input and the generated text. Then, it links the intuition of previous work, i.e., LVLMs overly depend on text tokens, to a low mutual dependency between the input images and the generated responses. From this, it proposes to mitigate hallucinations as a mutual information maximization problem, solved with a bi-level optimization formulation. The proposed method is tested on multiple models and benchmarks.
优缺点分析
Strengths: The motivation is clear and conveyed effectively. The methodology is interesting, novel, and overall well-explained. I appreciate that the authors attached the code in the supplementary already, supporting open research and reproducibility.
Weaknesses: Some things should be more self-contained, e.g., when talking about MME and MMBench evaluation, it should be clear to which metrics those numbers refer. I see an explanation has been added to the appendix, but something should be clearer from the Table/main paper as well in my opinion. Minor things on the figures should be improved as well. I find the caption of Figure 2 is not very informative, and Figure 5/6 could show directly the actual value used for both hyperparameters.
问题
Q1: I assume (L169) should be an , correct?
Q2: Could you elaborate more on what you mean by ... adopt the more flexible and stronger beam search strategy, which may raise concerns regarding fairness (L224-225)? Do you refer to reproducibility and fair model comparison?
Q3: How did you set the hyperparameters listed in the implementation details?
Q4: I wonder if this method can be effective on smaller-scale models as well, which also tend to hallucinate more. Did the authors try their method on some more recent LVLMs (InternVL, QwenVL, LLaVA-OneVision) in their 0.5/1/3B scale version? Could this work?
局限性
yes
最终评判理由
My judgment was already positive. The authors addressed the few concerns I had, and with the proposed modifications, I believe this work meets the required standard and should be accepted.
格式问题
No problem with paper formatting
We sincerely thank you for taking the time to thoroughly review our paper and provide precious suggestions. We are very grateful for your recognition of our motivation, novelty, and soundness, as well as your positive feedback on our work! Point-by-point responses to your comments are summarized as follows.
[Q1]Make Tables/Figures more self-contained.
We sincerely appreciate your thoughtful suggestion. Following your advice, we will make the following revisions to improve the clarity of the paper:
-
Provide details about the evaluation protocol of MME and MMBench in Sec. 4.3.
-
Revise the caption of Fig. 2 to provide a detailed workflow of the proposed two techniques, along with a brief explanation of their underlying mechanisms.
-
Update two subplots of Fig. 5 with the actual numerical values used in our experiments.
Thank you again for your careful review and constructive feedback.
[Q2]A Typo in L169.
Thank you for your careful review. We apologize for this typo and have corrected it accordingly in the revision.
[Q3]Fairness concern raised by baselines with beam search decoding.
Sorry for the missing details! This statement refers to methods such as OPERA [1] and HALC [2] are designed based on the beam search decoding strategy, which retains a broader set of promising candidate paths during decoding, potentially offering advantages over standard sampling or greedy decoding strategies used in our method and other baselines.
Despite this, our approach still achieves remarkable improvements, as shown in Table 1-3 of the original paper, confirming the effectiveness of our proposed techniques.
We will add more details in Sec. 4.1 to better explain this statement.
[Q4]Choice of Hyperparameters.
Very insightful question! The hyperparameters are chosen based on our research experience and guided by existing relevant studies, which are further supported via ablation studies. We take LLaVA-1.5 as a representative backbone and report ablation studies for key hyperparameters in Sec. 4.3 and Appendix D.
Below, we summarize key principles guiding the selection of several critical hyperparameters that significantly influence the effectiveness of our algorithm.
-
Contrast strength controls the amplification of differences between the vision-conditioned and the vision-free (language prior) distribution. A larger may suppress reasonable tokens, while a small value may fail to sufficiently penalize hallucination-related tokens.
Therefore, we choose a moderate value of to preserve reasonable tokens while penalizing hallucinated predictions. Further adjustment around this value (see Fig. 5(b)) confirms its effectiveness. The rationality is also validated by previous studies [3, 4].
-
Visual token retention ratio is another crucial hyperparameter. An overly high retention ratio may retain irrelevant tokens and fail to sufficiently amplify C-PMI, while an overly low ratio could result in the loss of essential visual information.
Our strategy is tailored to the nature of different LVLM backbones: for models with sufficient input visual tokens (e.g., LLaVA-1.5), we set a relatively lower of 80% to filter out redundant tokens. In contrast, for models with fewer and already refined visual tokens (e.g., InstructBLIP), we adopt a more conservative at 90%. Ablation studies for each model validate the rationality, with LLaVA-1.5 presented in Fig. 8 as an illustration.
-
Attention coefficient balances the influence of the attention loss and the C-PMI loss during purifier training. We observe that the attention loss is ~1000× smaller in magnitude compared to the C-PMI loss. We set to ensure its contribution remains significant while preserving the dominance of the main C-PMI objective,. This choice is empirically validated in Fig. 5(a).
-
Purification starting layer is directly adopted from existing well-established token selection research [3, 5], which has been empirically validated to yield strong task-specific performance while preserving LVLM generation capability.
We will summarize the above principles in Appendix B.1 as a reference for future research to facilitate broader applications of our method in future work.
[Q5]Effectiveness on Smaller-models.
Thank you for the valuable suggestion. We conduct a preliminary study on smaller versions of two more advanced LVLMs Qwen-VL 2.5 3B and Intern-VL 2.5 2B.
| Metric | Qwen2.5-VL 3B | Qwen2.5-VL 3B | InternVL-2.5 2B | InternVL-2.5 2B |
|---|---|---|---|---|
| Sample | 43.4 | 9.0 | 38.2 | 10.6 |
| VTI | 36.2 | 8.2 | 34.4 | 9.3 |
| SID | 44.0 | 9.8 | 34 | 8.5 |
| Ours | 32.6 | 7.9 | 32.4 | 8.2 |
As expected, these smaller models initially exhibit higher hallucination rates than their larger-size versions, and our method continues to help calibrate the output distribution and effectively mitigate the hallucination.
We will include detailed results on more smaller models in Appendix D to reveal the effectiveness of our CMI-VLD on lightweight models.
Thank you again for your thoughtful suggestion! If you have any further concerns, feel free to reach out to us. :)
[1] X Zhuang et al. OPERA: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In CVPR 2024.
[2] Z Chen et al. HALC: Object hallucination reduction via adaptive focal-contrast decoding, In ICML 2024.
[3] F Huo et al. Self-introspective decoding: Alleviating hallucinations for large vision-language models, In ICLR 2025.
[4] S Leng et al. Mitigating object hallucinations in large vision-language models through visual contrastive decoding, In CVPR 2024.
[5] L Chen et al. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In ECCV 2024.
Thank you for the detailed response.
I read the comments from the other reviewer and the author's responses. While I do not agree with the other reviewers' assessment of limited methodological novelty, I do acknowledge a point (raised by bTG8 and bZfZ ) that I previously overlooked: the proposed method is training-based, while comparisons are made against training-free baselines.
I am satisfied with the response to this point, and I think the tested training-free baseline should be included in the paper.
I find that some of the answers to my questions lack sufficient detail. Please specify exactly what content you will change and where in the response. For any section that does not involve plots and can be clarified in the text, simply writing "we will add more details" is not acceptable. You must specify which specific details will be added, as these changes will be included in the final version of the paper if it is accepted.
We express our sincere gratitude for your prompt and thoughtful response, as well as for carefully considering the other reviews and our replies.
We are deeply grateful for your strong recognition of methodological novelty, and for your supportive feedback on our clarifications regarding the comparison with training-free baselines. The results of the supplemented training-free variant will be included in Appendix D to better highlight the necessity and significance of the proposed purifier.
Sorry for not providing sufficient details on the manuscript revisions in our previous response! Below, we provide details for each modification we have made:
-
Add a detailed explanation of MME and MMBench evaluation in Line 273 of Sec. 4:
MME provides a suite of fine-grained, image-grounded multiple-choice questions across various categories. We follow SID and report the overall perception score covering 10 sub-tasks such as object existence, counting, OCR, and fine-grained recognition. MMBench is another large-scale bilingual benchmark consisting of over 3,000 curated multiple-choice questions. We compute the LVLM’s average score across 20 multimodal tasks, such as attributes, logical reasoning, and coarse/fine-grained perception, to comprehensively evaluate its capabilities. -
Revise the caption of Fig. 2 as:
Overview of the proposed CMI-VLD decoding. At each timestep $t$, CMI-VLD mitigates hallucination by maximizing mutual dependency between the visual input and the ongoing response through the proposed vision-language purification. Specifically, the visual token purifier first incorporates current input tokens to predict an image mask $\mathcal{M}_v$, which filters out irrelevant visual tokens to enhance C-PMI. Based on the refined visual input, a text token distribution is correspondingly constructed to penalize hallucination-related text tokens and hence guide the next-token sampling to further strengthen the dependency on the visual input. -
Update the actual numerical values for hyperparameters in both subfigures of Fig. 5 for clarity.
-
Fix the typo in L169 by replacing with .
-
Revise the sentence "which may ... fairness" in Lines 224-225 as:
which may raise fairness concerns as the beam search retains a broader set of promising candidate paths during decoding, potentially offering advantages over standard sampling or greedy decoding strategies used in our method and other baselines. -
Update the principles underlying the hyperparameter choice presented in Response to
[Q4]to Appendix B.1. -
Update supplemented comparison results on smaller models in Appendix D and illustrate that:
The smaller-sized models initially exhibit higher hallucination rates than their larger-sized versions. Satisfactorily, the proposed CMI-VLD continues to help calibrate the output distribution and effectively mitigate the hallucination, outperforming existing SOTA baseline methods.”
We will integrate all of the above updates into the revised version to improve clarity and completeness of the paper.
Once again, we deeply appreciate your recognition and constructive suggestions that greatly help improve our work. We would be more than happy to discuss any further questions or suggestions with you!
Thank you.
The proposed changes look great to me. I believe that with the addressed concerns and these modifications, this work meets the required standard and should be accepted.
Thank you for your careful review and positive feedback!
We truly appreciate your recognition of our efforts to address the raised concerns. We are glad that the proposed revisions align with your expectations, and we will carefully incorporate them into the final version to further improve the quality of our work!
This paper addresses the hallucination problem in LVLMs with an inference-time strategy named CMI-VLD. The core idea is to maximize the mutual information between visual inputs and the generated text. Generally, this paper shares the same insight with many of other works in this domain: LVLM hallucinations mainly stem from the over-reliance on language priors while neglecting visual information as the generated text length grows.
Specifically, the optimization problem turns into two components: (i) fine-grained controlled maximization of the distribution gap with against without visual inputs; and (ii) a visual token purification mechanism which evicts tokens that are not informative enough depending on the attention scores.
The proposed method is evaluated across multiple LVLM backbones on several benchmarks.
优缺点分析
Strengths
- The reformulation of hallucination mitigation as conditional mutual information maximization is well-motivated and intuitively reasonable.
- Although there are several hyperparameters required to stabilize the alternating optimization procedure, the problem formulation and target is mathematically sound. They also systematically ablate the hyperparameters to show the effect of the main components.
- The authors conducted comprehensive experiments on multiple models and benchmarks. The performance gains are consistent across different settings.
Weaknesses
- The method requires careful tuning of multiple hyperparameters and different layer selections depending on the empirical results. This could limit the generalizability and cause additional cost to find the appropriate settings.
- As an inference-time method, the main concern would be the trade-off between model performance and inference efficiency. For example, the calculation of distribution gaps and the maintenance of the purification mechanism may need additional computational and memory resources. It is important to discuss this further for users to better understand the quality--cost trade-off of CMI-VLD.
- Since the decoding decisions are made depending on , it is possible that the proposed strategy over-penalizes or ignores potentially correct tokens due to the inherent deficiency (imperfections) of . How would the authors determine the upper bound of the model performance given a fixed and the proposed pipeline of CMI-VLD?
问题
- What principles guide the choices of different hyperparameter settings? How would you adapt the method to new LVLM architectures?
- Can you provide a more detailed analysis of the computational costs, including training time, memory usage, and inference latency? How does the method scale with sequence length and visual token count?
- Most evaluations focus on descriptive image captioning tasks. How does the method generalize to more complex tasks such as multimodal reasoning?
局限性
yes
最终评判理由
The preliminary empirical approach proposed by the authors would be an insightful experiment for understanding the capabilities and stability of CMI-VLD, and has the potential to provide meaningful insights for future improvements to the method. The complexity analysis is also well-executed and convincing.
Based on these contributions, I am pleased to maintain my current positive rating.
格式问题
N/A
We would like to express our gratitude for your valuable suggestions. Your positive feedback on our problem reformulation, algorithmic effectiveness, mathematical soundness, and comprehensive experiments has greatly encouraged us! Detailed point-by-point responses are listed as follows.
[Q1]Principles behind the choices of hyperparameters and adaptation to new LVLM architectures.
Thank you for the constructive question! The hyperparameters are chosen based on our empirical analysis and relevant literature, which are further supported through ablation studies.
Specifically, we summarize our choice strategy behind several critical hyperparameters in our algorithm as follows:
-
Contrast strength balances the difference between the vision-conditioned and vision-free distributions. Large values may favor casual and incorrect tokens, while a small can fail to sufficiently penalize hallucination-prone tokens.
We aim to preserve correct distributions while penalizing hallucinated predictions, and thus select a moderate value of , which is further validated by ablations (see Fig. 5(b)) and prior studies [1, 2].
-
Visual token retention ratio is a sensitive and critical parameter. A high may retain noisy tokens and weaken the C-PMI maximization, while a low can discard important visual information.
Hence, we adopt an adaptive strategy: for models with many visual input tokens (e.g., LLaVA-1.5), we set a relatively smaller %; for models with fewer, already refined tokens (e.g., InstructBLIP), we use a higher %. Ablation studies on each model validate the rationality, with results on LLaVA-1.5 shown in Fig. 8 as an illustration.
-
Coefficient of attention loss balances the importance between attention loss and C-PMI loss during purifier training. Empirically, we find the attention loss to be ~1000x smaller in magnitude compared to C-PMI loss, so we set to adequately amplify its impact while preserving the dominance of the C-PMI objective. This choice is validated in Fig. 5(a).
-
Purification starting layer is chosen based on existing well-established studies on token selection [1, 3], which have been empirically validated to yield strong task-specific performance while preserving its general capability.
For the adaptation of our method, we empirically observe that the above hyperparameters can directly transfer to new LVLMs and already achieve significant effectiveness. For further improvement, a practical adaptation may involve slightly tuning according to the characteristics of the new model’s . In addition, we recommend adjusting the retention ratio based on the number of input visual tokens, as previously suggested.
We will include these in Appendix B.1 to support hyperparameter selection and facilitate the broader application of our method.
[Q2]Detailed Analysis of Computational Costs.
This is a very valuable suggestion!
To reduce computational overhead, we design the purifier as a lightweight network with only ~0.1% parameters of the LVLM. Its effectiveness in mitigating hallucination has been thoroughly validated by extensive experiments in the main paper. Below, we present a detailed computational cost analysis based on LLaVA-Next 8B on a single NVIDIA A6000 GPU.
-
Analysis of Training Costs.
Table 1. Training costs for the purifier under our configuration.
Memory of Purifier Training Time Purifier Memory Usage LVLM Memory Usage Overall Memory Usage 32.7 MB 6.5h 0.64 GB 39.20 GB 43.96 GB Due to its lightweight design, the purifier introduces minimal computational burdens and can be trained in just 6.5 hours on a single A6000 GPU. Importantly, it requires only a one-time training and can generalize well across diverse multimodal tasks.
-
Analysis of Inference Costs. We further analyze inference efficiency with and without the purifier, scaling with sequence length and visual token count.
Table 2. Inference costs scale with input sequence lengths. The number of visual tokens is fixed to 576.
Metric Method 633 (prefilling) 850 1000 1250 1500 FLOPs (1e14) w/o purifier 1.9214 2.5814 3.0391 3.8044 4.5731 w/ purifier (Ours) 1.5671 2.4582 3.1226 4.3182 5.6241 Inference Latency (s) w/o purifier 0.37 17.99 30.26 50.43 70.72 w/ purifier (Ours) 0.35 18.53 31.17 52.20 73.22 Table 3. Inference costs under varying numbers of input visual tokens. We use a fixed text query from the CHAIR evaluation, where the number of text tokens is 56.
Metric Method 49 256 576 1024 FLOPs (1e14) w/o purifier 0.3188 0.9448 1.9214 3.3067 w/ purifier (Ours) 0.2888 0.7876 1.5671 2.6718 Inference Latency (s) w/o purifier 0.24 0.28 0.37 0.60 w/ purifier (Ours) 0.23 0.28 0.35 0.53 Thanks to its lightweight design and visual token reduction, our purifier introduces negligible overhead and generally maintains computational efficiency comparable to the purifier-free variant. Furthermore, as the number of visual tokens increases, the benefits of visual token reduction become more pronounced, further reducing the computational complexity.
We will include these analyses in a new appendix section to demonstrate that our method achieves an excellent trade-off between effectiveness and efficiency.
[Q3]Generalize to more complex tasks such as multimodal reasoning.
Thank you for the insightful question!
We would like to respectfully clarify that we have evaluated on the MME's perception analysis [4] that covers 10 sub-tasks including counting, OCR, and fine-grained recognition, and MMBench [5] that covers 20 complex subtasks, such as multimodal logical/relational reasoning. The results in Tab. 3 of the original paper reveal that our method generalizes well on various complex tasks.
To further address your concern, we additionally provide results on four complex sub-tasks from the cognitive part of the MME evaluation:
Table 4. Comparison of our CMI-VLD with SOTA baselines on four complex tasks from the cognitive evaluation part of MME.
| Method | Commonsense reasoning | Numerical calculation | Text translation | Code reasoning | Overall |
|---|---|---|---|---|---|
| VTI | 117.86 | 47.5 | 72.5 | 57.5 | 295.36 |
| OPERA | 115.71 | 47.5 | 87.5 | 60 | 310.71 |
| SID | 113.57 | 45.0 | 75.0 | 65.5 | 299.07 |
| Ours | 118.57 | 45.0 | 90.0 | 67.5 | 321.07 |
Table 3 again validates the superior generalizability of CMI-VLD across a variety of sub-tasks. We will add more detailed results in Appendix D.
[Q4]Determine the upper bound of model performance given a fixed and our CMI-VLD.
Very insightful question! Indeed, CMI-VLD calibrates the LVLM's output token distribution by maximizing C-PMI. However, the final output distribution is still parameterized by the original LVLM, i.e., , and hence the ultimate performance is inherently bounded by the expressive capacity of .
To estimate this upper bound, we propose a preliminary empirical approach as follows.
Let strategy denote a set of hyperparameters in our algorithm (e.g., contrast strength and retention ratio ). Then, the calibrated distribution at timestep is denoted as .
We define a “golden sampler”, which ideally performs the following decoding procedure:
- At each decoding step , if the probability assigned by to the ground-truth token is greater than zero (or above a predefined small threshold), the sampler directly selects the ground-truth token;
- Otherwise, it selects the token with the highest probability under ;
- Repeatedly apply this process to generate sufficient sentences, and compute average hallucination scores (e.g., CHAIR) based on ground-truth sentences.
By adopting a variety of strategies to influence and computing the corresponding hallucination score, we can empirically approximate the best achievable performance under a fixed .
This idealized setup, which directly leverages the ground-truth token whenever its model-assigned probability is non-zero, minimizes hallucination as much as possible under the given . Therefore, the resulting performance can serve as a practical estimate of the upper bound for hallucination mitigation under our framework.
We will include this procedure in a new appendix section and explore more rigorous theoretical bounds in future work.
Once again, we deeply appreciate your valuable suggestions for improving our work and would be delighted to further discuss with you about any remaining concerns!
[1] F Huo et al. Self-introspective decoding: Alleviating hallucinations for large vision-language models, In ICLR 2025.
[2] S Leng et al. Mitigating object hallucinations in large vision-language models through visual contrastive decoding, In CVPR 2024.
[3] L Chen et al. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In ECCV 2024.
[4] Z Liang et al. A survey of multimodel large language models. In CAICE 2024.
[5] Y Liu et al. MMBench: Is your multi-modal model an all-around player? In ECCV 2024.
I would like to thank the authors for their constructive responses, which have adequately addressed my concerns.
The preliminary empirical approach proposed by the authors would be an insightful experiment for understanding the capabilities and stability of CMI-VLD, and has the potential to provide meaningful insights for future improvements to the method. The complexity analysis is also well-executed and convincing.
Based on these contributions, I am pleased to maintain my current positive rating.
Dear R# BpZA,
We sincerely appreciate your positive feedback and recognition of our contributions! We are glad that the proposed empirical estimation and additional experiments have adequately addressed your concerns, and we value your encouraging assessment.
Thank you again for your constructive engagement and support during the review process.
Best regards,
The Authors
The paper addresses the issue of hallucinations in LVLMs. The authors propose a Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy to maximize the dependency between generated text and image inputs. Specifically, the paper introduces a bi-level optimization framework, involving a learnable visual token purifier and distribution calibration, aiming to dynamically promote visually grounded and faithful text generation. Extensive experiments on several benchmarks and models demonstrate reduced hallucinations and competitive computational efficiency.
优缺点分析
Strengths
-
Reducing the hallucination of LVMs is a critical problem for LVMs and has significant implications for practical use cases.
-
The idea of strengthening the mutual dependency between generated texts and input images is interesting and reasonable.
-
Extensive experiments are conducted on diverse LVMs (LLaVA-1.5, Shikra, InstructBLIP, LLaVA-Next) across multiple hallucination benchmarks.
Weakness
-
Missing ablations. Most of the ablation focuses on hyperparameters, but the ablation can be expanded to more architectural variations. For instance, comparing the learning-free visual token selection vs. the proposed purifier.
-
There is limited analysis of why C-PMI is less effective on some datasets (e.g., POPE results, Table 2, are more marginal compared to benchmarks like CHAIR).
问题
Please see my comments in the Strengths and Weakness section.
局限性
Yes
最终评判理由
My concerns have been well-addressed. I will keep my positive score.
格式问题
N/A
Thank you for dedicating your time and effort to reviewing our paper. We are deeply encouraged by your positive comments on the novelty, soundness, and experiments of our work! Below, we provide point-by-point responses to address your concerns.
[Q1]Ablation of the learning-free variant.
Thank you for the insightful suggestion!
Initially, we proposed learning a purifier to reduce the computational overhead incurred by manual token selection. To validate this design, we implement a learning-free variant CMI-VLD, which selects tokens by manually computing our derived score in Eq. (7) at each step, with all other settings unchanged.
Table 1. CHAIR metrics and Token-per-second (TPS) of CMI-VLD and its learning-free variant CMI-VLD on four LVLMs using greedy decoding. We present the results of the existing SOTA method, SID [1], as a reference.
| Metric | Method | LLaVA-1.5 | InstructBLIP | Shikra | LLaVA-NEXT |
|---|---|---|---|---|---|
| SID | 42.8 | 56.2 | 51.2 | 38.0 | |
| CMI-VLD | 30.0 | 40.4 | 36.2 | 26.6 | |
| CMI-VLD | 29.9 | 43.2 | 30.6 | 27.2 | |
| SID | 12.1 | 15.8 | 13.6 | 8.9 | |
| CMI-VLD | 9.0 | 11.8 | 10.2 | 6.5 | |
| CMI-VLD | 8.9 | 12.9 | 8.9 | 6.8 | |
| TPS | SID | 8.76 | 11.70 | 3.85 | 15.71 |
| CMI-VLD | 2.45 | 2.41 | 1.05 | 2.35 | |
| CMI-VLD | 8.96 | 11.86 | 4.29 | 16.52 |
Table 1 shows that both variants of our method significantly mitigate hallucination compared to the SOTA baseline, validating the effectiveness of our objective function derived from C-PMI. However, manual token selection incurs substantial latency due to repeated score computations at each decoding step, limiting its practicality in real-world applications.
In contrast, our learned purifier efficiently selects informative tokens with nearly 4× faster inference than CMI-VLD while preserving strong effectiveness, exhibiting an excellent trade-off between performance and efficiency.
We will include this experiment as a new ablation in Sec. 4.3 to better highlight the significance of the purifier. Thank you again for your thoughtful recommendation.
[Q2]Analysis of difference in performance gain.
This is a very insightful question!
- The key reason lies in the nature of the POPE benchmark, which typically involves binary yes/no questions, with responses typically short and following fixed patterns, such as "Yes, there is a [object] in the image." As a result, the evaluation primarily hinges on the first one or a few tokens (i.e., Yes or No) [2].
- Consequently, our CMI-VLD may not fully demonstrate its potential in such a constrained setup, as it is designed to dynamically adjust the decoding process throughout the entire generation rather than concentrating solely on the initial tokens.
- In contrast, the CHAIR benchmark is a more general benchmark and evaluates hallucinations based on the entire generated sentence, allowing our strategy to take full effect during the whole decoding process. This results in stronger performance gains (e.g., a notable average improvement of 7.5 in for Shikra than SOTA baselines), confirming the effectiveness of our method in more expressive and practical generation scenarios.
We will revise Sec. 3.2 to better highlight and explain this distinction. Thank you again for the thoughtful suggestion.
Finally, we would like to express our gratitude once again for your perceptive and valuable feedback! It would be our pleasure to engage in further discussion with you if there are any remaining concerns.
[1] F Huo et al. Self-introspective decoding: Alleviating hallucinations for large vision-language models, In ICLR 2025.
[2] X Zhuang et al. OPERA: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In CVPR 2024.
Thank you for the detailed response. My concerns have been well-addressed. I will keep my positive score.
Dear R# KxNK,
We sincerely appreciate your thoughtful feedback and kind support. We are glad that the additional experimental results and clarifications have addressed your concerns. Your recognition of our efforts and your positive assessment are sincerely appreciated and highly encouraging.
We will carefully incorporate your suggestions into the final version to further improve the quality of our work.
Thank you again for your constructive comments throughout the review process.
Best regards,
The Authors
This paper addresses the hallucination problem in LVLMs. The authors argue that hallucinations largely stem from LVLMs’ over-reliance on textual priors and underutilization of visual information during decoding. To address this, they propose a novel decoding strategy, which maximizes Conditional Pointwise Mutual Information (C-PMI) between images and generated text.
The approach consists of two complementary components:
- Calibrated Distribution Sampling: Prioritizes text tokens whose generation is strongly influenced by the visual input.
- Visual Token Purifier: A lightweight trainable module that prunes irrelevant image tokens based on relevance to the textual context and attention scores.
By formulating hallucination mitigation as a bi-level optimization problem, the method dynamically refines both text and image tokens during generation. Extensive experiments on multiple LVLMs and benchmarks show that the method significantly reduces hallucinations.
优缺点分析
Strengths
- Theoretical Soundness: The reformulation of hallucination mitigation as a C-PMI maximization problem offers a principled and novel perspective, contrasting with prior heuristic decoding approaches.
- Methodological Effectiveness: The bi-level optimization strategy that alternately calibrates text sampling and visual token refinement is well-motivated and effectively operationalized. The method achieves state-of-the-art hallucination mitigation across diverse models and metrics.
- The paper is clearly written, and the methodology is carefully derived and supported with detailed experiments and ablations.
Weaknesses
- Limited Methodological Novelty: The proposed approach primarily combines two established ideas—text token prioritization (common in hallucination mitigation literature) and visual token reduction (widely used in efficient inference). As such, the paper's main contribution lies in the integration and formalization of these components into a unified bi-level optimization framework, rather than in architectural or algorithmic breakthroughs.
- Computational Overhead and Generalizability: Unlike prior visual token reduction methods that often adopt training-free method (e.g., attention scores, cls-token), this work introduces a learnable visual token purifier, which adds training overhead. The necessity of this additional training is questionable, and the generalizability of the trained purifier across tasks or domains also remains uncertain.
- Experimental Models: The evaluation is conducted on a select set of mid-scale LVLMs (e.g., LLaVA, Shikra, InstructBLIP), while more capable and recent models such as Qwen-VL and InternVL are not considered. This raises the question of whether hallucination is still a prominent issue in stronger or more recent models.
问题
- Evaluation on Recent LVLMs: Why did the authors not include models like Qwen-VL or InternVL2.5 in their evaluation? If these models have lower hallucination rates, what is the practical relevance of hallucination mitigation for them?
- Purifier Design and Cost: The visual token purifier adds complexity. Have the authors considered or compared against training-free token reduce mechanisms (e.g., attention scores, cls-token)? How about the generalizability of the trained purifier across tasks or domains?
局限性
yes
最终评判理由
The authors have addressed most of my concerns by providing empirical evidence supporting the benefits of the proposed learnable token purifier and demonstrating the method’s generalization to more capable MLLMs. Overall, this is a solid paper that presents a promising direction for reducing hallucination in MLLMs and has the potential to attract attention within the community.
格式问题
NA
We express our sincere gratitude for dedicating your valuable time to providing insightful comments. We greatly appreciate your positive feedback on our motivation, writing quality, theoretical soundness, algorithmic effectiveness, experiments, and the principled and novel perspective introduced by CMI-VLD. Our detailed responses to all of your concerns are presented below.
[Q1]Methodological Novelty.
Thank you for this valuable question! We would like to present the following points to address your concerns.
-
General Directions vs. Concrete Innovations:
We would like to respectfully point out that, similar to other research topics, text token prioritization and visual token reduction are two general research directions, which can be instantiated by different algorithms [1, 2, 3, 4], each with its specific design and implementation. Therefore, we would kindly suggest that the key to assessing methodological novelty might lie in how the algorithms are concretely formulated, designed, and implemented in specific tasks or scenarios.
-
C-PMI-Grounded Bi-level Optimization with Task-Specific Techniques:
We highlight that our methodological innovation involves introducing C-PMI to regulate the decoding process for hallucination mitigation, where we derive a computable objective formula and propose a novel bi-level optimization framework. Building upon this formulation, we correspondingly design and implement two specific token purification strategies, which are conceptually and algorithmically distinct from prior heuristic approaches.
-
Learnable Visual Purifier with Custom Design:
In addition, for the visual subproblem, the proposed method goes beyond simple selection mechanisms and proposes to learn a visual purifier, where we accordingly design its tailored loss function, training procedure, and inference paradigm.
As outlined above, our work makes methodological contributions in problem reformulation, a novel optimization framework, and the design of tailored techniques and implementations.
We will revise the Sec. 3 to better highlight these contributions.
[Q2]Evaluation on Recent LVLMs.
Thank you for the thoughtful suggestion! Initially, our evaluation aligns with the well-recognized SID [1] and includes four representative LVLMs. To incorporate your valuable feedback, we consider more advanced models Qwen-VL 2.5 7B and InternVL 2.5 8B.
Table 1. CHAIR results of our CMI-VLD with two SOTA baselines on two advanced LVLMs.
| Method | QwenVL-2.5 7B | QwenVL-2.5 7B | InternVL-2.5 8B | InternVL-2.5 8B |
|---|---|---|---|---|
| Vanilla | 34.2 | 10.1 | 35.8 | 10.3 |
| VTI | 32.4 | 9.7 | 37.4 | 10.5 |
| SID | 34.6 | 10.6 | 38.0 | 10.9 |
| Ours | 29.2 | 7.7 | 32.4 | 9.8 |
The results show that our method continues to effectively mitigate hallucinations on these SOTA models. Notably, the observed exceeding 30% and over 10% indicate that even advanced LVLMs still suffer from serious hallucination, consistent with recent findings in [5, 6]. This further underscores that hallucination remains a practically relevant challenge in LVLMs, highlighting the necessity of continued research in this area.
We will supplement more detailed results in Appendix D of the revision.
[Q3]Necessity of Purifier and its Generalizability.
Very constructive question!
-
Necessity of purifier training:
To address your concerns, we implement a training-free variant that selects visual tokens based on attention scores, with all other setups unchanged.
Table 2. CHAIR results of CMI-VLD and a training-free attention-based variant.
Metric Method LLaVA-1.5 InstructBLIP Shikra LLaVA-NEXT Training-free 44.4 52.6 51.6 30.4 Ours 29.9 43.2 30.6 27.2 Training-free 11.3 16.0 13.8 10.2 Ours 8.9 12.9 8.9 6.8 Trained with our well-designed loss function derived from C-PMI, the purifier brings remarkable performance gains over the attention-based selection strategy, validating both the effectiveness of our proposed technique and the necessity of training a purifier.
-
Generalizability across tasks:
We fully understand your concern and would like to address it with the following points:
i. A plug-in purifier with only a single training:
Note that the purifier is trained only once using the proposed paradigm and then directly evaluated across multiple benchmarks, i.e., CHIAR, POPE, GPT-4 assisted Evaluation, MMBench [7], and MME [8], which have included various data domains and multimodal tasks.
ii. Excellent cross-task and cross-domain generalization:
In particular, MME and MMBench are two general-purpose LVLM benchmarks. For MME, we follow SID [1] and report the overall perception score covering 10 sub-tasks such as object existence, counting, OCR, and fine-grained recognition. For MMBench, we report the average score across 20 multimodal tasks, such as attributes, logical reasoning, and coarse/fine-grained perception. The results in Table 3 of the original paper demonstrate that the proposed purifier can generalize well to various tasks.
iii. More results on 4 complex tasks:
To further address your concerns, we supplement the analysis on four recognition tasks from MME's cognitive evaluation.
Table 3. Comparison of our CMI-VLD with SOTA baselines on four complex tasks from the recognition evaluation part of MME.
Method Commonsense reasoning Numerical calculation Text translation Code reasoning Overall VTI 117.86 47.5 72.5 57.5 295.36 OPERA 115.71 47.5 87.5 60 310.71 SID 113.57 45 75 65.5 299.07 Ours 118.57 45 90 67.5 321.07 Table 3 again confirms the generalization ability of our visual purifier across diverse tasks and domains, revealing its practicality and broad utility.
Thanks again for this precious comment. We will supplement these in Appendix D to better clarify these aspects.
Finally, we would like to express our gratitude once again for your valuable feedback! We would be glad to further discuss any remaining concerns you might have.
[1] F Huo et al. Self-introspective decoding: Alleviating hallucinations for large vision-language models, In ICLR 2025.
[2] Z Chen et al. HALC: Object hallucination reduction via adaptive focal-contrast decoding, In ICML 2024.
[3] S Leng et al. Mitigating object hallucinations in large vision-language models through visual contrastive decoding, In CVPR 2024.
[4] X Zhuang et al. OPERA: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In CVPR 2024.
[5] Y Wu et al. Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object Perception. In CVPR 2025.
[6] J Duan et al. TruthPrInt: Mitigating LVLM Object Hallucination Via Latent Truthful-Guided Pre-Intervention.
[7] Y Liu et al. MMBench: Is your multi-modal model an all-around player? In ECCV 2024.
[8] Z Liang et al. A survey of multimodel large language models. In CAICE 2024.
I appreciate the detailed responses provided by the authors. They have successfully convinced me of the effectiveness of the learnable token purifier and the method's ability to generalize to more capable MLLMs. I will keep my rating as is.
Dear Reviewer bTG8,
We sincerely appreciate your insightful feedback and kind support. It is encouraging that the supplemented experiments and clarifications have addressed your concerns, and we sincerely appreciate your recognition of our work and your positive rating for acceptance.
We will refine both the main manuscript and supplementary materials in the revision, better highlighting our contributions and incorporating the additional experimental results.
Once again, we sincerely thank you for the time and efforts you devote in our work.
Best regards,
The Authors
The paper introduces CMI-VLD, a novel decoding strategy to mitigate hallucinations in Large Vision-Language Models (LVLMs). The core idea is to reframe the problem from an information-theoretic perspective, maximizing the Conditional Pointwise Mutual Information (C-PMI) between visual inputs and generated text. This is operationalized through a bi-level optimization framework that concurrently calibrates text token sampling and prunes irrelevant visual tokens via a lightweight, learnable purifier.
Reviewers found the paper to be well-motivated, clearly written, and technically sound. Initial discussions raised valid and critical concerns regarding methodological novelty (vs. prior work in contrastive decoding and token pruning), the computational overhead of the trainable purifier, and the evaluation's scope on the most recent SOTA models. However, the authors' comprehensive rebuttal effectively addressed these points by providing substantial new experiments and detailed analyses. This exemplary engagement successfully convinced all reviewers, leading to a strong and unanimous consensus for acceptance.
Summary Of Reasons To Publish:
- The paper's main strength lies in grounding hallucination mitigation in a theoretically sound C-PMI maximization framework. This provides a unified and principled perspective that moves beyond the more heuristic nature of many prior decoding strategies.
- The authors' rebuttal significantly strengthened the paper's claims. They demonstrated the method's effectiveness on more advanced LVLMs (Qwen-VL, InternVL), proving its relevance. Crucially, new ablation studies validated that the learnable purifier outperforms training-free alternatives and is highly efficient (~4x faster than a manual-computation variant).
- Concerns about the purifier's overhead were effectively dispelled. The authors provided a detailed cost analysis showing that the module is extremely lightweight, introduces negligible latency, and can even improve efficiency by reducing the number of visual tokens processed.
Summary Of Suggested Revisions:
The authors have already agreed to most of the necessary revisions during the discussion phase. To further strengthen the camera-ready version, it is essential to incorporate these promised changes clearly:
- Integrate the new experimental results on advanced LVLMs (Qwen-VL, InternVL) and the comparative analysis against the training-free/manual-computation variants into the main paper or appendix to fully substantiate the claims of effectiveness and efficiency.
- Add the detailed computational cost analysis (FLOPs, latency, memory) to an appendix to provide a transparent view of the method's practical trade-offs.
- Formalize the principles for hyperparameter selection and guidance for adapting the method to new LVLMs in the appendix, as discussed in the rebuttal, to enhance reproducibility and broader adoption.
- Refine the main text and figure captions to improve clarity, particularly regarding the evaluation protocols for MME/MMBench and the "fairness" discussion concerning beam search, as committed to in the author-reviewer discussion.