AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding
We introduce Audio-Visual Contrastive Decoding (AVCD), a training-free framework for mitigating hallucinations in AV-LLMs by reformulating the existing contrastive decoding framework to support trimodal interactions.
摘要
评审与讨论
This paper proposes Audio-Visual Contrastive Decoding (AVCD), a novel, training-free decoding method for mitigating hallucinations in audio-visual large language models (AV-LLMs). AVCD extends traditional contrastive decoding (CD) to a trimodal setting (audio, video, and text) by introducing three key components: (1) dominance-aware attentive masking based on attention weights, (2) a new trimodal contrastive formulation, and (3) entropy-guided adaptive decoding to improve efficiency. The method is evaluated on AVHBench using VideoLLaMA2 and Video-SALMONN, where it shows improved performance over standard CD baselines. However, the scope of evaluation and the interpretation of results are limited.
优缺点分析
Strengths
-
Novel decoding framework: AVCD extends contrastive decoding to multimodal settings in a principled way, addressing an important gap in inference-time hallucination control.
-
Modality-aware masking: The use of attention-based dominance detection for perturbation is an intuitive and data-driven improvement over random or fixed-modality corruption.
-
Training-free: AVCD can be integrated with existing models without requiring re-training, making it practical and broadly applicable.
-
Efficiency-aware: The entropy-guided decoding strategy is a useful addition for real-world deployment.
Weaknesses
-
- Disconnect from hallucination theory: While AVCD improves output accuracy, the paper does not clearly explain how the proposed method targets specific types of hallucinations (e.g., multimodal vs. modality-confused hallucination). This weakens the link between the design and the central challenge.
-
Limited evaluation scope: The AVCD method is only tested on AVHBench, with no experiments on more challenging or general benchmarks such as OmniBench or MM-Vet. Video-LLMs are not evaluated on complex reasoning tasks that would truly test hallucination resilience.
-
Unclear motivation and interpretation of some equations: Equation (11) is introduced without sufficient theoretical or intuitive justification. Lines 318–322 provide a vague interpretation that lacks grounding in either mathematical analysis or empirical results.
-
No ablations on individual components: The contribution of each AVCD module (dominance-aware masking, trimodal CD, entropy-guided decoding) is not independently assessed, making it difficult to attribute performance gains.
问题
Questions and Suggestions for Authors
-
What kinds of hallucinations are being reduced? Is AVCD more effective against cross-modal inconsistencies, unimodal fabrication, or spurious alignments? A taxonomy and targeted evaluation would strengthen your claims.
-
Can you evaluate AVCD on harder benchmarks like OmniBench etc.? Current tasks may not fully stress-test hallucination mitigation in AV-LLMs. More complex reasoning tasks would better reveal the advantages and limitations of AVCD.
-
Please clarify the derivation and intuition of Equation (11). What assumptions or theoretical principles justify this formulation? The explanation around L318–322 is difficult to follow.
-
Provide per-component ablations. How much does each of the three components contribute to the overall performance? Are any of them redundant?
局限性
Yes, but not in detail.
最终评判理由
The author provide detailed and professional experimental results in (1)(2)(4). And I would like to mention the performance increase on OmniBench (31.9%) to (34.5%) is marginal. But the results have a p-value less then 0.05, suggesting the benefits of proposed methods in (3). So I change the score from 3 to 4.
格式问题
N/A
We appreciate the reviewer’s insightful comments. The reviewer’s experimental suggestions enabled us to offer a more in-depth analysis of our method. Based on the comments, we have:
(1) Categorized hallucination cases to demonstrate which types our model effectively mitigates,
(2) Conducted broader performance validation using the OmniBench dataset,
(3) Added clarification on the motivation behind Eq. (11),
(4) Provided per-component ablation results.
Our detailed responses are provided below.
[W1 & Q1] What types of hallucinations does AVCD primarily mitigate, such as cross-modal inconsistencies, unimodal fabrications, or spurious alignments?
Thank you for your insightful suggestion regarding the taxonomy of hallucinations. In response, we provide a deeper analysis of the types of hallucinations mitigated by AVCD, using a subset of the AVHBench dataset. Following your guidance, we categorize hallucinations into three representative types defined in AVHBench:
-
Case 1. Audio-driven video hallucination: Vision-centric questions misled by irrelevant or misleading audio cues.
Example: Q: Is the ship visible in the video? Video: No ship visible, Audio: Ship sounds present.
-
Case 2. Video-driven audio hallucination: Audio-centric questions misled by visual content.
Example: Q: Is the lion making sound? Video: Lion visible, Audio: No lion sound.
-
Case 3. Audio-visual alignment hallucination: Failures to correctly judge the consistency between audio and visual modalities.
Example: Q: Are the contexts of audio and visual content matching?
We evaluate AVCD on each category using VideoLLaMA2 as the base model. The results are as follows:
| Category | Base | AVCD | Improvement |
|---|---|---|---|
| Case 1 | 86.4% | 86.4% | 0.0% |
| Case 2 | 81.3% | 86.3% | +5.0% |
| Case 3 | 64.4% | 71.2% | +6.8% |
While AVCD maintains the already high performance in Case 1, it shows significant improvements in Cases 2 and 3. These results support our claim that AVCD balances modality contributions more effectively.
We will include this analysis and the above categorization in the revised version to better characterize the nature of hallucinations and the targeted efficacy of AVCD.
[W2 & Q2] Could you evaluate AVCD on more challenging benchmarks like OmniBench to better assess its effectiveness in complex hallucination scenarios?
In response to the reviewer’s suggestion, we additionally evaluate various decoding methods (Base, VCD, OPERA, VCD*, and AVCD) on the OmniBench dataset, which is known to require more complex reasoning and better tests hallucination mitigation in AV-LLMs.
Following the OmniBench [1], we use video-SALMONN as the backbone model, where they report an overall accuracy of 35.6%. We reproduce this setting and obtain 33.5% accuracy with Base decoding, which we use as our baseline.
As shown in the table below, AVCD is the only method that consistently outperforms the Base model, whereas other methods often underperform or show comparable results. This highlights AVCD’s strong generalization ability, even on more challenging benchmarks.
| Decoding | Overall | Action | Story | Plot | Object | Context | Identity | Text | Count |
|---|---|---|---|---|---|---|---|---|---|
| Base | 33.5 | 27.1 | 27.0 | 24.1 | 61.1 | 30.1 | 46.9 | 21.4 | 7.1 |
| VCD | 23.3 | 18.3 | 17.4 | 13.1 | 44.6 | 26.2 | 37.5 | 7.1 | 14.3 |
| OPERA | 30.8 | 27.9 | 23.0 | 20.7 | 52.1 | 36.9 | 34.4 | 14.3 | 7.1 |
| VCD* | 31.9 | 27.1 | 26.1 | 21.5 | 57.4 | 30.5 | 43.8 | 21.4 | 0.0 |
| AVCD (Ours) | 34.5 | 28.3 | 28.7 | 24.5 | 60.7 | 30.5 | 50.0 | 21.4 | 7.1 |
[Abbreviation in the table]
Action: Action & Activity, Story: Story Description, Plot: Plot Inference, Object: Object Identification & Description, Context: Contextual & Environmental, Identity: Identity & Relationship, Text: Text & Symbols, Count: Count & Quantity
[Reference]
[1] OmniBench: Towards The Future of Universal Omni-Language Models, arXiv 2025.
[W3 & Q3] What is the motivation behind Equation (11), and what role does it play in the proposed method?
Eq. (11) shows an intermediate step to verify whether hallucinations caused by masking audio, video, and both modalities could be treated as independent. The equation combines the three masked variants in a single formulation, assuming that their effects on hallucination could be simply additive.
However, as shown in Table 3, the performance of Eq. (11) was lower than applying Eq. (2). We interpret this result as evidence that hallucinations caused by different modalities are not independent, and that masking multiple modalities simultaneously can lead to redundant or overlapping corrections.
This finding became a key motivation for the design of AVCD, which explicitly considers interactions between modalities rather than naively aggregating all masked logits—thereby avoiding interference, as explained in L204 of the main paper.
As a result, AVCD can treat hallucinations from different modalities as interacting factors, and improves decoding performance through more sophisticated formulation.
[W4 & Q4] Could you provide per-component ablations to show how much each of the three components contributes to performance?
Following reviewer’s suggestion, we conducted an ablation study on AVHBench using VideoLLaMA2 as the base model. The results are summarized below:
| Method | Accuracy (%) | Inference Speed (sec/token) |
|---|---|---|
| Baseline | 74.15 | 4.4 |
| + Dominance-aware masking | 79.02 | 4.4 |
| + Eq. (10) | 81.95 | 4.4 |
| + EAD (Ours) | 81.95 | 3.1 |
Baseline refers to a contrastive decoding (CD) setup with randomly selected dominant modality and Eq. (11).
Dominance-aware masking improves accuracy by approximately +5%.
Trimodal CD with Eq. (10), our proposed extension of CD to trimodal alignment, further improves performance by +2.9%.
Entropy-guided adaptive decoding (EAD) maintains accuracy while reducing inference latency by over 30%.
These results validate the complementary roles of all three AVCD components as each contributes meaningfully to either accuracy or efficiency.
Dear Reviewer qKTD,
Thank you again for your time and efforts in reviewing our paper.
We just wanted to kindly check whether our responses have addressed your questions and concerns. We truly appreciate your feedback, and would be happy to clarify anything further if needed.
Best regards,
Authors
The author provide detailed and professional experimental results. I would like to mention the performance increase on OmniBench (31.9%) to (34.5%) is marginal. But the results have a p-value less then 0.05, suggesting the benefits of proposed methods
Thank you for your thoughtful feedback. We sincerely appreciate your recognition of the improvements made and the statistical significance of our results. Your comments are valuable to us and help strengthen the contribution of our work.
This paper introduces Audio-Visual Contrastive Decoding (AVCD), a training-free decoding framework designed to mitigate hallucinations in Audio-Visual Large Language Models (AV-LLMs). The core contributions of AVCD are threefold:
- Dominance-aware Attentive Masking: Instead of perturbing a fixed modality, AVCD dynamically identifies less dominant modalities using attention distributions and applies attentive masking to generate perturbed logits for contrast.
- Trimodal Contrastive Formulation: The paper reformulates the conventional contrastive decoding (CD) framework to explicitly handle trimodal inputs (audio, visual, textual).
- Entropy-Guided Adaptive Decoding (EAD): To improve inference efficiency, AVCD selectively skips CD steps for high-confidence predictions based on the entropy of the model’s output distribution.
The authors demonstrate through experiments on various MLLMs (AV-LLMs, video-LLMs, and image-LLMs in supplementary) that AVCD consistently outperforms existing decoding methods and base models in reducing hallucinations and improving accuracy.
优缺点分析
Strengths:
- The paper addresses a critical and timely problem: hallucination in AV-LLMs, which is more complex than in VLMs due to trimodal interactions.
- The qualitative examples (Fig. 1, Fig. 5) help effectively illustrate AVCD’s ability to correct multimodal hallucinations.
Weaknesses:
- The exact mechanism of “attentive masking” could be slightly more detailed in the main paper. For instance, are tokens completely zeroed out, or replaced with a special token? The paper states “mask out,” which implies suppression. The choice of P=50% for the threshold is justified by a reference to the supplement, which is acceptable.
- AVCD involves multiple forward passes (up to 4 for the full trimodal contrast based on Eq. 10: original, ¬v, ¬a, ¬v¬a). While EAD mitigates this, the worst-case overhead is notable. Figure 6 shows inference speed vs. accuracy, which is good, but a more direct comparison of average inference time (e.g., relative slowdown factor) for AVCD vs. Base and VCD on a key benchmark like AVHBench could be beneficial.
- It is unclear how hallucinations in AV-LLMs differ from those in VLMs. Also, it seems that AVCD could also be applied to general VLMs beyond the AV-LLM setting.
- MINOR consideration: The idea of using attention scores to guide perturbations is not entirely new (e.g., SID uses attention to find the least informative tokens). However, AVCD’s approach of identifying less dominant modalities and then masking their most informative tokens for trimodal CD is a distinct application.
问题
- The paper introduces the hallucination problem in AV-LLMs but does not clearly define what constitutes hallucination in the audio-visual context. How does it differ from or relate to hallucinations in general multimodal models?
- Since the proposed method is based on attention mechanisms, the paper lacks an analysis of attention visualizations comparing the decoding process before and after applying AVCD.
- Why do you justify that the attention distribution of the final query token (denoted as ) is a valid and reliable measure for identifying modality dominance?
- The paper omits an important related work OPERA, which analyzed hallucination from attention sink. It is recommended that the authors include a comparison with OPERA in the experiments.
- Entropy-based decoding strategies are not novel, and have been used in prior works. What advantages does your method offer over existing entropy-guided approaches?
- In line 223, you mention denotes corrupted audio states, but seemingly do not tell how we do the corrupted operation, which is a key step in AVCD. Could you provide more information about this?
- The paper states tokens in non-dominant modalities exceeding a top P% attention threshold are “masked out.” Could you clarify if this means their embeddings/features are zeroed out, or replaced with a generic [MASK] embedding?
- Equation 10 implies up to four forward passes for a single token generation step in the worst-case scenario (original, ¬v, ¬a, ¬v¬a). While Figure 6 shows the trade-off with EAD, could you provide a more direct measure of the average inference time slowdown (e.g., as a percentage or factor increase) for AVCD compared to Base decoding on a challenging benchmark like AVHBench, perhaps at the optimal τ (e.g., τ=0.6)? How much of the computational graph can be shared across these multiple forward passes for different masked inputs, or are they largely independent computations through the LLM decoder?
局限性
yes
最终评判理由
I think this paper still has room for much improvement, including experiment and writing.
格式问题
no
We appreciate reviewer's insightful comments. We were able to add crucial details that enhance the clarity of our work. Based on the comments, we have:
(1) elaborated on the attentive masking mechanism,
(2) provided details of the inference speed experiments,
(3) clarified how AV-LLMs suffer from hallucination,
(4) highlighted differences from the SID model,
(5) demonstrated the reliability of attention via Attention-Guided Masking,
(6) explained the motivation for using the final query token,
(7) compared performance against a non-CD method (OPERA),
(8) clarified how our entropy-based decoding differs from existing methods.
Our detailed responses are provided below.
[W1 & Q6 & Q7] Could you clarify how "masking out" and "corrupted states" are implemented in your method?
We apologize for the confusion. In our paper, terms such as “corrupted” or “masked out” consistently refer to the same operation: zeroing out the corresponding embeddings/features to reduce their influence during decoding.
For the corruption operation, we first compute attention values across all layers to derive a global attention distribution, which determines both the dominance score and the masking threshold. We then identify tokens that exceed this threshold and apply zero-out masking to those associated with less dominant modalities in the attention weights.
We agree that this clarification is important and will explicitly state this process in the final version of the paper.
[W2 & Q8] Have you conducted a direct speed comparison on AVHBench? Additionally, can the multiple forward passes for masked inputs share parts of the computation graph, or are they executed independently?
In Figure 6, we measured inference time on the AVHBench dataset using the VideoLLaMA2 model. Specifically, we averaged the response time over 100 samples to compute the results. As suggested by the reviewer, we will include the inference speed scale in the final version, as shown in the table below.
| Decoding | threshold (τ) | Speed Scale (s/token) | Accuracy (%) |
|---|---|---|---|
| Base | - | 1 (1.75) | 78.05 |
| VCD | - | 1.4 (2.50) | 62.44 |
| AVCD | 0.8 | 1.3 (2.25) | 80.98 |
| AVCD | 0.6 | 1.8 (3.14) | 81.95 |
While base contrastive decoding typically requires up to four separate full forward passes in AV-LLMs, our Entropy-Guided Adaptive Decoding (EAD) effectively reduces unnecessary passes by leveraging entropy-based filtering.
Although, like other existing CD methods, our implementation does not employ shared computation across passes, AVCD still achieves notable improvements. For example, with τ = 0.8, AVCD not only surpasses the Base method in accuracy but also runs faster than VCD, which generally requires two separate decoding passes.
[W3 & Q1] Could you clarify how hallucination is defined in the audio-visual context, and how it relates to hallucinations in other multimodal models?
Hallucination in AV-LLMs differs fundamentally from that in VLMs due to the presence of three modalities (audio, vision, and language), as opposed to two in VLMs. In AV-LLMs, hallucinations often arise when the model imagines non-existent audio from visual input or visual content from audio input, as noted in AVHBench [1]. This is further elaborated by CMM [2], which attributes such hallucinations to over-reliance on unimodal priors and spurious inter-modality correlations.
This tri-modal structure introduces 7 types of modality interactions (3: unimodal, 3: bimodal, 1: trimodal), making hallucination more complex than in VLMs.
Our proposed AVCD framework is designed specifically to handle this tri-modal challenge by applying modality-aware CD. Despite this challenge, AVCD shows strong generalizability to video-LLMs and image-LLMs, outperforming prior CD methods (see Table 2, Supp. Sec. C, and Supp. Sec. F).
In particular, Table 2 shows that AVCD brings significant performance gains over the base model even on video-LLM settings, highlighting its broad applicability beyond the tri-modal setup.
[Reference]
[1] AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models, ICLR 2025.
[2] The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio, arXiv, 2024.
[W4] Could you clarify how your method differs from SID?
While we acknowledge that attention-guided perturbations have been explored in prior work (e.g., SID), our contribution lies in extending this idea to the tri-modal setting of AV-LLMs. Specifically, AVCD adaptively identifies less dominant modalities and selectively masks their most informative tokens, introducing a novel application of CD tailored to multimodal interactions.
To our best knowledge, this is the first systematic exploration of CD in AV-LLMs, and we further demonstrate its generalizability across AV-LLMs, video-LLMs, and image-LLMs, underscoring its broad applicability beyond the original trimodal context.
[Q2] Have you analyzed attention visualizations before and after applying AVCD?
While we cannot provide explicit attention maps, we conducted an indirect yet rigorous evaluation to verify whether the model's attention captures semantically meaningful auditory-visual information.
Specifically, we compared two masking strategies during decoding: a random zero-out masking strategy, which randomly masks 50% of tokens, and our attentive zero-out masking strategy, which masks the top 50% most attended tokens as identified by the attention scores.
| Method | Accuracy (%) |
|---|---|
| Base | 78.05 |
| Random Masking | 71.71 |
| Attention-Guided Masking (Ours) | 81.95 |
As shown in Table above, random masking significantly degrades decoding performance, while our proposed attentive masking improves performance by 3.9% over the base decoding.
These results suggest that the attention mechanism in AVCD is indeed focusing on informative and modality-relevant tokens.
[Q3] Is it appropriate to use the last query token's attention distribution for identifying dominant modality?
We use the attention distribution from the final query token because it is directly responsible for next-token prediction in autoregressive decoding. This design choice is supported by prior works [3,4], where attention from the final token is commonly used to interpret token-level importance in LLMs.
To validate this decision, we compare our strategy against an alternative approach that averages attention across all query tokens. As shown in the table below, using final query token achieves higher accuracy, indicating that the final token provides a more reliable signal for analyzing modality dominance during decoding.
| Method | Accuracy (%) |
|---|---|
| average query token | 80.00 |
| final query token (Ours) | 81.95 |
[Reference]
[3] Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models, ICLR 2025.
[4] Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs, ICLR 2024.
[Q4] Could you provide a comparison with OPERA to demonstrate the effectiveness of AVCD?
we attempted to apply OPERA [5] to VideoLLaMA2; however, due to its beam search with multiple attention rollbacks, we encountered out-of-memory (OOM) issues when handling its long input sequences (2K+ tokens). To circumvent this, we evaluated OPERA on a lighter AV-LLM—video-SALMONN, which processes only a few hundred tokens, to assess its feasibility in the AV domain.
As shown in the table below, both OPERA and VCD led to noticeable performance degradation on AVHBench and AVHBench-Captioning. In contrast, our trimodal extension of VCD (VCD*) improved over Base decoding, though it still lagged behind AVCD.
| Decoding | AVHBench | AVHBench_cap |
|---|---|---|
| Base | 60.00 | 1.94±0.05 |
| OPERA | 56.59 | 1.73±0.09 |
| VCD | 59.51 | 1.65±0.05 |
| VCD* | 65.85 | 2.18±0.07 |
| AVCD (Ours) | 66.83 | 2.28±0.02 |
Overall, AVCD achieves the most significant performance gain over Base across both benchmarks.
[Reference]
[5] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation, CVPR 2024.
[Q5] How does your entropy-based decoding differ from existing approaches?
While [6] utilizes entropy to set a hyperparameter (e.g., α) that controls the degree of contrasting in CD, simply adapting this approach requires up to four full forward passes—making it impractical for AV-LLMs.
In contrast, we treat entropy as a threshold signal to decide whether to apply CD at all, rather than how strongly to contrast. This design choice enables us to maintain reasonable inference speed even in the tri-modal setting. We believe this approach is more suitable for real-world deployment scenarios, where efficiency and scalability are critical.
[Reference]
[6] Adaptive Contrastive Decoding in Retrieval-Augmented Generation for Handling Noisy Contexts, EMNLP 2024.
Dear Reviewer Nvqo,
Thank you again for your time and efforts in reviewing our paper.
We just wanted to kindly check whether our responses have addressed your questions and concerns. We truly appreciate your feedback, and would be happy to clarify anything further if needed.
Best regards,
Authors
As a follow-up to your comment on inference speed, we would like to provide additional results for the FlashAttention setup.
Incorporating FlashAttention provides a more realistic estimate of practical deployment costs. With FlashAttention enabled, the decoding speed of the Base method improves from 1.75 s/token to approximately 1.30 s/token.
We will update the paper to reflect this for more detailed comparison.
Hi authors, I have gone through your rebuttal for my review and others. Please ensure all clarifications updated if accepted. I will raise my rating as my concerns are solved.
Dear Reviewer Nvqo,
Thank you very much for your thoughtful follow-up. We sincerely appreciate your decision to raise the rating, and we will ensure that all clarifications are properly reflected in the final version of the paper.
Sincerely,
The Authors
The paper presents a significant advancement in hallucination mitigation for AV-LLMs with a well-designed, efficient, and empirically validated approach. However, its impact could be enhanced by addressing limitations, expanding evaluations, clarifying content, and exploring practical and ethical implications.
优缺点分析
Innovative Framework: Introduces Audio-Visual Contrastive Decoding (AVCD), a novel, training-free method to mitigate hallucinations in Audio-Visual Large Language Models (AV-LLMs), extending contrastive decoding to trimodal settings (audio, video, language). Adaptive and Efficient: Uses attention-based modality dominance detection and entropy-guided decoding to balance multimodal reasoning and improve inference speed. Robust Validation: Demonstrates consistent performance improvements (e.g., 6% and 11% accuracy gains on VideoLLAMA2 and video-SALMONN) across datasets like AVHBench, with statistical rigor. Theoretical Grounding: Provides mathematical support through KL divergence and Taylor expansion, with open-access code promised for reproducibility. Comprehensive Context: Builds on extensive prior work in vision-language models and hallucination mitigation.
Limited Scope: Validation is restricted to specific models and datasets, potentially limiting generalizability. Truncated Content: OCR errors and missing sections obscure critical details, such as derivations and full experimental setups. Lack of Limitations Discussion: Does not explicitly address AVCD’s limitations or potential failure cases. Dependence on Attention: Relies on attention mechanisms, which may be unreliable if biased. Missed Opportunities: Lacks discussion on non-CD method comparisons, ethical implications, and real-world applications.
问题
Please provide a detailed introduction to the datasets used in the experiments. For example, what are the total numbers of samples in the three datasets?
局限性
- In Formula 1, how is the softmax function utilized? Please explain its working mechanism and potential impacts.
- The term "damaged and biased visual information" (such as x¬v) mentioned in the formula refers to what? What extent is it considered, and does it reflect noise or bias in the data collection process?
最终评判理由
Thank you for your comprehensive response. They have addressed my main concerns, and I maintain my recommendation score of 4 points.
格式问题
No
Thanks to the reviewer’s insightful comments, we were able to add detailed information that enhances reader understanding. Based on the comments, we have:
(1) re-emphasized the broad validation scope,
(2) referred to the sections describing derivations and experimental setups,
(3) supplemented the discussion with failure cases,
(4) added experiments to verify attention reliability,
(5) included comparisons with non-CD methods and added explanations on ethical considerations and potential real-world applications,
(6) provided more detailed descriptions of the datasets used,
(7) explained the role of the softmax function in Eq. (1),
(8) supplemented the explanation on modality distortion.
Our detailed responses are provided below.
[W1] Limited validation scope with respect to models and datasets.
We would like to clarify that AVCD is evaluated on four multimodal large language models (VideoLLaMA2, video-SALMONN, Video-LLaVA, and LLaVA-1.5) across five datasets (MUSIC-AVQA, AVHBench, MSVD-QA, ActivityNet-QA, and POPE). These evaluations cover a diverse range of tasks and modalities, demonstrating the robustness and generalizability of our approach.
[W2] Insufficient details on derivations and experimental setups.
We would like to clarify that our paper does not involve OCR in any part of the method or evaluation. Additionally, we provide detailed information about the datasets, evaluation protocols, baselines, and implementation details in Sec. 4.1 of the main paper. The complete mathematical derivation is also included in Supp. Sec. A
[W3] Absence of discussion on limitations and failure cases.
We would like to note that we include a discussion of our method’s limitations in Supp. Sec. H. To further support readers' understanding, we will include additional failure cases in the final version of the paper.
[W4] Concerns on attention reliability.
We understand the reviewer’s concern regarding the reliability of attention. To address this, we conducted additional experiments on AVHBench dataset using VideoLLaMA2 model comparing attention-guided masking with random masking strategies.
| Method | Accuracy (%) |
|---|---|
| Base | 78.05 |
| Random Masking | 71.71 |
| Attention-Guided Masking (Ours) | 81.95 |
When random masking is applied, performance drops significantly to 71.71%, compared to the Base decoding at 78.05%. In contrast, our proposed attention-guided masking selectively distorts the modality that truly affects the model output, resulting in improved performance of 81.95%. These results indicate that attention-based masking effectively identifies influential content and contributes to reducing hallucinations.
Moreover, attention mechanisms have been extensively studied in prior works [1,2] for identifying important tokens in LLMs, further supporting their reliability.
[Reference]
[1] Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models, ICLR 2025.
[2] Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs, ICLR 2024.
[W5] Limited discussion on non-CD methods, ethical considerations, and potential real-world applications.
• Non-CD method comparisons
To address comparisons with non-CD methods, we include experiments with the OPERA [3] method as a representative baseline using video-SALMONN model.
| Decoding | AVHBench | AVHBench_cap |
|---|---|---|
| Base | 60.00 | 1.94 ± 0.05 |
| OPERA | 56.59 | 1.73 ± 0.09 |
| VCD | 59.51 | 1.65 ± 0.05 |
| VCD* | 65.85 | 2.18 ± 0.07 |
| AVCD (Ours) | 66.83 | 2.28 ± 0.02 |
Our results show that OPERA underperforms in the AV-LLM setting, with lower scores on both AVHBench and AVHBench_cap. In contrast, our method achieves the best performance, demonstrating its effectiveness in handling complex audio-visual inputs and reducing hallucinations where OPERA struggles.
• Ethical implications
We acknowledge that CD can produce more persuasive and coherent text, which raises ethical concerns. Specifically, it may be misused to generate fake news, misinformation, or manipulated content by making false information appear more credible. Responsible deployment and mitigation strategies should be considered in future work.
• Real-world applications
Our proposed method is model-agnostic and can be applied broadly in any environment using AV-LLMs to reduce hallucinations. Considering the trade-off between model performance and inference time, users can adapt the method flexibly according to their application needs, making it practical for real-world deployment.
[Reference]
[3] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation, CVPR 2024.
[Q1] What are the total numbers of samples in the three datasets?
Thank you for your question. Our evaluation covers diverse test sets across different model types. For AV-LLMs, we use AVQA (9,185 pairs), AVHBench (5,302 pairs), and AVHBench-Captioning (1,000 pairs). For Video-LLMs, we evaluate on MSVD-QA (1,000 pairs) and ActivityNet-QA (6,760 pairs). Lastly, for image-LLMs, we use the POPE dataset, which contains 9,000 samples across three categories: Random, Popular, and Adversarial (3,000 each). We will include detailed dataset descriptions in the revised paper.
[L1] What is the purpose of applying softmax in Eq. (1)?
In Eq. (1), the softmax function is used in the standard way to convert the LLM’s output logits into a probability distribution over the vocabulary. This is an essential step for enabling word prediction, as it allows the model to assign interpretable likelihoods to each possible token.
[L2] What does the term "damaged and biased visual information" refer to?
We would like to clarify that the phrase “damaged and biased visual information” specifically refers to intentionally corrupted inputs introduced during inference, such as noise addition, data augmentation, or masking, as mentioned in L144 of the main paper. It is important to note that this does not refer to any issues arising from data collection.
Dear Reviewer BYoK,
Thank you again for your time and efforts in reviewing our paper.
We just wanted to kindly check whether our responses have addressed your questions and concerns. We truly appreciate your feedback, and would be happy to clarify anything further if needed.
Best regards,
Authors
This paper introduces Audio-Visual Contrastive Decoding (AVCD), a novel, training-free decoding framework designed to mitigate hallucinations in audio-visual large language models (AV-LLMs). The authors suggest that existing contrastive decoding (CD) methods, primarily developed for Vision-Language Models (VLMs), are ill-suited for more complex trimodal interactions that cause hallucinations in AV-LLMs.
- Trimodal Contrastive Decoding Formulation: The paper extends the conventional CD framework to jointly handle audio, visual, and textual inputs. It provides a mathematical reformulation that allows for contrasting outputs based on perturbations across multiple modalities simultaneously.
- Dominance-Aware Attentive Masking: Instead of perturbing a fixed modality, AVCD dynamically identifies the less dominant modalities at each decoding step by analyzing the model's internal attention distributions. It then applies an "attentive masking" strategy to these less influential modalities to generate the perturbed logits for contrastive decoding. This approach avoids introducing noise from direct input distortion.
- Entropy-Guided Adaptive Decoding (EAD): To improve inference efficiency, AVCD calculates the entropy of the model's initial prediction; if the entropy is low (indicating high confidence), the more computationally expensive contrastive steps proposed are skipped. This balances the performance gains of CD with computational overhead.
Experimental Results
The method was evaluated on several MLLMs, including AV-LLMs (VideoLLaMA2, video-SALMONN), video-LLMs, and image-LLMs. Key findings include:
- On the AVHBench hallucination benchmark, AVCD improved accuracy by 6% for VideoLLaMA2 and 11% for video-SALMONN compared to the baseline decoding method.
- AVCD consistently outperformed both baseline decoding and an extended version of the VCD method across multiple datasets, including MUSIC-AVQA, MSVD-QA, and ActivityNet-QA.
- Ablation studies evaluate the the core components. The adaptive modality recognition strategy is compared to fixed-modality approaches, with some gains. The trade-off between accuracy and inference speed from the EAD mechanism is shown.
- The proposed method outperforms VCD on image-LLM benchmarks for which VCD was originally designed.
优缺点分析
Strengths
Intersting Topic
The submission tackles the complex issue of hallucinations in trimodal audio-visual large language models (AV-LLMs), a challenging area that has received less attention than the more popular vision-language models (VLMs).
Extends existing work
The submission presents some mathematical extensions of previous CD frameworks.
Experiments
The proposed method is compared to a number previous comparable works and demonstrates some large and some more marginal gains.
Weaknesses
Justification of the Mathematical Approximation
A core component of the proposed work is the mathematical approximation presented in Equation (7), which simplifies the combination of logit distributions from original and corrupted signals.
While this reformulation is helpful, a potential weakness arises from its justification. The authors validate this approximation by demonstrating that their "attentive masking" strategy yields a small Kullback-Leibler (KL) divergence between the logit distributions of the original and corrupted signals (Figure 4). This low divergence is presented as a strength, indicating that the method introduces less distortion than prior techniques and so the approximation is valid.
However, this justification creates an apparent contradiction with the fundamental premise of contrastive decoding.
- The core issue: The method's mathematical validity rests on the assumption that the two distributions are similar (low KL divergence). Yet, the method's efficacy is presumed to come from the contrast between them. If the distributions are similar enough to validate the approximation, it is unclear whether there is sufficient difference between them to provide a meaningful contrastive signal that can effectively steer the model away from hallucinations. The two conditions seem mutually exclusive.
This concern seems particularly amplified when considering individual token predictions.
- Local vs. Global Accuracy: The KL divergence is a measure of global similarity across the entire vocabulary distribution. It is plausible that for a small but critical subset of tokens—for instance, a factually correct word versus a likely hallucination—the probabilities from the original and corrupted signals diverge significantly. In these high-impact cases, the Taylor approximation would be locally inaccurate, introducing error precisely where the contrastive mechanism is most needed. The main paper does not analyze or account for the error introduced by the approximation in these specific, high-divergence scenarios.
Computational Efficiency and Practical Implementation
The proposed method is reliant on materializing the full attention matrix to identify the "dominant modality" at each decoding step, when AVCD is triggered. This design choice has two primary negative consequences:
-
Incompatibility with Modern Attention Mechanisms: The requirement to inspect the numerical attention scores is fundamentally incompatible with state-of-the-art fused attention kernels like FlashAttention. These optimizations achieve significant improvements in speed and memory efficiency precisely by avoiding the materialization of the N x N attention matrix. Forcing a model to use a standard, non-optimized attention implementation to enable AVCD would negate these crucial performance gains, making the model inherently slower and more memory-intensive than a baseline that can leverage these common optimizations.
-
An Efficiency Paradox in the Adaptive Strategy: The authors introduce Entropy-Guided Adaptive Decoding (EAD) to reduce computational costs by selectively applying the expensive multi-pass AVCD only when the model's confidence is low. However, this creates an efficiency paradox. It would seem that the model incurs the high cost of materializing the attention matrix on every single token, even on high-confidence tokens where the subsequent AVCD step is ultimately skipped, unless the authors have done some inference management to only call the slow un-optimized kernel when AVCD is called.
Moreover, it appears the inference speed with Base as shown in Figure 6 uses the same attention kernel as AVCD given and Base have the same inference speed and performance. Any realistic implementation of "Base" would use FlashAttention or another optimized kernel during inference.
Unifying the Mathematical Framework and Implementation
On reading Section 3.2, I find the method the authors outline confusing. The core of the issue stems from a disconnect between the formula and the descriptive text:
-
The General Formula: Equation (10) presents a complete trimodal framework where the final logit is adjusted by multiple separate contrastive terms: one for a corrupted audio signal and one for a corrupted video signal and one for both signals being corrupted. This mathematical representation implies that the model's output is shaped by contrastive pressures from both modalities simultaneously.
-
The Described Implementation: The "Attentive Masking Strategy" (Section 3.3) details an adaptive process where, at each step, the model identifies and masks specific tokens in identified "less dominant" modalities. This description suggests that language tokens could be determined to be a less dominant modality in which case Equation (10) does not represent this possibility.
问题
-
Could you clarify the apparent tension in requiring a low KL divergence to validate the approximation in Equation 7, while also needing a sufficiently large difference between distributions to provide a meaningful contrastive signal, particularly for specific tokens.
-
On Computational Efficiency: The need to materialize the attention matrix precludes the use of optimizations like FlashAttention. Could you comment on the practical trade-offs in inference speed and how the "Entropy-Guided" mechanism mitigates this? Noting the need for the costly attention matrix materialization is only known once the forward pass for the original model has been done.
-
Equation 10 suggests a contrast against both audio and video modalities, while the text describes masking only adaptively identified less dominant modalities. Could you explicitly clarify how the definite formula is reconciled with the adaptive modality selection described?
局限性
yes
最终评判理由
I thank the reviewers for their comments. They have largely addressed my concerns well and additional concerns (for myself) have not been raised during the discussion period. I raise my score.
格式问题
no concerns
We sincerely thank the reviewer for the thorough and insightful feedback. The reviewer’s detailed critique greatly helped us deepen our explanation regarding KL divergence and improve the clarity of the paper. Based on the comments, we have:
(1) provided an analysis of the relationship between low KL divergence and the strength of the contrastive signal,
(2) explained how our implementation captures local token-level changes,
(3) compared our approach to scenarios using FlashAttention,
(4) revised the notation to a more general form reflecting the adaptive modality selection.
Our detailed responses are provided below.
[W1.1 & Q1] Can low KL divergence still lead to sufficient contrastive effectiveness?
Contrary to reviewer’s concern, low KL divergence does not mean the contrastive signal is ineffective or hallucinations are absent. To support this, we applied AVCD and VCD on the AVHBench test set using VideoLLaMA2. AVCD changed 13.1% of answers (8.5% improved, 4.6% degraded), while VCD changed only 7.4% (4.4% improved, 3.0% degraded), despite VCD showing higher KL divergence. This demonstrates that semantically coherent perturbations from attention-based masking (used in AVCD) provide a more effective contrastive signal than simply increasing KL. Therefore, the two conditions, low divergence and effective contrast, are not mutually exclusive.
[W1.2 & Q1] Can KL divergence effectively capture local token-level changes that are critical to the final answer?
Thank you for raising this important point. We clarify our use of KL divergence in the proposed framework.
In our implementation, KL is computed after applying the Adaptive Plausibility Constraint, as detailed in Supp. Sec. B. This constraint, commonly used in the CD literature, ensures that the final output is restricted to tokens with sufficient confidence in the original logit distribution.
As a result, the KL divergence is calculated only over the top-K logits that are most likely to influence output generation, addressing the concern that KL fails to capture local behavior.
We will revise the main paper to explicitly describe this procedure.
[W2 & Q2] Is AVCD computationally efficient and practically applicable, for example in settings that leverage FlashAttention?
We acknowledge the reviewer’s valid concern regarding efficiency. Indeed, our current AVCD implementation is incompatible with FlashAttention. To clarify the efficiency gap, we provide the following inference speed comparison (lower is faster):
| Decoding | threshold (τ) | Speed Scale (s/token) | Accuracy (%) |
|---|---|---|---|
| Base w/ Flashattn | - | 1 (1.30) | 78.05 |
| Base | - | 1.3 (1.75) | 78.05 |
| VCD | - | 1.9 (2.50) | 62.44 |
| AVCD | 0.8 | 1.7 (2.25) | 80.98 |
| AVCD | 0.6 | 2.4 (3.14) | 81.95 |
While FlashAttention is not currently applicable due to the need to access attention scores, our goal is fundamentally different: this paper is the first to design a contrastive decoding (CD) strategy tailored for AV-LLMs. Simply extending existing CD methods designed for unimodal or vision-language models often focused on a fixed modality, which leads to suboptimal or unsuitable behavior in AV settings. Our work identifies this issue and proposes a solution via modality-aware masking and new mathematically extended CD formula for three modalities.
Moreover, instead of naively applying CD across all tokens with multiple forward passes, we introduce Entropy-Guided Adaptive Decoding (EAD) to selectively apply AVCD only when necessary, reducing the computational overhead.
Although FlashAttention cannot be used at this stage, we are optimistic. Recent work [1] on optimizing CD for LLMs demonstrates that more efficient CD implementations are possible, and we believe future work can extend these techniques to AV-LLMs once modality-aware CD becomes more mature. Our contribution lays the foundation for this line of research, and we ask for your generous understanding of the current trade-off in efficiency in light of the broader methodological contribution.
[Reference]
[1] Fast Large Language Model Collaborative Decoding via Speculation, ICML 2025.
[W3 & Q3] On the generality of the framework.
Thank you for pointing out this confusion. We apologize for the lack of clarity. In the final version, we will ensure that Equation (10) reflects the generalized case, allowing any modality to be adaptively masked based on its dominance score.
Dear Reviewer seE2,
Thank you again for your time and efforts in reviewing our paper.
We just wanted to kindly check whether our responses have addressed your questions and concerns. We truly appreciate your feedback, and would be happy to clarify anything further if needed.
Best regards,
Authors
Questions relating to the mathematical framework still apply after the author's response.
- Your response has not addressed the question I asked about the validity of the mathematical approximation you use. Even if you restrict the KL-divergence calculation to top-k logits in the original distribution, this doesn't constrain the logit values of the secondary distribution to be close to the top-k logits, you could still have a large difference in per-token logits and therefore the approximation you use would have large error in these cases. The experimental result you have provided is an empirical justification amounting to "this works in practice" which is valid, but isn't mathematically rigorous as the submission suggests. Once again, you are adjusting per token logits and so the logit difference needs to be small for the approximation to be valid. KL divergence does not tell a reader what the logit difference of the tokens which would've been originally selected and are eventually selected are. If your argument is really "it works empirically" you should say so.
- On the speed argument, I would say >2x inference cost is pretty substantial, but given this is an initial work in this direction it is a minor weakness as other reviewers have pointed out. When authors have a baseline which do not require attention matrix materialization, they should use FlashAttn or whatever SotA attention method is applicable as no-one should run a base model without FlashAttn or similar if the attention matrix does not need to be materialized. So the AVCD method is really 70/140% slower than the baseline rather than 30/80% slower as you have outlined in other reviews.
Dear Reviewer seE2,
We just wanted to kindly follow up to ask whether our previous responses have adequately addressed your concerns. If you have any remaining questions or suggestions, we would be more than happy to discuss them.
Thank you again for your time and consideration.
Sincerely,
The Authors
We sincerely thank the reviewer for the constructive and detailed feedback. Your comments helped us clarify the limitations of our current explanation and better position our empirical evidence.
[Q1] Validity of the Approximation
We understand the reviewer’s concern that KL divergence does not directly justify the error approximation in Eq. (7), and we appreciate the opportunity to clarify this point. To address this, we additionally computed the exact approximation error described in Supp. A.1 (), using logits selected via the adaptive plausibility constraint.
This offers a more direct and rigorous quantification of the approximation error. We evaluated this error on 100 examples from the AVBench dataset. As shown in the tables below, AVCD consistently yields smaller or comparable approximation errors than VCD across all modality masking settings. Furthermore, the absolute error values are small, providing strong empirical support for the validity of our approximation.
[Table] Approximation Error with Adaptive Plausiblity Constraint | Method | Video Masked | Audio Masked | Audio & Video Masked | |---|:---:|:---:|:---:| | VCD | 0.015 | 0.083 | 0.073 | | AVCD (Ours) | 0.015 | 0.032 | 0.037 |
We will revise the final version of the paper to replace the KL divergence-based explanation with this analysis, which more directly supports the validity of the approximation. As such, we will further revise L224–225 to clarify that it is found to work well empirically, rather than for its theoretical implications.
[Q2] Inference Speed under FlashAttention
We fully agree that reporting inference speed under FlashAttention provides a more realistic estimate of practical deployment costs. Accordingly, we have shared FlashAttention-based inference-time comparisons in our response to another reviewer and will include these results in the final version of the paper to more accurately reflect the runtime overhead of AVCD in modern setups.
The paper tackles the problem of CD not being well-suited for AV-LLMs on hallucination by proposing AVCD to surpress modality-induced hallucinations. It leverages attention distributions to identify less dominant modalities and generated perturbed outputs and entropy guided adaptive decoding to selectively skip unnecessary decoding steps. It is training free and effective on multiple AV benchmarks.
优缺点分析
Strengths:
-
The method is training free.
-
Clear ablations on dominant modalities ablation and masked modality.
-
Proposed efficient inference by entropy guided adaptive decoding and clearly shows trade-off between accuracy and inference time.
-
First CD framework that handles three modalities jointly (audio, video, language) instead of corrupting a single fixed channel.
Weakness:
-
I believe the masking ablation needs a row where you perform completely random modality-agnostic or modality-specific masking. Or you can compare to previous works that fabricate weak context such as noise, random corruption, adversarial inputs, etc.
-
The attention mask potentially has weak foundation where the attention doesn't reflect the 'real strong modality' if it is noisy, or affected by certain prompt engineering. If the attention doesn't track true importance, then the entire framework will still hallucinate. It would be great if there is an empirical evidence that attention is mostly 'correct'.
-
The speed is still 2x slower but it's shouldn't be a strong weakness.
问题
Have you tried using different masking probability P for different layers or even different models? Correct me if you have done so because intuitively a global P can't be optimal for different layers.
局限性
yes
最终评判理由
Thanks for the response. The authors have answered all my questions and I currently have no concern. I am not worried about the scoping of 'what is hallucination' here since general performance is still an indicator of levels hallucination. But it did seem that the scoping of this paper can be slightly more general by focusing on this method improves the general performance. Otherwise, the authors can include more analysis on hallucination, which fortunately is included in other reviewers' rebuttals.
格式问题
No concerns for paper formatting
We thank the reviewer for the valuable feedback. The comments helped us create a clearer and more understandable final version for readers. Based on the comments, we have:
(1) conducted comparison experiments between random masking and attention-guided masking,
(2) examined the reliability of analyses based on attention weights,
(3) re-emphasized our efforts to improve inference speed,
(4) supplemented the explanation that our masking strategy is based on global attention.
Our detailed responses are provided below.
[W1] Validation of attention-guided masking against random masking or noise injection.
Thank you for the insightful comments. In response, we conducted additional experiments to compare Attention-Guided Masking with Random Masking strategies.
| Method | Accuracy (%) |
|---|---|
| Base | 78.05 |
| Random Masking | 71.71 |
| Attention-Guided Masking (Ours) | 81.95 |
When we apply random masking, the performance (71.71%) drops significantly compared to the base decoding (78.05%). In contrast, our proposed attention-guided masking selectively distorts the modality that truly affects the model output, resulting in improved performance (81.95%).
Additionally, as shown in Table 1 of the main paper, AVCD consistently outperforms VCD, which perturbs the input by injecting noise, across all benchmarks. This suggests that simply corrupting the input with random noise is not as effective as selectively masking semantically meaningful tokens. Therefore, our method provides evidence that attention can serve as a useful signal for locating critical information, helping reduce hallucinations more reliably.
[W2] How reliable are the attention weights in identifying key modalities?
In Table 4, we validate the effectiveness of using attention weights to identify the dominant modality. Our adaptive strategy frequently selects “language” as the dominant modality, and this choice consistently yields better performance compared to manually assigning video or audio as dominant.
Moreover, attention weights have been extensively studied in prior works [1,2] for identifying important tokens in LLMs, further supporting their reliability in modality attribution.
[Reference]
[1] Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models, ICLR 2025.
[2] Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs, ICLR 2024.
[W3] A minor concern regarding the 2× slower inference speed.
Thank you for recognizing that the speed gap is not a major weakness in the context of our overall contribution. We would like to emphasize that, despite handling three modalities in AV-LLMs, our method accelerates inference speed by incorporating an entropy-guided adaptive decoding strategy that effectively skips unnecessary steps. This mechanism helps reduce computational overhead and maintains a balance between richer multimodal reasoning and decoding efficiency.
[Q1] Have you considered varying the masking probability (P) across different layers or models?
We apologize for any confusion caused.
To clarify, we compute attention values across all layers and mask the top 50% of tokens based on the global distribution. As a result, the number of masked tokens varies per layer, effectively leading to a layer-wise adaptive masking scheme.
To validate this, we conducted a comparison against a setting where we mask the top 50% attention-value tokens separately within each layer. The results are shown below:
| Masking method | Accuracy (%) |
|---|---|
| w/o masking (Base) | 78.05 |
| Fixed - top 50% per layer | 79.02 |
| Adaptive - top 50% (Ours) | 81.95 |
While fixed masking is already more effective than Base decoding, our adaptive strategy guided by global attention achieves significantly better performance. This supports the idea that masking based on global attention better identifies influential tokens across the entire model.
Thanks for the response. The authors have answered all my questions and I currently have no concern. I am not worried about the scoping of 'what is hallucination' here since general performance is still an indicator of levels hallucination. But it did seem that the scoping of this paper can be slightly more general by focusing on this method improves the general performance. Otherwise, the authors can include more analysis on hallucination, which fortunately is included in other reviewers' rebuttals.
Thank you for your kind response and for letting us know that your concerns have been addressed. We appreciate your suggestion, and we will make sure to include a deeper analysis of hallucination cases in future revisions to help broaden the scope of the work.
Hi Reviewers,
As the discussion deadline is approaching, could you please take a moment to acknowledge the rebuttal, revise your score if your opinion has changed, and post any follow-up comments or questions you may have?
Thanks for your time and contributions to the review process.
Best, AC
Dear reviewers and AC,
We sincerely appreciate the time and effort you dedicated to reviewing our manuscript.
As highlighted by the reviewers, our work has been recognized as novel (BYoK, qkTD), training-free (yejw, BYoK, qkTD), offering clear ablations and robust validation (yejw, seE2, BYoK), addressing a critical and timely problem (Nvqo), and efficient (BYoK, qkTD).
Your feedback has greatly improved our paper:
- yejw: Validated the proposed masking strategy and clarified the adaptive masking explanation.
- seE2: Refined the KL divergence explanation and added FlashAttention to the inference-time comparison for fairer evaluation.
- BYoK: Added comparisons with the non-CD method (OPERA) and provided clearer explanations.
- Nvqo: Clarified the masking-out operation, justified using the final query token, and emphasized differences from prior work.
- qkTD: Analyzed hallucination types mitigated by AVCD, evaluated the method on OmniBench for generalizability, and conducted ablations on the three components.
We are deeply grateful for all the valuable feedback and will incorporate these improvements faithfully into the final version of the paper.
Sincerely,
The Authors
This paper proposes AVCD, a novel, training-free framework to mitigate hallucinations in audio-visual large language models by reformulating contrastive decoding for trimodal inputs. The reviewers agreed that the paper addresses an important and timely challenge in AV-LLMs, and highlighted several key strengths: the method is training-free and therefore broadly applicable, it introduces a principled extension of contrastive decoding to three modalities, and it balances accuracy and efficiency through entropy-guided adaptive decoding. Experimental validation across multiple datasets and models shows consistent gains, with particularly strong improvements on AVHBench, reinforcing both the novelty and practical relevance of the approach. The ablation studies and additional experiments added during rebuttal further strengthened the technical contribution.
Some concerns remain. One reviewer raised questions about the theoretical justification of the mathematical approximation used in the framework, noting that the explanation relies on empirical rather than rigorous analysis. Others mentioned the reliance on attention scores, efficiency trade-offs, and the need for deeper analysis of failure cases and hallucination categories. While these are valid points, the authors provided substantial clarifications, additional experiments, and broader evaluations that satisfied most reviewers. Overall, the paper makes a strong methodological contribution to hallucination mitigation in multimodal LLMs.