PaperHub
5.5
/10
Poster4 位审稿人
最低5最高6标准差0.5
6
6
5
5
3.5
置信度
正确性2.3
贡献度2.3
表达2.5
ICLR 2025

Mitigating Modality Prior-Induced Hallucinations in Multimodal Large Language Models via Deciphering Attention Causality

OpenReviewPDF
提交: 2024-09-22更新: 2025-04-02

摘要

关键词
Modality PriorsMultimodal HallucinationsCounterfactual ReasoningDeciphering Attention Causality

评审与讨论

审稿意见
6

This paper introduces a causal inference framework named CausalMM. Considering the influence of visual and textual priors on the predictions of multimodal large language models (MLLMs), the authors employ backdoor adjustment and counterfactual reasoning to mitigate these priors’ effects. Specifically, they design various methods to perturb the attention layers in ViT and LLMs, such as randomization and reversal, to obtain the perturbed model predictions (counterfactual logits). They then use contrastive decoding to enhance the model’s predictive probability distribution. The method is a plug-and-play solution and demonstrates effectiveness on mainstream models and benchmarks.

优点

  1. This paper presents a simple yet effective method: perturbing the attention matrices in ViT or LLMs to generate counterfactual logits, and then using contrastive decoding to improve the model’s predictive probability distribution.
    • Flexibility and Generalizability. This method is plug-and-play and offers more flexibility compared to VCD, which obtains counterfactual logits by adding noise to images. Perturbing attention allows for greater design space and importantly, can be adapted to modules of different modalities, such as ViT and LLM.
    • Simplicity and Effectiveness. Unlike VCD and its variants that often rely on Adaptive Plausibility Constraints, the proposed method does not appear to require them.
  2. The experiments demonstrate the effectiveness of the method.

缺点

  1. How was the hyperparameter (e.g., γ\gamma) determined? Were hyperparameters independently tuned for different benchmarks and model/method variants?
  2. The choice of base models is not comprehensive. The authors selected LLaVA-1.5 and Qwen2-VL as base models to validate their method, but introducing more models would help confirm the method’s universality. For instance, a table could be added to the appendix to show the method’s cross-model generalizability.
  3. Section 3.1 and Figure 2 are hard to follow. Section 3.1 introduces very complex causal relationships, but these seem not closely tied to the method. If I understand correctly, the proposed method is largely similar to VCD: it generates counterfactual logits by perturbation and improves predictions by removing this "background noise" from the normal logits, which does not necessitate such complex preliminary explanations. I would also like to mention that the PeffectP_{effect} discussed in Section 3.3 might be unnecessary.
  4. The equations are also hard to follow. For example, in Section 3.3, the core of the method involves logits processing, which is unrelated to softmax. Therefore, separating logits from the softmax equation and presenting them independently would be clearer and more concise.

问题

See Weakness 1.

评论

The choice of base models is not comprehensive. The authors selected LLaVA-1.5 and Qwen2-VL as base models to validate their method, but introducing more models would help confirm the method’s universality. For instance, a table could be added to the appendix to show the method’s cross-model generalizability. (part.2)

To demonstrate the effectiveness of our approach on large multimodal language models of different architectures, we added experimental data from the Q-former-based InstructBLIP model and the embedding-autoregressive-based Chameleon model to the original experimental data from the vision encoder-mlp-llm paradigm. On the Chameleon model, our method comprehensively surpasses the performance of the baseline model and effectively suppresses the multimodal hallucinations of the model.

InstructBLIP:

DatasetSettingMethodAccuracyPrecisionRecallF1 Score
RandomRegular80.7181.6779.1980.41
VCD84.5388.5579.3283.68
Vision87.1792.7280.6786.27
Language86.9094.8978.0085.62
Multimodal87.9094.5980.4086.92
MSCOCOPopularRegular78.2277.8778.8578.36
VCD81.4782.8979.3281.07
Vision83.9786.3780.6783.42
Language83.5387.7178.0082.57
Multimodal84.9088.3580.4084.19
AdversarialRegular75.8474.3079.0376.59
VCD79.5679.6779.3979.52
Vision81.4781.8980.8081.34
Language82.0084.7378.0781.26
Multimodal82.4383.7180.5382.09
RandomRegular80.9177.9786.1681.86
VCD84.1182.2187.0584.56
Vision87.3385.9489.2787.57
Language87.8787.7288.0787.89
Multimodal88.4787.8689.2788.56
A-OKVQAPopularRegular76.1972.1685.2878.17
VCD79.7876.0087.0581.15
Vision81.0776.6989.2782.50
Language82.3379.0188.0783.29
Multimodal82.1378.4588.6083.22
AdversarialRegular70.7165.9185.8375.56
VCD74.3369.4686.8777.19
Vision74.8369.1189.8078.11
Language76.2771.0788.6078.87
Multimodal75.9770.5189.2778.79
RandomRegular79.6577.1484.2980.56
VCD83.6981.8486.6184.16
Vision86.1084.5688.3386.40
Language86.6786.8686.4086.63
Multimodal87.2386.6788.0087.33
GQAPopularRegular73.8769.6384.6976.42
VCD78.5774.6286.6180.17
Vision77.7772.9288.3379.89
Language79.1775.4886.4080.57
Multimodal78.9774.9986.9380.52
AdversarialRegular70.5666.1284.3374.12
VCD75.0870.5985.9977.53
Vision74.5069.3387.8777.51
Language76.3071.8186.6078.51
Multimodal75.8371.1986.8078.22
评论

We are deeply grateful for the your recognition of our work's innovation and thoroughness, as well as their constructive feedback. We have addressed each suggestion on the manuscript's weaknesses and made the necessary revisions.

How was the hyperparameter (e.g., γ\gamma) determined? Were hyperparameters independently tuned for different benchmarks and model/method variants?

γ\gamma represents the degree of confidence in the treatment effect and is used to adjust the strength of suppressing the modal prior. We have added this explanation in the text. For different models, the hyperparameter settings that achieve optimal performance are similar.

The choice of base models is not comprehensive. The authors selected LLaVA-1.5 and Qwen2-VL as base models to validate their method, but introducing more models would help confirm the method’s universality. For instance, a table could be added to the appendix to show the method’s cross-model generalizability.

To demonstrate the effectiveness of our approach on large multimodal language models of different architectures, we added experimental data from the Q-former-based InstructBLIP model and the embedding-autoregressive-based Chameleon model to the original experimental data from the vision encoder-mlp-llm paradigm. On the Chameleon model, our method comprehensively surpasses the performance of the baseline model and effectively suppresses the multimodal hallucinations of the model.

Chameleon:

DatasetSettingMethodAccuracyPrecisionRecallF1 Score
RandomRegular61.9057.4691.6770.64
Language69.2363.1792.2774.99
MSCOCOPopularRegular65.1059.8691.6772.43
Language69.4363.3492.2775.12
AdversarialRegular60.2056.2891.4069.66
Language64.0058.9492.3371.95
RandomRegular60.3756.2693.2070.16
Language65.7060.1493.1373.08
A-OKVQAPopularRegular57.3054.2593.2068.58
Language63.0758.1693.1371.60
AdversarialRegular53.5751.9993.2066.75
Language56.8353.9693.1368.33
RandomRegular60.3756.2693.2070.16
Language68.4362.1894.1374.89
GQAPopularRegular59.3755.7690.6769.05
Language66.7360.8194.1373.89
AdversarialRegular52.7351.5590.6765.73
Language57.7754.5094.1369.03
评论

Section 3.1 and Figure 2 are hard to follow. Section 3.1 introduces very complex causal relationships, but these seem not closely tied to the method. If I understand correctly, the proposed method is largely similar to VCD: it generates counterfactual logits by perturbation and improves predictions by removing this "background noise" from the normal logits, which does not necessitate such complex preliminary explanations. I would also like to mention that the discussed in Section 3.3 might be unnecessary.

We believe that detailed causal modeling can greatly help analyze the connections and independent effects of different factors in the model.

Specifically, VCD contrasts outputs derived from original and distorted image inputs. In the contrast, CausalMM isolates the influence of modal priors and other confounders on multimodal attention by using backdoor adjustment methods, obtains the positive treatment effect of attention on output through counterfactual reasoning, adjusts the output of the model at the attention and feature levels, and balances the modality priors. The two are fundamentally different at all levels.

Tabular Comparison of CausalMM and VCD

FeatureCausalMMVCD
Core MethodologyStructural Causal Model (SCM) with backdoor adjustment and counterfactual reasoningContrastive decoding
Focus of InterventionVisual and language attention mechanisms, visual features and LLM hidden statesInput image
Mechanism of action1.de-confound 2.Obtain the positive treatment effect 3.Adjust attention, features and hidden states 4.Balance the modality priorsContrasts outputs derived from original and distorted image inputs
VersatilityMultimodal hallucinations (vision + language)Object hallucinations
Support single-modal tasks (such as LLM)×
Exploring the causal mechanisms within the model×
Dealing with the confounding effects of modality priors×
Modality Priors AddressedVisual and language priors-

We believe that the discussion of Peffect{P_{effect}} in section 3.3 is necessary because they represent the paradigm of counterfactual reasoning in causal theory and fully present the complete process of the do operator and counterfactual reasoning. This is necessary to help readers understand how we obtain the positive causal effect of effective attention on the model output.

The equations are also hard to follow. For example, in Section 3.3, the core of the method involves logits processing, which is unrelated to softmax. Therefore, separating logits from the softmax equation and presenting them independently would be clearer and more concise.

Thank you for your suggestion. We think the argmax function can more intuitively show our selection process for the next token in the output sequence. If you have more suggestions on the writing format, please let us know.

评论

We greatly appreciate your thoughtful critique and suggestions. Below is a summary of our revisions and clarifications based on your feedback:

  • Cross-model generalizability: As per your request, we provided experimental data in "Author Response to Reviewer gJXZ (Part 1)" and "Author Response to Reviewer gJXZ (Part 2)" demonstrating the performance of our method on Meta's Chameleon model as well as the InstructBLIP model. We have also included experimental results for InstructBLIP and Chameleon in the appendix, which confirm that our method is applicable across several mainstream MLLM architectures. We encourage you to review these additions.

  • Applicability and generalizability: Following your suggestions, we discussed hyperparameter-related details in "Author Response to Reviewer gJXZ (Part 1)" and added explanations regarding hyperparameters in line 234 of the revised paper. Additionally, we conducted hyperparameter sensitivity tests for our method, with the detailed data provided in "Author Response to Reviewer CnJZ (Part 3)".

  • Differences from the VCD method: In response to your feedback, we elaborated on the differences between our method and the VCD method in "Author Response to Reviewer gJXZ (Part 3)". Furthermore, we added corresponding content in the appendix of the paper to detail the theoretical derivations related to causal reasoning and reiterated the importance of structural causal modeling.

We hope these revisions and clarifications address your concerns and look forward to any additional feedback or questions.

评论

Dear Reviewer gJXZ,

Thank you for your valuable time! We are writing to kindly follow up on the status of our manuscript review. We have been actively engaging with other reviewers and have received valuable feedback. Your insights would be greatly appreciated to further enhance the quality of our work.

Thank you for your time and consideration!

Yours sincerely,

CausalMM Team

评论

Dear Authors,

Thank you for your detailed explanation, which thoroughly addressed my concerns and misunderstandings. I also appreciate the updates in the manuscript, which make the paper clearer and more comprehensive. I have updated the rating accordingly.

评论

Dear Reviewer gJXZ,

Thank you for your valuable feedback and for taking the time to review our manuscript!

Best regards,

CausalMM Team

审稿意见
6

The paper introduces a causal inference framework called CAUSALMM, which is designed to mitigate the issue of multimodal hallucinations in Multimodal Large Language Models (MLLMs) that are often caused by biases from visual and language priors. The core idea is to treat modality priors as a confounder between the attention mechanisms and the model output, and to apply structural causal modeling to address these biases. Specifically, the authors use backdoor adjustment and counterfactual reasoning at both the visual and language attention levels to alleviate the negative effects of modality priors, thereby enhancing the alignment between the MLLM's inputs and outputs.

优点

  • The paper is well-written and clear motivated.
  • The method is plug-and-play and does not require retraining, making it practical for existing MLLMs.
  • The integration of causal inference to modify attention weights and optimize token generation in MLLMs is novel.

缺点

  • The background knowledge about causal inference is insufficient. The authors do not explain why causal inference is effective in capturing the causal impact of effective attention in MLLM output.
  • Several claims lack explanations and references. The authors are advised to carefully proofread the paper and add the necessary citations. For instance:
    • Line 65: “Modality priors are one of the confounding factors in the causal path of MLLM.” Why? And what is the definition of the confounding factor?
    • Line 58-61: “existing decoding strategies……, overlooking the causal relationships among attention visual attention, language attention, modality priors, and model output.”
  • In Section 3.3, the authors introduce the causal effect and the corresponding calculation of selecting the next token. But what’s next after determining the index of the next token? The authors are advised to supplement more on this point. I assume that it is going to talk about inferencing with the modified next token prediction.
  • Though the proposed method is somewhat novel, the experimental results are not quite significant and robust compared with existing methods (Table 1).
  • Lack of implementation details. It would be more convincing to provide more details on how to reproduce the experimental results.
  • The notations should be considered more carefully. For example, in Section 3.3, it is confusing that in Line 222-223 OO represents the original output but in Line 226-227 it denotes the model output.

问题

  • In Lines 58-61, the authors state that existing decoding strategies overlook the causal relationships among attention visual attention, language attention, modality priors, and model output. However, VCD [1] injects noise into the image input and acquires the output logits for contrastive decoding, which can be seen as a counterfactual intervention at the visual input. The effects of this intervention are also later shown in the attention parts and model outputs [2,3]. Could the authors make more discussion on this perspective?
  • What does γ\gamma mean in the equations in Section 3.3? More importantly, the calculation of choosing the index of the next token is not discussed. Why is it calculated in this way?
  • Could you provide the results of VCD and OPERA on VLind-Bench?

[1] Leng S, Zhang H, Chen G, et al. Mitigating object hallucinations in large vision-language models through visual contrastive decoding[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 13872-13882.

[2] Xiao X, Wu B, Wang J, et al. Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment[J]. arXiv preprint arXiv:2405.17871, 2024.

[3] Chen Z, Xu C, Qi Y, et al. Mllm is a strong reranker: Advancing multimodal retrieval-augmented generation via knowledge-enhanced reranking and noise-injected training[J]. arXiv preprint arXiv:2407.21439, 2024.

评论

We are deeply grateful for your recognition of our work's innovation and thoroughness, as well as their constructive feedback. We have addressed each suggestion on the manuscript's weaknesses and made the necessary revisions.

Insufficient Theoretical Justification: The paper lacks a deep theoretical analysis of why the proposed causal interventions lead to improved performance. The causal model is described, but the theoretical foundations and assumptions are not thoroughly explored or justified.

We have supplemented some arguments related to the causal inference theory in the paper. We will describe how our causal inference method balances the model's modal priors from three aspects: structural causal model, backdoor adjustment, and counterfactual reasoning:

Structural Causal Model (SCM)

The process of methods based on causal theory includes building the structural causal models and choosing the proper way of causal inference. In the structure, nodes indicate the important variables or variables that are hidden but have an effect on the important ones, while the edges indicate inner causal relationships. Combining different causal structures and different concerns, suitable causal reasoning methods are selected to eliminate spurious correlations.

Backdoor Adjustment

In this work, we reinterpret the backdoor adjustment framework to analyze the causal influence of modality priors on attention mechanisms and model outputs. By identifying modality priors (MM) as confounders, we isolate the causal effect of attention (AA) on the output (OO) using the backdoor adjustment method.

Variables:

AA (attention): The mechanism whose causal effect we aim to evaluate.

MM (modality priors): Influences both AA and OO, acting as a confounder.

OO (model output): The outcome variable influenced by AA and MM.

Causal Challenge:

The backdoor path AMOA \leftarrow M \to O introduces confounding, making it necessary to adjust for MM to isolate the causal effect of AA on OO.

To block confounding, the backdoor criterion ensures that:

1. MM blocks all backdoor paths from AA to OO.

2. MM is not influenced by AA.

Using this criterion, the causal effect of AA on OO is computed as

P(odo(a))=mP(oa,m)P(m),P(o \mid do(a)) = \sum_m P(o \mid a, m) P(m),

Modality priors (MM) explain the indirect influence of AA on OO, enabling a disentangled analysis. Adjusting for MM removes the confounding, ensuring that AA's causal impact on OO is properly estimated.

Counterfactual reasoning

By controlling for confounders, we can more accurately estimate causal relationships. This method serves as a foundation for counterfactual reasoning, which enables the assessment of treatment effects in systems like multimodal models.

Causal Effect of Visual Attention (AiA_i)

The causal effect of the visual attention mechanism on the model output OO is given by:

Peffect_V=EAiA~i[P(OAi=Ai,I=I,Pv=Pv)P(Odo(Ai=ai),I=I,Pv=Pv)].P_{effect\_V} = E_{A_i \sim \tilde{A}_i}\left[P(O | A_i = **A**_i, I = **I**, P_v = **P**_v) - P(O | \text{do}(A_i = **a**_i), I = **I**, P_v = **P**_v)\right].

Here:

Peffect_VP_{effect\_V} represents the treatment effect of visual attention on the output OO.

Ai**A**_i denotes the observed visual attention, while ai**a**_i represents the intervention applied to the visual attention.

Causal Effect of Language Model Attention (AtA_t)

Similarly, the causal effect of the language model attention on the output OO can be expressed as:

Peffect_L=EAtA~t[P(OAt=At,Tt=Tt,Pl=Pl)P(Odo(At=at),Tt=Tt,Pl=Pl)],P_{effect\_L} = E_{A_t \sim \tilde{A}_t}\left[P(O | A_t = **A**_t, T_t = **T**_t, P_l = **P**_l) - P(O | \text{do}(A_t = **a**_t), T_t = **T**_t, P_l = **P**_l)\right],

Here:

Peffect_LP_{effect\_L} represents the treatment effect of language attention on the output OO.

At**A**_t is the observed language model attention, while at**a**_t represents the intervention on the language model attention.

Combined Causal Effect in a Multimodal Setting

In multimodal systems, the combined treatment effect of both visual and language attention mechanisms is described as:

Peffect_M=EAi,AtA~i,A~t[P(OAi=Ai,At=At,I=I,Tt=Tt,Pv=Pv,Pl=Pl)]P(Odo(Ai=ai),do(At=at),I=I,Tt=Tt,Pv=Pv,Pl=Pl),P_{effect\_M} = E_{A_i, A_t \sim \tilde{A}_i, \tilde{A}_t}\left[P(O | A_i = **A**_i, A_t = **A**_t, I = **I**, T_t = **T**_t, P_v = **P**_v, P_l = **P**_l)\right] - P(O | \text{do}(A_i = **a**_i), \text{do}(A_t = **a**_t), I = **I**, T_t = **T**_t, P_v = **P**_v, P_l = **P**_l),

In this formulation:

Peffect_MP_{effect\_M} measures the combined effect of both visual and language attention mechanisms on the model output.

The observed and intervened attention variables are denoted by Ai,ai**A**_i, **a**_i for visual attention, and At,at**A**_t, **a**_t for language attention.

评论

Several claims lack explanations and references. The authors are advised to carefully proofread the paper and add the necessary citations. For instance: Line 65: “Modality priors are one of the confounding factors in the causal path of MLLM.” Why? And what is the definition of the confounding factor? Line 58-61: “existing decoding strategies……, overlooking the causal relationships among attention visual attention, language attention, modality priors, and model output.

We added references to relevant articles to support our point of view. At the same time, for the possible confusion, we modified the sentence in the original text to make the conclusion more moderate.

Confounding Factor: A confounding factor, also known as a confounder, is a variable that influences both the dependent variable and independent variable, causing a spurious association. In simpler terms, it's an outside influence that can distort the true relationship between the variables being studied.

Without causal inference theory, the true impact of factors such as attention on model output under the influence of modal priors has not been well studied.

[1] Peng, D., Wei, W., Mao, X., Fu, Y., Chen, D. "An Empirical Study on the Language Modal in Visual Question Answering". arXiv preprint arXiv:2305.10143

[2] Chen, M., Cao, Y., Zhang, Y., Lu, C. "Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models: A Causal Perspective". arXiv preprint arXiv:2403.18346

[3] Lukics, K. S., Lukács, Á. "Modality, presentation, domain and training effects in statistical learning". Sci Rep, 2022.

[4] Gema, A. P., Jin, C., Abdulaal, A., Diethe, T., Teare, P., Alex, B., Minervini, P., Saseendran, A. "DeCoRe: Decoding by Contrasting Retrieval Heads to Mitigate Hallucinations". arXiv preprint arXiv:2410.18860

[5] Lee K, Kim M, Yoon S, et al. VLind-Bench: Measuring Language Priors in Large Vision-Language Models. arXiv preprint arXiv:2406.08702, 2024.

In Section 3.3, the authors introduce the causal effect and the corresponding calculation of selecting the next token. But what’s next after determining the index of the next token? The authors are advised to supplement more on this point. I assume that it is going to talk about inferencing with the modified next token prediction.

We have added content about the selection of the next token in the revised version. We use direct sampling as a decoding strategy because it can improve the diversity of text and reduce the probability of repeatedly outputting the same word.

Lack of implementation details. It would be more convincing to provide more details on how to reproduce the experimental results.

We are already preparing an open source version of the code to increase readability. Please pay attention to our follow-up work.

The notations should be considered more carefully. For example, in Section 3.3, it is confusing that in Line 222-223 represents the original output but in Line 226-227 it denotes the model output.

Thank you for your feedback. We have corrected the representation of the relevant variables in the revised version.

What does γ\gamma mean in the equations in Section 3.3?

γ\gamma represents the degree of confidence in the treatment effect and is used to adjust the strength of suppressing the modal prior. We have added this explanation in the text.

Could you provide the results of VCD and OPERA on VLind-Bench?

We provide experimental data of VCD and OPERA methods on the VLind benchmark. The result of the OPERA method is 0, which is similar to some models in Table 1 in the original VLind paper[1]. No conclusion is given in the original paper. The reason is speculated to be the decline in the ability to follow certain instructions.

MetricsSckS_{ck}SvpS_{vp}ScbS_{cb}SlpS_{lp}CBLP
Regular32.140.743.333.143.727.1
VCD30.548.047.831.044.029.2
OPERA*00--00
CausalMM57.080.864.061.859.940.2

[1] Lee K, Kim M, Yoon S, et al. VLind-Bench: Measuring Language Priors in Large Vision-Language Models[J]. arXiv preprint arXiv:2406.08702, 2024.

评论

Though the proposed method is somewhat novel, the experimental results are not quite significant and robust compared with existing methods (Table 1).

We added this table to expand the comparison with more baselines. In the table, the values taken are the averages of the three parts of the POPE benchmark (MSCOCO, A-OKVQA, GQA). It can be seen that the CausalMM method can achieve the highest value most of the time.

DatasetSettingMethodAccuracyPrecisionRecallF1 Score
RandomRegular80.4278.9383.2180.94
DOLA83.0083.0683.1383.00
VCD84.1184.2084.3384.13
OPERA85.0788.3980.7384.39
AGLA87.3088.8385.6887.07
Vision86.8787.7486.0986.75
Language87.1589.8284.1686.71
Multimodal87.8789.7185.8987.60
InstructBLIPPopularRegular76.0973.2282.9477.65
DOLA78.9977.1283.1379.85
VCD79.9477.8484.3380.80
OPERA78.3373.8587.7380.20
AGLA81.8680.1785.6882.58
Vision80.9478.6686.0981.94
Language81.6880.7384.1682.14
Multimodal82.0080.6085.3182.64
AdversarialRegular72.3768.7883.0675.42
DOLA74.6771.5383.1176.68
VCD76.3273.2484.0878.08
OPERA75.5070.4987.7378.17
AGLA77.2974.0985.6779.16
Vision76.9373.4486.1678.99
Language78.1975.8784.4279.55
Multimodal78.0875.1485.5379.70
RandomRegular83.7289.3077.1382.55
DOLA84.7887.5981.2784.19
VCD86.0590.3980.9185.29
OPERA88.6488.0989.7387.43
AGLA88.5494.4182.0887.71
Vision87.1792.3581.2886.33
Language86.8491.9680.8685.68
Multimodal88.7992.6384.3588.26
LLaVA-1.5PopularRegular79.7382.0376.7379.11
DOLA79.7584.1176.2280.61
VCD81.5282.5980.6081.39
OPERA83.3480.2789.7384.44
AGLA85.1487.8882.0884.68
Vision83.1384.8481.3782.85
Language84.3186.7583.8084.26
Multimodal85.0686.4483.8284.87
AdversarialRegular76.0276.2076.6076.36
DOLA76.3277.2775.4776.16
VCD77.8476.8780.7578.53
OPERA76.6871.6689.7179.46
AGLA81.1381.2082.1081.36
Vision78.6277.8381.5179.31
Language78.5978.4979.7778.90
Multimodal80.3679.5382.8680.91
评论

Thanks a lot for the authors' response and care for my concerns. I have updated the rating accordingly.

评论

Thank you very much for your kind response. We truly appreciate your thoughtful feedback and support!

评论

In Lines 58-61, the authors state that existing decoding strategies overlook the causal relationships among attention visual attention, language attention, modality priors, and model output. However, VCD [1] injects noise into the image input and acquires the output logits for contrastive decoding, which can be seen as a counterfactual intervention at the visual input. The effects of this intervention are also later shown in the attention parts and model outputs [2,3]. Could the authors make more discussion on this perspective?

Thank you for your feedback. Specifically, VCD contrasts outputs derived from original and distorted image inputs. In the contrast, CausalMM isolates the influence of modal priors and other confounders on multimodal attention by using backdoor adjustment methods, obtains the positive treatment effect of attention on output through counterfactual reasoning, adjusts the output of the model at the attention and feature levels, and balances the modality priors. The former's operation is not consistent with the definition of intervention in the causal theory system. Causal reasoning does not directly use intervention to optimize the output.

Tabular Comparison of CausalMM and VCD

FeatureCausalMMVCD
Core MethodologyStructural Causal Model (SCM) with backdoor adjustment and counterfactual reasoningContrastive decoding
Focus of InterventionVisual and language attention mechanisms, visual features and LLM hidden statesInput image
Mechanism of action1.de-confound 2.Obtain the positive treatment effect 3.Adjust attention, features and hidden states 4.Balance the modality priorsContrasts outputs derived from original and distorted image inputs
VersatilityMultimodal hallucinations (vision + language)Object hallucinations
Support single-modal tasks (such as LLM)×
Exploring the causal mechanisms within the model×
Dealing with the confounding effects of modality priors×
Modality Priors AddressedVisual and language priors-
审稿意见
5

This paper addresses the problem of modality prior-induced hallucinations in Multimodal Large Language Models (MLLMs). The authors propose a causal reasoning framework, which applies structural causal modeling and counterfactual reasoning to MLLMs. By treating modality priors as confounding factors between attention mechanisms and the model's output, the approach aims to mitigate the negative effects of these priors. The method involves interventions on both visual and language attention mechanisms and is evaluated on several benchmarks.

优点

  1. Perspective: The paper introduces the idea of applying causal inference techniques to address modality prior-induced hallucinations in MLLMs, which is an interesting and potentially valuable perspective.

  2. Comprehensive Experiments: The authors conduct experiments on multiple benchmarks, including VLind-Bench, POPE, and MME, providing a range of evaluations for their method.

  3. Ablation Studies: Inclusion of ablation studies helps in understanding the impact of different components of the proposed method.

缺点

  1. Insufficient Theoretical Justification: The paper lacks a deep theoretical analysis of why the proposed causal interventions lead to improved performance. The causal model is described, but the theoretical foundations and assumptions are not thoroughly explored or justified.

  2. Limited Novelty: While the application of causal inference to MLLMs is presented as novel, causal reasoning has been previously applied in machine learning models, including language models. The paper does not sufficiently differentiate its contributions from existing work in causal inference applied to deep learning.

  3. Inadequate Comparison with Baselines: For some evaluations(such as figure 3,5,6), the experimental evaluation compares the proposed method with only a partial set of baselines from the setup. Under many settings, the performance of this method underperform or only slightly outperform baseline methods as shown in table 1.

  4. Superficial Experimental Analysis: The results, while showing improvements, lack statistical significance testing. Additionally, there is a lack of detailed analysis of where and why the method improves performance, making it difficult to assess the true impact.

  5. Applicability and Generalization: The approach is tested on specific MLLMs, but it is unclear how well the method generalizes to other models.

  6. Lack of Discussion on Limitations: The paper does not adequately discuss the limitations or potential downsides of the proposed method, such as scenarios where it might not work well or possible negative impacts.

问题

  1. Theoretical Justification: Can you provide a more detailed theoretical analysis or proofs to support the efficacy of your causal interventions?

  2. Computational Complexity: What is the computational overhead introduced by your method? How does it compare to the baseline models in terms of runtime and resource consumption?

  3. Generality of the Approach: How well does your method generalize to other types of MLLMs, such as chameleon?

  4. Limitations and Failure Cases: What are the limitations of your method? Are there scenarios where it does not perform well or might even degrade performance? How do you address potential negative impacts?

  5. Impact of Hyperparameters: How sensitive is your method to the choice of hyperparameters involved in the interventions? Have you performed a sensitivity analysis?

  6. Robustness to Noise: How does your method handle noisy or adversarial inputs? Does the causal framework improve robustness in such cases?

评论

We are deeply grateful for your recognition of our work's innovation and thoroughness, as well as their constructive feedback. We have addressed each suggestion on the manuscript's weaknesses and made the necessary revisions.

Insufficient Theoretical Justification: The paper lacks a deep theoretical analysis of why the proposed causal interventions lead to improved performance. The causal model is described, but the theoretical foundations and assumptions are not thoroughly explored or justified.

We have supplemented some arguments related to the causal inference theory in the paper. We will describe how our causal inference method balances the model's modal priors from three aspects: structural causal model, backdoor adjustment, and counterfactual reasoning:

Structural Causal Model (SCM)

The process of methods based on causal theory includes building the structural causal models and choosing the proper way of causal inference. In the structure, nodes indicate the important variables or variables that are hidden but have an effect on the important ones, while the edges indicate inner causal relationships. Combining different causal structures and different concerns, suitable causal reasoning methods are selected to eliminate spurious correlations.

Backdoor Adjustment

In this work, we reinterpret the backdoor adjustment framework to analyze the causal influence of modality priors on attention mechanisms and model outputs. By identifying modality priors (MM) as confounders, we isolate the causal effect of attention (AA) on the output (OO) using the backdoor adjustment method.

Variables:

AA (attention): The mechanism whose causal effect we aim to evaluate.

MM (modality priors): Influences both AA and OO, acting as a confounder.

OO (model output): The outcome variable influenced by AA and MM.

Causal Challenge:

The backdoor path AMOA \leftarrow M \to O introduces confounding, making it necessary to adjust for MM to isolate the causal effect of AA on OO.

To block confounding, the backdoor criterion ensures that:

1. MM blocks all backdoor paths from AA to OO.

2. MM is not influenced by AA.

Using this criterion, the causal effect of AA on OO is computed as

P(odo(a))=mP(oa,m)P(m),P(o \mid do(a)) = \sum_m P(o \mid a, m) P(m),

Modality priors (MM) explain the indirect influence of AA on OO, enabling a disentangled analysis. Adjusting for MM removes the confounding, ensuring that AA's causal impact on OO is properly estimated.

Counterfactual reasoning

By controlling for confounders, we can more accurately estimate causal relationships. This method serves as a foundation for counterfactual reasoning, which enables the assessment of treatment effects in systems like multimodal models.

Causal Effect of Visual Attention (AiA_i)

The causal effect of the visual attention mechanism on the model output OO is given by:

Peffect_V=EAiA~i[P(OAi=Ai,I=I,Pv=Pv)P(Odo(Ai=ai),I=I,Pv=Pv)].P_{effect\_V} = E_{A_i \sim \tilde{A}_i}\left[P(O | A_i = **A**_i, I = **I**, P_v = **P**_v) - P(O | \text{do}(A_i = **a**_i), I = **I**, P_v = **P**_v)\right].

Here:

Peffect_VP_{effect\_V} represents the treatment effect of visual attention on the output OO.

Ai**A**_i denotes the observed visual attention, while ai**a**_i represents the intervention applied to the visual attention.

Causal Effect of Language Model Attention (AtA_t)

Similarly, the causal effect of the language model attention on the output OO can be expressed as:

Peffect_L=EAtA~t[P(OAt=At,Tt=Tt,Pl=Pl)P(Odo(At=at),Tt=Tt,Pl=Pl)],P_{effect\_L} = E_{A_t \sim \tilde{A}_t}\left[P(O | A_t = **A**_t, T_t = **T**_t, P_l = **P**_l) - P(O | \text{do}(A_t = **a**_t), T_t = **T**_t, P_l = **P**_l)\right],

Here:

Peffect_LP_{effect\_L} represents the treatment effect of language attention on the output OO.

At**A**_t is the observed language model attention, while at**a**_t represents the intervention on the language model attention.

Combined Causal Effect in a Multimodal Setting

In multimodal systems, the combined treatment effect of both visual and language attention mechanisms is described as:

Peffect_M=EAi,AtA~i,A~t[P(OAi=Ai,At=At,I=I,Tt=Tt,Pv=Pv,Pl=Pl)]P(Odo(Ai=ai),do(At=at),I=I,Tt=Tt,Pv=Pv,Pl=Pl),P_{effect\_M} = E_{A_i, A_t \sim \tilde{A}_i, \tilde{A}_t}\left[P(O | A_i = **A**_i, A_t = **A**_t, I = **I**, T_t = **T**_t, P_v = **P**_v, P_l = **P**_l)\right] - P(O | \text{do}(A_i = **a**_i), \text{do}(A_t = **a**_t), I = **I**, T_t = **T**_t, P_v = **P**_v, P_l = **P**_l),

In this formulation:

Peffect_MP_{effect\_M} measures the combined effect of both visual and language attention mechanisms on the model output.

The observed and intervened attention variables are denoted by Ai,ai**A**_i, **a**_i for visual attention, and At,at**A**_t, **a**_t for language attention.

评论

Limited Novelty: While the application of causal inference to MLLMs is presented as novel, causal reasoning has been previously applied in machine learning models, including language models. The paper does not sufficiently differentiate its contributions from existing work in causal inference applied to deep learning.

While causal inference has been applied in machine learning, the novelty of CausalMM lies in its specifically designed structural causal model and counterfactual reasoning framework tailored for Multimodal Large Language Models (MLLMs). Unlike previous works, CausalMM explicitly addresses the causal relationships between visual and language modalities and identifies and adjusts the influence of modality priors on attention mechanisms through intervention.

FeatureFocusEmpiricalInnovationTechnique
Rhino [1]Causal relationship learning from time-series dataExtensive synthetic experiments and real-world benchmarksCombining vector auto-regression, deep learning, and variational inferenceModeling non-linear relationships with history-dependent noise and instantaneous effects
Causal-StoNet [2]High-dimensional complex dataExtensive numerical studiesSparse deep learning theoryAdaptive Stochastic Gradient MCMC (SGMCMC)
CUTS [3]Causal discovery from irregular time-series dataJoint imputation of unobserved data points and causal graph constructionDelayed Supervision Graph Neural Network (DSGNN) for unstructured dataIterative framework with mutually boosting modules for data prediction and graph fitting
VBCI [4]Predicting cellular gene expressions under counterfactual perturbationsExtensive experiments demonstrating superiority over state-of-the-art deep learning modelsNovel graph variational Bayesian causal inference framework utilizing gene regulatory networksAdjacency matrix updating for graph convolutional networks during pre-training
CausalMM (Ours)Mitigating multimodal hallucinations in MLLMsExtensive experiments on VLind-Bench, POPE, and MME benchmarksCausal inference framework (CausalMM) for attention mechanismBack-door adjustment and counterfactual reasoning at attention levels

[1] Gong, W., Jennings, J., Zhang, C., Pawlowski, N. "Rhino: Deep Causal Temporal Relationship Learning With History-dependent Noise". Machine Learning, 2022.

[2] Fang, Y., Liang, F. "Causal-StoNet: Causal Inference for High-Dimensional Complex Data". arXiv preprint arXiv:2403.18994

[3] Cheng, Y., He, K., Xiao, T., Dai, Q., Suo, J., Li, Z., Yang, R. "CUTS: Neural Causal Discovery from Irregular Time-Series Data". International Conference on Learning Representations, 2023.

[4] Voloch, L., Barton, R. A., Ioannidis, V., Donno, C. D., Wu, Y., Price, L., Karypis, G., Wang, Z. "Predicting Cellular Responses with Variational Causal Inference and Refined Relational Information". International Conference on Learning Representations, 2022.

评论

Inadequate Comparison with Baselines: For some evaluations(such as figure 3,5,6), the experimental evaluation compares the proposed method with only a partial set of baselines from the setup. Under many settings, the performance of this method underperform or only slightly outperform baseline methods as shown in table 1.

We added this table to expand the comparison with more baselines. In the table, the values taken are the averages of the three parts of the POPE benchmark (MSCOCO, A-OKVQA, GQA). It can be seen that the CausalMM method can achieve the highest value most of the time.

DatasetSettingMethodAccuracyPrecisionRecallF1 Score
RandomRegular80.4278.9383.2180.94
DOLA83.0083.0683.1383.00
VCD84.1184.2084.3384.13
OPERA85.0788.3980.7384.39
AGLA87.3088.8385.6887.07
Vision86.8787.7486.0986.75
Language87.1589.8284.1686.71
Multimodal87.8789.7185.8987.60
InstructBLIPPopularRegular76.0973.2282.9477.65
DOLA78.9977.1283.1379.85
VCD79.9477.8484.3380.80
OPERA78.3373.8587.7380.20
AGLA81.8680.1785.6882.58
Vision80.9478.6686.0981.94
Language81.6880.7384.1682.14
Multimodal82.0080.6085.3182.64
AdversarialRegular72.3768.7883.0675.42
DOLA74.6771.5383.1176.68
VCD76.3273.2484.0878.08
OPERA75.5070.4987.7378.17
AGLA77.2974.0985.6779.16
Vision76.9373.4486.1678.99
Language78.1975.8784.4279.55
Multimodal78.0875.1485.5379.70
RandomRegular83.7289.3077.1382.55
DOLA84.7887.5981.2784.19
VCD86.0590.3980.9185.29
OPERA88.6488.0989.7387.43
AGLA88.5494.4182.0887.71
Vision87.1792.3581.2886.33
Language86.8491.9680.8685.68
Multimodal88.7992.6384.3588.26
LLaVA-1.5PopularRegular79.7382.0376.7379.11
DOLA79.7584.1176.2280.61
VCD81.5282.5980.6081.39
OPERA83.3480.2789.7384.44
AGLA85.1487.8882.0884.68
Vision83.1384.8481.3782.85
Language84.3186.7583.8084.26
Multimodal85.0686.4483.8284.87
AdversarialRegular76.0276.2076.6076.36
DOLA76.3277.2775.4776.16
VCD77.8476.8780.7578.53
OPERA76.6871.6689.7179.46
AGLA81.1381.2082.1081.36
Vision78.6277.8381.5179.31
Language78.5978.4979.7778.90
Multimodal80.3679.5382.8680.91
评论

Superficial Experimental Analysis: The results, while showing improvements, lack statistical significance testing. Additionally, there is a lack of detailed analysis of where and why the method improves performance, making it difficult to assess the true impact.

We performed a statistical significance analysis on the experimental results. The significance level is 0.05. The results of the statistical significance test support the advantages of our method and prove its effectiveness in practical applications.

Metrict-statisticp-valueSignificant
Accuracy2.6780.016True
F1 Score3.5850.002True

Where and why the method improves performance:

Multimodal large language models (MLLMs) are prone to hallucination problems caused by modality priors (visual or language), such as relying on textual cues and ignoring visual input. The model may mistakenly rely on the attention distribution of a specific modality, leading to incorrect cognition of object existence and attributes.

CAUSALMM analyzes the causal influence of visual and language attention through structured causal modeling (SCM), considers modality priors as confounding factors, and corrects causal paths through counterfactual reasoning. By generating different attention states (such as random, reversed, and unified attention) through counterfactual interventions, the contributions of different modalities can be isolated and quantified, thereby more accurately judging the causal role of multimodal information.

In the multimodal collaborative mode, CAUSALMM balances the causal effects of visual and language attention to make the generated content consistent with the multimodal input, thereby improving the generation quality of the model.

Lack of Discussion on Limitations: The paper does not adequately discuss the limitations or potential downsides of the proposed method, such as scenarios where it might not work well or possible negative impacts. & Limitations and Failure Cases: What are the limitations of your method? Are there scenarios where it does not perform well or might even degrade performance? How do you address potential negative impacts?

We added more discussions on limitations in the appendix. In the case study section of Section 4.4 of the original paper, we show examples that our method still fails to solve. These examples focus on fine-grained visual perception, perception of spatiotemporal relations, and understanding of high-order semantics. As a balancer of modal priors, CausalMM can maximize the capabilities of existing backbone models, but its limitations lie in the performance limits of the vision encoder and LLM. We will continue to study how to effectively utilize the existing visual information and maximize the performance of the vision encoder and LLM.

Computational Complexity: What is the computational overhead introduced by your method? How does it compare to the baseline models in terms of runtime and resource consumption?

Similar to other methods that optimize at the inference stage, our method will consume more time in the complete inference process. The additional time is negligible in the latency of normal conversations.

Impact of Hyperparameters: How sensitive is your method to the choice of hyperparameters involved in the interventions? Have you performed a sensitivity analysis?

Our method is not sensitive to non-marginal values of hyperparameters. Below is our sensitivity analysis of some experimental results.

Sensitivity Table (Gamma, Epsilon)

MetricGamma SensitivityEpsilon Sensitivity
Accuracy0.0062360.004382
F1 Score0.0076620.005491

Robustness to Noise: How does your method handle noisy or adversarial inputs? Does the causal framework improve robustness in such cases?

CausalMM can handle noisy and adversarial inputs. Noise can be included in counterfactual attention together with low-quality attention and desensitized through the framework of CausalMM. For adversarial inputs, the settings of VLind benchmark and POPE benchmark include such inputs. From the experimental results of the original paper, CausalMM can handle adversarial inputs well and improve the alignment of model output with input.

评论

"Similar to other methods that optimize at the inference stage, our method will consume more time in the complete inference process. The additional time is negligible in the latency of normal conversations."

Some question:

  1. Time Overhead: Some baselines they report explicit latency metrics. Quantitative comparisons (e.g., % increase or ms) would clarify the claim of negligible impact.
  2. Memory Overhead: Memory usage is not addressed but is a key factor in evaluating methods.
  3. Clarity: "Normal conversations" is vague; specifying datasets or benchmarks would improve transparency.
评论
  • What are the results of OPERA in Figure 4?

As per your request, we have incorporated the results for OPERA. We will update this figure in a future version.

MethodexistencecountpositioncolorposterscelebrityscenelandmarkartworkOCRcommonsense_reasoningnumerical_calculationtext_translationcode_reasoning
Regular186.67123.33136.67153.33138.78132.94151.5149.0126.5135.0134.2975.0125.095.0
VCD186.67136.67140.0166.67146.26148.24163.0144.5130.0114.99137.1470.0120.0100.0
OPERA195.0148.33133.33155.0136.05127.65154.25153.0123.25125.0114.2940.090.062.5
vision186.67160.0140.0166.67146.26150.0161.0160.0140.5135.0137.14100.0110.090.0
language190.0140.0133.33166.67171.43153.53172.5163.0135.5115.0135.7195.0115.0100.0
multimodal196.67156.67133.33176.67153.74145.88164.5164.0142.0135.0144.29100.0115.0100.0
  • What are the results of OPERA in Figure 5?

Similarly, as per your request, we have incorporated the results for OPERA. We will update this figure in a future version.

ModelPerception ScoresCognition Scores
Regular1433.72429.29
VCD1476.99427.14
OPERA1450.87306.79
Vision1546.09437.14
Language1516.44435.00
Multimodal1568.46459.29
  1. Time Overhead: Some baselines they report explicit latency metrics. Quantitative comparisons (e.g., % increase or ms) would clarify the claim of negligible impact.
  2. Memory Overhead: Memory usage is not addressed but is a key factor in evaluating methods.
  3. Clarity: "Normal conversations" is vague; specifying datasets or benchmarks would improve transparency.

For the definition of "Normal conversations," we refer to the subjective experience of human users when interacting with the model. We invited professionals unrelated to the work to engage in normal conversations with the model using our method, without prior knowledge, and found that the actual delay was virtually imperceptible.

We conducted 10 experiments on POPE and averaged all the data. The results are as follows:

MethodRegularVCDCausalMM
Time1.001.801.69
Memory1.001.051.11

The values represent the ratio relative to the Regular method.

We hope the above results address your concerns. Thank you again for your discussion!

CausalMM Team

评论

Applicability and Generalization: The approach is tested on specific MLLMs, but it is unclear how well the method generalizes to other models. & Generality of the Approach: How well does your method generalize to other types of MLLMs, such as chameleon?

To demonstrate the effectiveness of our approach on large multimodal language models of different architectures, we added experimental data from the Q-former-based InstructBLIP model and the embedding-autoregressive-based Chameleon model to the original experimental data from the vision encoder-mlp-llm paradigm.

Chameleon:

DatasetSettingMethodAccuracyPrecisionRecallF1 Score
RandomRegular61.9057.4691.6770.64
Language69.2363.1792.2774.99
MSCOCOPopularRegular65.1059.8691.6772.43
Language69.4363.3492.2775.12
AdversarialRegular60.2056.2891.4069.66
Language64.0058.9492.3371.95
RandomRegular60.3756.2693.2070.16
Language65.7060.1493.1373.08
A-OKVQAPopularRegular57.3054.2593.2068.58
Language63.0758.1693.1371.60
AdversarialRegular53.5751.9993.2066.75
Language56.8353.9693.1368.33
RandomRegular60.3756.2693.2070.16
Language68.4362.1894.1374.89
GQAPopularRegular59.3755.7690.6769.05
Language66.7360.8194.1373.89
AdversarialRegular52.7351.5590.6765.73
Language57.7754.5094.1369.03

In addition, the experimental results of InstructBLIP can be found in the Appendix.

评论

We greatly appreciate your thoughtful critique and suggestions. Below is a summary of our revisions and clarifications based on your feedback:

  • Theoretical support for the validity of causal reasoning: In "Author Response to Reviewer CnJZ (Part 1)", we provided a comprehensive justification for the validity of causal reasoning. Correspondingly, we have improved potentially confusing sections in the revised version of the main text and added relevant theoretical derivations and justifications in the appendix.

  • Evidence for innovation and effectiveness: In "Author Response to Reviewer CnJZ (Part 3)", we presented comparisons between our method and several others. Additionally, in "Author Response to Reviewer CnJZ (Part 4)", we provided experimental data demonstrating the performance of our method on Meta's Chameleon model. We have also added experimental results for InstructBLIP and Chameleon in the appendix, which show that our method is applicable across several mainstream MLLM architectures. We welcome you to check these additions.

  • Applicability and generalizability: At your request, we conducted hyperparameter sensitivity tests and statistical significance analyses of our method. The specific data has been detailed in "Author Response to Reviewer CnJZ (Part 3)".

  • Discussion of limitations: In accordance with your request, we have added a discussion of the limitations of our method, along with corresponding content in the appendix of the paper.

We hope these revisions and clarifications address your concerns and look forward to any additional feedback or questions.

评论

Hello,

Thank you for the detailed response. This clarifies the method significantly. I also appreciate the additional detailed results. I will update the rating accordingly.

评论

Dear Reviewer CnJZ,

Thank you for your positive response and support for our work! We noticed that the rating has not been updated, so we would like to confirm this with you.

Thank you again for your time and assistance!

Yours sincerely,

CausalMM Team

评论

Thank you for providing these results. After futher reviewing them in detail, I noticed the following:

  1. For the LLaVA-1.5 section, the table appears to simply aggregate results from Table 1, which doesn’t introduce any new insights. This could be misleading, as the three datasets differ in size. Based on the main results from Table 1 in the paper, OPERA still demonstrates a best F1 score on both MSCOCO and GQA.

  2. Could you clarify the origin of DOLA? I read through the paper and the discussions, but I couldn’t find any explanation or citations for it.

Additionally, I have a few questions regarding specific figures:

  • What are the results of VCD and OPERA on VLind-Bench for the LLaVA and Qwen2-VL models in Figure 3?
  • What are the results of OPERA in Figure 4?
  • What are the results of OPERA in Figure 5?

I’d appreciate your clarification on these points. Thank you!

评论

Thank you for your feedback, we are happy to resolve your confusion.

For the LLaVA-1.5 section, the table appears to simply aggregate results from Table 1, which doesn’t introduce any new insights. This could be misleading, as the three datasets differ in size. Based on the main results from Table 1 in the paper, OPERA still demonstrates a best F1 score on both MSCOCO and GQA.

The practice of averaging results across three datasets is inspired by the experimental presentation format in the AGLA [1] paper we are comparing to. In that paper, they averaged the results of the three components of POPE for comparison. Apologies for not mentioning this, which may have caused some confusion. Regarding the F1 score of OPERA, the results in Table 3 on more data should better represent the overall performance of our method. We have checked all the main results from Table 1 in the paper and found an error in line 359. In the adversarial setting, the F1 score for OPERA is:

MetricAccuracyPrecisionRecallF1 score
OPERA73.9067.7791.1377.74*

The F1 score in the paper was mistakenly listed as 84.59. We will correct this in a future version.

[1] An, W., Tian, F., Leng, S., Nie, J., Lin, H., Wang, Q., ... & Lu, S. (2024). AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention. arXiv preprint arXiv:2406.12718.

Could you clarify the origin of DOLA? I read through the paper and the discussions, but I couldn’t find any exp,lanation or citations for it.

Of course. DOLA refers to DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models [2]. Specifically, DoLA reduces hallucinations in LLMs by contrasting logits from later and earlier layers. We will add a citation to this paper in a future version. Thank you for the reminder.

[2] Chuang, Y. S., Xie, Y., Luo, H., Kim, Y., Glass, J., & He, P. (2023). Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883.

  • What are the results of VCD and OPERA on VLind-Bench for the LLaVA and Qwen2-VL models in Figure 3?

In our rebuttal to reviewer vNn2, we provided experimental results for LLaVA on VLind-Bench and also included corresponding analysis for OPERA's test results. You can refer to this response: "Author Response to Reviewer vNn2 (Part.2)". The specific data is as follows:

MetricsSckS_{ck}SvpS_{vp}ScbS_{cb}SlpS_{lp}CBLP
Regular32.140.743.333.143.727.1
VCD30.548.047.831.044.029.2
OPERA00--00
CausalMM57.080.864.061.859.940.2

As per your request, we have also provided the results of Qwen2-VL tested with VCD and OPERA methods:

MetricsSckS_{ck}SvpS_{vp}ScbS_{cb}SlpS_{lp}CBLP
Normal88.197.459.879.559.350.0
VCD85.897.461.877.461.650.7
OPERA00--00
CausalMM94.097.765.778.565.651.3
评论

The practice of averaging results across three datasets is inspired by the experimental presentation format in the AGLA [1] paper we are comparing to. In that paper, they averaged the results of the three components of POPE for comparison. Apologies for not mentioning this, which may have caused some confusion. Regarding the F1 score of OPERA, the results in Table 3 on more data should better represent the overall performance of our method. We have checked all the main results from Table 1 in the paper and found an error in line 359. In the adversarial setting...

Please double check all results, these mistakes in results presented can be misleading.

In our rebuttal to reviewer vNn2, we provided experimental results for LLaVA on VLind-Bench and also included corresponding analysis for OPERA's test results. You can refer to this response: "Author Response to Reviewer vNn2 (Part.2)".

Thank you for response. While you provided results for DOLA, could you clarify why no results were included for OPERA?

评论

Please double check all results, these mistakes in results presented can be misleading.

Thank you for your reminder! We fully understand the importance of data accuracy. We have thoroughly checked for any potential errors caused by LaTeX formatting and ensured the correctness of all results.

Thank you for response. While you provided results for DOLA, could you clarify why no results were included for OPERA?

Of course, we are happy to answer your questions.

LLaVA-1.5 / VLind-Bench

MetricsSckS_{ck}SvpS_{vp}ScbS_{cb}SlpS_{lp}CBLP
Regular32.140.743.333.143.727.1
VCD30.548.047.831.044.029.2
OPERA*00--00
CausalMM57.080.864.061.859.940.2

Qwen2-VL / VLind-bench

MetricsSckS_{ck}SvpS_{vp}ScbS_{cb}SlpS_{lp}CBLP
Normal88.197.459.879.559.350.0
VCD85.897.461.877.461.650.7
OPERA*00--00
CausalMM94.097.765.778.565.651.3

These two tables should be the ones you have questions about. The results for the OPERA method are not missing, but rather all are zero. This phenomenon also appears for other models in Table 2 on page 7 of the VLind-Bench paper [1]. The original paper does not provide an explanation or conclusion for this, but we speculate that certain methods may have reduced the model's ability to follow specific instructions, leading to responses that result in a final score of zero in the evaluation pipeline. In the next work, we will explore the internal logic related to this phenomenon and deeply reveal the decisive factors related to the ability of the model to follow instructions. Welcome to pay attention to our work in the future!

[1] Lee K, Kim M, Yoon S, et al. VLind-Bench: Measuring Language Priors in Large Vision-Language Models[J]. arXiv preprint arXiv:2406.08702, 2024.

We hope the above explanation has resolved your doubts. Thank you for your feedback and support!

CausalMM Team

审稿意见
5

The paper points out problems arising from biases induced by visual and language priors in the visual encoder and the LMM backbone, and it mentions the oversight of the causal relationship between attention and the model's output. In this study, a method called CausalMM identifies modality priors as confounders, addressing them through backdoor adjustment and counterfactual reasoning.

优点

  • The paper is well-written and easy to read.
  • Performance improvements were observed in the benchmarks used for evaluation.
  • The creation of counterfactual attention in various ways is novel.

缺点

  • It is unclear if this method constitutes a backdoor adjustment. My understanding of backdoor adjustment involves identifying a confounder, then using it to reduce the confounder's impact, which differs from using counterfactuals as described here. Counterfactuals are typically used to measure natural direct effects or natural indirect effects, and I am curious if this method follows such an approach. Mathematical proof may be needed for line 225.
  • Please explain which paths need to be blocked in Figure 2. It appears that the path from the image to visual attention and from text token embedding to LLM attention should be blocked, but aren’t those paths essential?
  • This method seems closer to a contrastive decoding approach rather than backdoor adjustment. It appears to use counterfactual attention similarly to how negative samples are used.
  • The intervention on attention to reduce the influence of defined priors as confounders seems to also reduce the impact of input on attention.

问题

  • Line 58 mentions the VCD method as considering only statistical correlation. I thought the VCD method also accounts for causation by using negative images to identify language priors. Why do you view the VCD method as overlooking causal relationships?
  • On page 5, does 'j' refer to the order of the method for creating counterfactual attention?
评论

Please explain which paths need to be blocked in Figure 2. It appears that the path from the image to visual attention and from text token embedding to LLM attention should be blocked, but aren’t those paths essential?

Specifically, in addition to the paths from image to visual attention and from text token embedding to LLM attention, the paths from visual modality prior to visual attention and from language modality prior to LLM should also be blocked, but the paths are too short to be well presented as dashed lines in the figure. We have modified the figure to ensure readability.

Regarding the question of whether these paths are essential, this involves our definition of the blocking operation in the backdoor adjustment in the previous question. For example, blocking the path from image I to attention A can be understood as artificially giving a value to A so that the value of A is no longer affected by I. This operation is called the do operation in the causal inference theory system: do(a)do(a). The truncation of this path can be expressed as: P(Ido(A))=P(I)P(I \mid do(A)) = P(I). The same applies to other cases.

This method seems closer to a contrastive decoding approach rather than backdoor adjustment. It appears to use counterfactual attention similarly to how negative samples are used.

Thank you for your feedback. We have explained back-door adjustment in detail in the first question. Below is a table that distinguishes our method from Contrastive decoding. Specifically, Contrastive decoding contrasts outputs derived from original and distorted inputs. In the contrast, CausalMM isolates the influence of modal priors and other confounders on multimodal attention by using backdoor adjustment methods, obtains the positive treatment effect of attention on output through counterfactual reasoning, adjusts the output of the model at the attention and feature levels, and balances the modality priors.

We use VCD [1] for specific comparison:

Tabular Comparison of CausalMM and VCD

FeatureCausalMMVCD
Core MethodologyStructural Causal Model (SCM) with backdoor adjustment and counterfactual reasoningContrastive decoding
Focus of InterventionVisual and language attention mechanisms, visual features and LLM hidden statesInput image
Mechanism of action1.de-confound 2.Obtain the positive treatment effect 3.Adjust attention, features and hidden states 4.Balance the modality priorsContrasts outputs derived from original and distorted image inputs
VersatilityMultimodal hallucinations (vision + language)Object hallucinations
Support single-modal tasks (such as LLM)×
Exploring the causal mechanisms within the model×
Dealing with the confounding effects of modality priors×
Modality Priors AddressedVisual and language priors-

[1] Leng, Sicong, et al. "Mitigating object hallucinations in large vision-language models through visual contrastive decoding." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

评论

The intervention on attention to reduce the influence of defined priors as confounders seems to also reduce the impact of input on attention.

Your understanding is correct. But the influence obtained after intervention is not directly positively correlated with the result. The attention and corresponding features obtained through intervention and backdoor adjustment serve as causal probability anchors, helping us dynamically obtain the positive processing effect of effective attention on model output through counterfactual reasoning. It is used to balance modal priors at the feature and hidden states levels. The corresponding formula in the article is as follows:

For the visual attention (AiA_i):

Peffect_V=EAiA~i[P(OAi=Ai,I=I,Pv=Pv)P(Odo(Ai=ai),I=I,Pv=Pv)].P_{effect\_V} = E_{A_i \sim \tilde{A}_i}\left[P(O | A_i = **A**_i, I = **I**, P_v = **P**_v) - P(O | \text{do}(A_i = **a**_i), I = **I**, P_v = **P**_v)\right].

Here, Peffect_VP_{effect\_V} represents the causal effect of the visual attention mechanism on the model output OO. The term Ai**A**_i denotes the observed visual attention, whereas ai**a**_i represents the intervention applied to the visual attention.

For the LLM attention (AtA_t):

Peffect_L=EAtA~t[P(OAt=At,Tt=Tt,Pl=Pl)P(Odo(At=at),Tt=Tt,Pl=Pl)],P_{effect\_L} = E_{A_t \sim \tilde{A}_t}\left[P(O | A_t = **A**_t, T_t = **T**_t, P_l = **P**_l) - P(O | \text{do}(A_t = **a**_t), T_t = **T**_t, P_l = **P**_l)\right],

where Peffect_LP_{effect\_L} denotes the causal effect of the language model attention on the output OO. The notation At**A**_t is the observed language model attention, and at**a**_t is the intervention applied to the language model attention.

In a multimodal setting, the combined causal effect is given by:

Peffect_M=EAi,AtA~i,A~t[P(OAi=Ai,At=At,I=I,Tt=Tt,Pv=Pv,Pl=Pl)]P(Odo(Ai=ai),do(At=at),I=I,Tt=Tt,Pv=Pv,Pl=Pl),P_{effect\_M} = E_{A_i, A_t \sim \tilde{A}_i, \tilde{A}_t}\left[P(O | A_i = **A**_i, A_t = **A**_t, I = **I**, T_t = **T**_t, P_v = **P**_v, P_l = **P**_l)\right] - P(O | \text{do}(A_i = **a**_i), \text{do}(A_t = **a**_t), I = **I**, T_t = **T**_t, P_v = **P**_v, P_l = **P**_l),

where Peffect_MP_{effect\_M} represents the combined causal effect of both visual and language attention mechanisms on the output OO.

Line 58 mentions the VCD method as considering only statistical correlation. I thought the VCD method also accounts for causation by using negative images to identify language priors. Why do you view the VCD method as overlooking causal relationships?

We think that our statement may have caused confusion, so we adjusted it in the paper. If we talk about causality in a broad sense, your idea should be correct. But based only on the narrow causal reasoning theory system, VCD does not align with any causal model.

On page 5, does 'j' refer to the order of the method for creating counterfactual attention?

In the CausalMM method, "j" iterates over all tokens in the denominator (to compute the softmax normalization). We have added relevant instructions in the revision.

评论

Hello,

Thank you for your thoughtful response. I appreciate the detailed explanation.

First, I understand the equation for backdoor adjustment and the logic behind it. Additionally, the table clearly highlights the differences between CausalMM and VCD, which I found very helpful. Thank you for providing that comparison.

That said, I believe that the intervention in backdoor adjustment can be understood as utilizing counterfactual attention. However, upon further reflection, I see this process as being closer to computing the direct effect rather than purely applying backdoor adjustment. Specifically, when comparing the equations for controlled direct effect and natural direct effect, the equation in line 224 appears very similar, suggesting that it is effectively capturing the direct effect of attention.

Moreover, I view contrastive decoding as another way of computing the direct effect, and your method seems to align with this approach. In this sense, your method appears similar to contrastive decoding in its focus and implementation.

Thank you again for your response, and I look forward to the final version of your paper.

评论

We are deeply grateful for the your recognition of our work's innovation and thoroughness, as well as their constructive feedback. We have addressed each suggestion on the manuscript's weaknesses and made the necessary revisions.

It is unclear if this method constitutes a backdoor adjustment. My understanding of backdoor adjustment involves identifying a confounder, then using it to reduce the confounder's impact, which differs from using counterfactuals as described here. Counterfactuals are typically used to measure natural direct effects or natural indirect effects, and I am curious if this method follows such an approach. Mathematical proof may be needed for line 225.

Thank you for your interest in the relevant theory. In our work, backdoor adjustment and counterfactual reasoning are combined to drive the entire mechanism. The role of backdoor adjustment is to identify confounding factors such as modal priors and reduce the impact of confounding factors on identifying the causal relationship between attention and model output. The state of the variables can be qualitatively described as "counterfactuals", which depends on the specific circumstances of different studies. The counterfactuals you describe should be the counterfactual reasoning we use later, which has been widely used in the field of machine learning. We use counterfactual reasoning to obtain the positive treatment effect brought by intervening attention. The relevant formulas have been described in detail in section 3.3.

The following is a further explanation of backdoor adjustment in our work:

Structural Causal Model (SCM)

Variables and their roles:

AA (attention): This represents the model's attention mechanism that we aim to evaluate or manipulate.

MM (modality priors): Modality priors influence both the model's attention (AA) and the output (OO), thus creating confounding.

OO (model output): The outcome variable, which is affected both directly by AA and indirectly through MM.

Causal structure and backdoor paths:

The backdoor path in this SCM is AMOA \leftarrow M \to O, which starts with an arrow pointing into AA and creates a confounding junction structure.

To isolate the causal effect of AA on OO, the confounding influence of MM must be blocked.

Backdoor Criterion:

To apply backdoor adjustment, the adjustment set MM must satisfy the following criteria:

1. MM blocks all backdoor paths from AA to OO.

2. MM does not include any descendants of AA (i.e., variables causally influenced by AA).

By intervening on AA and adjusting for MM, we can isolate the causal effect of AA on OO.

Backdoor Adjustment Formula:

Given a sufficient adjustment set MM, the causal effect P(odo(a))P(o \mid do(a)) is identified as:

P(odo(a))=mP(oa,m)P(m)P(o \mid do(a)) = \sum_m P(o \mid a, m) P(m)

Derivation:

1. Starting with the interventional distribution:

P(odo(a))=mP(odo(a),m)P(mdo(a)) P(o \mid do(a)) = \sum_m P(o \mid do(a), m) P(m \mid do(a))

2. Using the property of the intervention do(a)do(a):

Under the intervention do(a)do(a), the variable AA is no longer influenced by MM. Thus:

P(mdo(a))=P(m) P(m \mid do(a)) = P(m)

3. Replacing P(odo(a),m)P(o \mid do(a), m) with the observational counterpart:

Due to the backdoor criterion, MM blocks all confounding paths, allowing:

P(odo(a),m)=P(oa,m) P(o \mid do(a), m) = P(o \mid a, m)

  1. Combining these results: P(odo(a))=mP(oa,m)P(m)P(o \mid do(a)) = \sum_m P(o \mid a, m) P(m)

Application to Attention-Output Framework:

In the context of our framework:

1. Backdoor path:

The backdoor path AMOA \leftarrow M \to O reflects the confounding effect of modality priors (MM) on the attention mechanism (AA) and the model's output (OO).

2. Intervention:

By intervening on AA, we ensure that the causal effect of attention on the output is isolated, free from the influence of modality priors.

3. Adjustment:

To block the backdoor path, we adjust for MM, computing the summation over all possible values of MM to account for its confounding effect.

Full Formula for the Framework:

In our framework, the causal effect of attention (AA) on the model output (OO) can be computed as:

P(odo(a))=mP(oa,m)P(m)P(o \mid do(a)) = \sum_m P(o \mid a, m) P(m)

P(oa,m)P(o \mid a, m): The conditional probability of the output given attention AA and modality priors MM.

P(m)P(m): The marginal probability of modality priors MM.

We added the corresponding content in the appendix of the revision.

[1] Judea Pearl. Causality. Cambridge university press, 2009.

[2] Kexuan Zhang, Qiyu Sun, Chaoqiang Zhao, and Yang Tang. Causal reasoning in typical computer vision tasks. arXiv:2307.13992, 2023a.

评论

We are delighted to clarify the differences between our method and VCD. Thank you for recognizing our rebuttal.

We greatly appreciate your willingness to discuss the theories related to causality. We fully agree with your interpretation of backdoor adjustment and value your understanding of the related theory. However, some important misunderstandings appear to remain unresolved. The following points should address these concerns:

  • In our method, backdoor adjustment and counterfactual reasoning are two closely related but different stages (as we have explained in our responses to Reviewers CnJZ and vNn2). These two theories are applied in different processes and at different scales.

  • You may have misunderstood that our backdoor adjustment and counterfactual reasoning describe the same process. For example, the part you mentioned in line 224 pertains to counterfactual reasoning rather than backdoor adjustment. The formula for backdoor adjustment, as mentioned earlier (using example variables), is:

P(odo(a))=mP(oa,m)P(m)P(o \mid do(a)) = \sum_m P(o \mid a, m) P(m)

  • In the initial version of our paper, we did not provide a detailed explanation of backdoor adjustment due to space limitations. This might have led to some confusion. Therefore, in the revised version, we added a detailed explanation of backdoor adjustment in the appendix and clarified it in the main text to prevent further misunderstanding. These additions are highlighted in the revised manuscript for your reference.

  • Specific distinctions between the two stages in our method:

    • Backdoor adjustment supports the theoretical justification for our method’s ability to estimate the causal effects of other variables despite the confounding factor of modality priors. The key variable in this stage is the modality prior.

    • Counterfactual reasoning involves estimating the causal effect of attention on the model's output under the assumption that attention fails (i.e., using counterfactual attention). The key variable in this stage is attention.

  • You interpreted the theorem presented in line 224 as representing direct effect, but it actually represents counterfactual reasoning (as we have explained to other reviewers). The key distinction is that counterfactual reasoning estimates the effect of an event under the hypothetical condition that it does not occur (fails), while direct effect refers to the isolated effect of a variable on the outcome when other variables are held constant [1]. Our method estimates the causal effect of attention on the model’s output under the assumption of attention failure. If we were to use direct effect, we would instead focus on the influence of other variables on the outcome when attention is controlled.

  • Regarding your statement that VCD is similar to direct effect, we agree. However, VCD does not constitute counterfactual reasoning. Specifically:

    • In counterfactual reasoning, interventions isolate the influence of specific variables by controlling the values of certain nodes in the causal graph. In contrast, VCD utilizes distorted inputs merely to amplify input biases, rather than to simulate hypothetical conditions.

    • Counterfactual reasoning aims to measure the causal effect of a specific variable, while VCD’s process leans more toward statistical adjustment rather than causal modeling.

[1] Judea Pearl. Causality. Cambridge University Press, 2009.

We hope these clarifications help resolve any misunderstandings about our work. We are also delighted to engage in discussions on causal reasoning, as we believe it is a theoretical framework deserving more attention.

Should you have further suggestions for improving the paper, please let us know! If we have addressed your concerns, we hope you would reconsider the rating.

CausalMM Team

评论

Hello, Thank you for the detailed response and the clarifications provided. Your explanation has significantly improved my understanding of the method. I recognize and acknowledge my misunderstanding. I will update my rating accordingly.

评论

Dear Reviewer VPyB,

Thank you for your positive feedback and approval of our work! We noticed that the rating has not been updated, so we would like to confirm this with you.

Thanks again for taking the time to discuss!

Yours sincerely,

CausalMM Team

评论

Dear reviewers, AC, SAC, and PC,

First of all, we would like to express our sincere gratitude to the reviewers for their valuable time and insightful comments. We are pleased to see that the reviewers have agreed with several positive aspects of our paper, such as novelty and significance (Reviewers VPyB, CnJZ, vNn2, gJXZ), performance improvement (Reviewers VPyB, CnJZ, vNn2, gJXZ), and good writing (Reviewers VPyB, vNn2).

Your expertise has greatly helped us strengthen our manuscript——these have been some of the most helpful comments we have received in years! We have made a concerted effort to address all the major issues raised, and we sincerely appreciate the reviewers' updated rating and thorough recognition.

Sincerely,

CausalMM Team

AC 元评审

Summary:

This paper introduces CausalMM, a framework combining backdoor adjustment and counterfactual reasoning to mitigate hallucinations in multimodal large language models (MLLMs). The method treats modality priors as confounders between attention mechanisms and model outputs.

Strengths:

  1. Practical solution for MLLMs:

"The method is plug-and-play and does not require retraining, making it practical for existing MLLMs" (Reviewer vNn2)

  1. Reasonable empirical validation:

"The authors conduct experiments on multiple benchmarks, including VLind-bench, POPE, and MME benchmarks" (Reviewer CnJZ)

  1. Novel perspective:

"The paper introduces the idea of applying causal inference techniques to address modality prior-induced hallucinations in MLLMs" (Reviewer CnJZ)

Weaknesses:

  1. Limited theoretical foundation:

"The background knowledge about causal inference is insufficient. The authors do not explain why causal inference is effective in capturing the causal impact of effective attention in MLLM output" (Reviewer vNn2)

  1. Marginal improvements:

"Though the proposed method is somewhat novel, the experimental results are not quite significant and robust compared with existing methods (Table 1)" (Reviewer vNn2)

Justification:

Despite reservations, I recommend acceptance for several reasons:

  1. The paper offers a new perspective on addressing hallucinations in MLLMs through causal inference, even if the theoretical foundations could be stronger.

  2. Two reviewers (VPyB and CnJZ) increased their ratings after author responses, indicating the paper's issues are not fundamental flaws but rather limitations that can be addressed.

  3. The method is immediately applicable without retraining, providing practical value despite modest improvements.

However, several limitations temper our enthusiasm:

  • Improvements over baselines are modest (2-3%)
  • Theoretical justification remains somewhat unclear
  • Implementation details could be more complete
  • Comparison with simpler alternatives like VCD isn't fully convincing

While the authors addressed many reviewer concerns, questions about the method's theoretical foundations and practical impact remain. The acceptance is based more on the potential of the approach and reviewer rating improvements rather than strong conviction about the current results.

审稿人讨论附加意见

Both Reviewer VPyB and CnJZ increased their ratings after detailed discussions with authors. Notably, VPyB's major concerns about backdoor adjustment were resolved through mathematical clarification. The authors added:

  • Statistical significance analysis
  • Additional model experiments (Chameleon, InstructBLIP)
  • Clearer theoretical justification
  • More comprehensive baseline comparisons

However, some fundamental concerns about experimental rigor and theoretical foundations remain partially unaddressed.

最终决定

Accept (Poster)