PaperHub
6.3
/10
Poster4 位审稿人
最低6最高7标准差0.4
6
7
6
6
3.5
置信度
COLM 2025

Traceable and Explainable Multimodal Large Language Models: An Information-Theoretic View

OpenReviewPDF
提交: 2025-03-23更新: 2025-08-26
TL;DR

We introduce an information-theoretic framework that uses mutual information, a Concept Bottleneck, and an InfoNCE mechanism to explain how multimodal models align and integrate visual and textual inputs.

摘要

关键词
multimodal LLMinformation theory

评审与讨论

审稿意见
6

The paper introduced a information theory based method for analyze interactions between visual tokens and text tokens in multimodal LLMs. The framework trained a Concept encoder to extract concept bottleneck information. The author did ablation studies on a few benchmarks with carefully designed visual and textural prompting to study the interaction relationships.

接收理由

  1. This is the first method to measure visual and text tokens relationship quantitatively with the help of information theory.
  2. The author did studies on a few benchmarks with different experiment setting. Quantitative results align with human intuition for MLLM.
  3. The information theory based analysis provide new direction and inspiration for MLLM interpretability research

拒绝理由

  1. It would be great for audience to understand if the author can provide more detailed proof of Lemma 3.3 and 3.4
  2. In Experiments part, the author only test with LLaVA 1.5, it would be more convincing to include other MLLMs, such as Qwen-2.5-VL, LLaMA3-V, etc.
  3. The 4 stage observed in LLaVA may not exist in other MLLMs
  4. In Finding 3, caption can also be regarded as one type of QA, please clarify the experiment setting here.
  5. In Finding 4 and 5, conclusions are drawn from comparing QA task with caption task, however, caption task benchmark is COCO-Cap, the concept encoder is also trained with MS-COCO, there may exists potential distribution shift, more detailed analysis or change caption evaluation with other benchmarks may be helpful.
  6. In Line 288, the author mentioned "a stable F_v in Figure 1b", please clarify while line refer to?
  7. In Line 295, the experiment setting is "original POPE queries with visual prompted images", however in later sentence "Only visual prompt", which is confusing.
  8. In Finding 9, textual reference to image can enhance visual information injection substantially, however, this might related to LLaVA instruction following training data is formated. Need more investigation.
评论

6. Line 288 Reference to "stable FvF_v"

In Finding 7, we noted that the image reference was unclear. Specifically, "Figure 1b" refers to the left image in Figure 2 in the same RQ2 section, while "Figure 1d" refers to the right image in Figure 2 of the same section. We will make sure to update this to improve clarity of our paper.

The entire section on Finding 7 discusses the influence of "Disruptive Prompts." We describe the experimental setup regarding "Disruptive Prompts" in the first paragraph of RQ2, around line 270.

7. Ambiguous Experiment Setting in Line 295

In our paper, we refer to the setting of "original POPE queries with visual prompted images" as "Only visual prompt" to enhance the conciseness of the text and image legends. However, we recognize that this may lead to confusion. To improve clarity, we will update the reference to "Visual prompt without text reference."

8. Training Data Influence in Finding 9

Thank you for the suggestion. To verify that this effect is not specific to LLaVA’s training format, we conducted the same experiment on the Qwen model. Through the experiment, we observe that the information flow trends holds with those traced on LLaVA. Settings refer visual prompts through texts presents higher FvF_v. These results support the generality of Finding 9 across models, suggesting that the enhancement is not merely due to LLaVA’s formatting, but reflects a broader multimodal behavior.

Qwen 2.5 VL 3B result for RQ3 on POPE (external PDF Figure4

Setting1015202530
No visual prompt-0.0300.0230.0160.0390.026
Only visual prompt-0.0350.0100.0010.0180.022
Correct Text Reference-0.0110.0170.0280.0450.046
Wrong Text Reference-0.0120.0250.0310.0450.040

[1] Li, Bohao, et al. "Seed-bench: Benchmarking multimodal large language models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[2] Zhu, King, et al. "LIME: Less Is More for MLLM Evaluation." arXiv preprint arXiv:2409.06851 (2024).

评论

Thank author for provinding more experiments and answered most of my questions. I have updated the score accordingly.

评论

Dear Reviewer fJZx,

Thank you for your valuable comments. We have carefully reviewed your comments and have provided responses to each of the points you raised.

1. Proofs of Lemmas 3.3 and 3.4

We will add detailed proof of Lemmas 3.3 and 3.4 in the appendix to improve the clarity.

2. Model Diversity in Experiments

We have expanded our experiments to include Qwen 2.5 VL, a more advanced and powerful MLLM. The results are presented in the table below, with key points presenting curve trends.

Our findings indicate that the Qwen model also exhibits a four-stage information processing flow pattern, consistent with our observations from LLaVA. This alignment between Qwen VL and LLaVA confirms that the findings reported in our paper are applicable across different MLLMs, further strengthen the generality of our findings.

Qwen 2.5 VL 3B Result FtF_texternal PDF Figure2(a), Figure2(b)

Dataset6111621263136
VQA-v20.0080.0130.0500.0330.0500.0160.002
COCO Caption-0.0060.0070.0420.0120.0740.051-0.006
AOKVQA0.0020.0070.0280.0040.025-0.007-0.015
HAL-Eval-0.0010.0220.0510.0350.0730.0540.024

Qwen 2.5 VL 3B Result FvF_vexternal PDF Figure2(c), Figure2(d)

Dataset5101520253035
VQA-v2-0.0330.0030.0200.0560.0620.3160.480
COCO Caption-0.022-0.0080.015-0.019-0.0640.1030.149
AOKVQA0.0130.1300.2510.2920.5240.5890.540
HAL-Eval-0.047-0.044-0.063-0.072-0.0670.1580.408

3. Generality of the Four-Stage Finding

As noted in response to question 2, we observe a consistent four-stage information flow pattern in Qwen, despite it being a different MLLM with distinct ViT architecture, language decoder architecture, and parameter counts. This observation helps to validate the generalizability of the findings presented in our paper.

4. Experiment Setting in Finding 3 (Caption vs. QA)

In the context of our study, we consider Caption and QA as two distinct tasks and Caption is not regarded as QA, as we did not mention Caption can also be regarded as one type of QA in our paper. In studies related with MLLM evaluation, it is a common practice to contain both Captioning and QA tasks. [1, 2] The primary objective of Finding 3 is to conduct an analysis of information flow patterns in MLLMs across these two different multimodal tasks. We select Caption and QA as they represent two critical and classical tasks in vision-language research, making their comparison particularly valuable for understanding model behaviors.

5. Potential Distribution Shift in Findings 4 & 5 (COCO-Cap vs. Concept Encoder)

Our current four benchmarks cover both in-domain and out-of-domain data, which are carefully selected to compare the information flow patterns of MLLMs across different tasks (captioning and question-answering) and different data distributions (including in-domain and out-of-domain, such as COCO with Hal-Eval and VQA with AOKVQA).

Based on such experimental design and the comparisons made across these settings, including the in-domain and out-of-domain settings, we derive several insights, including Findings 4 and 5. Specifically, we find that: (1) QA tasks rely more heavily on injected contextual information compared to captioning tasks; and (2) when considering different data sources (in-domain vs. out-of-domain, and with or without inconsistent textual prompts), information injection is less effective, particularly for out-of-domain data with inconsistent prompts.

评论

For R3, how you prompt model to get the caption? If you use question like "please describe the image in detail", this should be considered as a type of QA.

R5 didn't answer my question. How about using Flicker?

审稿意见
7

The paper introduces an information-theoretic framework to understand and quantify how textual instructions shape multimodal representations in Multimodal Large Language Models (MLLMs). By leveraging principles from the Information Bottleneck and Concept Bottleneck, the authors proposed a framework that maps complex multimodal representations into interpretable latent spaces. An InfoNCE-based contrastive mechanism is then used to explicitly separate and quantify the contribution of textual input from raw visual information. Empirical studies are conducted to demonstrate the dynamics of multimodal processing, including how visual information undergoes multi-stage transformations when influenced by various types of textual instructions, and how disruptions in text input can lead to degradation or hallucinations in visual representation quality. Overall, the paper is theoretical strong and has several empirical results backed inferences that would be useful to the multimodal community.

接收理由

  1. Strong theoretical and novel framework to combine the Information bottleneck principle with a concept bottleneck approach, offering a systematic, quantitative way to disentangle and study the role of text in shaping visual representations.

拒绝理由

  1. It remains unclear whether the proposed framework can be easily generalized across different architectures and tasks within multimodal systems. Ideally, it would be nice to see this study across two broad categories of MLLMs, i.e. unified embeddings ones like LLAVA and cross modality attention ones like NVLM from NVIDIA.

  2. It would be interesting to see if inference/findings are consistent across different vision embedding models like CLIP, ImageBind, other unified embedding models.

  3. Need some discussion on how the findings can help us to design better multimodal architecture or training mechanism in future.

给作者的问题

Findings 7 and 8 are not clear. can you elaborate how the results are supporting these claims.

评论

Dear Reviewer j1jo,

Thank you for your valuable comments. We have carefully reviewed your comments and have provided responses to each of the points you raised.

1. Generalization Across Architectures

To further strengthen the generalizability of our findings, we have expanded our experiments to include Qwen 2.5 VL, a more advanced and powerful MLLM with a different architecture from LLaVA. The results are presented in the table below, with key points presenting curve trends.

We observe that the Qwen model also exhibits a four-stage information processing flow pattern, consistent with our findings from LLaVA. This alignment between Qwen VL and LLaVA confirms that the findings reported in our paper are applicable across different MLLMs.

Qwen 2.5 VL 3B Result FvF_vexternal PDF Figure2(a), Figure2(b)

Dataset6111621263136
VQA-v20.0080.0130.0500.0330.0500.0160.002
COCO Caption-0.0060.0070.0420.0120.0740.051-0.006
AOKVQA0.0020.0070.0280.0040.025-0.007-0.015
HAL-Eval-0.0010.0220.0510.0350.0730.0540.024

Qwen 2.5 VL 3B Result FtF_texternal PDF Figure2(c), Figure2(d)

Dataset5101520253035
VQA-v2-0.0330.0030.0200.0560.0620.3160.480
COCO Caption-0.022-0.0080.015-0.019-0.0640.1030.149
AOKVQA0.0130.1300.2510.2920.5240.5890.540
HAL-Eval-0.047-0.044-0.063-0.072-0.0670.1580.408

2. Generalization Across Vision Embeddings

As mentioned previously, we further evaluated the information flow pattern in Qwen 2.5 VL, which employs a more advanced Vision Transformer (ViT) than CLIP, incorporating convolution, attention, normalization, and other techniques. [1]

The four-stage information flow pattern identified by our methods in LLaVA is also evident in Qwen, despite its different ViT and language decoder architecture. This consistency further strengthens the generalizability of the findings presented in our paper.

3. Design Implications for Future MLLMs

We have combined and analyzed the information flow patterns we traced, and we present some prospective guidance for the design of MLLM architectures and training mechanisms.

One prospective guidance on training is adding Layer-wise auxiliary losses during traning traced by our methods, such as an (1) auxilary loss term to minimize FvF_v after the filtering stage to encourage maintaining more visual information, or (2) an auxiliary term that maximize FtF_t - FvF_v near the output layer in knowledge-intensive tasks which can hopefully enhance task-relevant alignment and LLM knowledge injection of MLLMs. [2]

Regarding model architecture, our findings related to RQ(3) indicate that textual tokens significantly influence visual information during MLLM inference, which may lead to visual information loss. Therefore, it may be beneficial to design new architectures that allow visual information to bypass certain layers, thereby preserving more specific visual details and potentially reducing biases in MLLMs. [3, 4]

[1] Bai, Shuai, et al. "Qwen2. 5-vl technical report." arXiv preprint arXiv:2502.13923 (2025).

[2] Wu, Junda, et al. "Mitigating Visual Knowledge Forgetting in MLLM Instruction-tuning via Modality-decoupled Gradient Descent." arXiv preprint arXiv:2502.11740 (2025).

[3] Ghatkesar, Aarti, Uddeshya Upadhyay, and Ganesh Venkatesh. "Looking beyond language priors: Enhancing visual comprehension and attention in multimodal models." arXiv preprint arXiv:2505.05626 (2025).

[4] Wang, Chenxi, et al. "Mllm can see? dynamic correction decoding for hallucination mitigation." arXiv preprint arXiv:2410.11779 (2024).

评论

Thanks to the authors for conducting new experiements and providing clarification to the prevoiusly raised concerns. I don't have any other comments.

审稿意见
6

This paper proposes an information‑theoretic framework for tracing multimodal interactions in vision‑language models. By estimating layer‑wise mutual‑information metrics, the authors reveal how visual features are progressively filtered, blended, and ultimately compressed as textual context flows through the layers of LLaVA‑1.5.

Overall, the motivation is clear and the framework could become a useful diagnostic tool for multimodal LLMs, pinpointing where visual and textual signals converge. Nonetheless, the study has two notable weaknesses: (a) crucial mathematical notation in Section 3.1 is ambiguous, and (b) the main claims rely entirely on the behaviour of the two proposed metrics without additional supporting experiments. Further empirical validation—such as interventions on hidden states—would be needed to substantiate the conclusions.

接收理由

  • Introduces a novel information‑theoretic diagnostic toolkit (F_V^{(l)}, F_T^{(l)}) enabling layer‑wise traceability in multimodal LLMs, a capability not provided by prior work.
  • The method is lightweight (post‑hoc, no retraining), making it a practical diagnostic for both researchers and practitioners.

拒绝理由

  • Key mathematical notation in §3.1 is ambiguous (e.g., inconsistent use of T, N, X_0^{(l)}, X_T^{(l)}), making it difficult to reproduce results or even compute the proposed metrics correctly.
    • TT is introduced as raw text, yet in Eq. (1) it denotes a sequence of tokens. This usage is inconsistent with II, which represents a raw image fed to the VLM through the function ff.
    • NN is never defined (I presume it is the number of layers in gπg_\pi).
    • X0(l)X_{0}^{(l)} is labeled “encoded visual representation,” but the 0-th token is merely the first image token—it does not capture the full visual representation. The authors may have meant Xf(V)(l)X_{|f(V)|}^{(l)} (the last image token).
    • XT(l)X_{T}^{(l)} is called the “multimodal representation,” yet TT is not an index but the text itself. Even if one sets T=TT = |T|, the notation still seems off; perhaps Xf(V)+T(l)X_{|f(V)|+|T|}^{(l)} (the last input token) was intended.
    • Because these symbols form the foundation of the method—and are reused in the proposed information-theoretic measurements—the notation must be clarified. In my review, I assumed X0(l)X_{0}^{(l)} refers to the last image token and XT(l)X_{T}^{(l)} to the last input token.
  • Core empirical claims rely solely on the shapes of two mutual‑information curves; without causal interventions or ablation studies, the four‑stage interpretation remains speculative.
    • For example, the authors identify four stages of multimodal encoding solely from the shapes of the proposed metric curves. However, it is difficult—at least for me—to draw firm conclusions from these plots alone. They indicate when information from image tokens is injected into text tokens, but not how this integration occurs. We can certainly hypothesize, but for such hypotheses to be scientifically convincing we need more rigorous experiments—for instance, interventions on the model’s hidden states—to corroborate the claims.
  • Evaluation is limited to a single model (LLaVA‑1.5‑7B) and a concept bottleneck tied to 80 MS‑COCO classes, leaving generality to other architectures and tasks untested.
  • The experiments lack comparisons to simpler diagnostic tools (e.g., similarity of hidden states between image tokens and text tokens [1]), so the incremental benefit of the proposed toolkit is unclear.

[1] https://arxiv.org/abs/2411.00646

评论

4. Comparisons to simpler diagnostic tools

We conducted experiments using several existing simple metrics to measure the information differences in hidden states between image tokens and instruction tokens. These metrics include Euclidean distance [7], cosine similarity [8], Pearson correlation [9] and JS Divergence [10].

Our observations reveal that the current simple evaluation metrics fail to (1) effectively trace changes in information flow across layers. The results of some metrics exhibites mostly monotonic increases or decreases with subtle fluctuations. For example, Cosine similarity and Pearson correlation increase gradually from layer 5 to 35 (e.g., from 0.20 → 0.59), showing almost identical trends with no distinct inflection points, thereby failing to reveal any interpretable processing stages. Other metrics such as JS Divergence degrades to a constant near output layers (near layer 35), which suggests that the metric loses discriminative power near the output. Such limitations prevent them from capturing the distinct stages identified by our methods.

(2) Additionally, these simple metrics do not provide a theoretically explainable perspective on the evaluated results. They measure the similarity of hidden states in a straightforward manner (e.g., through direct calculations of cosine or Euclidean distance), which lacks a meaningful theoretical framework when dealing with the complex semantic vectors in MLLMs. These findings further underscore the contribution of our information evaluation methods, which successfully trace more nuanced information flow patterns than simpler metrics while also offering theoretically grounded explanations.

Result of simper diagnostic tools(external PDF Figure1

Method5101520253035
Cosine Similarity0.200.320.440.480.490.58059
Euclidean Distance48.6562.0569.4776.88105.85175.24238.50
Pearson Correlation0.200.320.430.480.490.580.59
JS Divergence*1000.0820.110.090.170.150.050.03

[1] Quantmeyer, Vincent, Pablo Mosteiro, and Albert Gatt. "How and where does CLIP process negation?." arXiv preprint arXiv:2407.10488 (2024).

[2] Kim, Hazel, et al. "Detecting LLM Hallucination Through Layer-wise Information Deficiency: Analysis of Unanswerable Questions and Ambiguous Prompts." arXiv preprint arXiv:2412.10246 (2024).

[3] Khanam, Rahima, and Muhammad Hussain. "Yolov11: An overview of the key architectural enhancements." arXiv preprint arXiv:2410.17725 (2024).

[4] Zhang, Shizhao, et al. "Domain adaptive yolo for one-stage cross-domain detection." Asian conference on machine learning. PMLR, 2021.

[5] Wei, Jian, Qinzhao Wang, and Zixu Zhao. "YOLO-G: Improved YOLO for cross-domain object detection." Plos one 18.9 (2023): e0291241.

[6] Tao, Mingxu, et al. "Probing multimodal large language models for global and local semantic representations." arXiv preprint arXiv:2402.17304 (2024).

[7] Alshamrani, Sultan. "Distance Matters: Euclidean Embedding Distances for Improved Language Model Generalization and Adaptability." IEEE Access (2024).

[8] Bashier, Housam Khalifa, Mi-Young Kim, and Randy Goebel. "Disk-CSV: distilling interpretable semantic knowledge with a class semantic vector." Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2021.

[9] Nasir, Inzamam Mashood, et al. "Pearson correlation-based feature selection for document classification using balanced training." Sensors 20.23 (2020): 6793.

[10] Sutter, Thomas, Imant Daunhawer, and Julia Vogt. "Multimodal generative learning utilizing jensen-shannon-divergence." Advances in neural information processing systems 33 (2020): 6100-6110.

评论

Dear Authors,

I appreciate the authors' hard work on the rebuttal by providing additional experimental results. The extra discussion and experimental results, both in this thread and in other threads, significantly enhance this work. I will raise my score.

评论

Dear Reviewer JoXW,

Thank you for your valuable comments. We have carefully reviewed your comments and have provided responses to each of the points you raised.

1. Mathematical Notation Ambiguity

Thanks for your feedbacks, we will update the notations to imporve the clarity.

2. Empirical Support for the Four-Stage Interpretation

We would like to mention that the four-stage view is an optional descriptive lens rather than the core contribution of our paper. It illustrates the structured dynamics observed in our empirical results, helping interpret mutual information trends across layers. The key contributions of our work lie in introducing a principled information-theoretic framework to quantify and trace multimodal information flow in MLLMs, which remain valid regardless of how one segments the observed dynamics. Besides, there are relevant studies that also offer empirical analyses with rigorous theoretical methodologies. [1, 2] Such studies highlight the importance of experimentation in validating theoretical models.

Additionally, we further perform a selective ablation experiment which corroborates our findings. Specifically, we imply intervention on image tokens by masking these tokens. The results shows that when image tokens are disabled:

(1) Information Compression (Stage 4) remains virtually unaffected, , consistent with our claim that this stage primarily involves text-response generation rather than visual processing. (2) Stages 1–3 (Pre-processing, Filtering, and Contextual Injection) exhibit significant disruption when image tokens are masked. The characteristic bimodal trend vanishes, collapsing into a monotonic decrease. This aligns with our proposal that these stages actively process visual information whose dynamics are dependent on image tokens.

Qwen 2.5 VL 3B intervention on VQA(external PDF Figure3

Setting71217222732
Mask image token0.012-0.041-0.030-0.034-0.038-0.064
Original setting0.0070.0220.0460.0350.0430.019

3. Evaluation Scope and Baselines

We have expanded our experiments to include Qwen 2.5 VL, a more advanced and powerful MLLM. The results are presented in the table below, with key points presenting curve trends.

We observe that the Qwen model also exhibits a four-stage information processing flow pattern, consistent with our findings from LLaVA. This alignment between Qwen VL and LLaVA enhances the generalizability of the findings reported in our paper.

Qwen 2.5 VL 3B Result FvF_vexternal PDF Figure2(a), Figure2(b)

Dataset6111621263136
VQA-v20.0080.0130.0500.0330.0500.0160.002
COCO Caption-0.0060.0070.0420.0120.0740.051-0.006
AOKVQA0.0020.0070.0280.0040.025-0.007-0.015
HAL-Eval-0.0010.0220.0510.0350.0730.0540.024

Qwen 2.5 VL 3B Result FtF_texternal PDF Figure2(c), Figure2(d)

Dataset5101520253035
VQA-v2-0.0330.0030.0200.0560.0620.3160.480
COCO Caption-0.022-0.0080.015-0.019-0.0640.1030.149
AOKVQA0.0130.1300.2510.2920.5240.5890.540
HAL-Eval-0.047-0.044-0.063-0.072-0.0670.1580.408

These 80 number of classes in our evaluation methods are not tied to COCO, which are actually extracted leveraging YOLO's anchor mechanism. YOLO's anchor mechanism has reliable cross-domain adaptability [3,4,5], which can be extended to accommodate various downstream evaluation tasks.

Additionally, our choice of 80 classes from standard YOLO aligns with common practices in vision-language research, where such categories serve as a standardized semantic bottleneck for fair comparison across different datasets and settings. [6]

审稿意见
6

This paper proposes a new IB-based framework for analyzing multimodal LLMs. The key idea is to leverage the concept bottleneck [1] to quantify the layer-wise visual information proportion and obtain the concept vector, which can be utilized for MLLM analysis. Experiments have been conducted to answer various interesting questions, and some useful insights have been obtained.

[1] Koh, Pang Wei, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. "Concept bottleneck models." In International conference on machine learning, pp. 5338-5348. PMLR, 2020.

接收理由

  • The idea of utilizing the IB-based theory and concept bottleneck to analyze MLLMs is novel to me, and the theory looks sound.
  • This paper studies several important research questions, which is a good contribution to the community.

拒绝理由

  • The method is built upon previous work [1], which impacts the technical contributions to some degree.
  • Some research questions may be too vague and difficult to answer. Though the paper uses the proposed framework to reach some conclusions, it may still be hard to answer the proposed questions. A suggestion would be modifying the questions into some smaller questions, instead of in the current form.
  • The used model is LLaVA1.5, which is old and may not be powerful enough. The authors should consider using more advanced MLLMs such as Qwen-VL. The evaluation is conducted on benchmarks that may be too simple. Please also consider more advanced benchmarks such as MM-Vet 2.

给作者的问题

N/A

评论

Dear Reviewer WuoZ,

Thank you for your valuable comments. We have carefully reviewed your comments and have provided responses to each of the points you raised.

1. Technical Contribution and Novelty

Our motivation generates from the triger of the CBM as you mentioned in [1], however our theoretical and technical details are novel.

Teoretically, our main contribution is the proposal of two information evaluation perstective for MLLM information injection combining the Upper Bound for Mutual Information Upper Bound.

2. Research Question Scope

We have further specified our research questions to enhance their alignment with our findings and ensure they can be more effectively addressed by the findings. For example, in RQ2, our finding mainly focus on the influence of "disruptive prompts", we modify the RQ2 as "How do disruptive textual instructions influence multimodal information flow in MLLMs?" to specify a clearer question scope. For RQ4, we can modify the RQ as "How does key words in different instructions influence MLLM knowledge injection?" to specify a clearer question scope.

3. Model Choice and Benchmark Sufficiency

We have expanded our experiments to include Qwen 2.5 VL, a more advanced and powerful MLLM. The results are presented in the table below, with key points presenting curve trends.

Our findings indicate that the Qwen model also exhibits a four-stage information processing flow pattern, which is consistent with our observations using LLaVA. This consistency between Qwen VL and LLaVA reinforces the validity of our findings across different MLLMs.

Qwen 2.5 VL 3B Result FvF_vexternal PDF Figure2(a), Figure2(b)

Dataset6111621263136
VQA-v20.0080.0130.0500.0330.0500.0160.002
COCO Caption-0.0060.0070.0420.0120.0740.051-0.006
AOKVQA0.0020.0070.0280.0040.025-0.007-0.015
HAL-Eval-0.0010.0220.0510.0350.0730.0540.024

Qwen 2.5 VL 3B Result FtF_texternal PDF Figure2(c), Figure2(d)

Dataset5101520253035
VQA-v2-0.0330.0030.0200.0560.0620.3160.480
COCO Caption-0.022-0.0080.015-0.019-0.0640.1030.149
AOKVQA0.0130.1300.2510.2920.5240.5890.540
HAL-Eval-0.047-0.044-0.063-0.072-0.0670.1580.408

Regarding the benchmarks, we have carefully selected them to encompass various scenarios and facilitate meaningful comparisons. And many related works also use these benchmarks. [1, 2, 3] This carefully designed benchmark selection has yielded valuable insights. Specifically, the benchmarks cover different tasks (captioning and other assignments) and diverse data sources (both in-domain and out-of-domain). Through our comparisons across these benchmarks, which feature varying settings with comparable complexity, we have identified several key findings. Notably, findings 3 and 4 highlight different information patterns across various visual tasks, while finding 5 reveals distinct information patterns influenced by in-domain and out-of-domain data, as well as inconsistent textual inputs.

[1] Tao, Mingxu, et al. "Probing multimodal large language models for global and local semantic representations." arXiv preprint arXiv:2402.17304 (2024).

[2] Nguyen, Duy-Kien, and Takayuki Okatani. "Multi-task learning of hierarchical vision-language representation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.

[3] Huo, Jiahao, et al. "Mmneuron: Discovering neuron-level domain-specific interpretation in multimodal large language model." arXiv preprint arXiv:2406.11193 (2024).

评论

For ALL Reviews

We sincerely thank all the reviewers for the insightful feedback. We are encouraged by the recognition of our work's value and have carefully addressed all concerns. The highlights of our contribution are:

Novel Theoretical Framework for MLLM Interpretability

We deeply appreciate the reviewers' acknowledgment of our novel integration of Information Bottleneck and Concept Bottleneck theories (Reviewers 1-4). This framework provides the first quantitative tool to trace layer-wise visual-textual interactions in MLLMs, addressing a critical gap in multimodal interpretability research.

Lightweight Diagnostic Toolkit with Community Impact

Reviewers highlighted our method's practical value as a lightweight analysis tool. This lightweight design enables researchers to systematically and quantitatively diagnose diverse MLLMs efficiently, potentially guiding future architecture design. (Reviewers 2-4)

Systematic Empirical Analysis with Actionable Insights

We are grateful for the recognition of our revelation of a four distinct processing stages in LLaVA and quantified text-visual interplay mechanisms, validated through multiple benchmark scenarios including QA, captioning, and hallucination detection tasks. (Reviewers 1,3,4)

Common Issue Resolutions

We have proactively addressed key cross-reviewer concerns:

Model/Benchmark Scope: We have expanded experiments to include more advanced Qwen 2.5 VL architectures, validating that the observed multi-stage processing dynamics extend beyond LLaVA to modern MLLMs.

Expanded Baseline Comparisons and Finding Validation: (1) To contextualize our methodological contributions, we benchmarked our approach against several standard metrics. These existing metrics fail to resolve the nuanced, stage-specific dynamics captured by our mutual-information framework and lack grounding in information-theoretic principles needed to interpret observations. (2) We add intervention experiments on image tokens. The results of the intervention experiment consists with our previous findings which further corroborates our claims.

Mathematical Ambiguities: We have rigorously redefined notation in Section 3.1 (particularly clarifying X0(l)X_0^{(l)}, XT(l)X_T^{(l)} and layer indexing) and provided detailed proofs for Lemmas 3.3-3.4 in the appendix.

最终决定

This paper proposes a new IB-based framework for analyzing multimodal LLMs. After rebuttal, it received scores of 6667. Overall, the authors provided a strong rebuttal, and all reviewers were positive about the paper. They noted that the use of IB theory and concept bottlenecks to analyze MLLMs is novel, and the proposed framework has the potential to become a valuable diagnostic tool for multimodal LLMs.

However, reviewers also raised some concerns. They agreed that analyzing only LLaVA-1.5 was insufficient. In response, the authors added results on Qwen2.5-VL during the rebuttal. Additionally, reviewers pointed out that (a) key mathematical notations in Section 3.1 were ambiguous, and (b) comparisons to simpler diagnostic tools were lacking. The authors addressed these by clarifying the notations and providing additional comparisons.

Overall, the AC recommends accepting the paper.