EMMA: Efficient Visual Alignment in Multi-Modal LLMs
摘要
评审与讨论
This paper proposes EMMA, a lightweight cross-modality module designed to efficiently fuse visual and textual encodings, generating instruction-aware visual representations for the language model.
优点
1-Clear motivation: Improving effeciency is important for MLLMs. 2-Well-written paper: Easy to follow. 3-Substantial experiments and corresponding discussion.
缺点
1-The model is very similar to LLaVA: EMMA introduces CLIP-text encoder and an early fusion module. 2-The involvement of CLIP-text encoder is not very convincing: CLIP is trained for classification task only, but EMMA is a general-purposed instruction-aware MLLM. I am doubtful about the effectiveness when CLIP-text encoder deals with long instructions. You may make some visualization of the activation when CLIP encounter some long instructions / tasks apart from classification.
3-The discussion in Sec. 3.2 should incorporate more datasets apart from MMVP. Because I am not sure of the effectiveness of the CLIP-text encoder as above.
4-Unfair comparison: Table 1 without LLaVA, but Figure 3 only compare with LLaVA. Should compare LLaVA in Table1 and compare mPLUG-Owl2 and other methods in Figure 3. You may provide a table / figure to compare more methods.
5-The scale of Figure 3 is inappropriate, slight improves with significant distance, e.g. 64.3 -> 66.19 in MMB-en. I think a histogram bar will be better.
6-The claim in line 247-250 is too absolute: Therefore, a persistent challenge remains in effectively aligning these two modalities to ensure seamless fusion and task-specific adaptability. Actually, task-specific adaptability is important, but previous works solve it by LLM (very large scale).
7-Figure5: Two legends: purple and light green. What do dark green bars represent? Cannot understand right figure. Please give more decriptions.
8-Performances improve slightly in lots of datasets.
9-POPE dataset has many subsets, should provide detailed performances on each of them in a table.
10-If you mentioned hallucination, you should discuss and cite more papers. For example, there are relation hallucinations: R-Bench[1], and MMRel[2], and some datasets containing attribute hallucination: MME[3]. I do not ask you to evaluate on a bunch of datasets, you are just suggested to cite them and then the readers will have a clear picture of the hallucination tasks.
[1] Mingrui Wu, Jiayi Ji, Oucheng Huang, Jiale Li, Yuhang Wu, Xiaoshuai Sun, and Rongrong Ji. Evaluating and analyzing relationship hallucinations in lvlms. In ICML, 2024 [2] Nie J, Zhang G, An W, et al. MMRel: A Relation Understanding Dataset and Benchmark in the MLLM Era[J]. arXiv preprint arXiv:2406.09121, 2024. [3] Wujian Peng, Sicheng Xie, Zuyao You, Shiyi Lan, and Zux- 677 uan Wu. Synthesize, diagnose, and optimize: Towards fine- 678 grained vision-language understanding. In CVPR, 2024
问题
Please see the weakness.
My major concern is about the novelty, both the slightly incremental structure based on LLaVA and the effectiveness of CLIP-text encoder.
We thank the reviewer for their valuable feedback and comments.
Question: 1-The model is very similar to LLaVA: EMMA introduces CLIP-text encoder and an early fusion module.
Answer: Please refer to the first topic in the General Response, where we have thoroughly clarified the goals and design principles of EMMA.
Question: 2-The involvement of CLIP-text encoder is not very convincing: CLIP is trained for classification task only, but EMMA is a general-purposed instruction-aware MLLM. I am doubtful about the effectiveness when CLIP-text encoder deals with long instructions. You may make some visualization of the activation when CLIP encounter some long instructions / tasks apart from classification.
Answer: Thank you for discussing the CLIP's training objective: aligning corresponding image and text embeddings while maximizing the separation between unrelated pairs. It’s worth noting that the training dataset for CLIP comprises at least 400 million pairs, which extend far beyond simple classification descriptions (e.g., "A photo of a cat"). This extensive and diverse data enables the model to effectively learn embeddings that are semantically aligned across the two modalities. This cross-modality alignment is the reason that the CLIP visual encodings are preferred over other candidates to produce the visual encodings that are then passed through the LLM. In this work, we propose that leveraging the text encoder alongside the visual encoder to encode instructions allows for a better utilization of the two modules' alignment.
Due to the CLIP text encoder's input length limitation (a maximum of 77 tokens), we extract the most relevant portion of the instruction during both training and inference to feed into the text encoder. For example, given a full instruction such as: "A chat between a curious user and an artificial intelligence assistant. The assistant provides helpful, detailed, and polite answers to the user's questions. {question}" we isolate and input only the {question} into the text encoder.
Finally, to evaluate the performance of EMMA's modality adaptation module across various settings, it is helpful to refer to Tables 2, 3, and 4. These tables present results from over 10 benchmarks, containing diverse prompt structures.
Question: 3-The discussion in Sec. 3.2 should incorporate more datasets apart from MMVP. Because I am not sure of the effectiveness of the CLIP-text encoder as above.
Answer: Section 2 presents three key analyses: • General Analysis: This examines the weights of the modality adaptation module in a dataset-independent manner, providing insights into its overall behavior. • MMVP Dataset Analysis: This focuses on the MMVP dataset, which features similar images. It highlights the impact of instruction injection on the representations, enabling them to become more distinguishable. • Mutual Information Analysis: This explores the mutual information between visual and textual representations using another benchmark, LLaVA-in-the-Wild.
Question: 4-Unfair comparison: Table 1 without LLaVA, but Figure 3 only compare with LLaVA. Should compare LLaVA in Table1 and compare mPLUG-Owl2 and other methods in Figure 3. You may provide a table / figure to compare more methods.
Answer: Table 1 primarily focuses on comparing models with early modality adaptation modules; however, LLaVA has been included as suggested. Figure 3 (now it's Figure 5(a)) highlights a direct comparison between EMMA and LLaVA, as these two models share identical architecture and training setups, differing only in the modality adaptation module introduced by EMMA. Please refer to the first topic of General Response and Tables 2 and 4 for a comprehensive comparison between EMMA and many of the recent MLLMs.
Question: 5-The scale of Figure 3 is inappropriate, slight improves with significant distance, e.g. 64.3 -> 66.19 in MMB-en. I think a histogram bar will be better.
Answer: Thanks for your suggestion. We have adjusted the scales to accurately reflect the correct differences; please refer to Figure 5(a).
Question: 6-The claim in line 247-250 is too absolute: Therefore, a persistent challenge remains in effectively aligning these two modalities to ensure seamless fusion and task-specific adaptability. Actually, task-specific adaptability is important, but previous works solve it by LLM (very large scale).
Answer: It remains unclear whether previous methods, including LLaVA, have effectively addressed this issue. [1] highlights that the current advancements in MLLMs are largely attributed to improvements in the LLM component, while these models often exhibit shortcomings on more visually demanding benchmarks. Notably, EMMA achieves significant improvements (9.3%), on this benchmark. Additionally, our model demonstrates substantial improvements in addressing robustness challenges, such as hallucinations, which persist as critical issues in existing modules. Please refer to the table below for a detailed evaluation of EMMA's robustness capabilities.
| Method | R-Bench-Image-leval | R-Bench-Instance-level | MMRel-SDXL-Spatial | MMRel-Dalle-Spatial | MMRel-Dalle-Action | MMRel-VG-Action | MMRel-VG-Spatial | AMBER-Attr | AMBER-Exi | Foil |
|---|---|---|---|---|---|---|---|---|---|---|
| LLaVA | 79.89 | 67.53 | 51.56 | 49.75 | 69.74 | 68.55 | 50.34 | 20.23 | 91.43 | 50.08 |
| EMMA | 80.78 | 68.12 | 74.69 | 63.61 | 71.79 | 76.39 | 62.16 | 26.15 | 94.50 | 66.13 |
We refer to LLaVA-style fusion (occurring within the LLM) as Late Fusion and EMMA-style fusion as Early Fusion.
A key drawback of late fusion is that the LLM component is not explicitly trained for seamless multi-modal integration. This limitation can hinder certain capabilities of the LLM, during the finetunig process. In contrast, early fusion overcomes this issue by aligning modalities prior to passing them to the LLM. This approach ensures that the representations are already instruction-aware, leading to more effective multi-modal processing.
On the other hand, most current methods implement early fusion by introducing hundreds of millions of additional parameters, necessitating equally large-scale datasets for training. This presents a persistent challenge in achieving effective and efficient alignment between the two modalities.
The proposed approach aims to overcome these challenges by leveraging lightweight and efficient methods to perform modality adaptation without compromising performance or requiring vast computational resources.
[1] Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Question: 7-Figure5: Two legends: purple and light green. What do dark green bars represent? Cannot understand right figure. Please give more descriptions.
Answer: One of the reorganizations we made to the paper was positioning this histogram alongside the MMVP instances (now Figure 3) to better convey the message of the plot. The histogram displays the L2-norms of the MMVP image pairs using LLaVA's visual representations and EMMA's instruction-aware representations. This effectively demonstrates that EMMA's visual representations, enhanced by instruction encodings, create more distinguishable representations for visually similar images.
Question: 8-Performances improve slightly in lots of datasets.
Answer: Given the complexity of the tasks and the variety of benchmarks, It's unlikely for a model to provide significant improvements over all benchmarks (as observed for the previous models in this area [1, 2]). However, EMMA demonstrates significant improvements for some benchmarks, as highlighted in the table below showing the percentage of gains.
| Method | VQAv2 | VisWiz | SQAI | OkVQA | MMBEN | MMBCN | Muirbench |
|---|---|---|---|---|---|---|---|
| EMMA | +6.92% | +1.53% | +5.7% | +2.57% | +1.94% | +1.85% | +6.54% |
And for robustness
| Method | AMBER-Attr | AMBER-Rel | AMBER-Exis | FOIL |
|---|---|---|---|---|
| EMMA | +5.92% | +0.24% | +3.07% | +16.05% |
[1] mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
[2] Question Aware Vision Transformer for Multimodal Reasoning
Question: 9-POPE dataset has many subsets, should provide detailed performances on each of them in a table.
Answer: For a better presentation of the radar plot, only the "popular" part is included in the paper. For a complete comparison, please refer to the table below.
| Method | adv | pop | rand |
|---|---|---|---|
| LLaVA | 81.16 | 87.03 | 91.46 |
| EMMA | 82.77 | 88.27 | 91.23 |
Question: 10-If you mentioned hallucination, you should discuss and cite more papers. For example, there are relation hallucinations: R-Bench[1], and MMRel[2], and some datasets containing attribute hallucination: MME[3]. I do not ask you to evaluate on a bunch of datasets, you are just suggested to cite them and then the readers will have a clear picture of the hallucination tasks.
Answer: This table addresses the suggestion to include a broader discussion on hallucination tasks by providing performance comparisons across several benchmarks and datasets specifically designed to evaluate different types of hallucinations. EMMA consistently demonstrates stronger performance across all benchmarks except for the MME benchmark.
| Method | R-Bench-Image-leval | R-Bench-Instance-level | MMRel-SDXL-Spatial | MMRel-Dalle-Spatial | MMRel-Dalle-Action | MMRel-VG-Action | MMRel-VG-Spatial | MME-Cognition | MME-Perception |
|---|---|---|---|---|---|---|---|---|---|
| LLaVA | 79.89 | 67.53 | 51.56 | 49.75 | 69.74 | 68.55 | 50.34 | 355 | 1501 |
| EMMA | 80.78 | 68.12 | 74.69 | 63.61 | 71.79 | 76.39 | 62.16 | 318 | 1463 |
As suggested, the hallucination evaluation table (Table 3) has been updated to reflect the inclusion of these new benchmarks.
Thanks for authors' response, which addressed most of my concerns. But I still suggest authors' to do more experiments and analysis on another dataset apart from MMVP, as suggested in Question 3. I will raise my rate to 6.
Thank you for your valuable comments. The latest version of the paper includes a more comprehensive evaluation across 20 benchmarks, encompassing a diverse range of tasks, including visual question answering and reasoning, mathematical college-level tasks, robustness to hallucination, multi-image reasoning, and OCR.
This paper introduces EMMA (Efficient Multi-Modal Adaptation), a lightweight module designed to improve cross-modal alignment in multi-modal large language models (MLLMs). EMMA integrates visual and textual encodings using a low-parameter approach that leverages pre-trained CLIP models. Specifically, the method incorporates both CLIP’s vision encoder and its text encoder into the multi-modality alignment module, taking advantage of their pre-trained alignment to fuse visual representations with instruction encodings efficiently. The proposed architecture adds less than 0.2% additional parameters, making it highly efficient compared to existing models such as mPLUG-Owl2.
优点
-
Detailed analytical insights: The paper offers a thorough analysis of the modality alignment process through l1, l2, and mutual information comparisons. This analysis provides deeper insights into how visual and textual tokens are integrated within the alignment module, enhancing the interpretability of the model's decision-making process.
-
Robustness against hallucinations: The empirical results highlight EMMA's superior ability to avoid hallucinations in multi-modal tasks, as demonstrated on benchmarks such as AMBER and FOIL. This robustness is critical for improving the reliability of MLLMs in real-world scenarios where model hallucinations can pose significant risks.
缺点
-
Inconsistent writing and presentation: The overall writing of the paper is somewhat disorganized, making it difficult to follow and reducing its readability. For example, Section 3.1 is intended to introduce the EMMA method, but the section ends with a discussion of experimental results (in the last paragraph on page 6). Similarly, in Section 3.2, which is supposed to focus on the Analysis on Modality Adaptation by EMMA, the section starts by explaining model details, which would be better suited earlier in the paper. Additionally, Figure 1 is referenced only on page 6, even though it appears on page 2. Many figures and tables also lack proper axis labels and detailed captions, which further hinders the reader's ability to understand the data. For instance, it would be helpful to clarify the contents of the axes in Figures 2, 5, and 6.
-
Unfair experimental comparisons: The paper does not provide a fair and direct comparison between different vision-language adapters, which is essential since the focus is on adapter efficiency. While achieving complete consistency between the vision and language models might be challenging, it would be beneficial to at least include an additional set of experiments that utilize other adapters, such as [1-2], alongside the same vision and language models. This would allow for a more fair comparison of their performance against the EMMA adapter. By keeping the vision and language models constant across different methods, the paper could offer a clearer evaluation of the specific contributions made by EMMA’s adapter design.
问题
Please see weakness.
-
Can the authors provide more formulized descriptions of the proposed method? For example, this could include mathematical formulas or structured descriptions similar to those at the beginning of Section 3.2.
-
Can the authors include additional experimental comparisons with other adapter-based methods, such as [1-2] ? These would offer a more comprehensive evaluation of EMMA’s performance relative to other state-of-the-art adapters.
-
EMMA uses CLIP-ViT-L-14 as its visual encoder, but it's unclear whether the same visual model is used in the comparison models. Could the authors provide details on the visual models used by the other models in Tables 2 and 4, along with their parameter counts?
[1] MMA: Multi-Modal Adapter for Vision-Language Models
[2] Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling
We thank the reviewer for their valuable feedback and comments.
Question: Inconsistent writing and presentation: The overall writing of the paper is somewhat disorganized, making it difficult to follow and reducing its readability. For example, Section 3.1 is intended to introduce the EMMA method, but the section ends with a discussion of experimental results (in the last paragraph on page 6). Similarly, in Section 3.2, which is supposed to focus on the Analysis on Modality Adaptation by EMMA, the section starts by explaining model details, which would be better suited earlier in the paper. Additionally, Figure 1 is referenced only on page 6, even though it appears on page 2. Many figures and tables also lack proper axis labels and detailed captions, which further hinders the reader's ability to understand the data. For instance, it would be helpful to clarify the contents of the axes in Figures 2, 5, and 6.
Answer: Thank you for your comments on the presentation. The revised version of the paper is now clearer and easier to read. The updates on the paper are listed in the third topic of the General Response.
Question: Unfair experimental comparisons: The paper does not provide a fair and direct comparison between different vision-language adapters, which is essential since the focus is on adapter efficiency. While achieving complete consistency between the vision and language models might be challenging, it would be beneficial to at least include an additional set of experiments that utilize other adapters, such as [1-2], alongside the same vision and language models. This would allow for a more fair comparison of their performance against the EMMA adapter. By keeping the vision and language models constant across different methods, the paper could offer a clearer evaluation of the specific contributions made by EMMA’s adapter design.
Answer: Thank you for your suggestion. MMA [1] is a concurrent work to EMMA and introduces a Multi-Modal Adapter that employs a cross-attention-based module in its architecture. However, the Multi-Modal Adapter proposed in MMA is specifically designed for classification tasks, and its capabilities as a VL (Vision-Language) adapter for MLLMs have not been explored. EMMA differentiates itself from MMA in two key ways:
• Innovative Alignment Approach: EMMA is the first to demonstrate that leveraging the inherent alignment between the visual and textual components of CLIP can enhance early modality adaptation in MLLMs, reducing the reliance on parameter-intensive modality adaptation modules.
• Architectural Differences: The architecture of MMA’s Multi-Modal Adapter is distinct from EMMA’s VL adapter, reflecting differing design philosophies and applications.
The following table presents a comparison of three MLLMs with identical setups, differing only in their modality adaptation modules: (1) LLaVA, the baseline model without a modality adaptation module, (2) EMMA, and (3) MMA. The results demonstrate that EMMA consistently achieves the highest performance across all benchmarks, except for MuirB, where MMA outperforms. MMA shows improvements in benchmarks such as SQA and MuirB compared to baseline, highlighting the strength of its modality adaptation capabilities. However, its complex architecture results in overfitting as reported for some benchmarks like GCA, MMB, MMVP, and A-Bench. We hypothesize that effective modality adaptation, refining visual representations through instruction encodings, requires a meticulously balanced design, particularly when working with an efficient training set of only 1.2 million samples (as the baseline). This ensures that the generalization of visual representations is maintained while seamlessly integrating instruction information.
| Method | Data | SQA | GQA | POPE | MMB-en | MMB-cn | MMVP | A-Bench | MuirB |
|---|---|---|---|---|---|---|---|---|---|
| LLaVA-1.5 | 1.2M | 66.8 | 62.0 | 86.1 | 64.3 | 58.3 | 24.7 | 62.97 | 23.46 |
| MMA[1] | 1.2M | 67.08 | 49.33 | 79.73 | 60.84 | 51.42 | 8.67 | 56.01 | 29.5 |
| EMMA | 1.2M | 71.00 | 62.78 | 87.27 | 66.19 | 60.2 | 34 | 63.48 | 29.08 |
The Tip-Adapter[2] method introduces a fine-tuning strategy for vision foundation models by incorporating an adapter module immediately after the vision model. This approach confines the fine-tuning process to the adapter module, keeping the vision foundation model frozen. Essentially, Tip-Adapter mirrors the approach used by LLaVA, where an adapter is added after the vision encoder and before the LLM component to preprocess visual encodings for the language model. However, Tip-Adapter does not propose a modality adaptation module or early fusion of text encodings with visual representations.
[1] MMA: Multi-Modal Adapter for Vision-Language Models
[2] Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling
Question: Can the authors provide more formulized descriptions of the proposed method? For example, this could include mathematical formulas or structured descriptions similar to those at the beginning of Section 3.2.
Answer: The mathematical formulation of EMMA has been incorporated into Section 3.1 of the paper as suggested.
Question: EMMA uses CLIP-ViT-L-14 as its visual encoder, but it's unclear whether the same visual model is used in the comparison models. Could the authors provide details on the visual models used by the other models in Tables 2 and 4, along with their parameter counts?
Answer: Table 1 contains all the details on the vision encoders, VL adapter, and LLM part. As suggested, the Vision Encoders have also been added to Table 2 and Table 4 in the revised version of the paper. For convenience, they are provided in the following table. As reported the ViT-L/14 is the most efficient vision encoder utilized by the current SOTA MLLMs.
| Method | BLIP-2 | InstructBLIP | Shikra | IDEFICS | Qwen-VL-Chat | LLaVA-1.5 | mPLUG-Owl2 | BRAVE | EMMA |
|---|---|---|---|---|---|---|---|---|---|
| Vision Encoder | ViT-g/14 (1.3B) | ViT-g/14 (1.3B) | ViT-L/14 (0.3B) | ViT-H/14 | ViT-G/14 (1.9B) | ViT-L/14 (0.3B) | ViT-L/14 (0.3B) | EVA-CLIP-g , ViT-L/14, SILC-G/16, ViT-e, DINOv2-L/14 (7.2B) | ViT-L/14 |
I appreciate the authors' efforts and their responses, so I raise my score to 5.
This paper introduces EMMA, a lightweight cross-modality module that implements visual-language early fusion, generating instruction-aware visual representations for the language model. This module adds less than 0.03% of parameters to the framework while boosting performance across multiple MLLM benchmarks.
优点
- Leveraging the inherent alignment properties of pre-trained CLIP text and vision encoders to implement an early fusion module is intuitive.
- The model's structural design is characterized by its simplicity and clarity.
- The analysis presents a novel perspective on modality adaptation.
缺点
-
In line 251, it is mentioned that current methodologies depend on complex cross-modality modules, particularly highlighting mPLUG-Owl2. However, the paper does not provide a theoretical comparison with methods like LLaVA and Qwen-VL, which utilize straightforward projection layers that add only a minimal number of parameters to achieve visual-language alignment.
-
The proposed EMMA introduces a Modality Adaptation module based on the correlation between word tokens in instructions and visual tokens. However, this methodology lacks performance guarantees when instructions are verbose or correlate poorly with visual tokens. The benchmarks selected in the paper are skewed towards VQA tasks, which inherently possess a high correlation between instructions and visual tokens. It is imperative that benchmarks assessing fine-grained perception capabilities, such as document understanding and OCR tasks (e.g., Chartqa[1] and Docvqa[2]), should also be evaluated to ensure comprehensiveness.
-
Lack of comparison with visual-language early fusion works (e.g., QA-ViT[3]).
[1] Chartqa: A benchmark for question answering about charts with visual and logical reasoning.
[2] Docvqa: A dataset for vqa on document images
[3] Question Aware Vision Transformer for Multimodal Reasoning. In CVPR, 2024
问题
-
In line 295, the claim that leveraging the inherent alignment properties of CLIP encoders obviates the need for complex cross-modal modules and extensive training to attain alignment is deemed unwarranted, given that the instruction embeddings fed into the LLM are not inherently aligned with CLIP features. Indeed, the proposed EMMA incorporates a projection layer akin to LLaVA after its Modality Adaptation module and also adopts the pre-alignment training from LLaVA. The reviewer considers this projection layer a pivotal component of EMMA's alignment achievement. Conducting additional ablation experiments wherein this projection layer is removed could potentially strengthen the authors' argument.
-
While Line 319 mentions the utilization of data from the same datasets, the 1.8M training samples employed by EMMA deviate in quantity from those used in any of the baselines, rendering it challenging to demonstrate the effectiveness of the Modality Adaptation module. Given that LLaVA 1.5 is fully open-source and shares a very similar structure with EMMA, adding the following experimental results would significantly strengthen this paper: (1) the outcomes of EMMA using the 1.2M training set from LLaVA 1.5, and (2) the results of LLaVA 1.5 utilizing the 1.8M training set from EMMA.
-
The conclusion drawn in line 343 is not convincing. The findings in Figure 5(a) demonstrate that text tokens are indeed involved in the feature fusion process. Yet, they do not sufficiently establish that the alignment module effectively identifies the "most" informative textual tokens. The reviewer thinks the positions of the most valuable tokens for an arbitrarily given instruction are random, and the current framework adopting a simple linear layer, which allocates fixed importance weights for textual tokens during inference, can not adapt to this change.
-
The reviewer praises the methodology illustrated in Figure 6, which employs Mutual Information to assess the importance of aligned visual representations in the reasoning process. However, concerns about its reliability are raised. A significant counterexample emerges in situations where the dataset requires responses strictly in YES/NO format, making the mutual information between the visual representations and the response encodings lack meaningful relevance.
Minor:
- Which figure does the conclusion in line 338 correspond to?
Question: Lack of comparison with visual-language early fusion works (e.g., QA-ViT[1]).
Answer: Thank you for bringing this paper to our attention. The QA-ViT [1] paper introduces a Question-Aware Vision Transformer approach for multimodal reasoning, embedding question awareness directly within the vision encoder. In contrast, our approach in EMMA does not modify the internal components of the vision encoder. Instead, we introduce a modality adaptation module that operates after the vision encoder. The key advantage of our approach over QA-ViT is that, by preserving the internal mechanics of the vision encoder, we maintain the generalization capabilities of the visual representations it produces. This ensures that the benefits of the pre-trained vision encoder's broad applicability are retained, providing a robust foundation for a wide range of tasks. Please check the following for a comparison between our method and QA-ViT, the results for QA-ViT are taken from their paper.
Traditional benchmarks:
| Method | VQAv2 | VizWiz | TextVQA |
|---|---|---|---|
| QA-ViT[1] | 80.5 | 36.5 | 59.1 |
| EMMA | 89.42 | 56.03 | 57.0 |
| Difference | +8.92% | +19.98% | +2.1% |
As we can observe from the table above, on the traditional benchmarks, EMMA significantly outperforms QA-ViT by 8.92% and 19.98% on VQAv2 and VizWiz respectively while QA-ViT beats EMMA on TextVQA by a small margin. Also, the evaluations in the QA-ViT paper are limited to the traditional benchmarks and they haven't utilized the MLLM-specialized benchmarks.
[1] Question Aware Vision Transformer for Multimodal Reasoning. In CVPR, 2024
Question: In line 295, the claim that leveraging the inherent alignment properties of CLIP encoders obviates the need for complex cross-modal modules and extensive training to attain alignment is deemed unwarranted, given that the instruction embeddings fed into the LLM are not inherently aligned with CLIP features. Indeed, the proposed EMMA incorporates a projection layer akin to LLaVA after its Modality Adaptation module and also adopts the pre-alignment training from LLaVA. The reviewer considers this projection layer a pivotal component of EMMA's alignment achievement. Conducting additional ablation experiments wherein this projection layer is removed could potentially strengthen the authors' argument.
Answer: Please refer to the second topic (The direct effect of EMMA's modality adaptation) in the General response where we have addressed this concern.
Question: While Line 319 mentions the utilization of data from the same datasets, the 1.8M training samples employed by EMMA deviate in quantity from those used in any of the baselines, rendering it challenging to demonstrate the effectiveness of the Modality Adaptation module. Given that LLaVA 1.5 is fully open-source and shares a very similar structure with EMMA, adding the following experimental results would significantly strengthen this paper: (1) the outcomes of EMMA using the 1.2M training set from LLaVA 1.5, and (2) the results of LLaVA 1.5 utilizing the 1.8M training set from EMMA.
Please refer to the second topic (The direct effect of EMMA's modality adaptation) in the General response where there's a comparison between LLaVA-1.2M and EMMA-1.2M.
Following your suggestion we trained LLaVA-1.5 with 1.8M data and the results are provided in the following table, highlighting that EMMA significantly outperforms LLaVA on benchmarks that benefit from instruction-aware representations, such as VQAv2, VizWiz, SciQA, Muirbench, and hallucination-related tasks. While EMMA also performs better on other benchmarks with smaller differences, LLaVA slightly outperforms EMMA on the GQA benchmark. Overall, this demonstrates EMMA’s strength, particularly in tasks requiring effective alignment between visual and textual modalities.
| Model | Data | VQAv2 | VisWiz | GQA | SciQA | OKVQA | MMB-en | MMB-cn | MathVista | Muirbench | amber-attr | amber-rel |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LLaVA-1.5 | 1.8M | 83.76 | 54.39 | 56.73 | 68.02 | 67.72 | 65.8 | 59.27 | 24.7 | 29.69 | 21.55 | 32.29 |
| EMMA | 1.8M | 89.42 | 56.03 | 56.01 | 73.14 | 68.57 | 66.44 | 60.15 | 25.1 | 32 | 26.15 | 37.34 |
Question: The conclusion drawn in line 343 is not convincing. The findings in Figure 5(a) demonstrate that text tokens are indeed involved in the feature fusion process. Yet, they do not sufficiently establish that the alignment module effectively identifies the "most" informative textual tokens. The reviewer thinks the positions of the most valuable tokens for an arbitrarily given instruction are random, and the current framework adopting a simple linear layer, which allocates fixed importance weights for textual tokens during inference, can not adapt to this change.
Answer: Thank you for pointing this out. It is important to note that: (1) the instructions used during training are predominantly short-length texts, typically containing a question about the figure, which results in many tokens being masked, and (2) the attention mask employed in text encoding ensures that these masked tokens are ignored during the encoding process. Figure 5(a) demonstrates that during training, the modality adaptation module effectively identifies the key text tokens and assigns them higher weights, emphasizing their importance in the feature fusion process.
Question: The reviewer praises the methodology illustrated in Figure 6, which employs Mutual Information to assess the importance of aligned visual representations in the reasoning process. However, concerns about its reliability are raised. A significant counterexample emerges in situations where the dataset requires responses strictly in YES/NO format, making the mutual information between the visual representations and the response encodings lack meaningful relevance.
Answer: For this analysis, we utilize the LLaVA-In-Wild [1] benchmark, which doesn't contain yes/no questions. The benchmark's statistics reveal that, on average, responses contain 98.0 words, with a minimum answer length of 5 words.
[1] Visual Instruction Tuning
Question Which figure does the conclusion in line 338 correspond to?
Answer: Figure 2(a) presents the L2 norms of the weights associated with both visual and text tokens in a single plot. Comparing these values, we observe that the weights corresponding to the visual tokens exhibit larger norms.
Thanks for the author's response, which has effectively addressed some of the concerns. However, there remain a few issues that still need attention:
-
Given the slight decline in performances observed on both ChartQA and DocVQA, the reviewer believes that EMMA should be evaluated on additional OCR benchmarks (such as AI2D[1], OCRbench[2], TextVQA[3], and InfoVQA[4]) to ascertain whether the method can achieve comparable performance to LLaVA 1.5 or whether it consistently impairs performance on such tasks. Even if the latter is the case, it would be beneficial to acknowledge and discuss these limitations in the main text, along with proposing feasible solutions for future work.
-
The concern raised in "Q1: In line 295" remains unaddressed by the incremental experiments on LLaVA presented in the General response. The current experimentation is still insufficient to substantiate the claim that "the inherent alignment between the two modalities allows for seamless integration, minimizing the need for complex cross-modal modules or extensive training to achieve alignment." The reviewer opines that appropriate ablation studies to validate this claim would involve (1) replacing LLaVA's MLP multi-modal projector with EMMA's modality adaptation module, rather than retaining both, given that LLaVA's MLP is inherently capable of achieving alignment between vision and text. (2) Additionally, retaining both the MLP and the modality adaptation module while omitting certain training data or phases from LLaVA's protocol would serve to demonstrate the modality adaptation module can minimize the need for extensive training to achieve alignment.
-
A follow-up question regarding "Q3: The conclusion drawn in line 343 lacks convincing evidence." The reviewer acknowledges that during the training phase, the instructions were relatively short and straightforward. However, how does EMMA ensure the accurate identification of crucial textual tokens during the testing phase, particularly in cases where instructions are more complex? For example, in SQA, instructions often include extensive hints or lectures that are essential for guiding responses. If EMMA struggles to pinpoint critical textual tokens in such scenarios, what mechanisms or rationale support its ability to significantly enhance performance in SQA tasks?
[1] A diagram is worth a dozen images
[2] On the hidden mystery of ocr in large multimodal models
[3] Towards vqa models that can read
[4] Infographicvqa
Thank you for your response.
Question: Given the slight decline in performances observed on both ChartQA and DocVQA, the reviewer believes that EMMA should be evaluated on additional OCR benchmarks (such as AI2D[1], OCRbench[2], TextVQA[3], and InfoVQA[4]) to ascertain whether the method can achieve comparable performance to LLaVA 1.5 or whether it consistently impairs performance on such tasks. Even if the latter is the case, it would be beneficial to acknowledge and discuss these limitations in the main text, along with proposing feasible solutions for future work.
Answer: As suggested, we have extended our evaluation to include three of the four mentioned OCR benchmarks: OCRbench, TextVQA, and InfoVQA. Unfortunately, the AI2D dataset is not publicly available at the moment, preventing us from including it in our analysis. Below are the results comparing EMMA and LLaVA-1.5:
| Method | OCRbench | TextVQA | InfoVQA |
|---|---|---|---|
| LLaVA-1.5 | 34 | 58.2 | 14.8 |
| EMMA | 34.67 | 57.0 | 15 |
EMMA demonstrates comparable performance to LLaVA-1.5 on the newly added benchmarks, slight improvement for OCRBench and InfoVQA, and a slight decline for TextVQA.
Moreover, we have included evaluations on the OCR benchmarks and expanded the discussion on this limitation in section 4 of the paper. For future work, we propose integrating additional OCR-focused pretraining datasets to enhance the training of the modality adaptation module, thereby improving performance on OCR-specific tasks.
Question:The concern raised in "Q1: In line 295" remains unaddressed by the incremental experiments on LLaVA presented in the General response. The current experimentation is still insufficient to substantiate the claim that "the inherent alignment between the two modalities allows for seamless integration, minimizing the need for complex cross-modal modules or extensive training to achieve alignment." The reviewer opines that appropriate ablation studies to validate this claim would involve (1) replacing LLaVA's MLP multi-modal projector with EMMA's modality adaptation module, rather than retaining both, given that LLaVA's MLP is inherently capable of achieving alignment between vision and text. (2) Additionally, retaining both the MLP and the modality adaptation module while omitting certain training data or phases from LLaVA's protocol would serve to demonstrate the modality adaptation module can minimize the need for extensive training to achieve alignment.
Answer: Below, we address the two suggestions raised in the question individually:
(1) Replacing LLaVA's MLP multi-modal projector with EMMA's modality adaptation module: It is important to clarify that LLaVA's projector is single-modal, as it operates exclusively on visual encodings to project them from the visual space into the language embedding space[1]. This functionality differs fundamentally from that of EMMA’s modality adaptation module, making them non-interchangeable. EMMA’s modality adaptation module is specifically designed to integrate instruction encodings into the visual tokens, producing instruction-aware visual representations. Its purpose is not to replace the visual-to-language projection but to enhance cross-modal alignment by early fusion of instruction encodings into the visual representations. As such, the two modules serve distinct roles.
(2) Omitting certain training data: In the following table we have provided the training data that is utilized for different SOTA multimodal LLMs:
| Method | #Samples |
|---|---|
| BLIP-2 [2] | 129M |
| InstructBLIP [3] | 130M |
| Shikra[4] | 6.1M |
| IDEFICS[5] | 354M |
| Qwen-VL[6] | 1.4B |
| mPLUG-Owl2[7] | 348M |
| BRAVE [8] | 100M |
| LLaVA-1.5[9] | 1.2M |
The training datasets for multimodal LLMs vary significantly in scale, ranging from millions to over a billion samples. LLaVA, with just 1.2 million samples, notably utilizes the smallest dataset required to fine-tune the projection module and LLM effectively. This minimal dataset size demonstrates the balance needed to achieve robust multimodal understanding in LLMs while minimizing the risk of overfitting.
[1] Visual Instruction Tuning
[2] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
[3] InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
[4] Shikra: Unleashing Multimodal Large Language Models with Fine-Grained Visual Alignment
[5] IDEFICS: IDEFICS: An Open-Source Framework for Training Large Multimodal Models
[6] Qwen-VL: A Billion-scale Vision-Language Model with Versatile Abilities
[7] mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
[8] BRAVE: Broadening the visual encoding of vision-language models
[9]LLaVA-1.5: Improved Baselines with Visual Instruction Tuning
Question: A follow-up question regarding "Q3: The conclusion drawn in line 343 lacks convincing evidence." The reviewer acknowledges that during the training phase, the instructions were relatively short and straightforward. However, how does EMMA ensure the accurate identification of crucial textual tokens during the testing phase, particularly in cases where instructions are more complex? For example, in SQA, instructions often include extensive hints or lectures that are essential for guiding responses. If EMMA struggles to pinpoint critical textual tokens in such scenarios, what mechanisms or rationale support its ability to significantly enhance performance in SQA tasks?
Answer: It is important to highlight that in our design, the instruction serves dual purposes:
-
Guiding the Language Model (LLM): The full instruction is provided to the language model to ensure it generates a coherent and contextually relevant response.
-
Refining Visual Tokens via the Modality Adaptation Module: The instruction also serves as a guide for refining the visual tokens within the modality adaptation module. However, in this case, only the main component of the instruction—which is the question—is passed through the module.
For instance, this is one of the instructions from one of the samples: "Which type of force from the baby's hand opens the cabinet door?\nOptions:\n(A) pull\n(B) push". The question part which is "Which type of force from the baby's hand opens the cabinet door?" passes through the modality adaptation, which helps refine the visual tokens to be more focused on the topic of question.
Thank you for your feedback on our paper. I would like to kindly remind you that we have provided the requested inputs and the additional information. Please let me know if you have any further questions or concerns.
Dear Reviewer,
Thank you once again for your insightful comments and constructive feedback. As the rebuttal period draws to a close, please feel free to reach out if there are any additional points or clarifications you would like us to address.
We thank the reviewer for their valuable feedback and comments.
Question: In line 251, it is mentioned that current methodologies depend on complex cross-modality modules, particularly highlighting mPLUG-Owl2. However, the paper does not provide a theoretical comparison with methods like LLaVA and Qwen-VL, which utilize straightforward projection layers that add only a minimal number of parameters to achieve visual-language alignment.
Answer: In the paper, we compare our approach with both LLaVA and Qwen-VL, as detailed in Table 2 and Table 4, with a summary provided below for reference. Among the three methods, LLaVA[1] does not include a module for the early fusion of visual and textual tokens, deferring modality adaptation entirely to the language model (LLM) component. In contrast, EMMA incorporates a lightweight modality adaptation module with just 1.2M parameters, offering an efficient solution. On the other hand, Qwen-VL[2] utilizes a cross-attention-based adapter module with 80M parameters, making it significantly more complex and parameter-intensive compared to EMMA.
First set of benchmarks
| Method | #Params | VQA | VisWiz | SQA | GQA | OkVQA |
|---|---|---|---|---|---|---|
| Qwen-VL-Chat | 9.6B | 78.2 | 38.9 | 68.2 | 57 | 56.6 |
| LLaVA-1.5 | 7B | 78.5 | 50.0 | 66.8 | 62.0 | - |
| EMMA | 7B | 89.42 | 56.03 | 73.14 | 56.01 | 68.57 |
Second set of benchmarks
| Method | #Params | MMB-en | MMB-cn | MMMU | MathVista | Muirbench |
|---|---|---|---|---|---|---|
| Qwen-VL-Chat | 9.6B | 60.6 | 56.7 | 35.9 | - | - |
| LLaVA-1.5 | 7B | 64.3 | 58.3 | 35.11 | 21.1 | 23.46 |
| EMMA | 7B | 66.44 | 60.15 | 35.44 | 25.1 | 32.0 |
As indicated in the tables, EMMA outperforms other models in 8 out of 10 benchmarks, while being equivalent as competing approaches for MMMU.
[1] Improved Baselines with Visual Instruction Tuning
[2] Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond
Question: The proposed EMMA introduces a Modality Adaptation module based on the correlation between word tokens in instructions and visual tokens. However, this methodology lacks performance guarantees when instructions are verbose or correlate poorly with visual tokens. The benchmarks selected in the paper are skewed towards VQA tasks, which inherently possess a high correlation between instructions and visual tokens. It is imperative that benchmarks assessing fine-grained perception capabilities, such as document understanding and OCR tasks (e.g., Chartqa[1] and Docvqa[2]), should also be evaluated to ensure comprehensiveness.
Answer: Thank you for raising this question. In EMMA's design, we prioritized a lightweight architecture to preserve the generality of visual tokens and minimize overfitting. As a result, EMMA is expected to retain the performance of LLaVA, in scenarios where the instructions correlate poorly with the visual tokens and improve the performance when the visual tokens can be refined by leveraging the instruction tokens. The following table compares the performance of LLaVA and EMMA, demonstrating that EMMA achieves performance levels similar to LLaVA for these cases.
| Method | ChartQA[1] | DocQA[2] |
|---|---|---|
| EMMA | 9.9 | 18.27 |
| LLaVA-1.5 | 10.21 | 18.93 |
[1] Chartqa: A benchmark for question answering about charts with visual and logical reasoning.
[2] Docvqa: A dataset for vqa on document images
We would like to thank the reviewers for their interest, positive remarks, and comments. Below we emphasize two important comments on the contributions of our work and provide a summary of the changes in the revision of our paper. Please feel free to reach out if there are still questions or comments you would like us to address.
1- On the importance of having a lightweight modality adaptation module
EMMA is designed to provide an efficient modality adaptation that adheres to the following principles:
Early Fusion of Visual and Instruction Encodings: a lightweight cross-modality module designed to efficiently fuse visual and textual encodings, generating instruction-aware visual representations for the language model.
Harnessing CLIP’s Inherent Alignment: EMMA is the first to leverage the inherent alignment between the visual and textual components of CLIP and enhance the efficiency of early modality adaptation in MLLMs, reducing the reliance on parameter-intensive modality adaptation modules.
Lightweight and Data-Efficient Design: The model requires significantly fewer samples for training or fine-tuning, avoiding the need for hundreds of millions of data points.
Enhanced Robustness Against Overfitting: EMMA’s design minimizes the risk of overfitting, even with limited training samples (LLaVA's instruction tuning data of size 1.2M). This robustness ensures reliable performance across diverse datasets.
Preservation of Visual Token Generality: The model carefully refines visual tokens by the instructions while preserving the tokens’ generality, maintaining their adaptability across various downstream tasks.
Flexibility for Diverse Applications: EMMA’s lightweight architecture allows efficient fine-tuning across diverse datasets and applications [1, 2] (In contrast to its counterparts, which employ complex modality adaptation modules requiring millions of data for finetuning).
The proposed method is intentionally designed with these considerations in mind, balancing simplicity and efficiency to meet these critical objectives.
2 - The direct effect of EMMA's modality adaptation
To isolate the impact of adding the modality adaptation module to the baseline model (LLaVA), we conducted an initial set of experiments where the modality adaptation module was integrated into LLaVA's architecture while keeping all other settings the same as LLaVA (including training data, training setup, etc.). This ensures a fair evaluation of the capabilities of EMMA's proposed modality adaptation. The results, shown in Figure 5, reflect improvements attributable solely to the modality adaptation module, as it is the only factor that was modified. The table below is provided as a reference, derived from Figure 5.
| Method | Data | SQA | GQA | POPE | MMB-en | MMB-cn | MMVP | A-Bench | MuirB |
|---|---|---|---|---|---|---|---|---|---|
| LLaVA-1.5 | 1.2M | 66.8 | 62.0 | 86.1 | 64.3 | 58.3 | 24.7 | 62.97 | 23.46 |
| EMMA | 1.2M | 71.00 | 62.78 | 87.27 | 66.19 | 60.2 | 34 | 63.48 | 29.08 |
| Difference | --- | +4.2% | +0.78% | +1.17% | +1.89% | +1.9% | +9.3% | +0.51% | +5.62% |
3 - Summary of changes in the paper
- The discussion on the direct impact of EMMA's modality adaptation has been relocated to the ablation study section for clarity and better context.
- Detailed descriptions of the model have been transferred from the analysis section (3.2) to section 3.1 to provide the full picture of our proposed method in one part.
- Mathematical formulations of EMMA have been included in section 3.1 to enhance the understanding of EMMA's methodology.
- Figure 1 has been repositioned closer to section 3.1, where EMMA's architecture is introduced, ensuring better alignment between the explanation and the visual representation.
- Missing details in the figures have been thoroughly added to provide a complete and accurate depiction.
- Additional hallucination benchmarks have been incorporated into Table 4 to provide a more comprehensive evaluation of hallucinations.
- The QA-ViT[3] method which was missed from the related works and evaluations was added to both sections.
Please note that, due to the revisions made to the paper, the numbering of figures and tables may have changed. To maintain consistency, all references to figures and tables correspond to the revised version of the paper submitted during the rebuttal period.
[1] LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
[2] PA-LLaVA: A Large Language-Vision Assistant for Human Pathology Image Understanding
[3] Question Aware Vision Transformer for Multimodal Reasoning. In CVPR, 2024
Dear Reviewers,
Thank you once again for your comments and feedback. We would like to kindly remind you that we are approaching the end of the rebuttal period, so please don’t hesitate to reach out if you still have any questions or comments about our work.
Dear all reviewers,
Our reviewer-author discussion will end soon. For each of you, please check all the files and see if anything you'd like to discuss with authors.
Best, Your AC
This paper proposes a visual alignment design in multi-modal LLMs. It is conducted via an early fusion module with in-depth analysis and experimental validations. The reviews are mixed. [QZxi, Eq4D] raised several issues regarding a lack of theoretical comparison, lack of performance guarantees, and limited comparison with adapter-based approaches. During the discussion phase, the authors address these issues by providing additional experiments, and technical illustrations. [Eq4D] is generally satisfied while [QZxi] still raised unconvincing performance comparisons, and lack of claim support via ablation studies. Overall, the AC has checked all the files, and feels the current early fusion design are not lightweight in that the text encoder is introduced with additional instruction projection. It theoretically makes sense that adding additional interaction modules benefits vision-language correlations. The advantages of these largely own to the text encoder which utilizes pre-trained data. Meanwhile, the back-and-forth projection of text data (i.e., the input text is projected into CLIP space and then back to the text tokenized space) is not elegant and makes the design more engineering-like. The authors shall better improve the current design and welcome to the next venue.
审稿人讨论附加意见
[QZxi, Eq4D] raised several issues regarding a lack of theoretical comparison, lack of performance guarantees, and limited comparison with adapter-based approaches. During the discussion phase, the authors address these issues by providing additional experiments, and technical illustrations. [Eq4D] is generally satisfied while [QZxi] still raised unconvincing performance comparisons, and lack of claim support via ablation studies.
Reject