M²IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation Engineering
We propose a novel approach of vectorized multimodal in-context learning, improving both efficiency and performance effectively.
摘要
评审与讨论
This paper introduces M²IV, a method to improve multimodal in-context learning (ICL) in Large Vision-Language Models (LVLMs) by replacing token-intensive demonstrations with learnable vectors injected into the MHA and MLP blocks. These vectors, trained via self-distillation with mimicry, synergistic, and supervised losses, aim to capture fine-grained semantics and leverage the distinct roles of MHA and MLP for better efficiency and performance, particularly in many-shot scenarios. The paper also proposes VLibrary for storing and reusing these M²IV vectors for various tasks.
Quality & Clarity: The methodology and experiments are reasonably detailed and extensive, though clarity on some aspects like the synergistic loss and many-shot aggregation could be improved.
Originality: While building on prior work in representation engineering, the specific M²IV approach targeting both MHA/MLP with synergistic training appears novel.
Significance: Addresses key ICL challenges (efficiency, performance). If claims hold, M²IV and VLibrary offer a potentially valuable and efficient paradigm for LVLM adaptation.
接收理由
Decent Empirical Results: Demonstrates state-of-the-art performance across a comprehensive set of 7 benchmarks and 3 LVLMs, consistently outperforming both vanilla ICL (few-shot and many-shot) and several existing representation engineering methods (TV, FV, ICV, I2CL, LIVE). The average accuracy gain of 3.74% over 16-shot ICL is notable.
Efficiency Gains: Addresses the high cost of standard ICL by significantly reducing token overhead during inference. The paper provides evidence that the inference time savings can offset the initial training cost, making it potentially cost-effective for widespread application. This efficiency extends to many-shot scenarios, bypassing context length limitations.
Practical Utility (VLibrary): Introduces VLibrary, a practical framework for storing, retrieving, combining, and transferring M²IVs, showcasing potential for real-world applications like cross-modal alignment, output customization (instruction following, explainability), and safety improvements.
拒绝理由
Incremental Novelty: While the specific M²IV formulation is new, the core concept of using learned vectors to replace ICL demonstrations builds directly upon recent work like LIVE, ICV, I2CL, FV, and TV. The improvement over LIVE, while present, might be seen as incremental, especially considering LIVE also uses trained layer-wise vectors.
Complexity and Specificity: The method requires a specific training process for each task and each LVLM architecture. While VLibrary allows storing these, generating them involves significant computation (though potentially amortized). The transferability experiments, while promising, show a performance drop (-1.48% on average without fine-tuning), indicating model/task specificity remains a factor. The combination strategies also add complexity.
Clarity of Theoretical Justification: Theorems 1 and 2 attempt to formalize the distinct roles of MHA and MLP. However, the theorems themselves seem to show decompositions based on linearity/structure rather than fundamentally proving why MHA handles broader semantics and MLP handles fine-grained details in the complex non-linear dynamics of deep networks. The link between these linear decompositions and the high-level claims about MHA/MLP roles could be stronger or more nuanced. The assumptions (e.g., linear MLP in Thm 2 proof) might oversimplify.
Limited Analysis of Failure Cases: While ablations show the importance of different components, the paper provides limited qualitative analysis or discussion of when or why M²IV might fail or underperform compared to vanilla ICL, even if it does so less often. Understanding the limitations is crucial.
Data Processing Dependence: The data processing steps (CLIP embeddings, k-means clustering, augmentation) seem crucial for performance, as shown in ablations (Table 2, Fig 9). This adds components and hyperparameters to the overall pipeline, and the sensitivity to these choices (e.g., number of clusters K) could be explored further.
给作者的问题
-
Could you provide more intuition behind the synergistic loss? How exactly does encouraging alignment along the diagonal of M^l and orthogonality off-diagonal foster the desired complementary roles of MHA and MLP?
-
Regarding Theorems 1 and 2: How well do these linear decompositions capture the actual behavior within complex, non-linear LVLMs? Theorem 2's proof assumes a linear MLP transformation (xW); how does this relate to standard MLP blocks with non-linear activations?
-
The many-shot aggregation method is mentioned briefly. Could you elaborate on the pairwise aggregation mechanism for MLP outputs? How is information loss mitigated during this sequential aggregation?
-
While M²IV outperforms vanilla ICL on average, are there specific task types or data characteristics where vanilla ICL might still be preferable or where M²IV shows limitations? A deeper qualitative error analysis would be beneficial.
-
How sensitive is the performance of M²IV to the choice of the CLIP model used for initial embedding/clustering and the number of clusters K in the k-means step?
Thank you for the nice comments and valuable suggestions. Below we offer clarifications to the concerns you raised.
Q1: Could you provide more intuition behind the synergistic loss?
R1:
- Intuition: Our goal is to distil multimodal in-context sequences into a compact vector set that preserves, and in practice often surpasses, the performance obtained when the model receives demonstrations explicitly. This objective imposes two simultaneous demands on the vectors: (i) they must compress the sequences to reduce redundancy, and (ii) they must still expose the fine-grained semantic cues that guide LVLMs to recognise the task implicit in the demonstrations. These requirements appear contradictory because aggressive distillation usually erases precisely the subtle relations and attributes on which task recognition depends, and relying solely on a mimicry loss based on output distributions is unlikely to address this issue thoroughly. Prior works on multimodal ICL suggests that in certain complex scenarios the context is more inclined to provide task recognition signals about what task should be executed, while exhaustive scene details for task learning may contribute little to downstream reasoning. Inspired by these insights, we envisioned a straightforward solution: first select information from the sequences and discard most unnecessary parts, then perform fine-grained distillation on the remaining information. To implement this in the model’s latent space, we take note of the complementary inductive biases of MHA and MLP. We aim to partially decouple their functions through explicit intervention during processing, thereby realizing a filtering-followed-by-distillation pipeline. Accordingly, we introduce the synergistic loss that enables the attention mechanism to select salient relational information while guiding the MLP to refine those cues into a denser, task-aware encoding, thus performing selection and fine-grained distillation in concert.
- Mechanisms: To enforce coordination we build, at each layer, the normalised cross-view correlation matrix . The synergistic loss presses the diagonal of this matrix toward one, forcing each attention axis to remain colinear with its MLP counterpart. Attention therefore acts as a focused gating mechanism that picks out the relational features most indicative of the latent task, and the MLP is obliged to refine primarily those same features instead of drifting to unrelated content. Because the two branches now speak in a common semantic basis, their residual contributions add constructively, preserving the evidence needed for accurate task recognition during inference. The loss simultaneously drives every off-diagonal entry toward zero. By penalising cross-axis correlation we prevent either branch from encoding information unrelated to core semantics. This decorrelation enlarges the effective subspace spanned by the combined outputs, so attention can continue to represent where and with whom tokens interact while the MLP can devote its full capacity to how their attributes support the hypothesised task. The joint pressure of diagonal alignment and off-diagonal suppression therefore realises our intended pipeline inside the model’s latent space: selection occurs when attention gates the task-relevant relations, and fine-grained distillation occurs when the MLP sharpens those relations into a compact yet semantically rich representation.
Q2: How well do these linear decompositions capture the actual behavior within complex, non-linear LVLMs?
R2: Theorems 1 and 2 aim to demonstrate how MHA and MLP process demonstrations differently, enabling their unique roles in ICL. The decomposition is not intended to fundamentally prove why MHA handles broader semantics and MLP handles fine-grained details, but rather to show how these components can be utilized, each exploiting its unique properties, to embed in-context sequences as vectors within the LVLMs.
Our approach assumes a linear MLP for simplicity, focusing on the linear part while initially disregarding the non-linear activation and subsequent linear transformation. However, this simplification does not compromise the generality of our findings. In reality, standard MLP blocks with non-linear activations can be represented as , where is the activation function. Even when incorporating the activation function and the operation on both sides of the equation, our conclusions remains valid. Applying it to Theorem 2, we obtain: . This is because these elements can be seamlessly integrated into a more comprehensive model framework, thereby preserving the essential characteristics of the model's behavior.
R3: We apologize for omitting the detailed processing pipeline for many-shot M²IV from the main text.
-
Pairwise aggregation pipeline: A long n-shot prompt is first divided into overlapping windows of length w with an overlap o; the teacher LVLM runs on every window and the MLP activations are mean-pooled into fixed-length vectors . These per-window summaries are then combined left-to-right in a size-preserving, two-vector loop: at step the current aggregate is concatenated with the new window vector to form ; two learned projections produce a gate and a candidate ; the update keeps dimensionality at , matching the original MLP hidden size. Sharing the same tiny gating block across all windows makes the procedure in time and adds only parameters, so the final vector can be injected back into the LVLM without any adapter.
-
Information loss mitigation: The element-wise gate learned above selectively retains high-salience features from earlier aggregates. To further curb boundary effects, we employ a lightweight tuning strategy to determine optimal window length w and overlap o: (i) a coarse grid (e.g. w ∈ {32, 64, 128}, o ∈ {0, ¼w, ½w}) is evaluated on a held-out set using cosine similarity between the aggregated student vector and a full-context teacher vector as a cheap proxy for fidelity; (ii) the top few candidates enter a successive-halving loop that progressively allocates more budget, and the survivor is picked by end-task accuracy adjusted for latency.
Q4: While M²IV outperforms vanilla ICL on average, are there specific task types or data characteristics where vanilla ICL might still be preferable or where M²IV shows limitations?
R4: Your suggestion is highly valuable. Our main experiments (Table 8) demonstrates M²IV’s strong performance and generalization, which leads us to overlook a deeper examination of cases where M²IV might underperform vanilla ICL. In this revision we present a detailed qualitative error analysis that synthesises evidence from the main results, the ablation study, and the VLibrary application.
- Architecture-induced performance drop: In our main results, M²IV underperforms vanilla ICL only once across all benchmark–model pairs: on A-OKVQA using LLaVA-Next (43.96% vs. 45.92%). However, on the same benchmark, M²IV performs significantly better than vanilla ICL on OpenFlamingo and IDEFICS2, suggesting the failure stems not from the task itself but from architecture-specific properties of LLaVA-Next. Unlike the other two LVLMs, LLaVA-Next processes a dense stream of 576 visual tokens, which imposes a heavy visual load on the residual pathways. We hypothesize that this high-density tokenization dominates intermediate representations, making them less receptive to latent modulations introduced by M²IV. Consequently, the injected vectors may either be diluted or conflict with already saturated pathways, resulting in performance degradation not observed in architectures with sparser or more modular vision-language fusion. This issue can be mitigated as current-generation LVLMs reduce the number of image tokens in their encoding.
- Flaws in teacher model input: As shown in Figure 7, on CVQA, when using I2I retrieval, M²IV underperforms vanilla ICL with RS. As CVQA exhibits high cultural and linguistic diversity, its knowledge base covers a broad range of semantic classes with substantial inter-class variation. However, I2I retrieval, which is driven solely by visual similarity, frequently selects ICDs that are visually close but semantically unrelated to the query, leading to misleading supervision during M²IV training. Therefore, we conclude that although M²IV, through distillation and training, can effectively mitigate retrieval bias in the teacher models (as evidenced across all benchmarks) and achieve substantial gains over vanilla ICL using the same retrieval strategy, it may still fail to converge on coherent task representations when the knowledge base exhibits a highly diverse and dispersed distribution. In such cases, this limitation prevents further improvements in accuracy. This suggests that when the sequence data used as the input to the teacher model inevitably exhibit significant biases and related deficiencies, vanilla ICL remains the preferred option. This observation also highlights the importance of thorough training data processing.
- Regressions in VLibrary applications: While M²IV achieves strong performance in most application scenarios, we observe two notable cases where it may underperform vanilla ICL during downstream use via VLibrary. First, as shown in Table 3, injecting M²IV into the middle or last 10 layers leads to lower accuracy than vanilla ICL on several datasets, including OK-VQA and VizWiz. This suggests that even though partial injection provides a flexible pathway for using M²IV, some scenarios demand tricky layer selection, while vanilla ICL remains simpler to apply in practice because it requires no internal modifications. Second, in the explainability experiments (Table 14) on VizWiz, 128-shot M²IV achieves higher CIDEr for rationale generation but lower VQA accuracy than 16-shot vanilla ICL (48.18% vs. 49.36%). Adding rationales makes the input more text-heavy, while VizWiz images can be visually ambiguous. This likely limits the benefits of M²IV’s fine-grained representations. Thus, in this text-heavy setting with poor cross-modal alignment, vanilla ICL remains the safer choice over higher-shot M²IV. However, this also raises a trade-off: the model’s limited long-text understanding may undermine the benefit of explicit rationale input.
Q5: How sensitive is the performance of M²IV to the choice of the CLIP model used for initial embedding/clustering and the number of clusters K in the k-means step?
R5:
- Robustness to CLIP Model Selection: In the M²IV pipeline, CLIP is used exclusively for constructing the training dataset by converting all examples into embeddings for clustering. In our paper, we employ a frozen CLIP‐ViT‐L/14. Here, we also evaluate three alternative CLIP configurations: CLIP‐ViT‐B/32, a version of CLIP‐ViT‐L/14 fine‐tuned with LoRA on the training data, and a fully fine‐tuned CLIP‐ViT‐L/14. The results in Table 1* indicate a clear trend: higher quality CLIP embeddings, which more effectively capture the semantic features of training examples, tend to result in better clustering quality. This leads to improved training data and, consequently, enhanced M²IV performance. Despite these trends, performance remains relatively stable across all four CLIP configurations, with only minor variation across tasks. This stability highlights the robustness of M²IV to the choice of CLIP model. One primary reason is that CLIP is used exclusively to construct training clusters. As long as the clustering captures broad task-level structure, the benefit from further improving embedding quality is limited. Thus, additional fine-tuning adds cost for minimal benefit, so a frozen CLIP-ViT-L/14 is optimal.
| Methods | VQAv2 | VizWiz | OK-VQA | GQA | A-OKVQA | CVQA | VL-ICL |
|---|---|---|---|---|---|---|---|
| ViT-L/14 | 63.93 | 43.40 | 55.37 | 73.81 | 53.59 | 60.11 | 33.36 |
| ViT-B/32 | 62.87 | 42.95 | 55.01 | 73.67 | 53.21 | 58.92 | 33.28 |
| ViT-L/14(LoRA) | 64.11 | 43.56 | 55.72 | 73.49 | 53.83 | 60.80 | 33.49 |
| ViT-L/14(Full) | 64.23 | 43.87 | 55.68 | 73.61 | 54.19 | 60.85 | 33.46 |
Table 1*:Comparison of M²IV performance across different CLIP models.
- Clarification of the Exploration of ’s Impact: In Appendix K (referenced at line 273 of the main text), we examine how the number of clusters affects M²IV’s performance. We plot M²IV’s accuracy on three representative benchmarks (VQAv2, GQA, and A-OKVQA) while varying from 500 to 3500. Across all three datasets, using semantic clustering to construct training data yields a characteristic trend: performance initially rises sharply as increases. As grows further, the rate of improvement slows and may even decline slightly at higher values. This pattern indicates that the diversity of training examples contributes more to M²IV’s success than sheer quantity, underscoring our approach’s potential in low-resource settings. More detailed conclusions can be found in Appendix K lines 773–782.
Thank you again for your time and valuable feedback. We hope the above content can solve your doubts. We have updated all the newly added experiments. Please let us know if there are further questions.
Thank you for the additional experiments --- I have raised my score to 6.
We sincerely thank you for your careful reassessment and updated score. Your insightful comments were instrumental in helping us improve the clarity and quality of the paper.
The paper proposes M2IV, a method for enhancing multimodal in-context learning in LVLMs by replacing token-heavy demonstrations with compact, learnable vectors injected into the model's residual stream. By explicitly leveraging the distinct roles of multi-head attention and MLP, M2IV achieves fine-grained semantic distillation and efficient cross-modal reasoning. It is trained via a self-distillation framework and enables efficient multi-shot ICL through a vector repository called VLibrary. The approach achieves state-of-the-art performance across seven multimodal benchmarks and three LVLMs, with better efficiency and generalizability compared to prior representation engineering techniques.
接收理由
- Novel decomposition of residual stream into MHA and MLP components for interpretable and targeted vector injection.
- Efficient many-shot scaling through M2IV aggregation, mitigated context window limitations without additional tokens.
- Introduction of VLibrary, a plug-and-play system for managing and applying learned vectors.
拒绝理由
- M2IV requires task-specific training, which partially undercuts the appeal of ICL as a parameter-free adaptation method.
- The approach, though efficient at inference, introduces a non-negligible training overhead and relies on curated training data and clustering pipelines.
- Generalization across models or tasks not explicitly trained on remains somewhat unclear; transfer strategies are briefly discussed but not deeply validated.
- Limited discussion on failure modes or scenarios where M2IV underperforms (e.g., in highly abstract or text-heavy tasks).
Thanks for your constructive suggestions. Your endorsement of our method and experiments gives us significant encouragement. Here are our clarifications.
Q1: M2IV requires task-specific training, which partially undercuts the appeal of ICL as a parameter-free adaptation method.
R1: Your question goes to the heart of our contribution. ICL is valued not just because it avoids parameter updates, but because its efficiency. Moreover, our study is grounded in ICL, but it focuses on representation engineering. This line of work seeks to convert many explicit prompts into a compact internal vector that guides the model without being shown at inference time. Our goal is to devise a representation-engineering method that boosts accuracy in the challenging multimodal ICL setting and surpasses existing static techniques. In this context, a small amount of task-specific additional training is necessary, yet it does not dilute the efficiency that make ICL attractive to users. In the paper we articulate this point from two angles:
- The Necessity of Training: Existing training-free methods have been shown to be insufficiently effective in complex multimodal ICL tasks, as they are typically confined to purely textual tasks with small label spaces and relatively homogeneous task types within a sequence. Equally limiting, a fixed projection is tied to the single sequence from which it is computed and thus cannot encode knowledge dispersed across a larger knowledge base, a capability that real applications routinely require. Untrained representations cannot accurately or fully capture the content of a dataset. Only through specialized training can information from numerous examples be effectively compressed into a single vector that meets usability standards.
- Training and Inference Efficiency of M²IV: This is the issue we highlight in the first paragraph of Section 4.3. We demonstrate M²IV’s efficiency using two visualizations, one for inference cost and one for total runtime. These plots show that altering certain parameters or introducing training does not reduce efficiency compared to vanilla ICL. On the contrary, in scenarios requiring multiple ICL, M²IV achieves remarkable efficiency improvements over vanilla ICL. Moreover, as the dataset grows, M²IV’s additional training time increases far more slowly than the inference time penalty incurred by appending more examples to the model input. These results demonstrate that on moderate-sized benchmarks and in large-scale real-world tasks, M²IV improves LVLM performance and achieves significant efficiency improvements. As modern users increasingly prioritize efficiency, M²IV doesn't undercut the appeal of ICL but also delivers greater practical value.
Q2: The approach, though efficient at inference, introduces a non-negligible training overhead and relies on curated training data and clustering pipelines.
R2: As we noted earlier, M²IV is not only efficient in inference but its training overhead is not 'non-negligible' and does not affect its overall efficiency. Moreover, because M²IV requires training and our goal is to minimize the data needed, high-quality training data and a dedicated preprocessing pipeline are essential.
- Dilutable training overhead: Figure 5 compares, for each benchmark, the total time needed for M²IV (training plus inference) with the inference time of vanilla ICL and shows that, after adding the training phase M²IV is still faster overall. In real-world applications this advantage becomes more pronounced, because the inference cost of vanilla ICL rises sharply as the knowledge base grows and larger context windows include more demonstrations, whereas M²IV incurs its training expense only once. Consequently, the training of M²IV does not diminish its overall efficiency.
- Reduced data requirement: Experiments in Appendix K demonstrate that, for M²IV training, data quality is more important than sheer quantity. With carefully curated samples, M²IV already delivers strong performance, and adding too many examples can even cause a slight drop in accuracy. This outcome confirms that our layer-wise steering vectors for the MHA and MLP branches capture the distribution of the full knowledge base more effectively than simply enlarging the dataset. The result is crucial for representation engineering, where many existing approaches overlook the search for optimal training data. Notably, compared with LIVE our method achieves better accuracy while using 76 percent fewer training examples. The saved training budget partially offsets the cost of data processing. Furthermore, our pipeline is straightforward to implement because it relies only on pretrained CLIP encoders and unsupervised k-means clustering.
Q3: Generalization across models or tasks not explicitly trained on remains somewhat unclear; transfer strategies are briefly discussed but not deeply validated.
R3: A detailed description of our transfer strategies and their empirical results appears in Appendix L. If a task T can be expressed as the union of two subtasks, the algebraic properties of vectors let us construct T’s M²IV by combining the M²IVs of the two subtasks, without retraining on a new knowledge base. For example, translating English into either Japanese or French (task A) can be represented as the union of translating English into Japanese (task B) and translating English into French (task C). Hence the M²IV for task A can be obtained by combining the M²IVs for tasks B and C. This combination may be a training-free vector addition, or it may involve a lightweight tuning step with a trainable parameter.
For cross-model generalization, when two LVLMs M and N share the same hidden-state dimensionality and number of layers, the entire set of M²IVs learned for M can be injected directly into N. This transfer, too, can proceed either without training or with an lightweight tuning step. Results in Tables 12 and 13 validate the effectiveness of both types of transfer.
Q4: Limited discussion on failure modes or scenarios where M2IV underperforms (e.g., in highly abstract or text-heavy tasks).
R4: Thank you for your insightful comments. A thorough discussion of the limitations is indeed important for this paper. In this revision we present a detailed qualitative error analysis that synthesises evidence from the main results, the ablation study, and the VLibrary application.
-
Architecture-induced performance drop: In our main results, M²IV underperforms vanilla ICL only once across all benchmark–model pairs: on A-OKVQA using LLaVA-Next (43.96% vs. 45.92%). However, on the same benchmark, M²IV performs significantly better than vanilla ICL on OpenFlamingo and IDEFICS2, suggesting the failure stems not from the task itself but from architecture-specific properties of LLaVA-Next. Unlike the other two LVLMs, LLaVA-Next processes a dense stream of 576 visual tokens, which imposes a heavy visual load on the residual pathways. We hypothesize that this high-density tokenization dominates intermediate representations, making them less receptive to latent modulations introduced by M²IV. Consequently, the injected vectors may either be diluted or conflict with already saturated pathways, resulting in performance degradation not observed in architectures with sparser or more modular vision-language fusion. This issue can be mitigated as current-generation LVLMs reduce the number of image tokens in their encoding.
-
Flaws in teacher model input: As shown in Figure 7, on CVQA, when using I2I retrieval, M²IV underperforms vanilla ICL with RS. As CVQA exhibits high cultural and linguistic diversity, its knowledge base covers a broad range of semantic classes with substantial inter-class variation. However, I2I retrieval, which is driven solely by visual similarity, frequently selects ICDs that are visually close but semantically unrelated to the query, leading to misleading supervision during M²IV training. Therefore, we conclude that although M²IV, through distillation and training, can effectively mitigate retrieval bias in the teacher models (as evidenced across all benchmarks) and achieve substantial gains over vanilla ICL using the same retrieval strategy, it may still fail to converge on coherent task representations when the knowledge base exhibits a highly diverse and dispersed distribution. In such cases, this limitation prevents further improvements in accuracy. This suggests that when the sequence data used as the input to the teacher model inevitably exhibit significant biases and related deficiencies, vanilla ICL remains the preferred option.
-
Regressions in VLibrary applications: While M²IV achieves strong performance in most application scenarios, we observe two notable cases where it may underperform vanilla ICL during downstream use via VLibrary. First, although partial injection provides a flexible pathway for using M²IV, some scenarios demand tricky layer selection, while vanilla ICL remains simpler to apply in practice because it requires no internal modifications. Second, adding rationales makes the input more text-heavy and VizWiz images can be visually ambiguous. This likely limits the benefits of M²IV’s fine-grained representations. Thus, in this text-heavy setting with poor cross-modal alignment, vanilla ICL remains the safer choice over higher-shot M²IV.
Thank you again for your time and valuable feedback. We hope the above content can solve your doubts. We have updated all the newly added experiments. Please let us know if there are further questions.
I have read the authors' rebuttal and the comments from other reviewers. I will maintain my rating of 6.
Thank you very much for your response and for the valuable comments you provided on our work. We would like to politely inquire about the reason the score was not updated. We are wondering if any of your concerns remain unaddressed, despite the additional clarifications and the newly added error analysis section in our revision. Please feel free to let us know if there is anything further you would like to discuss.
Dear reviewer, as the rebuttal phase comes to an end soon, we would really like to know whether our responses have adequately addressed some of your concerns. If you feel they have, we would be truly grateful if you could kindly consider adjusting your scores accordingly. Thank you very much.
This paper presents a new way to improve how large models understand both images and text. Instead of showing these models many examples, this method uses special learnable "In-context Vectors." This technique helps the models perform better, work more efficiently, and handle more information. The paper also describes a library to store and use these vectors for different tasks like improving how models connect images and text, and enhancing safety. Tests show this new approach works better than older methods.
接收理由
- Novel and Impactful Approach: The paper introduces an innovative method using learnable "In-context Vectors" that significantly enhances how models process and understand combined image and text information, leading to superior performance.
- Improved Efficiency and Scalability: This new technique streamlines multimodal learning, making models more efficient in terms of processing and able to handle a large number of examples, overcoming previous contextual limitations.
- Practical Toolkit for Versatile Applications: The research provides a valuable "VLibrary" for these vectors, allowing flexible use in important real-world scenarios such as better aligning model understanding across modalities, tailoring model outputs, and improving safety, thereby paving the way for future advancements.
拒绝理由
1. Model-specific datastore overhead Does the proposed method require constructing a separate datastore module for every backbone, simply to maintain a consistent representation space? If so, the engineering burden and memory footprint would scale linearly with the number of models, which is impractical for real-world deployment.
2. Handling of very long text in CLIP The paper does not explain how CLIP (or any vision-language encoder) is adapted to cope with overly long textual prompts. Please clarify whether you truncate, segment, or otherwise modify the inputs—and how this choice affects performance.
3. Lack of genuine multimodal novelty Apart from the bespoke datastore, Section 3.1 presents a modelling pipeline that appears identical to a single-modal setup. What concrete mechanisms exploit cross-modal interactions? At present, the approach seems to inherit most of its gains from datastore engineering rather than from any multimodal insight.
4. Obsolete evaluation baselines & uncertain scalability All three backbone models are roughly a year old. Could the authors demonstrate similar (or larger) improvements on current-generation LVLM checkpoints? Moreover, it remains unclear whether the gains hold across model scales (e.g., 7B → 70B parameters). Without such evidence, the practical impact is questionable.
Grammatical error on line 114: The sentence "…framework. by analyzing" (line 114).
Thank you for the time, thorough comments, and nice suggestions. We are pleased to clarify your questions in detail.
Q1: Model-specific datastore overhead Does the proposed method require constructing a separate datastore module for every backbone, simply to maintain a consistent representation space?
R1: Thank you for this important question. Below, we address the concern in detail:
-
It is not necessary to construct a separate vector datastore module for each backbone. Our VLibrary is designed solely for storing and indexing the vectors learned during training. It performs no inter-vector operations, so maintaining a consistent representation space across different backbones does not require restructuring the datastore. As a result, the structural requirements of VLibrary remain minimal and unchanged across models.
-
The system is lightweight, modular, and does not introduce deployment overhead. In practice, VLibrary is implemented using an off-the-shelf object store such as AWS S3, which integrates easily into existing pipelines. During retrieval, the system queries using the M²IV’s designated parameter set, (line 199). This architecture avoids the overhead of rebuilding datastores and instead supports straightforward, scalable deployment.
To further clarify this, we provide the engineering details of our VLibrary Workflow implementation below.
-
Storage Backend: The VLibrary is hosted on AWS S3, where each M²IV asset is a structured binary object containing the learned vectors (, ) and associated scalars (, ) for every LVLM layer. We serialize each asset using Protocol Buffers to preserve floating-point precision, then compress it with Zstandard before uploading.
-
Addressing and Deduplication: To guarantee uniqueness and support lifecycle management, we use content-based addressing. Specifically, we normalize the index parameters by enforcing a fixed precision and deterministic layer ordering. We then compute a SHA-256 hash, referred to as the
M2IV_Content_Hash, which serves as the S3 object key. -
Lookup and Retrieval: For efficient access, a Redis-based Mapping Service maintains associations between human-readable versioned identifiers (e.g.,
model_name@v1.2:task_name@v1.0) and their correspondingM2IV_Content_Hash. When an application requests a specific M²IV, it provides the versioned keys, retrieves the hash from Redis, and uses it to fetch the binary object from S3. The object is then decompressed and deserialized back into a usable in-memory M²IV structure. We further accelerate frequent queries through application-level caching.
In summary, this architecture ensures low-latency lookups, scalable and cost-effective storage, and avoids model-specific datastore duplication, facilitating easy integration and robust real-world deployment.
Q2: Handling of very long text in CLIP The paper does not explain how CLIP (or any vision-language encoder) is adapted to cope with overly long textual prompts.
R2: Within the M²IV pipeline we use CLIP only during the clustering step that constructs the training sequences. In our main experiments the textual parts of every VQA benchmark remain within 77 token limit, so we rely on the vanilla CLIP. In the section of VLibrary, we may encounter inputs, such as extended rationales, that exceed this limit. To avoid extra training cost and potential semantic loss we directly switch to LongCLIP[1], a well-trained variant that adapts CLIP to longer text. In addition, we experimented with three alternative methods: (i) truncating the sequence and feeding the shortened version to vanilla CLIP; (ii) applying a sliding-window segmentation with max_len = 77 tokens and stride = max_len//2; and (iii) first using an advanced closed-source LLM to summarize the text within the length limit and then embedding the summary with CLIP.
| Methods | Explainability(CIDEr) | Instruction(LLaVa score) | Jailbreak(ASR) |
|---|---|---|---|
| Truncate | 82.40 | 75.95 | 87.83 |
| Segment | 82.35 | 76.08 | 88.53 |
| Summary with GPT-4o | 82.67 | 76.54 | 89.94 |
| Summary with Gemini2.5Pro | 82.59 | 78.37 | 88.76 |
| LongCLIP | 83.13 | 77.09 | 89.36 |
Table 1*: Comparison of different long-text processing methods in explainability, instruction-following, and jailbreak. We experimented with multiple prompts for GPT and Gemini and this table reports the best results.
Table 1* shows LongCLIP is best. Since embeddings are only used for clustering, long-text methods have limited impact, and only some inputs need them, so overall performance stays stable. To reduce training costs and avoid additional LLM API calls for summarization, LongCLIP is an excellent choice.
Q3: Lack of genuine multimodal novelty Apart from the bespoke datastore, Section 3.1 presents a modelling pipeline that appears identical to a single-modal setup. What concrete mechanisms exploit cross-modal interactions? R3: First, our theoretical model in Section 3.1 actually offers an novel treatment of multimodality. As noted in the Introduction and Related Works, existing representation engineering methods underperform on multimodal tasks as they focus exclusively on the MHA branch, which may introduce a single-modal bias. By explicitly integrating the MLP branch and promoting complementarity between the MHA and MLP branches, we ensure that multimodal features receive proper attention and support fine-grained distillation. Second, although this pipeline resembles single-modal ICL, it can capture multimodal interactions through the unified processing performed by the LLM component of LVLMs. Our focus is on the decoder side (LLM), where both image and text features are projected into a shared semantic space. As such, the input sequence can be treated uniformly, yet still reflects the multimodal nature of the model. This unified representation is made explicit in Theorem 1, where the demonstration matrix is defined as: , where and represent image and question features respectively, and , are their token counts. This allows the self-attention mechanism to dynamically allocate attention, capturing internal semantics within demonstrations and interactions across modalities. Theorem 2 further shows that MLP layers also preserve this structure: , where aggregates cross-modal cues from . This allows the MLP to process and integrate information from different modalities, enabling the model to convey more aggregated yet nuanced cross-modal representations in ICL. In fact, a comprehensive study of the modality‐splitting workflow deserves a separate, more detailed theoretical paper, which constitutes our next project. We plan to employ probing, circuit, and related techniques to achieve more precise cross‐modal analysis within LVLMs.
Q4: Obsolete evaluation baselines & uncertain scalability All three backbone models are roughly a year old. Moreover, it remains unclear whether the gains hold across model scales. R4: We select OpenFlamingo, Idefics, and Llava-NeXT as they are classic LVLMs that are purposefully trained or fine-tuned on interleaved image-text data, yet they still exhibit shortcomings in multimodal ICL. M²IV substantially mitigates those weaknesses. The method is not restricted to these models; it can be also applied to current-generation open-source commercial LVLMs, whose multimodal ICL remains sensitive to sequence quality. To illustrate the broader potential of M²IV, we add experiments on InternVL2.5 (8B) (released 5 December 2024) and Qwen2.5VL (7B) (released 28 January 2025). In the experiments that follow, we use I2CL as the representative method that requires no training. All in-context sequences consist of 16 shots. Results in Tables 2* and 3* show that M²IV also achieves the highest performance across all benchmarks on these two latest SoTA LVLMs, thereby demonstrating its practical value.
| Methods | VQAv2 | VizWiz | OK-VQA | GQA | A-OKVQA | CVQA | VL-ICL |
|---|---|---|---|---|---|---|---|
| Vanilla ICL | 68.27 | 57.82 | 62.85 | 74.71 | 67.29 | 66.39 | 46.75 |
| I2CL | 63.73 | 49.08 | 58.69 | 69.54 | 63.36 | 57.07 | 41.05 |
| LIVE | 70.62 | 58.74 | 64.37 | 76.41 | 70.40 | 67.34 | 48.83 |
| MIV | 72.48 | 60.07 | 65.26 | 78.31 | 74.16 | 70.91 | 50.36 |
Table 2*: Performance comparison of different methods on InternVL2.5 (8B).
| Methods | VQAv2 | VizWiz | OK-VQA | GQA | A-OKVQA | CVQA | VL-ICL |
|---|---|---|---|---|---|---|---|
| Vanilla ICL | 71.40 | 59.62 | 64.83 | 73.95 | 68.31 | 70.16 | 47.28 |
| I2CL | 65.17 | 51.38 | 56.75 | 70.48 | 63.95 | 64.28 | 43.54 |
| LIVE | 73.36 | 61.58 | 65.27 | 76.38 | 69.84 | 71.59 | 49.61 |
| MIV | 75.23 | 63.57 | 66.83 | 76.97 | 71.64 | 74.58 | 51.32 |
Table 3*: Performance comparison of different methods on Qwen2.5VL (7B).
Additionally, we examine how M²IV’s performance gains vary with increasing model size on these two latest LVLMs, thereby demonstrating its practical impact.
| Methods | VQAv2 | VizWiz | OK-VQA | GQA | A-OKVQA | CVQA | VL-ICL |
|---|---|---|---|---|---|---|---|
| Vanilla ICL | 73.59 | 63.28 | 66.51 | 80.25 | 73.16 | 74.64 | 50.15 |
| I2CL | 67.28 | 57.48 | 61.37 | 74.79 | 66.67 | 69.32 | 43.26 |
| LIVE | 75.14 | 63.42 | 67.39 | 81.74 | 73.08 | 75.98 | 52.43 |
| MIV | 76.50 | 65.87 | 68.73 | 82.53 | 75.70 | 77.21 | 53.27 |
Table 4*: Performance comparison of different methods on InternVL2.5 (78B).
| Methods | VQAv2 | VizWiz | OK-VQA | GQA | A-OKVQA | CVQA | VL-ICL |
|---|---|---|---|---|---|---|---|
| Vanilla ICL | 76.42 | 66.08 | 67.59 | 78.37 | 72.48 | 76.52 | 49.38 |
| I2CL | 68.30 | 58.27 | 62.41 | 72.26 | 67.98 | 61.05 | 42.26 |
| LIVE | 77.21 | 67.15 | 67.48 | 80.13 | 73.19 | 78.46 | 51.24 |
| MIV | 78.54 | 67.93 | 69.36 | 81.20 | 74.38 | 79.29 | 52.08 |
Table 5*: Performance comparison of different methods on Qwen2.5VL (72B).
Tables 4* and 5* show that when scaling InternVL2.5 from 8B to 78B and Qwen2.5VL from 7B to 72B, training-free I2CL falls further behind vanilla ICL, indicating that as model size increases, training-free methods become increasingly unsuitable. At the same time, LIVE’s improvement gains slightly decline and occasionally lag behind vanilla ICL. In contrast, M²IV continues to achieve the best results across all benchmarks while maintaining its performance gains, demonstrating its generalizability across LVLMs of different sizes and confirming its strong practical value.
Thank you again for your time and valuable feedback. We hope the above content can solve your doubts. We have updated all the newly added experiments. Please let us know if there are further questions.
References:
[1] Zhang, Beichen, et al. "Long-clip: Unlocking the long-text capability of clip." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024.
Thanks for the author's reply and the additional experiments. I've increased my score.
Thank you very much for your careful re-evaluation and for raising the overall score. Your comments were instrumental in helping us strengthen the paper's technical soundness. I notice, however, that the revised score is 5, which remains just below the acceptance threshold. We would be grateful to learn whether any of your concerns are still outstanding, or if there are any parts of our rebuttal that may have been overlooked. We are keen to address any remaining issues through further discussion.
Moreover, we will offer an in-depth discussion of a new highlight of our work, which stems from your comments: efficient and effective editing of larger-scale models. Many existing model-editing approaches are primarily demonstrated on ≈7 B-parameter models, and their performance and efficiency degrade markedly when applied to models of 32 B parameters or more [1, 2]. By contrast, our experiments on InternVL 2.5 and Qwen 2.5-VL show that M²IV maintains stable performance gains even when model size increases tenfold. To further verify this point, we now include results on the LLaVA-OneVision [3] series (OpenFlamingov2, Idefics2, and LLaVA-NeXT do not provide versions larger than 32B).
| Methods | VQAv2 | VizWiz | OK-VQA | GQA | A-OKVQA | CVQA | VL-ICL |
|---|---|---|---|---|---|---|---|
| Vanilla ICL | 64.77 | 44.18 | 58.42 | 68.75 | 65.47 | 66.91 | 43.56 |
| I2CL | 57.53 | 39.02 | 53.64 | 62.89 | 58.49 | 57.84 | 37.89 |
| LIVE | 66.93 | 45.78 | 60.84 | 71.31 | 66.28 | 68.30 | 45.16 |
| MIV | 68.21 | 48.00 | 63.91 | 72.64 | 68.40 | 73.28 | 47.04 |
Table 6*: Performance comparison of different methods on LLaVa-OneVision (7B).
| Methods | VQAv2 | VizWiz | OK-VQA | GQA | A-OKVQA | CVQA | VL-ICL |
|---|---|---|---|---|---|---|---|
| Vanilla ICL | 68.94 | 50.47 | 62.84 | 71.49 | 70.39 | 70.25 | 49.73 |
| I2CL | 59.41 | 42.85 | 53.90 | 64.19 | 65.45 | 64.60 | 42.26 |
| LIVE | 69.25 | 50.08 | 64.09 | 72.58 | 70.51 | 69.72 | 50.80 |
| MIV | 70.13 | 51.95 | 64.87 | 73.67 | 72.82 | 72.38 | 52.10 |
Table 7*: Performance comparison of different methods on LLaVa-OneVision (72B).
The results in Tables 6* and 7* further demonstrate that M²IV remains more robust than alternative methods as model size increases. We now extend the comparison to the 70B-parameter regime, evaluating M²IV against LoRA.
| Methods | VQAv2 | VizWiz | OK-VQA | GQA | A-OKVQA | CVQA | VL-ICL |
|---|---|---|---|---|---|---|---|
| LoRA | 74.69 | 60.37 | 65.80 | 79.05 | 73.82 | 74.67 | 51.92 |
| MIV | 75.06 | 61.92 | 67.65 | 79.13 | 74.06 | 76.29 | 52.11 |
Table 8*: Average performance comparison of LoRA and MIV across LLaVa-OneVision (72B), InternVL2.5 (78B) and Qwen2.5VL (72B).
Table 8* demonstrates the comprehensive advantage of M²IV over LoRA. Moreover, at the 70B-parameter scale the parameter-efficiency gap widens: the total trainable parameters required by M²IV amount to only 1/80.7 of LoRA’s, compared with 1/50.8 at the 7B scale (Table 9). This extreme compactness makes M²IV an excellent match for the ongoing trend toward ever-larger commercial LVLMs. Taken together, the evidence above highlights the substantial potential of extending M²IV to the training and deployment pipelines of diverse, current-generation open-source LVLMs.
As LVLMs keep scaling and rapidly move toward mobile/edge devices and multi-tenant cloud services, the size and scheduling flexibility of tuning patches have become critical. Compared with other methods, M²IV slashes the additional parameter footprint, cutting both training data and compute requirements. Once deployed, it adds virtually no memory or storage overhead and can be hot-swapped on demand, with no need to rebuild computation graphs or manage bulky adapter files. In today’s ecosystem of “large backbones + sparse activation + edge deployment,” M²IV boosts both development and inference efficiency while steadily driving down overall system and operational costs.
Finally, thank you again for your time and effort.
[1] Gu, Jia-Chen, et al. "Model editing harms general abilities of large language models: Regularization to the rescue." arXiv preprint arXiv:2401.04700 (2024).
[2] Gangadhar, Govind, and Karl Stratos. "Model editing by standard fine-tuning." arXiv preprint arXiv:2402.11078 (2024).
[3] Li, Bo, et al. "Llava-onevision: Easy visual task transfer." arXiv preprint arXiv:2408.03326 (2024).
Dear reviewer, as the rebuttal phase comes to an end soon, we would really like to know whether our responses have adequately addressed some of your concerns. If you feel they have, we would be truly grateful if you could kindly consider adjusting your scores accordingly. Thank you very much.
This paper presents M2IV, a novel framework for efficient and fine-grained multimodal in-context learning (ICL) in Large Vision-Language Models (LVLMs). The authors address the inherent limitations of traditional ICL—particularly the token inefficiency and prompt sensitivity—by introducing learnable in-context vectors (M2IV) that are directly injected into the model's residual stream. These vectors leverage the distinct roles of Multi-Head Attention (MHA) and Multi-Layer Perceptrons (MLP) to semantically distill and store cross-modal knowledge. The approach includes a sophisticated self- distillation training mechanism and is supported by a retrievable vector repository named VLibrary. Extensive experiments across 7 benchmarks and 3 different LVLMs show that M2IV consistently outperforms both traditional ICL and state-of-the-art vector-based methods like LIVE, with improvements in both accuracy and inference efficiency.
接收理由
-
The method reduces token overhead while improving performance, which is critical for scaling LVLMs. VLibrary shows promise for broader impact and future extensibility.
-
Theoretical contributions clarify the roles of MHA and MLP, which deepens understanding of transformer internals in multimodal settings.
-
Experiments are thorough and the performance gains are solid across diverse benchmarks. Ablation studies highlight the importance of each component (especially L_{syn}).
拒绝理由
-
Although the paper claims to address two key challenges—capturing critical task mappings and their interdependencies, and achieving fine-grained semantic distillation—the ablation studies do not explicitly disentangle the method contribute to solving each challenge. More targeted experiments or visualizations could help verify these claims.
-
It is well known that task-specific training or fine-tuning may compromise a model’s general capabilities. The paper does not sufficiently examine whether M2IV, while enhancing performance on trained tasks, may degrade the model’s performance on unseen tasks or its general tasks.
给作者的问题
-
Can the authors provide more fine-grained analysis or visualizations to demonstrate how M2IV captures interdependencies among demonstrations and supports semantic distillation?
-
Does applying M2IV to a specific domain compromise the LVLM’s general capabilities or performance on unrelated tasks?
We sincerely appreciate the time and effort you devoted to reviewing our manuscript. Below, we provide detailed responses to each of your concerns.
Q1:
Provide more fine-grained analysis or visualizations to demonstrate how M2IV captures interdependencies among demonstrations and supports semantic distillation. Notably, as our methods distills in-context sequences into vectors and injects them directly into the zero-shot generation process, we cannot obtain visualizations of how it adjusts inter-demonstration attention in few-shot generation.
R1:
We add a fine-grained ablation study that isolates the distinct contributions of the MHA and MLP branches within the M²IV architecture.
-
Branch-Isolation Ablation: Motivated by[1], which categorizes multimodal ICL tasks based on task mappings into specific-mapping and generalized-mapping tasks, we conduct branch-isolation ablations on each type. In specific-mapping tasks, all demonstrations and the query sample share highly similar task mappings, which is the focus of existing representation engineering methods, which emphasize MHA. By contrast, generalized-mapping tasks exhibit fine-grained variation between demonstrations and between each demonstration and the query; this variation becomes more pronounced across the entire knowledge base and demands fine-grained multimodal features. We select VL ICL and a new task HatefulMemes[2] as representative specific-apping tasks, while VizWiz and A-OKVQA serve as examples of generalized-mapping tasks. After training M²IV, we consider three scenarios: full injection, injection of only the MHA branch, and injection of only the MLP branch.
Injection VL-ICL Hatefulmemes VizWiz A-OKVQA Full 33.36 84.62 43.40 53.59 MHA 30.39 81.47 36.71 48.90 MLP 22.67 73.58 34.07 46.18 Table 1*: Comparison of M²IV performance under different injection methods.
Table 1* shows that MHA remains fundamental for multimodal ICL because it identifies the core task and processes the relationships between demonstrations and the query sample. Removing MHA causes a significant performance drop across all tasks. In contrast, omitting MLP leads to a larger accuracy decline on generalized‐mapping tasks than on specific‐mapping tasks, demonstrating MLP’s essential role in fine‐grained semantic distillation.
-
Fine-grained Semantic Fidelity: To further validate the role of the MLP branch for fine-grained semantic distillation, we enhance the experiments in Section 5.2 which use the LLaVA benchmark to assess instruction following abilities. We augment every demonstration in the original support set with subset-specific markers: Conversation answers begin with [Sure, here’s what I found.] and end with [Feel free to ask more!]; Detail descriptions weave connective adverbs such as [In particular,] [Additionally,] [Moreover,]; Complex-reasoning demos follow a step-by-step format [To address the question,] … [First] … [Second] …… [To conclude] …. (During evaluation, we also accept synonyms of those words, provided they are enclosed in []). At this stage, the LVLM’s task shifts from relatively open-ended instruction following to generating fixed-format responses with fine-grained constraints on how it conveys image content. Meanwhile, evaluation extends the original GPT-4 scoring by also measuring each genrated answer’s keyword hit rate and the correctness of the markers’ prescribed order. Using the untouched query split, we train M²IV under the same training protocol as the original study. We examine the performance differences between the original and this fine-grained settings across two injection methods.
Setting Injection Conversation Detail Complex Original Full 72.53 56.41 75.32 MHA 68.86 52.98 70.43 Fine-grained Full 67.71 50.36 64.89 MHA 59.29 41.77 55.91 Table 2*: Comparison of M²IV performance under full injection versus injection without the MLP branch.
Results in Table 2* show that in the fine grained setting, performance drops more severely without the MLP branch. Therefore, to overcome coarse grained distillation and achieve fine grained distillation of task specific semantic information (such as adapting style to a particular response format) the MLP is essential. Moreover, the synergistic loss unlocks the MLP’s capacity. This is one of the key factors behind M²IV’s success in outperforming existing methods.
Q2: Does applying M2IV to a specific domain compromise the LVLM’s general capabilities or performance on unrelated tasks?
R2: Thank you for raising a very valuable question that the original paper overlooks, namely whether adding M²IV might affect the model’s general capability. Addressing this issue is crucial for enhancing the practical value of M²IV. Therefore, we test the model with the task-specific M²IVs trained on VQAv2, VizWiz, OK-VQA, GQA, and A-OKVQA inserted, and measure the performance gap compared to the baseline without injection on 8-shot VL-ICL, MathVista[3], the test split of MMMU[4], and MMBench[5], which are the benchmarks commonly used to evaluate the general ability of LVLMs and also serve as OOD data. We also compare these results with full fine-tuning.
| Dataset | Method | VL-ICL(8-shot) | MathVista | MMMU(test) | MMBench |
|---|---|---|---|---|---|
| Standard | 29.08 | 45.86 | 32.76 | 52.51 | |
| VQAv2 | Full FT | 30.07 | 40.68 | 27.68 | 50.69 |
| M²IV | 31.28 | 45.27 | 34.01 | 51.78 | |
| VizWiz | Full FT | 27.54 | 41.02 | 25.83 | 50.03 |
| M²IV | 28.97 | 45.08 | 33.40 | 51.93 | |
| OK-VQA | Full FT | 32.75 | 43.27 | 30.95 | 52.69 |
| M²IV | 34.60 | 45.81 | 34.58 | 54.07 | |
| GQA | Full FT | 32.81 | 38.64 | 33.25 | 52.37 |
| M²IV | 33.96 | 45.30 | 32.39 | 53.49 | |
| A-OKVQA | Full FT | 27.19 | 37.91 | 26.67 | 48.17 |
| M²IV | 30.29 | 44.72 | 32.42 | 51.32 |
Table 3*:Comparison of LVLM performance on general benchmarks: fully fine-tuned on the same dataset versus injected the corresponding M²IV.
Table 3* shows that, after full fine-tuning on a specific dataset, the model’s performance on general-ability benchmarks typically declines, especially when the dataset’s distribution differs substantially from that of the benchmark. In contrast, injecting the task-specific M²IV causes little degradation and even yields gains on some benchmarks. Unlike full fine-tuning, which alters all model parameters, M²IV injects small, learnable vectors into the model's residual streams without modifying the base model. These vectors are trained to mimic the effect of in-context demonstrations and are applied additively at specific layers, making their influence localized and reversible. As a result, M²IV enhances task-specific performance without overwriting the model’s general knowledge, preserving its original abilities when the vectors are not injected. When you want to restore the model, you only need to simply perform a subtraction. These results reveal an additional advantage of M²IV: it leaves the model’s inherent capabilities largely unaffected.
Thank you again for your time and valuable feedback. We hope the above content can solve your doubts. We have updated all the newly added experiments. Please let us know if there are further questions.
References: [1]Li, Yanshu, et al. "TACO: Enhancing Multimodal In-context Learning via Task Mapping-Guided Sequence Configuration." arXiv preprint arXiv:2505.17098 (2025).
[2] Kiela, Douwe, et al. "The hateful memes challenge: Detecting hate speech in multimodal memes." Advances in neural information processing systems 33 (2020): 2611-2624.
[3] Lu, Pan, et al. "Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts." arXiv preprint arXiv:2310.02255 (2023).
[4] Yue, Xiang, et al. "Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
[5] Liu, Yuan, et al. "Mmbench: Is your multi-modal model an all-around player?." European conference on computer vision. Cham: Springer Nature Switzerland, 2024.
To further address your second concern, we have added two additional experiments. These experiments highlight the vector–benchmark irrelevance of the inserted M²IVs. Specifically, we use the M²IVs introduced in the safety-testing section: one trained on HatefulMemes, which classifies images as harmful or not, and another trained on MM-SafetyBench, which focuses on LVLM jailbreak scenarios and contains harmful content.
| Dataset | Method | VL-ICL(8-shot) | MathVista | MMMU(test) | MMBench |
|---|---|---|---|---|---|
| Standard | 29.08 | 45.86 | 32.76 | 52.51 | |
| Hatefulmemes | Full FT | 27.49 | 40.92 | 26.79 | 48.77 |
| M²IV | 30.17 | 45.48 | 33.52 | 52.02 | |
| MM-SafetyBench | Full FT | 26.43 | 39.27 | 23.64 | 46.58 |
| M²IV | 27.61 | 44.79 | 32.81 | 50.84 |
Table 4*:Comparison of LVLM performance on general benchmarks: fully fine-tuned on two specific datasets versus injected the corresponding M²IV.
Table 4* further shows that M²IV alters an LVLM’s general abilities far less than conventional fine-tuning, underscoring its practical value. We therefore anticipate that representation-engineering techniques exemplified by M²IV will become a key component of future post-training pipelines, where both effectiveness and efficiency are paramount. We are delighted to address any additional questions you may have.
Dear reviewer, as the rebuttal phase comes to an end soon, we would really like to know whether our responses have adequately addressed some of your concerns. If you feel they have, we would be truly grateful if you could kindly consider adjusting your scores accordingly. Thank you very much.
Thanks for your responses. I think the new experiments and results strengthen the overall quality of the submission.
Dear reviewer, thank you very much for your comments. Your clear and professional comments have provided us with valuable insights that enabled us to enhance the quality of the paper and to highlight new strengths of our method. We also greatly appreciate your positive assessment of the additional experiments we included.
Because you indicated that the revision represents an improvement, we would like to ask respectfully whether you might be willing to reconsider your overall score for the paper. If there are any remaining issues or concerns, we are delighted to discuss them further.
Dear Area Chairs and reviewers,
We greatly appreciate the time and effort you invest in reviewing this paper. Below, we present our summary of the rebuttal phase.
Discussion with Reviewer Ed6E: The reviewer raises two concerns: the need for targeted experiments and analyses that clarify the unique contributions of the multi-head attention (MHA) and multi-layer perceptron (MLP) branches, and whether injecting M²IV impairs the model’s general abilities. To address the first point, we conduct an ablation study that isolates each branch’s function; the results show that the MHA branch captures critical task-level mappings, whereas the MLP branch achieves fine-grained semantic distillation, which confirms the rationale for separating the two branches. To address the second point, we measure LVLM performance on generic benchmarks unrelated to the target task after injecting task-specific M²IV; compared with full fine-tuning, M²IV causes little or no degradation in general performance. These findings highlight that M²IV enhances target-task performance while leaving the model’s original abilities intact.
Discussion with Reviewer P1rJ: The reviewer raises four concerns:model-specific datastore overhead, handling of very long text in CLIP, lack of genuine multimodal novelty, and obsolete evaluation baselines with uncertain scalability. To address the first concern, we describe in detail VLibrary’s storage and retrieval mechanisms and our engineering implementation, and we emphasize their efficiency and low cost, which enable deployment in diverse real-world systems. For the second concern, we clarify the long-text scenarios in our paper and point out the robustness of our method when CLIP processes extended inputs. Regarding the third concern, we refine Theorems 1 and 2 by explicitly incorporating multimodal components. For the fourth and primary concern, we evaluate the method on two current-generation LVLMs, InternVL 2.5 and Qwen 2.5 VL, and prove its effectiveness on these latest models. To assess scalability, we test the method on InternVL 2.5, Qwen 2.5 VL, and LLaVA-OneVision across parameter scales from 7 billion to 70 billion and observe that M²IV scales more favorably than competing approaches. These findings lead us to propose a new key contribution of M²IV: efficient and effective editing of larger-scale models.
Discussion with Reviewer SUT6: The reviewer raises four concerns: the potential impact of additional task-specific training, the overhead of training and data processing, the need for a more detailed introduction to generalization across models or tasks, and the lack of discussion of failure cases. To address the first concern, we demonstrate that task-specific training is both necessary and low-cost. For the second concern, we restate our analysis of training cost, which emphasizes the efficiency of M²IV, and the need for specialized data processing, noting that this pipeline reduces the required data volume and thereby offsets part of the overhead. For the third concern, we clarify the transfer strategy of M²IV and reiterate the supporting experimental results. Finally, we add an extensive qualitative error analysis that highlights and explains representative failure cases.
Discussion with Reviewer w71E: The reviewer raises five concerns: the need for more intuition behind the synergistic loss; a discussion of the applicability of Theorems 1 and 2 to non-linear LVLMs; a detailed description of the many-shot aggregation method; an error analysis; and an examination of M²IV’s sensitivity to the choice of CLIP variant and the number of clusters. For the first concern, we elaborate in detail on the intuition behind the design of the synergistic loss and explain its effect more rigorously from a mathematical perspective. For the second concern, we clarify how Theorems 1 and 2 apply to LVLMs and refine their statements accordingly. Regarding the third concern, we provide a thorough description of the many-shot aggregation procedure used in the paper. Addressing the fourth concern, we add an extensive qualitative error analysis. For the fifth concern, we include a sensitivity study showing that M²IV remains robust across different CLIP variants, and we reiterate our ablation study on the number of clusters in the paper.
In summary, we address every concern raised by the reviewers with detailed analyses or additional experiments, which not only reinforce our claim of M²IV’s efficiency and effectiveness but also reveal new advantages, such as its minimal impact on general capability and its effectiveness on larger-scale models.
COLM 2025 Conference Submission1730 Authors
I am deeply interested in the premise of the paper. As I understand it, the idea is to figure out how to perform ICL with a multi-modal language model. In contrast to normal ICL, where one provides the model with x-y input--output pairs of strings, x may be an image and y may be a string. I had, personally, never really considered this angle despite having published multiple papers in the multimodal LM space. Many of the reviewers' objections seem pedantic to me, e.g., baselines that are too old. For me, the most important aspect of this paper is simply the fact that the approach is quite interesting. There seems to be no critique of its novelty. Moreover, the authors' engagement with the reviewers during the review process is quite comprehensive.