PaperHub
7.2
/10
Poster4 位审稿人
最低3最高4标准差0.4
4
4
4
3
ICML 2025

An Empirical Study on Configuring In-Context Learning Demonstrations for Unleashing MLLMs' Sentimental Perception Capability

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24
TL;DR

We conduct an in-depth investigation into three pivotal factors that influence the configuration of In-Context Learning demonstrations on Multimodal Sentiment Analysis.

摘要

关键词
Multimodal Sentiment AnalysisMultimodal Large Language ModelIn-Context Learning

评审与讨论

审稿意见
4

The paper explores how to enhance the performance of Multimodal Large Language Models (MLLMs) in Multimodal Sentiment Analysis (MSA) by optimizing the configuration of In-Context Learning (ICL) demonstrations. The main findings include:

​1. Enhancing MSA Performance: The authors show that by carefully configuring ICL demonstrations, MLLMs can significantly improve their sentiment analysis capabilities. They identify three key factors: similarity measurement, modality presentation, and sentiment distribution.

​2. Optimizing Similarity Measurement: The study introduces various strategies for measuring the similarity between multimodal data, focusing on image-text pairs. They refine traditional similarity measures to better capture the nuances of MSA.

  1. ​Modality Presentation: The authors investigate different combinations of modalities (image and text) and find that careful presentation of multimodal information can enhance sentiment prediction.

​4. Sentiment Distribution: They explore different protocols for distributing sentiment labels in demonstrations to mitigate biases and improve fairness in predictions.

  1. ​Experimental Results: The proposed strategies lead to average accuracy improvements of 15.9% over the zero-shot paradigm and 11.2% over a random ICL baseline on six MSA datasets. The methods are shown to be effective and generalizable across different MLLMs and datasets.

给作者的问题

See above

论据与证据

​Claims Supported by Evidence:

The paper provides empirical evidence that optimizing In-Context Learning (ICL) demonstrations can significantly improve MLLMs' performance in Multimodal Sentiment Analysis (MSA). The improvements in accuracy over the zero-shot and random ICL baselines suggest that the strategies are effective.

​Problematic Claims:

​1. The claim that the method "fully unleashes the potential" of MLLMs in MSA may be overstated. While the improvements are significant, it is unclear if the method reaches the absolute maximum potential, especially without direct comparisons to state-of-the-art approaches. ​Similarity Measurement: The reliance on traditional cosine similarity for multimodal data may be too simplistic, especially given the acknowledged semantic gap between text and images. The use of CLIP for measuring text-image associations could have provided a more nuanced approach. 2. The method of generating images from text to enhance multimodal presentation raises concerns about the controllability of the generated images' emotional expressions. This could lead to misinterpretation of the original sentiment, particularly with neutral or ambiguous samples. ​3. The experimental setup lacks clarity, particularly regarding the metrics used and how CLIP is applied. This makes it difficult to assess the effectiveness of the method comprehensively.

方法与评估标准

While the paper proposes interesting directions for optimizing ICL in MSA, several aspects of the methodology and evaluation raise questions about their alignment with the problem’s demands:

​Multimodal Similarity Measurement: The paper relies on cosine similarity for cross-modal retrieval (text+image), despite acknowledging the semantic gap between modalities. CLIP, which explicitly models text-image alignment, is mentioned but not leveraged for similarity scoring. This choice risks conflating modality-specific features (e.g., syntax in text vs. objects in images) rather than capturing joint semantics. Why not use CLIP’s joint embeddings for similarity?

​Image Generation for Modality Presentation: Generating images from text (e.g., via diffusion models) to “augment” demonstrations is creative but risks introducing uncontrolled biases. For example, a neutral text paired with a generated image that skews positive/negative (as seen in Figure 3b) could mislead the model. Without validating that generated images preserve the original sentiment, this approach risks confounding results. How do you ensure generated images align with the text’s intended sentiment?

​Evaluation Metrics & Transparency: Table 2’s experimental setup lacks clarity on how CLIP is used (e.g., which layers, pooling strategies) and which metrics are computed (e.g., accuracy, F1). Similarly, the “Task Learning” experiment (Figure 5) abruptly switches to animal labels—this is clever but under-explained. How does this abstraction relate to sentiment prediction?

​Negative Sentiment Bias: The paper notes a bias toward positive/neutral predictions but stops short of diagnosing why (e.g., data imbalance in pretraining, model architecture). Without exploring this, claims about “mitigating bias” feel superficial. For example, does the “Category Balanced” protocol simply mask underlying issues in the model’s learned representations?

理论论述

The paper focuses primarily on ​empirical strategies for configuring ICL demonstrations in MSA, with no explicit theoretical claims or proofs (e.g., convergence guarantees, generalization bounds). As such, there are no formal proofs to verify. However, the paper’s methodological assumptions and experimental design choices raise implicit theoretical questions:

​Assumption of Modality Independence: The use of cosine similarity (or CLIP embeddings) to combine text-image pairs assumes that modalities can be treated as independent features. This ignores the cross-modal alignment problem, where text and image embeddings may reside in disjoint semantic spaces. A more theoretically grounded approach might require proving that the proposed similarity metrics align with human perception of sentiment.

​Bias Mitigation Without Guarantees: The paper introduces heuristics (e.g., sentiment-balanced demonstration distributions) to counter MLLMs’ negative bias but provides no theoretical analysis of why these protocols work or whether they generalize beyond the tested datasets.

实验设计与分析

See above

补充材料

The supplementary material provides additional experiments (Tables 7–10) and details on prompts, datasets, and methods.

与现有文献的关系

The paper’s contributions are overshadowed by methodological gaps and lack of rigor:

​Method Weaknesses: Uses simplistic cosine similarity for text-image retrieval despite acknowledging the semantic gap. Generates images from text without validating sentiment alignment (see Figure 3b’s misleading neutral-positive mismatch).

​Experimental Opacity: Fails to specify CLIP’s role in metrics (e.g., layers, pooling) or report statistical significance. Table 2’s "ICL Random" baseline lacks clarity on metrics (accuracy? F1?) and experimental controls.

​Bias Analysis Superficial: Attributes negative sentiment bias to MLLMs without probing pretraining data or architectural flaws. Bottom Line: The paper prioritizes engineering novelty over scientific rigor. Addressing these gaps is critical for meaningful contribution to MSA/ICL.

遗漏的重要参考文献

No

其他优缺点

Strengths:

  1. The focus on optimizing In-Context Learning (ICL) for Multimodal Sentiment Analysis (MSA) addresses a critical challenge in deploying MLLMs in real-world scenarios where labeled data is scarce.
  2. The experiments show consistent gains over zero-shot and random ICL baselines, demonstrating practical utility.

Weaknesses:

  1. The paper prioritizes engineering hacks (e.g., cosine similarity, text-to-image generation) over scientific rigor. Ignoring CLIP’s joint embeddings for cross-modal retrieval is a missed opportunity to bridge the semantic gap.

  2. Figure 3b’s mismatched neutral text/positive image highlights uncontrolled sentiment leakage in generated visuals—a core flaw in the "transfer emotional details" claim.

  3. Phrases like "fully unleashing MLLMs’ potential" are hyperbolic. The method tweaks ICL prompts but avoids tackling deeper limitations (e.g., model architectures, dataset biases). ​Experimental Opacity:

  4. Table 2’s "ICL Random" baseline lacks metric definitions (accuracy? F1?), and CLIP’s role in evaluation is unclear. How is CLIP used beyond vague "similarity measurement"? No ablations on key design choices (e.g., sentiment-balanced sampling, WITA weights). ​

  5. While negative sentiment bias is noted, the analysis stops at surface-level protocols (e.g., "category balancing"). No exploration of systemic issues like pretraining data skew or model inductive biases.

其他意见或建议

No

作者回复

We sincerely appreciate your thorough review and in-depth comments! Below, we present detailed responses to the weaknesses (W) and other concerns (O).

W4(1). Metric definitions in Table 2

In the caption of Table 2, we specify that "R strategy" represents random retrieval, and we report its average accuracy across 4,8,16-shot ICL.

W4(2)&W1(2). CLIP’s role in similarity measurement

In the manuscript, we compute unimodal scores via cosine similarity in CLIP’s embedding space, and multimodal scores are aggregated from unimodal ones. We employ CLIP’s visual encoder and the following projection layer to obtain visual embeddings, and CLIP’s textual encoder and the following projection layer to obtain textual embeddings. It is essentially an intra-modal retrieval.

CLIP does not provide joint embeddings that embed the joint semantics of image and text. Therefore, an alternative cross-modal retrieval approach is computing similarity scores between images from one sample and texts from another. We adopt intra-modal retrieval for two reasons:

  1. CLIP aligns image and text embeddings. Both cross-modal and intra-modal retrieval operate within a unified embedding space. Their underlying mechanisms are similar.
  2. In [1,2], the similarity between image-text inputs is computed within each modality and later combined, which has been empirically proven effective.

Following the setup in Table 2, we compare two retrieval approaches. T2I denotes image-to-text retrieval, and I2T denotes text-to-image retrieval. From the results, cross-modal retrieval offers no additional benefits.

IDEFICSMVSA-STwitter-15
R49.257.4
I56.559.1
T56.058.7
T2I55.758.9
I2T55.057.4

W2. Uncontrolled image generation- a flaw of the "transfer emotional details" claim

While your point is insightful, we are afraid you may misunderstand our motivation in investigating modality presentation, which is modality conversion can furnish supportive information (Line 68, 196-204). We have not made the "transfer emotional details" claim, as suggested in your comment. This misunderstanding may stem from the phrase "which are conducive to evoking emotions" in Line 204, where our original intent is to highlight a potential benefit. To clarify, we will rephrase this part in the revised manuscript. Following our motivation, we investigate both substituting and augmenting original modalities with auxiliary ones. Our analysis reveals that the modality conversion introduces extra noise (Lines 298-300), which aligns with your comment. This issue leads to our conclusion: modality conversion underperforms the use of original modalities.

Facilitating emotional alignment in image generation is an interesting direction. However, this field remains in its infancy. Pioneering work [3] has not released pretrained checkpoints, and it is difficult to train a model and validate its effectiveness within the rebuttal period. Instead, we will explore it in future work.

W3(1). Hyperbolic phrases

We will carefully check potential hyperbolic phrases in the manuscript and revise them to appropriate expressions.

W4(3). Ablations on design choices

In our investigation, variations are introduced to the pertinent settings only when probing specific factors. Therefore, Figure 6 and Lines 326-411 cover ablations of the sentiment-balanced sampling, and Figure 4(b) and lines 307-323 cover ablations of the WITA weight.

O1. Explanation for Figure 5 experiment

Please refer to the response to Q1 of Reviewer pNwb.

W5&W3(2). Explorations for pretraining data skew or model inductive biases

This manuscript focuses on configuring ICL demonstrations to unleash MLLMs' sentiment perception capabilities. Therefore, we mitigate the sentimental predictive bias with ICL instead of altering MLLMs themselves. We have not explored or tackled deeper limitations, as they lie beyond the primary scope of our research.

However, systematic studies of these limitations can indeed substantially contribute to both MLLMs and MSA. As further discussion, we attribute these limitations to pretraining data rather than model architecture, which is validated in [4]. By constructing emotion-related data, it enhances MLLMs' zero-shot performance on visual emotion recognition. This success has the potential to be replicated in solving the sentimental predictive bias.

W1(1). Scientific rigor

We hope our responses can convince you of our methodology design and experimental rigor, a strength also recognized by Reviewers nCbm and pNwb.

References

[1] Yang et al. Exploring Diverse In-Context Configurations for Image Captioning. NeurIPS, 2023.
[2] Li et al. How to Configure Good In-Context Sequence for VQA. CVPR 2024.
[3] Yang et al. EmoGen: Emotional Image Content Generation with Text-to-Image Diffusion Models. CVPR 2024.
[4] Xie et al. EmoVIT Revolutionizing Emotion Insights with Visual Instruction Tuning. CVPR 2024.

审稿意见
4

This paper conducts an empirical study to unleash the power of MLLMs using in-context learning for multimodal sentiment analysis. The authors study three key factors influencing in-context learning performance: similarity measurement, modality presentation, and sentiment distribution. Experiments are performed on six datasets, and the proposed optimized strategies result in a significant improvement, outperforming both the zero-shot method and in-context learning with a random selection strategy. The paper also highlights and addresses an inherent sentiment prediction bias in MLLMs.

给作者的问题

(1) Can the authors explain the meaning of the experiments conducted in Figure 5? Why can this experiment interpret the performance degradation brought by modality conversion? (2) For the sentiment distribution, different data sets adopt different strategies (Table 4). It is necessary to analyze why only Twitter-15 and Twitter-15 adopt Category Balanced. Are there any considerations or assumptions for this design? (3) Different data sets adopt different final policies, and how to determine which policies to adopt in practical applications should be discussed. (4) What is the difference between policies for post-level and aspect-level tasks in sentiment distribution?

论据与证据

This paper makes several key claims: (1) The zero-shot performance of MLLMs for multimodal sentiment analysis is undesirable, but MLLMs can effectively perform multimodal sentiment analysis when provided with well-configured in-context learning demonstrations. (2) The choice of similarity measurement, modality presentation, and sentiment distribution significantly affects in-context learning performance. (3) In-context learning introduces biases that can be mitigated by strategic demonstration selection. These claims are generally supported by experiments across six multimodal sentiment analysis datasets. The authors present comparative evaluations against zero-shot baselines, random in-context learning, and prior in-context learning techniques, demonstrating clear improvements.

方法与评估标准

The proposed various in-context learning strategies with the evaluation of six multimodal sentiment analysis benchmarks make sense for exploring in-context learning to enhance the sentiment perception ability of MLLMs. This paper evaluates weighted multimodal similarity for similarity measurement. The influence of different combinations of text, image, and generated modalities on in-context learning performance is also verified. Various sentiment distribution configurations are also verified, and the authors discover predictive bias in MLLMs. The accuracies of the experiments on six multimodal sentiment analysis datasets validate the effectiveness.

理论论述

The paper does not present formal theoretical results but provides a strong empirical justification for its findings. The discussion of in-context learning biases and mitigation strategies is insightful but could benefit from a deeper theoretical analysis of in-context learning's role in multimodal sentiment analysis.

实验设计与分析

The study is thorough, covering multiple datasets and models. Ablation studies help isolate the contributions of each factor. Comparison with zero-shot, random in-context learning, and other multimodal sentiment analysis methods is well-executed. The paper focuses on two MLLMs (IDEFICS-9B, Open-Flamingo2-9B) and fixed prompts; broader validation on additional models and prompt engineering would strengthen the results.

补充材料

I have reviewed the supplementary materials, including additional experimental details and ablation studies. The provided information is useful but lacks a discussion on additional MLLMs and various prompts.

与现有文献的关系

The related works about the key contribution are sufficiently provided. Exploring in-context learning for multimodal sentiment analysis is a novel topic. The study builds on prior research in in-context learning (Brown et al., 2020). The efficacy of in-context learning relies on retrieval (Zhang et al., 2022), presentation (Li et al., 2024), and distribution (Lyu et al., 2023). The basic similarity calculation is based on (Liu et al., 2022 and Yang et al.2022). The biases in in-context learning are also observed by (Yang et al., 2023c; Li et al., 2024; Baldassini et al., 2024).

遗漏的重要参考文献

The references that are essential to in-context learning for MLLMs and multimodal sentiment analysis are comprehensive.

其他优缺点

(1) Exploring in-context learning for multimodal sentiment analysis is an underexplored yet impactful direction in the multimodal area. This paper provides valuable practical experience on how to effectively utilize MLLMs for multimodal sentiment analysis with low resource consumption. (2) This paper is empirical rigor. The experimental design is meticulously crafted, and the findings are articulated. The study addresses biases in in-context learning, adding depth to the contribution. (3) This paper is well-written and well-organized.

There are some unclear descriptions: (1) Some figures do not correspond to the text expression and are difficult to understand. For example, in Figure 3 (a), it is unclear how to calculate various similarities in Figure 3 (a). If these values are just schematic, using variable representation to show only the calculation process is better. (2) For the sentiment distribution, different data sets adopt different strategies (Table 4). The reason why only Twitter-15 and Twitter-15 adopt Category Balanced should be provided. (3) Different data sets adopt different final policies, and how to determine which policies to adopt in practical applications should be discussed. (4) The difference between strategies for post-level and aspect-level tasks in sentiment distribution should be clarified.

其他意见或建议

(1) It is recommended to provide more clear figures. For example, Figure 6 is too informative and should be simplified to show the experimental results more clearly. (2) A discussion about the influence of prompt design is suggested.

作者回复

We sincerely appreciate your positive feedback and valuable advice! Below, we present detailed responses to the weaknesses (W), comments (C), questions (Q) and other concerns (O).

C1&W1. Unclear figures

In the revised manuscript, we will replace redundant components in Figure 6 with more intuitive indicators and reformulate the calculation process in Figure 3(a) using variable representations.

Q1. Explanation of Figure 5 experiment

Table 3 indicates that neither substituting nor augmenting original modalities with auxiliary ones improves ICL performance. Our explanation for the former is that modality conversion introduces information loss, outweighing potential benefits. The experiments in Figure 5 are designed to interpret why the latter degrades ICL performance.

Specifically, [1] decomposes ICL’s role into Task Recognition (TR) and Task Learning (TL). TR prompts the task format for MLLMs to apply their prior knowledge, and TL aids MLLMs in building a mapping between inputs and outputs. We hypothesize that the augmenting process weakens the TL effect. To validate this, we reformulate MSA tasks to map image-text pairs to specific animals, where MLLMs have no prior knowledge of the mapping. Figure 5(b) reveals a continuous performance decline with increased modalities, validating our hypotheses. Thereby, we interpret that the augmenting process complexifies input-output mappings and impairs the TL effect.

C2. Influence of prompt design

Please refer to the response to Q1 of Reviewer gx7F.

W2&Q2. Adoption of distribution protocols on Twitter-15 and Twitter-17

In the manuscript, we determine the adopted protocol based on the proportion of negative samples in the dataset (lines 408-411). Given prior knowledge of datasets’ distribution, we adopt the Category Balanced protocol for Twitter-15 (12.1% negative samples) and Twitter-17 (13.6% negative samples), and the Unlimited protocol for the other datasets with higher proportions.

Here is an intuitive explanation. Influenced by the short-cut effect, negative demonstrations in the ICL sequence can stimulate MLLMs to produce negative predictions, which intensifies as the ratio of negative demonstrations increases. By selecting protocols, we ensure an adequate ratio of negative demonstrations to mitigate MLLMs’ sentimental predictive bias. For datasets with few negative samples, the Category Balanced protocol guarantees one-third negative demonstrations, which other protocols fail to achieve. For datasets with more negative samples, reliable similarity measurement naturally retrieves sufficient negative demonstrations for negative test samples. The Unlimited protocol thereby outperforms the Category Balanced protocol by configuring more sentiment-aligned demonstrations for non-negative test samples.

W3&Q3. Selection of final policy in practical applications

Our final policy is integrated by the optimal strategies of three factors. In practical applications, we decide the optimal similarity measurement strategy based on the type of task (WIT for post-level MSA and WITA for aspect-level MSA). The optimal modality presentation strategy is fixed, where we compose demonstrations with image and text. In cases where the proportion of negative samples is known, the optimal sentiment distribution strategies can be easily determined. Otherwise, protocol selection should depend on specific prioritizations. For applications prioritizing recall of negative samples (e.g., mental health monitoring), the Category Balanced protocol should be adopted. For applications prioritizing precision in non-negative sample identification or overall accuracy (e.g., public opinion monitoring), the Unlimited protocol is recommended.

W4&Q4. Sentiment distribution strategies for post-level and aspect-level MSA

The optimal distribution protocol for a dataset is determined based on its proportion of negative samples, which is independent of the task type.

O1. Theoretical analysis of ICL's role in MSA

We further analyze ICL's role in MSA through its two effects: Task Recognition (TR) and Task Learning (TL). The response to Q1 shows TL's non-negligible role in MSA. It is also evidenced by the positive correlation between ICL performance and the number of shots (Tables 7-9). Concurrently, TR also exerts a remarkable influence, as validated by ICL's robustness to textual prompts (response to C2). Textual prompts and TR share a similar function: both inform the MLLM about the task format. The contrast between the unstable zero-shot and stable ICL performance reveals the superiority of TR effects.

Therefore, ICL's TR and TL roles are equally critical in MSA. This contrasts with prior findings in VQA, where TR dominates over TL. This highlights the unique characteristics of MSA and the necessity of task-specific investigations.

O2. Broader validation on additional models

Please refer to the response to C1 of Reviewer gx7F.

审稿意见
4

This paper explores using In-Context Learning (ICL) to enhance MLLMs for Multimodal Sentiment Analysis (MSA). The authors identify that MLLMs under the zero-shot paradigm exhibit weak performance on MSA tasks. They propose a systematic study of three key factors in ICL demonstration configuration: similarity measurement, modality presentation, and sentiment distribution. By optimizing these factors, they achieve average accuracy improvements of 15.9% over zero-shot performance and 11.2% over random ICL on six MSA datasets. They also identify and mitigate a sentimental predictive bias in MLLMs, leading to fairer sentiment classification.

给作者的问题

1.Did the authors experiment with different prompt engineering techniques for ICL in MLLMs? Understanding the sensitivity of results to prompt variations would be valuable.

2.Have the authors tried to select the support set several times randomly?

论据与证据

Yes. The authors claim that MLLMs can achieve competitive sentiment perception through ICL. Three key factors (similarity measurement, modality presentation, and sentiment distribution) significantly impact ICL performance. Sentimental predictive bias exists in MLLMs but can be mitigated via distribution balancing. Empirical results on six MSA datasets validate these claims. The proposed retrieval, presentation, and distribution strategies are tested against zero-shot, random ICL, and previous ICL strategies. Ablation studies strengthen the validity of the proposed method.

方法与评估标准

Yes. This paper systematically studies three key factors in ICL demonstration configuration: similarity measurement, modality presentation, and sentiment distribution. For similarity measurement, the authors have evaluated various retrieval strategies, including aspect-based similarity and weighted multimodal similarity. For modality presentation, the authors explore how different combinations of texts, images, and generated modalities affect ICL performance. For sentiment distribution, the impact of sentiment biases in demonstrations is analyzed, leading to the discovery of predictive bias in MLLMs. The evaluation on six standard MSA datasets with accuracy as the primary metric also makes sense for this study.

理论论述

This paper primarily focuses on empirical findings rather than theoretical derivations.

实验设计与分析

Experiments on six datasets ensure the generalization of the proposed ICL strategy. Ablation studies validate individual contributions of retrieval, modality, and distribution factors. Baseline comparisons are thorough, including random ICL, prior ICL strategies, and supervised models. However, the experiments lack the analysis of the computational cost of configuring optimal ICL demonstrations.

补充材料

The supplementary material was reviewed and contains additional experiments, dataset details, and ablation results.

与现有文献的关系

Although ICL for MLLMs has been studied in other multimodal tasks, it is underexplored for MSA, which is a pivotal challenge in the quest for general artificial intelligence. The paper builds on prior work in ICL (Brown et al., 2020), multimodal learning (Yin et al., 2023), and MSA (Zadeh et al., 2017). The discussion of MLLMs' zero-shot limitations aligns with recent findings in multimodal intelligence (Lian et al., 2024; Yang et al., 2023). The work extends ICL research to MSA, aiming to unleash the sentiment perception ability of MLLMs, which is a valuable contribution.

遗漏的重要参考文献

No. The related works of this paper are sufficient, including fully-supervised, few-shot MSA methods and in-context learning methods.

其他优缺点

Strengths:

1.The paper presents a novel and systematic exploration of ICL for MSA, an area that has received limited attention. It combines existing ICL strategies with sentiment bias mitigation, offering new perspectives on optimizing MLLM performance.

2.Addressing sentiment bias in MLLMs is an important contribution, as it improves both model fairness and reliability.

3.The paper is well-written and structured. The proposed strategies are easy to follow.

4.The experimental design is clearly explained, and sufficient ablation studies strengthen the empirical validation.

Weaknesses:

1.While ICL offers benefits, the computational feasibility of optimal retrieval strategies should be discussed.

2.Fine-tuning on support set could serve as additional baselines to contextualize ICL’s strengths and trade-offs.

其他意见或建议

1.It would be better to provide an analysis of whether the ICL strategies can generalize to different MLLM architectures beyond IDEFICS and Open-Flamingo.

2.The randomness of the 1% support set might influence the performance. A discussion is needed.

作者回复

We sincerely appreciate your positive feedback and constructive suggestions! Below, we present detailed responses to the weaknesses (W), comments (C) and questions (Q).

W1. Computational feasibility of optimal strategies

Please refer to the response to Q2 of Reviewer nCbm.

W2. Fine-tuning MLLMs on support set

On MVSA-S and Twitter-15, we sample three support sets (each comprising 1% of the training data) using three distinct random seeds. On these, we perform LoRA fine-tuning on the Q and V matrices within IDEFICS's gated xattn-dense layers, with a batch size of 1, a learning rate of 1e-4, and train for 3,000 steps. The training data is constructed in a zero-shot format, where the MLLM processes one image-text pair and its sentiment label at a time. The results (Accuracy) are reported below.

IDEFICSSupport SetMVSA-STwitter-15
Zero-Shot Paradigm-38.660.7
Zero-Shot Paradigm after Lora Fine-Tuning#145.863.3
ICL Ours 16-shot#166.567.0
Zero-Shot Paradigm after Lora Fine-Tuning#248.962.3
ICL Ours 16-shot#267.266.6
Zero-Shot Paradigm after Lora Fine-Tuning#346.163.4
ICL Ours 16-shot#366.767.2

LoRA fine-tuning significantly improves the MLLM's sentiment perception capability under the zero-shot paradigm, yet it still lags behind the optimized ICL configuration. Across different support sets, ICL demonstrates more stable performance compared to LoRA fine-tuning. These results will be incorporated into the revised manuscript to enrich baselines and enhance the reliability of the findings.

C1. Generalization to different MLLM architectures

We evaluate our optimized strategies on two other MLLMs: MiniCPM-o-2.6-8B [1] and GPT4o [2], and report the results (Accuracy) below. Our ICL strategies still demonstrate consistent performance advantages over other strategies, confirming their generalizability.

MiniCPM-o-2.6Support SetMVSA-STwitter-15GPT4oSupport SetMVSA-STwitter-15
Zero-Shot Paradigm-56.052.8Zero-Shot Paradigm-60.859.4
ICL Random 16-shot1% Training60.659.9ICL Random 16-shot1% Training63.661.0
ICL RICES 16-shot1% Training62.561.5ICL RICES 16-shot1% Training66.261.8
ICL Ours 16-shot1% Training67.468.2ICL Ours 16-shot1% Training72.568.7
ICL Random 16-shot100% Training60.459.9ICL Random 16-shot100% Training63.761.1
ICL RICES 16-shot100% Training63.662.3ICL RICES 16-shot100% Training67.462.3
ICL Ours 16-shot100% Training67.970.3ICL Ours 16-shot100% Training74.169.7

C2&Q2. Randomness in the selection of the support set

Please refer to the response to W2.

Q1. Sensitivity of ICL to prompt variations

In the investigation, we experiment with various textual prompts and find that they significantly impact zero-shot performance. However, their impact on ICL is minimal. Since this manuscript primarily focuses on how ICL configurations influence MLLMs' sentiment perception capabilities, we select a set of appropriate textual prompts (#1 Prompt below) and keep them fixed throughout the investigation. The performance (Accuracy) of IDEFICS under different prompts is reported below. The support set contains 1% data from the training set.

For post-level MSA:
#1 Prompt: A post contains an image and a text. Classify the sentiment of the post into [Positive, Neutral, Negative].
#2 Prompt: Please classify the sentiment of the image-text post into [Positive, Neutral, Negative].
#3 Prompt: Here is a post containing an image and a text. The optional categories are [Positive, Neutral, Negative]. What is the overall sentiment of the post?

For aspect-level MSA:
#1 Prompt: A post contains an image, a text and an aspect. Identify the sentiment of the aspect in the post. The optional categories are [Positive, Neutral, Negative].
#2 Prompt: Please classify the sentiment of the aspect in image-text post into [Positive, Neutral, Negative].
#3 Prompt: Here is a post containing an image, a text and an aspect. The optional categories are [Positive, Neutral, Negative]. What is the sentiment of the aspect in the post?

IDEFICSPromptMVSA-STwitter-15
Zero-Shot Paradigm#138.660.7
ICL Ours 16-shot#166.567.0
Zero-Shot Paradigm#228.251.9
ICL Ours 16-shot#266.366.9
Zero-Shot Paradigm#350.619.1
ICL Ours 16-shot#366.466.7

References

[1] Yao et al. MiniCPM-v: A GPT-4V Level MLLM on Your Phone. Arxiv, 2024.
[2] OpenAI. GPT-4 Technical Report. Arxiv, 2023.

审稿人评论

The authors have discussed the the computational feasibility of optimal retrieval strategies, fine-tunned MLLMs on 3 different support sets, evaluated optimized strategies on two extra MLLMs (MiniCPM-o-2.6-8B and GPT4o), and presented the sensitivity of ICL to prompt variations, which seems that addressed my all concerns. Moreover, among other 3 reviewers, 2 are positive and 1 is negative. As a result, I think making this work public is beneficial for related researchers, and I increase my rating as Accept.

审稿意见
3

The paper addresses Multimodal Sentiment Analysis using MLLMs by enhancing In-Context Learning through optimized demonstration retrieval, presentation, and distribution. It achieves significant accuracy gains over zero-shot and random ICL baselines and mitigates inherent sentiment bias.

给作者的问题

  1. Could you elaborate on how your ICL demonstration configuration strategies generalize to other multimodal tasks or larger, more diverse datasets beyond the six MSA datasets tested?
  2. What are the computational and time overheads associated with retrieving, presenting, and distributing demonstrations compared to the zero-shot paradigm?
  3. How exactly is the sentimental predictive bias identified and mitigated? Are there any specific metrics or case studies that demonstrate the effectiveness of this bias reduction?

论据与证据

Yes

方法与评估标准

Yes

理论论述

There are no theoretical proofs provided in the paper.

实验设计与分析

The experimental design is overall rigorous, validating the proposed improvements against zero-shot and random ICL baselines. However, concerns remain regarding dataset diversity, potential overfitting in demonstration selection, and computational overhead.

补充材料

I reviewed the entire supplementary material.

与现有文献的关系

The paper builds on recent advances in multimodal learning by extending ICL strategies for sentiment analysis. It contributes to the literature by refining demonstration selection and prompt engineering while addressing sentiment bias in MLLMs.

遗漏的重要参考文献

N/A

其他优缺点

Strengths

  • The paper tackles Multimodal Sentiment Analysis (MSA), which is a pivotal yet challenging task in the realm of MLLMs. It demonstrates that with proper In-Context Learning (ICL) configuration, MLLMs can achieve significantly better performance in sentiment analysis.
  • The discovery and mitigation of a sentimental predictive bias in MLLMs not only improves accuracy but also contributes to fairness in sentiment prediction, which is a valuable consideration in AI applications.

Weaknesses

  • The requirement to fine-tune multiple aspects of demonstration configuration (retrieval, presentation, and distribution) may introduce additional complexity in implementation. This could make it challenging to deploy the approach in real-world scenarios without careful calibration.

其他意见或建议

N/A

作者回复

We sincerely appreciate your positive feedback and helpful comments! Below, we present detailed responses to the questions (Q) and weaknesses (W).

Q1. Generalization to other tasks or datasets

Our strategies cover three factors: similarity measurement, modality presentation, and sentiment distribution. These components can be harmoniously transferred to other multimodal tasks, such as multimodal sarcasm detection and multimodal crisis event detection. Both tasks accept image-text inputs, allowing direct adoption of the similarity measurement strategy (WIT) and modality presentation strategy (Image, Text). Regarding classification targets, multimodal sarcasm detection focuses on whether sarcasm expression is present, while multimodal crisis event detection aims to identify disaster types. By generalizing the concept of sentiment distribution to the distribution of different categories, we can replicate the Category Balanced protocol on these tasks.

Experiments on HFM [1] (multimodal sarcasm detection) and CrisisMMD [2] (multimodal crisis event detection) demonstrate the generalizability of our strategies. We report accuracy and adopt 1% of the training set as the support set.

IDEFICSHFMCrisisMMD
Zero-Shot Paradigm71.869.4
ICL Random 16-shot78.974.1
ICL RICES 16-shot79.281.6
ICL Ours 16-shot84.786.3

Regarding the scale and diversity of datasets, they do not affect the application of the strategies themselves. When applying to larger-scale datasets, the size of the support set can influence both the effect of ICL and computational overhead, necessitating a trade-off based on specific priorities.

W1. Requirement of calibration in real-world scenarios

While the optimization process includes additional computational overhead, our optimized strategies exhibit strong generalizability across six MSA datasets (as in the manuscript), other multimodal tasks (according to the response to Q1), and diverse MLLM frameworks (according to the response to C1 of Reviewer gx7F). Therefore, in practical applications, our strategies can achieve promising results without further calibration.

Q2. Computational overhead compared to the zero-shot paradigm

In the optimized configuration, presenting and distributing demonstrations do not introduce additional computational overhead. The extra costs originate from demonstration retrieval and the expanded input sequence for MLLMs. The former scales with the size of the support set, as each test sample needs to be compared against all support set samples, while the latter is inherent to ICL. We report the average time overhead (ms) of processing an image-text sample under two support set scales.

IDEFICS# of Samples in Support SetRetrievalInferenceTotal
Zero-Shot Paradigm--78.178.1
ICL Random 4-shot--134.5134.5
ICL Ours 4-shot13636.4134.5170.9
ICL Ours 4-shot156264.2134.5198.7
ICL Random 16-shot--346.1346.1
ICL Ours 16-shot13636.4346.1382.5
ICL Ours 16-shot156264.2346.1410.3

Compared to the zero-shot paradigm, most additional time overhead is introduced by ICL itself. The cost introduced by our strategies accounts for a minimal proportion, demonstrating their efficiency.

Q3. Identification and mitigation of the sentimental predictive bias

In the manuscript, the sentimental predictive bias is identified based on the precision and recall metrics of the MLLM across different sentiment samples. As shown in Figure 6 (a-4) (b-4), the recall of negative samples (0-30/0-30) is significantly lower than the precision (50-90/54-88) on both datasets, implying the MLLM’s tendency to favor positive and neutral predictions over negative predictions.

On Twitter-17, the Category Balanced protocol (indicated by diamond within the red clusters in Figure 6 (b-4)) configures more negative demonstrations (higher SLR-Negative) compared to the other protocols. Affected by the short-cut effect, this configuration amplifies the MLLM's tendency to conduct negative predictions during inference, thereby mitigating this bias.

A more intuitive set of metrics is the proportion of negative samples classified by the MLLM as positive (PposP_{pos}), neutral (PneuP_{neu}), and negative (PnegP_{neg}). The table below demonstrates that nearly all negative samples under the Unlimited protocol are misclassified as positive or neutral, revealing significant predictive bias, whereas this bias is mitigated by the Category Balanced protocol. In the revised manuscript, we will integrate these analyses and include case studies to clarify our findings.

IDEFICSPposP_{pos}PneuP_{neu}PnegP_{neg}
Unlimited21.0%73.0%6.0%
Category Balanced11.3%60.7%28.0%

References

[1] Cai et al. Multimodal Sarcasm Detection in Twitter with Hierarchical Fusion Model. ACL, 2019.
[2] Alam et al. CrisisMMD: Multimodal Twitter Datasets from Natural Disasters. AAAI, 2018.

最终决定

This paper explores In-Context Learning (ICL) to enhance multimodal large language models (MLLMs) for sentiment analysis. It systematically investigates three critical ICL demonstration factors—similarity measurement, modality presentation, and sentiment distribution—achieving significant accuracy gains over zero-shot and random ICL baselines across multiple datasets. All reviewers recommend acceptance.