OV-MER: Towards Open-Vocabulary Multimodal Emotion Recognition
摘要
评审与讨论
This paper proposes a new platform for emotion recognition studies. It extends the previously released MER2023 dataset, where GPT-3.5 is heavily utilized to group emotions meaningfully. For comparison, the evaluation benchmark includes many existing LMMs.
给作者的问题
I wonder how much agreement there is among annotators (Kappa value) in the annotation framework.
论据与证据
This research aims to overcome the limitations of previous studies that rely on predefined taxonomies to capture more complex, subtle emotions. However, the grouping and emotion recognition performance results are not so different from those of the existing studies. LLMs may have learned limited taxonomies (such as those in Emotion Wheels) from literature. While this research provides a more detailed vocabulary set of emotions, it is still unclear what the new findings are in this paper.
方法与评估标准
The methods and evaluation criteria are sound and reasonable, but limited to identifying L1-level groups, which is not very different from the present classification framework.
理论论述
This paper is theoretically sound.
实验设计与分析
The experiments are well-designed and the ablation studies are good.
补充材料
This paper contains too many supplementary materials, making it difficult to understand without reading. It would have been better to clarify the focus.
与现有文献的关系
This paper challenges universal symbol grounding problems, which are significant for analyzing human behavior. Thus, it relates to the border scientific literature.
遗漏的重要参考文献
I believe the survey in this paper is thorough.
其他优缺点
This paper proves that collaborative annotation between humans and AI is feasible and fruitful.
其他意见或建议
As the authors claim, human emotion includes many different elements. Therefore, emotion recognition should be a "multi-label" problem. I strongly recommend that the proposed framework be extended in this direction.
update after rebuttal
Thank you for the rebuttal. I did not change my score.
Q1: This research aims to overcome the limitations of previous studies that rely on predefined taxonomies to capture more complex, subtle emotions. However, the grouping and emotion recognition performance results are not so different from those of the existing studies. The methods and evaluation criteria are sound and reasonable but limited to identifying L1-level groups, which is not very different from the present classification framework.
A1: It seems the reviewer may have some misunderstandings about our work. The L1-level group is just one of several grouping techniques used in this paper. Specifically, Table 2 presents results using GPT-based grouping, Table 3 provides results using L1-level grouping, and Table 12 shows results using L2-level grouping.
In this paper, our prediction approach follows an open-vocabulary setting, which allows for the recognition of unseen emotions. This is fundamentally different from traditional MER, which relies on fixed taxonomies. However, this flexibility introduces new evaluation challenges. Since there is no predefined label space, the model may predict closely related but differently expressed emotions (e.g., joyful and happy). To address this issue, we grouped similar emotions before computing evaluation metrics. Thus, the grouping operation is not intended to restrict emotion recognition to L1-level groups. Instead, to align with our open-vocabulary setup, we designed grouping techniques to facilitate performance evaluation. Of course, other evaluation metrics that do not rely on grouping techniques could also be used.
Meanwhile, due to the open-vocabulary setting of our OV-MER, we need to design not only new evaluation metrics but also novel dataset construction methods and baselines. These innovations are also key contributions of our work. Therefore, OV-MER is fundamentally different from traditional MER in terms of the task, dataset, evaluation metrics, and solution approach.
Q2: As the authors claim, human emotion includes many different elements. Therefore, emotion recognition should be a "multi-label" problem. I strongly recommend that the proposed framework be extended in this direction.
A2: Thanks for your comment. We would like to clarify that OV-MER has already attempted to address MER in a multi-label manner. As shown in Figure 2 and Appendix E, a single instance can be associated with multiple emotion labels simultaneously. Furthermore, we argue that our proposed OV-MER task encompasses the concept of a multi-label problem but extends beyond it by incorporating flexible and expandable emotion expressions. Consequently, conventional evaluation protocols (e.g., L2 loss) cannot be directly applied to the OV-MER task.
Q3: I wonder how much agreement there is among annotators (Kappa value) in the annotation framework.
A3: Unlike the traditional single-label-based annotation method with a fixed label space, OV-MER employs a multi-label-based annotation method without a fixed label space. Therefore, we cannot directly compute the Kappa value between different annotators. Therefore, we draw inspiration from Section M and utilize the Jaccard similarity coefficient to measure the inter-annotator agreement.
Specifically, assume there are samples and annotators. For each pair of annotators and , their annotation results for each sample are denoted as and , respectively. Here, and contain a set of emotion labels. We calculate the agreement score between annotators and as:
In our annotation process, we hired 8 annotators and conducted two rounds of checks, with no overlap among annotators in each round (see Appendix K). For the first round, the inter-annotator agreement is shown as follows:
| A₁ | A₂ | A₃ | A₄ | |
|---|---|---|---|---|
| A₁ | 1.00 | 0.57 | 0.47 | 0.51 |
| A₂ | 0.57 | 1.00 | 0.49 | 0.48 |
| A₃ | 0.47 | 0.49 | 1.00 | 0.46 |
| A₄ | 0.51 | 0.48 | 0.46 | 1.00 |
For the second round, the inter-annotator agreement is shown as follows:
| A₅ | A₆ | A₇ | A₈ | |
|---|---|---|---|---|
| A₅ | 1.00 | 0.66 | 0.71 | 0.77 |
| A₆ | 0.66 | 1.00 | 0.64 | 0.67 |
| A₇ | 0.71 | 0.64 | 1.00 | 0.69 |
| A₈ | 0.77 | 0.67 | 0.69 | 1.00 |
We observe that through multi-round checks, the inter-annotator agreement gradually increases. These results demonstrate the necessity of multi-round checks, which help enhance label reliability.
This paper extends traditional MER and introduces a novel task called open-vocabulary MER (OV-MER). The primary motivation behind this is to expand the scope of emotion recognition to encompass more fine-grained emotion labels. Since OV-MER is a newly proposed task lacking datasets, metrics, and baselines, the authors further construct a dataset (OV-MERD), define metrics (set-level metrics based on GPT and the emotion wheel), and establish baselines (baselines based on LLMs). In summary, this paper extends traditional MER tasks to OV-MER, offering a new research direction in this field.
给作者的问题
Please refer to my comments on the weakness part.
论据与证据
Yes. This paper contains extensive experiments to support its conclusions.
方法与评估标准
Yes.
理论论述
This is not a theoretical paper.
实验设计与分析
Yes. Their experiment designs and analyses are clear and sound. This paper first presents the baseline results on OV-MER. Subsequently, it verifies the impact of different annotation methods, different baseline generation strategies, the correlation among different metrics, and the rationality of labels in OV-MERD. Meanwhile, there are extensive experiments in the appendix to explore more aspects.
补充材料
Yes. The code, dataset, and some baseline results are provided in the supplementary material.
与现有文献的关系
The key contribution of this paper lies in extending traditional MER, which has a fixed label space, to OV-MER. OV-MER encompasses a more diverse range of emotion labels, thereby facilitating a more accurate description of emotions.
遗漏的重要参考文献
No, essential related work has been correctly cited.
其他优缺点
- In Table 2, some baseline models include both 7B and 13B versions. Please specify which version is used in the leaderboard?
- Please explain the reason for reporting both English and Chinese results in Table 2.
- Please discuss the correlation between English and Chinese results.
- Please explain how the OV-MERD dataset is annotated. Traditionally, due to the fixed label space, MER datasets usually employ multiple annotators and use majority voting to determine the final label. In OV-MER, please explain the method for determining the final label.
- Please discuss the reasons behind why the GPT-based scores are highly correlated with the EW-based metrics, but less correlated with the matching-based scores?
- In Table 12, it seems that M*-L2 is always more correlated with the GPT-based metrics than M*-L1. Please explain this further.
其他意见或建议
Please refer to my comments on the weakness part.
伦理审查问题
Not needed.
We sincerely appreciate your positive feedback and recognition of our contributions to advancing MER research. Your comments on our work are truly valuable to us.
Q1: In Table 2, some baseline models include both 7B and 13B versions. Please specify which version is used in the leaderboard.
A1: Thank you for your suggestion. In this paper, we use the 7B model by default, and we will clarify this in the revised manuscript.
Q2: Please explain the reason for reporting both English and Chinese results in Table 2.
A2: In Figure 2, we observe that there are certain differences in the labels extracted from different languages. To study the impact of language differences, we report results for both English and Chinese descriptions in Table 2.
Q3: Please discuss the correlation between English and Chinese results.
A3: We utilize the results from Table 2 to compute Pearson correlation coefficients (PCC) between the English and Chinese results for each metric. As illustrated in the following table, all metrics demonstrate strong cross-linguistic correlations.
| Metric | Fₛ | Precisionₛ | Recallₛ |
|---|---|---|---|
| PCC scores | 0.9896 | 0.9738 | 0.9817 |
Q4: Please explain how the OV-MERD dataset is annotated. Traditionally, due to the fixed label space, MER datasets usually employ multiple annotators and use majority voting to determine the final label. In OV-MER, please explain the method for determining the final label.
A4: We appreciate your comment regarding our annotation process. As detailed in Appendix K, our labeling procedure involved eight annotators familiar with emotion definitions and utilized a two-round verification pipeline. In the first round, four randomly selected annotators performed independent annotations. In the second round, the labels reviewed by the first group of annotators were merged, and the remaining four annotators conducted a second round of checks. This approach ensures that each preserved label receives confirmation from at least one annotator per round, thereby guaranteeing both the comprehensiveness and accuracy of the annotation results.
Q5: Please discuss the reasons behind why the GPT-based scores are highly correlated with the EW-based metrics, but less correlated with the matching-based scores?
A5: GPT-based and EW-based metrics are two grouping techniques used in our work, focusing on calculating emotion label similarity. In contrast, matching-based scores emphasize word-level matching between two descriptions, including non-emotional words. Consequently, matching-based metrics exhibit lower similarity compared to GPT-based and EW-based metrics.
Q6: In Table 12, it seems that M*-L2 is always more correlated with the GPT-based metrics than M*-L1. Please explain this further.
A6: M*-L1 emphasizes coarse-grained clustering information, whereas M*-L2 emphasizes fine-grained clustering information. The higher correlation between M*-L2 and GPT-based metrics suggests that GPT-based metrics primarily rely on fine-grained emotion clustering during the metric calculation.
The paper presents a novel paradigm for Open-Vocabulary Multimodal Emotion Recognition (OV-MER), addressing the limitations of existing MER systems that rely on predefined emotion taxonomies. The key contributions include:
- A new MER paradigm (OV-MER): Unlike traditional MER, which limits emotions to a fixed set of labels, OV-MER enables models to predict emotions beyond predefined categories, allowing for a more nuanced and flexible representation of human emotions.
- The OV-MERD dataset: A newly curated multimodal emotion dataset that supports open-vocabulary annotation, leveraging a human-LLM collaboration strategy to improve label richness.
- New evaluation metrics: Since OV-MER allows flexible labeling, the paper proposes set-based evaluation criteria, including similarity-based grouping methods and modified precision-recall metrics.
- Benchmarking and analysis: The paper provides extensive experiments evaluating state-of-the-art multimodal large language models (MLLMs) on the OV-MER task, showing that current models struggle to handle the complexity of open-vocabulary emotion recognition. The study suggests that OV-MER can significantly improve the generalizability of MER systems and facilitate a more human-like emotional understanding in AI applications.
给作者的问题
I recommend conducting an ablation study to analyze the individual contributions of text, audio, and video modalities in OV-MER. This would help clarify the relative importance of each modality and provide insights into how multimodal integration enhances emotion recognition.
论据与证据
The claims made in the paper are generally well-supported by experiments and analysis.
方法与评估标准
The methods and evaluation criteria are well-justified for the problem, particularly: • Dataset Construction: The OV-MERD dataset introduces a novel human-LLM hybrid annotation process, which improves label diversity. • Evaluation Metrics: The study introduces set-based precision-recall metrics tailored for open-vocabulary tasks, which is appropriate given the nature of OV-MER. • Experimental Design: The evaluation compares multiple state-of-the-art MLLMs, making the benchmarks robust. A potential limitation is that human evaluations (e.g., expert reviews on model-generated labels) are not included, which could provide a more qualitative assessment of OV-MER’s effectiveness.
理论论述
The paper does not focus on formal theoretical proofs but rather on empirical evaluation. Therefore, there are no formal proofs to verify.
实验设计与分析
The experimental design is robust and well-executed, with appropriate baselines. However, two areas could be improved:
- Ablation studies on the impact of different modalities (text, audio, video) are limited. While results indicate multimodal inputs improve performance, further analysis on their individual contributions would be valuable.
- Generalization to real-world data is not extensively discussed—datasets are primarily sourced from movies and TV shows, which may not fully capture spontaneous human emotions.
补充材料
The supplementary material includes additional dataset details, evaluation methodologies, and model results, which enhance reproducibility.
与现有文献的关系
The paper builds upon existing MER work but extends it toward open-vocabulary recognition, aligning with recent advances in LLM-driven perception models.
遗漏的重要参考文献
The paper provides a solid literature review on multimodal emotion recognition (MER) and open-vocabulary learning.
其他优缺点
Strengths Novelty: The introduction of open-vocabulary emotion recognition is a significant departure from traditional MER and provides greater flexibility in capturing nuanced human emotions. Dataset Quality: The OV-MERD dataset is well-constructed, with human-LLM collaboration ensuring diverse and high-quality annotations. Evaluation Metrics: The paper carefully designs set-based evaluation metrics, which are more appropriate for open-vocabulary settings than traditional classification metrics. Comprehensive Benchmarking: The study evaluates state-of-the-art MLLMs and provides thorough performance comparisons, offering valuable insights into current limitations in emotion AI. Weaknesses Limited real-world validation: The dataset is primarily sourced from movies and TV series, which may not fully represent spontaneous, real-life emotional expressions. A discussion on potential domain adaptation strategies would strengthen the work. No explicit human evaluation of model predictions: While human reviewers refine dataset labels, there is no separate human assessment of final model outputs. A user study evaluating how well OV-MER aligns with human perception would be beneficial.
其他意见或建议
No other comments or suggestions.
Thank you for your positive feedback on our work. We greatly appreciate your recognition that OV-MER represents a significant advancement in MER systems, enhancing their generalizability and enabling more human-like emotional understanding in AI applications.
Q1: While human reviewers refine dataset labels, there is no separate human assessment of final model outputs. A user study evaluating how well OV-MER aligns with human perception would be beneficial.
A1: Thanks for your insightful comment. To assess how well OV-MER aligns with human perception, we conducted a user study. Specifically, we hired 9 annotators and randomly selected 20 samples from our dataset. Each annotator was presented with (sample, OV-MERD label) pairs and asked to judge their alignment with human perception using a binary (Yes/No) response format. To ensure annotation quality, we also included inspection data consisting of (sample, incorrect label) pairs. We observe that 96% of the annotations confirm the alignment between OV-MERD labels and human perception. Considering the potential annotator errors, this result demonstrates that our OV-MERD labels align well with human perception.
Q2: Ablation studies on the impact of different modalities (text, audio, video) are limited. While results indicate multimodal inputs improve performance, further analysis on their individual contributions would be valuable.
A2: Thanks for your suggestion. In Table 2, we observe that CLUE-Video outperforms CLUE-Text, consistent with the nature of our OV-MERD dataset. To be specific, OV-MERD is derived from MER2023, where the textual modality contributes less than the visual modality in emotion recognition [1]. Meanwhile, CLUE-Audio achieves superior performance over both CLUE-Text and CLUE-Video, suggesting that although textual expressions may be ambiguous for emotion recognition, combining them with audio cues can effectively resolve these ambiguities, leading to better performance.
[1] Lian, Zheng, Licai Sun, Yong Ren, Hao Gu, Haiyang Sun, Lan Chen, Bin Liu, and Jianhua Tao. "Merbench: A unified evaluation benchmark for multimodal emotion recognition." arXiv preprint arXiv:2401.03429 (2024).
Q3: The dataset is primarily sourced from movies and TV series, which may not fully represent spontaneous, real-life emotional expressions. A discussion on potential domain adaptation strategies would strengthen the work.
A3: OV-MERD is derived from MER2023, which is sourced from high-rated movies and TV shows. The high ratings serve as an implicit validation of the actors' performances, ensuring spontaneous and realistic emotional expressions. Currently, this type of dataset is the mainstream in the MER research community, as it provides a cost-effective means to expand dataset scale. In the future, we plan to apply for additional funding to collect data featuring spontaneous, real-life emotional expressions by recruiting participants. Furthermore, we will employ domain adaptation techniques (e.g., Domain-Adversarial Neural Networks, DANN) to address potential domain gaps between different data sources. These will be incorporated into the future work section of the revised manuscript.
This paper proposes a novel paradigm by integrating the open-vocabulary concept into Multimodal Emotion Recognition (MER), which facilitates emotion prediction without relying on predefined categories. Specifically, the authors introduce a new dataset generated via their proposed CLUE-Multi Generation method, accompanied by novel evaluation metrics and preliminary benchmarks designed to improve MER applicability in real-world scenarios.
给作者的问题
please see Limitations.
论据与证据
Strengths:
- The paper is comprehensive and well-organized.
- The figures are clear, rich and concise.
- Extensive experimental evaluations have been conducted.
Limitations:
- The authors introduce a new task, Open-Vocabulary MER (OV-MER), accompanied by a self-constructed dataset and evaluation metrics. However, the details regarding data collection and processing lack sufficient transparency. The authors compare their results, obtained on their dataset, against several general multimodal large language models (MLLMs) without fine-tuning, raising concerns about fairness and validity. For example, GPT-4V achieves scores of 55.51, 48.52, 64.86, 57.21, 54.61, and 60.07 (Table 2), whereas CLUE-Multi obtains significantly higher scores of 80.05, 80.03, 80.07, 85.16, 87.09, and 83.31, respectively. This substantial performance gap suggests potential methodological issues and undermines the credibility of the reported results.
- Despite advocating for an open-vocabulary framework, the authors still categorize emotions into 236 classes (Table 5). While understandable for task evaluation purposes, the existence of finite categories means that traditional MER methods (such as [1-2]) could also be evaluated on the constructed dataset. However, the authors have not provided comparative experiments with conventional MER methods. Furthermore, all baselines listed in the main experiments are MLLM-based, neglecting the traditional MER approaches. Such omission significantly limits the robustness and comprehensiveness of the presented analysis.
[1] Multimodal transformer for unaligned multimodal language sequences. ACL 2019.
[2] Decoupled multimodal distilling for emotion recognition. CVPR 2023.
-
During dataset construction, the authors perform manual verification only for labels generated by ALLM/VLLM ("check×2"), while textual data generated by LLM (through "merge & analysis") undergo no manual checks, as explicitly stated by the authors. Considering the hallucination problem commonly observed in LLM-generated texts, an addition of manual verification for these texts is necessary. The lack of this critical step casts doubt on the reliability and rationality of the dataset construction process.
-
The authors state, "This dataset is an extension of MER2023 (Lian et al., 2023), from which we randomly select a portion of samples for further annotation." However, the description of dataset scale—including dataset size and proportions for training, validation, and testing—is absent both in the dataset description and in Table 1. This omission raises concerns regarding potential overfitting due to insufficient data, thus weakening confidence in the reported experimental outcomes. The vague dataset description fails to convincingly support the reliability of the experimental results.
-
Regarding the definition of evaluation metrics, the authors introduce a subscript "s," but do not explain its meaning or significance. Clarification of this notation is necessary to ensure clear understanding and reproducibility.
-
Although I acknowledge the comprehensive nature of this paper and appreciate its overall completeness, I find the heavy reliance on large language models for both dataset and methodological construction lacking sufficient theoretical underpinning—an essential criterion for ICML submissions. Given its current state, this work may be more suitable for submission to the NeurIPS Datasets and Benchmarks track, as it does not fully meet the rigorous theoretical standards expected for ICML (and of course this needs to be evaluated by AC and PC as well).
方法与评估标准
Please see the Limitation 1 and 2
理论论述
I cannot evaluate the Theoretical Claims of this paper. This is because there are no theoretical claims in this paper.
实验设计与分析
Please see the Limitation 1 and 2. The experimental designs and results are not convincing.
补充材料
Supplementary Material contains some demo codes.
与现有文献的关系
n/a
遗漏的重要参考文献
please see Limitation 2
其他优缺点
no other Strengths And Weaknesses
其他意见或建议
no other comments or suggestions
Q1: CLUE-Multi achieves higher scores than GPT-4V. This performance gap suggests potential methodological issues and undermines the credibility of the results.
A1: We believe the reviewer may have misunderstood the results in Table 2. As explained in Section 4.1 and illustrated in Figure 2, CLUE-Multi is the ground truth derived from manually verified visual and acoustic clues. In contrast, as shown in Figure 3, GPT-4V belongs to the CLUE-MLLM category, where it does not utilize manually verified clues. Therefore, CLUE-Multi is the upper-bound performance, while GPT-4V is a baseline. To enhance the clarity, we will revise the manuscript as follows: (1) The caption for the second part will be updated from "CLUE-MLLM" to "CLUE-MLLM (Baselines)". (2) The caption for the third part will be revised from "CLUE-M/A/T/V" to "CLUE-M/A/T/V (Upper-Bound Performance)".
Q2: Comparison with conventional MER methods.
A2: Traditional MER methods are not applied due to fundamental differences in our experimental setup. Specifically, conventional methods require identical label spaces for both training and testing sets. They cannot predict unseen emotions, i.e., . However, we use an open-vocabulary annotation manner for each sample, which inherently cannot guarantee alignment between training and testing label spaces. Since MLLM-based methods give us more freedom in emotion prediction, our work primarily leverages MLLM-based solutions.
If we force the traditional MER approach, these models can only predict labels that belong to their training label space. To address your concerns, we train on the IEMOCAP (or MELD) dataset and test on OV-MERD, following the zero-shot experimental setup for a fair comparison. For the model architecture, we evaluate both the attention model and MulT.
| M3-W1 L1 | M3-W1 L2 | M3-W2 L1 | M3-W2 L2 | M3-W3 L1 | M3-W3 L2 | M3-W4 L1 | M3-W4 L2 | M3-W5 L1 | M3-W5 L2 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Traditional Discriminative Models | ||||||||||
| MELD+MulT | 30.74 | 17.76 | 30.67 | 18.45 | 28.08 | 23.58 | 29.89 | 23.68 | 24.72 | 20.79 |
| MELD+Attention | 33.61 | 23.16 | 32.27 | 23.42 | 35.17 | 30.41 | 30.88 | 25.75 | 33.72 | 29.53 |
| IEMOCAP+MulT | 42.67 | 30.27 | 43.50 | 30.79 | 42.10 | 37.21 | 40.75 | 34.31 | 41.00 | 36.55 |
| IEMOCAP+Attention | 45.64 | 32.23 | 46.18 | 32.31 | 44.42 | 39.23 | 43.40 | 36.67 | 43.65 | 38.49 |
| MLLM-based Generative Models | ||||||||||
| Chat-UniVi | 57.00 | 42.25 | 57.50 | 42.43 | 56.80 | 45.66 | 55.86 | 41.97 | 55.81 | 43.61 |
Q3: Manual checks were performed only for ALLM/VLLM outputs, not for texts generated by LLMs.
A3: During our pipeline design, we observed noticeable errors and hallucinations in the outputs from ALLM/VLLM. In contrast, when it comes to the text merging task, given that GPT-series models exhibit impressive performance in reading comprehension [1] (close to human performance) and considering that multi-clue merging is a fundamental function in reading comprehension, we directly adopt the merging results from GPT-series models. This decision balances dataset reliability and construction cost. The user study in our response to Reviewer 2UWh further validates the high quality of LLM-based MER outputs.
[1] Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan et al. "Language models are few-shot learners." Advances in neural information processing systems 33 (2020): 1877-1901.
Q4: Training, validation, and testing are absent. This omission raises concerns regarding potential overfitting due to insufficient data.
A4: We believe the reviewer may have misunderstood some aspects of our work. This paper focuses on establishing a zero-shot benchmark for OV-MER. Consequently, all data is used for inference, meaning every sample serves as part of the testing set. Since our method does not involve any training process, the overfitting problem does not exist at all. In this paper, our primary contributions are to propose a new task (OV-MER) and lay the groundwork for OV-MER. In our follow-up work (as discussed in Section B. Limitations), we plan to develop more effective frameworks to better address OV-MER.
Q5: Meaning of the subscript "s".
A5: The subscript ``s'' indicates that these metrics are set-based, distinguishing them from the traditional single-label metrics.
Q6: Lack theoretical parts. More suitable for the NeurIPS Datasets and Benchmarks track.
A6: This paper goes beyond merely proposing a dataset or establishing benchmarks within existing paradigms. Instead, it represents a significant task-wise innovation in the MER field by introducing a novel paradigm that enables more accurate emotion modeling. Additionally, this paper employs a psychological theory-based method to address the challenging evaluation problem of OV-MER. Given these substantial contributions to the MER research community, we firmly believe this work meets ICML's standards.
Thank you for the author's response. I have carefully read this rebuttal, and some of my concerns are addressed. I also have some questions and suggestions as below:
- As answered in A2, providing results from traditional MER methods can further enable readers to understand the value of this work and avoid ambiguity and misunderstanding. Therefore, I suggest the authors discuss more traditional MER methods in the revision.
- What is the attention model? To my knowledge, MulT is also an attention-based method. Please clarify this.
- As answered in A4, the key contribution of this work is to build a zero-shot benchmark for OV-MER. Therefore, the authors should open source this benchmark with high-quality (include dataset and codes for obtaining the results of Table 2, 3, etc) if this paper is accepted, otherwise this paper will be meaningless.
Based on this rebuttal, I would like to raise my rating.
We thank the reviewer for improving the score.
Q1: As answered in A2, providing results from traditional MER methods can further enable readers to understand the value of this work and avoid ambiguity and misunderstanding. Therefore, I suggest the authors discuss more traditional MER methods in the revision.
A1: Based on your suggestions, we have conducted additional experiments using traditional MER methods.
| Model | M3-W1 L1 | M3-W1 L2 | M3-W2 L1 | M3-W2 L2 | M3-W3 L1 | M3-W3 L2 | M3-W4 L1 | M3-W4 L2 | M3-W5 L1 | M3-W5 L2 |
|---|---|---|---|---|---|---|---|---|---|---|
| Traditional Discriminative Models | ||||||||||
| MELD+MFM[1] | 22.28 | 13.51 | 21.77 | 13.51 | 19.59 | 17.67 | 22.10 | 18.20 | 16.72 | 14.82 |
| MELD+MISA[2] | 28.72 | 21.75 | 27.59 | 22.43 | 34.31 | 28.50 | 26.19 | 21.80 | 34.79 | 29.24 |
| MELD+GMFN[3] | 34.28 | 22.16 | 33.77 | 22.47 | 32.40 | 29.16 | 33.43 | 28.18 | 29.43 | 26.50 |
| MELD+MFN[4] | 31.19 | 21.57 | 30.66 | 21.66 | 31.26 | 28.02 | 32.42 | 25.54 | 27.97 | 24.89 |
| MELD+MulT[5] | 30.74 | 17.76 | 30.67 | 18.45 | 28.08 | 23.58 | 29.89 | 23.68 | 24.72 | 20.79 |
| MELD+LMF[6] | 41.47 | 27.70 | 40.86 | 28.43 | 42.29 | 37.36 | 38.54 | 32.83 | 40.05 | 35.16 |
| MELD+TFN[7] | 31.91 | 20.54 | 31.41 | 20.56 | 31.15 | 26.75 | 28.41 | 23.81 | 29.68 | 25.36 |
| MELD+Attention[8] | 33.61 | 23.16 | 32.27 | 23.42 | 35.17 | 30.41 | 30.88 | 25.75 | 33.72 | 29.53 |
| IEMOCAP+MFM[1] | 45.46 | 32.86 | 47.55 | 33.12 | 46.37 | 39.90 | 43.03 | 36.97 | 43.97 | 39.28 |
| IEMOCAP+MISA[2] | 49.14 | 35.98 | 48.80 | 36.53 | 48.66 | 43.86 | 47.31 | 39.82 | 48.21 | 43.37 |
| IEMOCAP+GMFN[3] | 49.35 | 35.85 | 49.57 | 36.09 | 49.18 | 43.29 | 46.72 | 39.28 | 47.30 | 42.71 |
| IEMOCAP+MFN[4] | 50.56 | 36.82 | 50.86 | 36.72 | 49.97 | 44.70 | 48.69 | 40.55 | 48.97 | 44.11 |
| IEMOCAP+MulT[5] | 42.67 | 30.27 | 43.50 | 30.79 | 42.10 | 37.21 | 40.75 | 34.31 | 41.00 | 36.55 |
| IEMOCAP+LMF[6] | 46.34 | 32.44 | 46.42 | 32.94 | 44.19 | 39.22 | 44.23 | 36.78 | 43.57 | 38.57 |
| IEMOCAP+TFN[7] | 46.13 | 33.45 | 46.66 | 33.91 | 46.27 | 41.27 | 42.31 | 35.95 | 45.82 | 40.69 |
| IEMOCAP+Attention[8] | 45.64 | 32.23 | 46.18 | 32.31 | 44.42 | 39.23 | 43.40 | 36.67 | 43.65 | 38.49 |
[1] Tsai, Yao-Hung Hubert, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency, and Ruslan Salakhutdinov. "Learning Factorized Multimodal Representations." ICLR.
[2] Hazarika, Devamanyu, Roger Zimmermann, and Soujanya Poria. "Misa: Modality-invariant and-specific representations for multimodal sentiment analysis." ACM Multimedia.
[3] Zadeh, AmirAli Bagher, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. "Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph." ACL.
[4] Zadeh, Amir, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. "Memory fusion network for multi-view sequential learning." AAAI.
[5] Tsai, Yao-Hung Hubert, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. "Multimodal transformer for unaligned multimodal language sequences." ACL.
[6] Liu, Zhun, and Ying Shen. "Efficient Low-rank Multimodal Fusion with Modality-Specific Factors." In ACL.
[7] Zadeh, Amir, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. "Tensor Fusion Network for Multimodal Sentiment Analysis." EMNLP.
[8] Lian, Zheng, Licai Sun, Yong Ren, Hao Gu, Haiyang Sun, Lan Chen, Bin Liu, and Jianhua Tao. "Merbench: A unified evaluation benchmark for multimodal emotion recognition." Arxiv.
Q2: What is the attention model? To my knowledge, MulT is also an attention-based method. Please clarify this.
A2: The "Attention" model refers to a foundation model architecture in MERBench [8]. Specifically, let , , and denote the acoustic, visual, and lexical features for a sample , respectively. This model first converts all inputs into the same dimension and then computes importance scores for each modality. Subsequently, it employs weighted fusion to obtain multimodal features , which are utilized for emotion prediction. For more details, please refer to MERBench.
$
h_i^m =\mbox{ReLU}\left(f_i^mW_m^h + b_m^h\right), m \in {a, l, v},
$
$
h_i = \mbox{Concat}\left(h_i^a, h_i^l, h_i^v\right),
$
$
\alpha_i = \mbox{softmax}\left(h_i^TW_\alpha+b_\alpha\right),
$
$
z_i = h_i\alpha_i.
$
Here, , , , and are trainable parameters. For the output, we have , , , and .
Q3: As answered in A4, the key contribution of this work is to build a zero-shot benchmark for OV-MER. Therefore, the authors should open source this benchmark with high-quality (include dataset and codes for obtaining the results of Table 2, 3, etc) if this paper is accepted, otherwise this paper will be meaningless.
A3: Thank you for your comments. We promise to publish all the data and baselines if this paper is accepted.
This paper extends traditional MER to the so-called open-vocabulary MER (OV-MER), in order to expand the scope of emotion recognition to encompass more fine-grained emotion labels. For this new task, the authors further construct a dataset (OV-MERD), define metrics (set-level metrics based on GPT and the emotion wheel), and establish baselines (baselines based on LLMs), thus offering a new research direction in this field.