AffectGPT: A New Dataset, Model, and Benchmark for Emotion Understanding with Multimodal Large Language Models
摘要
评审与讨论
This paper focuses on descriptive emotion understanding. Compared with discriminative emotion recognition, descriptive emotion understanding provides the possibility of modeling complex emotions. To promote the development of this field, they proposed new datasets (MER-Caption and MER-Caption+) and new models (AffectGPT). They also established a benchmark (MER-UniBench) for MLLM-based descriptive emotion understanding solutions. Experimental results verify the effectiveness of their proposed framework and dataset.
给作者的问题
Please refer to my comments on the weakness part.
论据与证据
Yes. This paper contains extensive experiments to support its conclusions.
方法与评估标准
Yes.
理论论述
This is not a theoretical paper.
实验设计与分析
Yes. Their experimental design and analysis are clear and reasonable. This paper first reports the performance of different MLLMs on MER-UniBench, demonstrating the effectiveness of their proposed AffectGPT framework. Then, they compare different instruction datasets under the same framework and training settings, demonstrating the effectiveness of their proposed dataset (MER-Caption+). At the same time, the ablation study of dataset filtering techniques, the necessity of pre-fusion operations, the discussion on the impact of input modality, and the user study of dataset quality assessment also demonstrate the effectiveness of each proposed module in this paper.
补充材料
Yes. The code and dataset are provided in the supplementary material.
与现有文献的关系
The key contribution of this paper is advancing the field of descriptive emotion understanding through the introduction of new datasets, architectures, and benchmarks.
遗漏的重要参考文献
No, essential related work has been correctly cited.
其他优缺点
Despite the extensive experiments conducted in this paper, there are several additional experiments and discussions that could further enhance the paper's comprehensiveness:
- While the paper discusses the choice of LLMs (as shown in Figure 4), it would be beneficial to also explore and discuss the impact of different audio and video encoders.
- Please further discuss the influence of rank selection in LoRA.
- It appears that this paper does not discuss how videos are sampled for the video branch.
- Additionally, the paper should discuss the impact of varying the number of sampled frames on the model's performance.
- In Figure 2, the dataset construction pipeline could benefit from further explanation.
- In Table 3, please explain the details of the filtering technique you used for the general instruction dataset.
其他意见或建议
Please refer to my comments on the weakness part.
伦理审查问题
Not needed.
Q1: It would be beneficial to also explore and discuss the impact of different audio and video encoders.
A1: (1) Impact of Audio encoders. The choice of audio encoder does not significantly impact performance. This confirms that AffectGPT's remarkable performance is primarily attributed to our proposed high-quality, large-scale dataset and effective framework rather than acoustic encoders. Meanwhile, we note that ImageBind achieves slightly worse performance compared to other audio encoders. One possible explanation for this is that other audio encoders are widely applied in audio content understanding tasks (such as ASR), where audio content plays a crucial role in emotion recognition.
| Audio Encoder | MER-UniBench |
|---|---|
| ImageBind | 73.81 |
| Data2vec_Base | 74.48 |
| WavLM_Large | 74.39 |
| HUBERT_Large | 74.77 |
(2) Impact of Video Encoder. The choice of video encoder has a limited impact on performance. Interestingly, CLIP_VIT slightly outperforms EVA_CLIP and DINOv2, consistent with the findings in MERBench [1], a unified benchmark for traditional categorical frameworks. These results suggest that conclusions drawn from traditional categorical frameworks, such as encoder selection, may also be applicable to MLLM-based descriptive frameworks.
| Video Encoder | MER-UniBench |
|---|---|
| EVA_CLIP | 73.42 |
| DINOv2 | 73.12 |
| SigLIP | 74.63 |
| CLIP_VIT | 74.77 |
[1] Lian, Zheng, Licai Sun, Yong Ren, Hao Gu, Haiyang Sun, Lan Chen, Bin Liu, and Jianhua Tao. "Merbench: A unified evaluation benchmark for multimodal emotion recognition." arXiv preprint arXiv:2401.03429 (2024).
Q2: Please further discuss the influence of rank selection in LoRA.
A2: In the following table, we count the increase in trainable parameters when using LoRA for the LLM branch. The first row represents the model without the LoRA module. Experimental results demonstrate that fine-tuning the LLM with LoRA improves performance compared to models without LoRA. However, increasing the rank for LoRA-based models does not yield significant performance gains and instead increases computational costs.
| Rank | Increased Parameters | MER-UniBench |
|---|---|---|
| - | 0 | 73.30 |
| 8 | 20,185,088 | 74.65 |
| 16 | 40,370,176 | 74.77 |
| 32 | 80,740,352 | 74.92 |
Q3: How are videos sampled for the video branch?
A3: We uniformly sample 8 frames per video by default.
Q4: The impact of varying the number of sampled frames on the model's performance.
A4: In the following table, we compare two types of inputs: (1) face-only and (2) face-text combinations, and evaluate model performance across different sampling frame counts, ranging from 2 to 64. We observe that using too few frames (e.g., fewer than 2) results in a noticeable decline in performance, indicating that insufficient frames lead to information loss. However, further increasing the number of sampling frames (e.g., more than 8) does not yield significant performance improvements. This can be attributed to the fact that MER tasks typically use short-duration videos with relatively stable facial expressions. Therefore, we default to sampling 8 frames per video in this paper.
| Input Modality | 0 | 2 | 4 | 8 | 16 | 32 | 64 |
|---|---|---|---|---|---|---|---|
| Face-only | - | 59.17 | 60.76 | 60.47 | 61.26 | 61.10 | 61.49 |
| Face-text | 67.44 | 72.94 | 73.54 | 74.31 | 73.51 | 74.10 | 74.23 |
Q5: The dataset construction pipeline could benefit from further explanation.
A5: We adopt a model-led, human-assisted annotation strategy. This approach leverages human priors to guide description generation and sample filtering, ultimately enabling automatic annotation for unlabeled data.
(1) Description Generation. During the description generation process, we first conduct preliminary experiments. In this phase, a small subset of samples is selected, and annotators are asked to assign fine-grained emotional labels to each sample. Based on the insights gained from these preliminary experiments, we perform a model selection process to ensure the quality of the automatically generated descriptions.
(2) Sample Filtering. During sample filtering, we employ a two-stage filtering technique. In the first stage, we remove samples with mismatched audio and video, as well as those with abnormal description lengths. In the second stage, we use model-based crowdsourcing to generate relatively reliable emotion labels. If the labels derived from the descriptions differ significantly from those obtained through model-based crowdsourcing, the descriptions are considered to be of low quality and are then removed.
In summary, we integrate human priors into both description generation and sample filtering to ensure the high quality of the generated descriptions in MER-Caption+.
Q6: The details of the filtering technique.
A6: For general instruction datasets, we utilize the prompt provided in Appendix E to extract emotion labels from each instruction-answer pair. Samples that result in empty emotion outputs are removed.
This paper introduces a new dataset for the multimodal emotion recognition (MER) task. The dataset is constructed using a model-driven, human-assisted approach. Initially, a coarse-grained dataset is generated through data description, followed by fine-grained data refinement through both low-level and high-level filtering. Additionally, a novel MLLMs-based method, "AffectGPT," is proposed, which enhances model performance by integrating audio and video feature vectors (pre-fusion) before feeding them into the LLM. This paper also presents a new MER evaluation standard to assess model performance in MER tasks. Several experiments validate the effectiveness of the dataset, the proposed method, and the evaluation standard.
update after rebuttal: I will keep my score as it is, even after the rebuttal, which addresses all of my comments.
给作者的问题
-
Regarding the ablation study of the model, since the video and audio features have already been fused, why are these features still fed separately into the LLMs? Have experiments been conducted using only the fused features?
-
What is the purpose of designing the evaluation metric FskF^k_s, or in other words, what specific aspect does this metric reflect?
-
Is it possible that MER-Caption+ also contains some instances where the data does not match the descriptions?
论据与证据
The claims have been supported by experimental analysis. For example: "We propose a pre-fusion operation that moves the cross-modal interaction outside the LLMs, further enhancing multimodal integration." Additionally, I agree that the current community still lacks large-scale datasets with detailed, descriptive emotion annotations necessary to fully realize the potential of MLLMs.
方法与评估标准
- Mainstream AV-LLMs leave everything of cross-modal interaction to the LLMs, which is insufficient for handling MER with multimodal characteristics. >> confirmed by experiments.
Strengths
- Novel two-level filtering (low-level modality alignment + high-level model crowdsourcing)
- Multi-level emotion grouping metrics (Level 1-3) handle free-form outputs effectively
Weaknesses
- Limited discussion on computational costs of pre-fusion operations
理论论述
The pre-fusion mechanism lacks theoretical analysis of modality interaction dynamics.
实验设计与分析
Yes, it is valid.
补充材料
The supplementary material includes the dataset labels and code, along with an appendix. The appendix primarily covers dataset comparisons and prompts. It is well-written and provides valuable details.
与现有文献的关系
The affective computing community is vast, and this dataset is currently the largest available, with a well-justified data collection procedure. The proposed method also demonstrates strong performance on existing datasets, and the suggested metrics appear reasonable. Therefore, I believe this work will be a valuable contribution to the community.
Moreover, the data collection procedure could serve as a guideline for other human studies aiming to collect large-scale datasets. This makes it potentially useful not only for the affective computing field but also for other research areas.
遗漏的重要参考文献
There is extensive research on these topics, but in assessing this paper, I did not find any major references missing.
其他优缺点
Strengths:
- This paper introduces a new MER dataset along with a method for collecting and constructing the dataset. Compared to other related datasets, it offers higher quality, a larger volume of data, and more diverse classifications.
- This paper proposes a new MER evaluation standard, "MER-UniBench," which provides a comprehensive evaluation framework for MLLMs-based MER emotion understanding tasks.
- Extensive experiments have been conducted to validate the dataset and evaluation standard.
Weaknesses:
- The structural innovation of AffectGPT is relatively insufficient, and it does not fully explain why the "Pre-fusion" operation can serve as a solution for cross-modal interaction.
其他意见或建议
Originality
- Novel model-led annotation paradigm
- Pre-fusion mechanism builds on established cross-modal attention
Significance
- Large-scale MLLM-focused emotion dataset
- Practical metrics for free-form emotion analysis
Q1: Limited discussion on computational costs of pre-fusion operations. The pre-fusion mechanism lacks theoretical analysis of modality interaction dynamics. The structural innovation of AffectGPT is relatively insufficient, and it does not fully explain why the "pre-fusion" operation can serve as a solution for cross-modal interaction.
A1: Regarding computational efficiency, the pre-fusion operation only involves Q-Former or attention operations, which are significantly less computationally intensive than LLMs. Theoretically, the Q-Former facilitates cross-modal interaction by distilling knowledge from multimodal content into query tokens, while the attention mechanism achieves this through dynamically predicted attention weights based on multimodal inputs.
The model architecture innovation represents one aspect of our work, and our contributions extend significantly beyond this. Specifically, we introduce the largest-scale emotion description dataset to date, constructed using an efficient model-led, human-assisted approach; we establish a comprehensive benchmark encompassing three key tasks: fine-grained emotion recognition, basic emotion classification, and sentiment analysis; and we develop specialized evaluation metrics designed for assessing free-form text outputs.
Q2: Regarding the ablation study of the model, since the video and audio features have already been fused, why are these features still fed separately into the LLMs? Have experiments been conducted using only the fused features?
A2: Thanks for your valuable comment. Following your suggestion, we conducted additional experiments to investigate this aspect. Our results demonstrate that incorporating raw audio and video features alongside the fused features yields modest performance improvements compared to using fused features alone.
| Input Type | MER-UniBench Score |
|---|---|
| Audio, video features | 72.95 |
| Audio-video fused features | 74.14 |
| Audio, video, and audio-video fused features | 74.77 |
Q3: What is the purpose of designing the evaluation metric , or in other words, what specific aspect does this metric reflect?
A3: In this paper, indicates the number of correctly predicted labels, and indicates whether the prediction covers all ground truth. is a harmonic mean of two metrics and .
Q4: Is it possible that MER-Caption+ also contains some instances where the data does not match the descriptions?
A4: Thanks for your valuable comment. Yes, there may be inaccurate descriptions in MER-Caption+ because we used an automatic annotation strategy without manual checks. However, the experimental results in Table 3 demonstrate that MER-Caption+ achieves significantly better performance than the manually annotated MAFW dataset. The main reason is that humans tend to focus on major clues, which can easily lead to incomplete descriptions. These results confirm that, despite the lack of manual checks in MER-Caption+, we can still ensure the quality of the labels. In the future, we will investigate other post-filtering techniques to further improve MER-Caption+'s annotation quality.
This paper presents a new video content description dataset with emotional words and highlights a novel annotation method for the dataset. Additionally, it proposes a model that enhances multimodal emotion recognition. The primary innovation of this model lies in its pre-fusion strategy for multimodal inputs. Finally, the paper introduces a new evaluation benchmark for assessing the performance of emotion recognition models.
update after rebuttal
I would like to keep my rating.
给作者的问题
The specific issues can be found in the weakness mentioned above.
论据与证据
The claims proposed in this paper are mostly supported by experimental evidence or have been mentioned in previous studies, and the reasoning is relatively sound.
方法与评估标准
The method proposed in this paper effectively helps address the challenges in current emotion recognition tasks. The dataset construction method reduces the difficulty of dataset creation. The model, which adopts a new strategy, also demonstrates improved performance. Finally, the benchmark introduced in the paper provides an effective means to evaluate the performance of multimodal emotion recognition models.
理论论述
I have indeed verified the correctness of the theories proposed in the paper. The paper includes numerous ablation experiments on the proposed methods. The results of these experiments validate the correctness of the proposed approaches. For example, Section 5 of the paper focuses on presenting various experimental results.
实验设计与分析
The paper clearly describes the design details of each experiment. When presenting comparative experiments, it frequently emphasizes "fairness." I also believe that the authors have made every effort to conduct fair comparative experiments.
补充材料
I have also gone through most of the supplementary materials, paying special attention to the parts that interest me. Additionally, I have identified some detailed issues within them.
与现有文献的关系
The paper builds upon previous multimodal emotion recognition research by summarizing and enhancing existing work. The authors propose new contributions, which are validated through experiments. The dataset introduced in the paper is also an innovation based on the experiences of previous studies. Finally, the benchmark presented serves as a comprehensive evaluation tool for assessing the capabilities of multimodal emotion recognition models.
遗漏的重要参考文献
No.
其他优缺点
The main advantages of this paper are as follows:
- The dataset proposed in the paper is large in scale and of high quality. The relevant experiments are thorough, and a user study has been conducted. While ensuring that the dataset can be automatically annotated, the quality of the data is also maintained. The proposed annotation and filtering methods have been experimentally validated for effectiveness.
- The model proposed for multimodal emotion recognition performs well. The effectiveness of the key component, the multimodal pre-fusion strategy, has also been validated through experiments.
- The paper introduces a new benchmark for evaluating the performance of multimodal emotion recognition models. This benchmark can comprehensively assess the model's ability to handle various multimodal emotion recognition tasks.
Although the paper has many advantages, there are still some details that need attention:
- In Appendix F, experiments have shown that using a combination of SALMONN and mPLUG-Owl results in better performance. However, the authors still chose the combination of SLAMONN and Chat-UniVi. An explanation for this choice is needed.
- The origin of the raw data used in the dataset construction is not provided by the authors.
- A large amount of data was filtered out during the dataset construction process, which limits the application scenarios of the dataset.
- The evaluation metric used in the benchmark for basic emotion recognition tasks may have the risk of being inaccurate. The current evaluation metric only calculates the number of correctly matched emotion words in the output of the multimodal emotion recognition model. However, if the model outputs entirely incorrect results, the impact of these errors is not accounted for.
- In Table 3, the performance of the LLaVA dataset decreased after applying the filtering strategy. The authors did not explain what the filtering strategy was. If it is the strategy mentioned in the paper, it suggests that this filtering strategy does not work well for the LLaVA dataset. The authors should analyze the reason.
- Although the paper proposes a new multimodal emotion recognition model with significant performance improvement, the model's innovation compared to previous work is limited, as it mainly adds a pre-fusion mechanism between multiple modalities.
其他意见或建议
The paper is overall well-written, and I hope the authors can provide explanations for the various detailed issues mentioned above.
We sincerely appreciate your recognition of our fair comparative experiments, innovative dataset, effective key components, and comprehensive evaluation benchmark.
Q1: In Appendix F, experiments have shown that using a combination of SALMONN and mPLUG-Owl results in better performance. However, the authors still chose the combination of SLAMONN and Chat-UniVi. An explanation for this choice is needed.
A1: Thanks for your careful review. In this paper, we do not use the combined results for model selection but instead rely on the performance of individual models. For example, for VLLM, Chat-UniVi outperforms mPLUG-Owl, and for ALLM, SALMONN outperforms SECap. Therefore, we use the combination of Chat-UniVi and SALMONN for description generation. The combination experiments are primarily designed to demonstrate that integrating multimodal cues can lead to better performance. Your suggestion of using the combined results for model selection is insightful, and we will conduct more experiments in this direction.
Q2: The origin of the raw data used in the dataset construction is not provided by the authors.
A2: The raw data comes from the unlabeled portions of MER2024, which is used with permission from the dataset owners. In this work, we annotate each unlabeled sample with emotion descriptions. The complete annotation procedure is illustrated in Figure 2.
Q3: A large amount of data was filtered out during the dataset construction process, which limits the application scenarios of the dataset.
A3: We would like to highlight that dataset quality is equally important as quantity. It is verified by the experimental results in Table 4: Given the same testing set (with diverse sources/scenarios), the model trained on the filtered datasets performs better than the one trained on the noisy dataset.
Q4: The evaluation metric used in the benchmark for basic emotion recognition tasks may have the risk of being inaccurate. The current evaluation metric only calculates the number of correctly matched emotion words in the output of the multimodal emotion recognition model. However, if the model outputs entirely incorrect results, the impact of these errors is not accounted for.
A4: Thanks for your comment. In fact, we have taken this into account during the metric design process. Basic emotion recognition tasks provide the majority-voted labels , which are generally reliable. However, emotion descriptions produce free-form outputs that may contain multiple labels, including fine-grained ones beyond basic emotions. Therefore, we use the HIT rate as the metric, ensuring that the basic label should be at least in .
Meanwhile, we attempt to design metrics for evaluating potentially incorrect labels in . However, the labels in that differ from the basic label are not necessarily incorrect - they may represent some fine-grained emotions not covered by basic categories. But since basic emotion recognition tasks lack fine-grained reference labels, we have not yet established appropriate evaluation metrics for this purpose. This remains an important research direction for our future work.
Q5: In Table 3, the performance of the LLaVA dataset decreased after applying the filtering strategy. The authors did not explain what the filtering strategy was. If it is the strategy mentioned in the paper, it suggests that this filtering strategy does not work well for the LLaVA dataset. The authors should analyze the reason.
A5: Thanks for your comment. Regarding general instruction datasets, we use a filtering process to retain only emotion-related instruction-answer pairs. Specifically, we use the prompt in Appendix E and extract emotion labels from each instruction-answer pair. Samples yielding empty emotion outputs are removed. As shown in Table 3, this filtering approach proves less effective for the LLaVA dataset. We hypothesize that the detailed descriptions in non-emotion subsets may also provide valuable cues for inferring emotional states in some scenarios.
Q6: Although the paper proposes a new multimodal emotion recognition model with significant performance improvement, the model's innovation compared to previous work is limited, as it mainly adds a pre-fusion mechanism between multiple modalities.
A6: The model architecture novelty is only a part of our work, and our contributions extend significantly beyond this. Besides the model architecture, we also use a model-led, human-assisted strategy to minimize human effort while constructing the largest multimodal emotion dataset to date. Also, we present a comprehensive benchmark covering fine-grained emotion recognition, basic emotion classification, and sentiment analysis, with metrics designed for free-form text outputs.
This paper introduces a new dataset, pre-fusion model, and evaluation benchmark to advance multimodal, natural language-based emotion understanding. It proposes a model-led, human-assisted strategy to minimize human effort while constructing the largest multimodal emotion dataset to date. The model features a pre-fusion architecture that integrates audio and video signals before combining them with text, enhancing multimodal fusion. Additionally, the paper presents a comprehensive benchmark covering fine-grained emotion recognition, basic emotion classification, and sentiment analysis, with metrics designed for free-form text outputs.
给作者的问题
N/A
论据与证据
N/A
方法与评估标准
N/A
理论论述
N/A
实验设计与分析
-
Lack of Baseline Comparisons on MER-Caption+: To better isolate the impact of the new dataset vs. the new model design, it would be helpful to compare baseline models trained on MER-Caption+ against AffectGPT trained on MER-Caption+. This would clarify whether the performance boost comes primarily from the dataset or the model architecture.
-
Dataset Coverage and Diversity: The paper does not specify the scenarios included in the dataset (e.g., daily conversations, news reports, or movie reviews). Additionally, it is unclear whether the dataset primarily focuses on first-person speaking or also includes multi-person videos, which could affect its generalizability.
-
Impact of Frame Signals on Performance: The results suggest that using face signals alone outperforms frame signals, raising the question of whether frame signals introduce more noise. It would be valuable to explore whether this issue can be mitigated through data cleaning or improved visual signal alignment with other modalities.
补充材料
N/A
与现有文献的关系
This paper introduces a comprehensive dataset with detailed natural language annotations, enabling a richer representation of emotions beyond predefined labels. It adopts a model-led, human-assisted annotation strategy, effectively balancing scale and label quality.
Additionally, the paper presents a rigorous evaluation pipeline, featuring in-depth ablation studies on the dataset, model, and LLMs. Notably, it also includes a user study to assess and compare dataset annotations, further validating its approach.
遗漏的重要参考文献
N/A
其他优缺点
N/A
其他意见或建议
N/A
We sincerely appreciate your recognition of our work's significance and your acknowledgment of our contributions—including the novel dataset, model architecture, and comprehensive benchmark for descriptive emotion understanding. These innovations enable richer, more flexible emotion representation that extends beyond conventional predefined-label paradigms.
Q1: Lack of Baseline Comparisons on MER-Caption+: To better isolate the impact of the new dataset vs. the new model design, it would be helpful to compare baseline models trained on MER-Caption+ against AffectGPT trained on MER-Caption+. This would clarify whether the performance boost comes primarily from the dataset or the model architecture.
A1: In our experimental design, we systematically validate the effectiveness of both the new dataset and model using the control-of-variables method. Specifically, in Table 3, we demonstrate the impact of our MER-Caption+ dataset by maintaining identical model architecture while varying only the training data. In Table 4, we verify the necessity of AffectGPT by keeping the training data constant while modifying only the model structure. These carefully designed ablation studies can already verify the effectiveness of the new dataset and the new model.
Q2: Dataset Coverage and Diversity: The paper does not specify the scenarios included in the dataset (e.g., daily conversations, news reports, or movie reviews). Additionally, it is unclear whether the dataset primarily focuses on first-person speaking or also includes multi-person videos, which could affect its generalizability.
A2: Thanks for your suggestion, and we will incorporate the following additional information in our revised manuscript. In this paper, we intentionally focus on single-person videos, as this allows us to eliminate interference from other speakers and reduce task difficulty. Multi-person MER belongs to another research topic and will be discussed in our future work.
| Category | Dataset | Chosen Set | # Samples | Label Description | Data Source |
|---|---|---|---|---|---|
| Fine-grained Emotion | OV-MERD+ | All | 532 | Unfixed categories and diverse number of labels per sample | movies, TV series |
| Basic Emotion | MER2023 | MER-MULTI | 411 | Most likely label among six candidates | movies, TV series |
| MER2024 | MER-SEMI | 1,169 | Most likely label among six candidates | movies, TV series | |
| IEMOCAP | Session5 | 1,241 | Most likely label among four candidates | actor's performance | |
| MELD | Test | 2,610 | Most likely label among seven candidates | "Friends" TV series | |
| Sentiment Analysis | CMU-MOSI | Test | 686 | Sentiment intensity, ranging from [-3, 3] | opinion videos in YouTube |
| CMU-MOSEI | Test | 4,659 | Sentiment intensity, ranging from [-3, 3] | opinion videos in YouTube | |
| CH-SIMS | Test | 457 | Sentiment intensity, ranging from [-1, 1] | movies, TV series, and shows | |
| CH-SIMS v2 | Test | 1,034 | Sentiment intensity, ranging from [-1, 1] | movies, TV series, and shows |
Q3: Impact of Frame Signals on Performance: The results suggest that using face signals alone outperforms frame signals, raising the question of whether frame signals introduce more noise. It would be valuable to explore whether this issue can be mitigated through data cleaning or improved visual signal alignment with other modalities.
A3: Thanks for your comment. We agree that using frame signals may introduce additional noise, resulting in performance degradation. Your suggestions for further verification are insightful, and we plan to explore these aspects through additional experiments in future work.
The paper received three Accepts and one Weak Accept. The reviewers note a rigorous evaluation pipeline, clear descriptions, and strong contribution to the field. This is a high-quality contribution to ICML.