PaperHub
6.5
/10
Oral4 位审稿人
最低4最高8标准差1.5
7
7
8
4
3.8
置信度
正确性3.0
贡献度2.8
表达3.3
NeurIPS 2024

Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06

摘要

关键词
comic narrative understandingvisual reasoningmultimodal benchmarkhumor understanding

评审与讨论

审稿意见
7

This paper introduces YESBUT, a new benchmark for evaluating large vision-language models' ability to understand humor and contradictions in comics with juxtaposed panels. The benchmark consists of two-panel comics with contradictory narratives, along with annotations for literal descriptions, contradiction explanations, underlying philosophies, and titles. The authors design four tasks of increasing difficulty: literal description writing, contradiction generation, underlying philosophy selection, and title matching. They evaluate several commercial and open-source vision-language models on these tasks using both automatic and human evaluation. The results show that even state-of-the-art models struggle with these tasks, especially the deeper reasoning required for understanding contradictions and abstractions. The paper provides insights into current limitations of AI in comprehending complex human expressions and offers directions for improvement.

优点

Originality: The paper addresses a unique and underexplored area in AI research—understanding humor in comics through contradictory narratives. This is a novel problem formulation that pushes the boundaries of current Vision Language Model (VLM) capabilities.

Quality: The data collection and annotation process are rigorous, involving multiple stages of human-AI collaboration and quality checks. The experimental design is comprehensive, evaluating multiple types of models on various aspects of comic understanding.

Clarity: The paper is well-structured and clearly explains the motivation, dataset creation, task designs, and experimental results. Figures and examples effectively illustrate the concepts.

Significance: Understanding humor and contradictory narratives in comics is significant for advancing AI's social and semantic comprehension. This research provides valuable insights into current AI limitations in this area and offers a pathway for future improvements, which is crucial for developing socially intelligent systems.

缺点

  1. The dataset size is relatively small (348 comics), which may limit the generalizability of the findings. The authors acknowledge this limitation.

  2. The annotation process, while rigorous, relies heavily on human judges and GPT-4, which may introduce biases. The paper does not explore potential biases in the dataset, such as cultural specificity of the humor or potential annotator biases.

  3. While the paper demonstrates that augmenting models with oracle descriptions improves performance, it does not thoroughly investigate why decomposing the task for vision-language models (by first generating descriptions) does not lead to consistent improvements.

  4. The paper highlights the limitations of current VLMs but does not provide concrete suggestions or experiments on how to overcome these limitations. Including some preliminary experiments with potential improvements could strengthen the paper.

问题

  1. How might the performance of these models change if tested on comics from different cultural contexts or with different styles of humor?

  2. Have the authors considered expanding the benchmark to include comics with more than two panels, to evaluate models' ability to understand more complex narrative structures?

  3. Could the authors provide more insight into why decomposing the task for VLMs (by first generating descriptions) doesn't consistently improve performance, especially for the title matching task?

  4. Can you provide more details on the types of biases that might have been introduced during the annotation process and how they were mitigated?

  5. How do the models' performances correlate with their training data or pre-training approaches? Could this provide insights into what types of pre-training might be most beneficial for these tasks?

  6. What are some specific strategies or architectural changes you propose for improving VLMs' understanding of contradictory narratives in comics?

局限性

The authors adequately address the limitations of their work, particularly acknowledging the relatively small dataset size and potential ambiguity in comic interpretation due to subjectivity. They also recognize that their benchmark may not cover all aspects of visual understanding required for more generalized AI applications.

The paper does not explicitly discuss potential negative societal impacts. While the focus on humor understanding is generally positive, it may be worth considering potential misuse cases (e.g., automated generation of misleading or offensive comics) or biases that could be amplified if such systems were deployed at scale. Suggestions for improvement include:

  1. Expanding the dataset to include a more diverse set of comics and narrative styles.
  2. Implementing bias detection and mitigation techniques in the annotation and model training processes.
  3. Considering the ethical implications of AI-generated content in real-world applications, ensuring that it respects cultural and social nuances.
作者回复

Thank you for your insightful suggestions! We appreciate your recognition of our problem formulation as novel and our experimental and evaluation methods as comprehensive. We are also encouraged that you think our work can provide insights for future research. We will revise our paper and incorporate additional discussions in Limitations. Below, we address your questions and concerns:

W1: The dataset size

Please see point 1 in the Overall Response.


W2 & Q4: Bias mitigation in annotation

Thank you for raising this good point! Please see Overall Response point 2.


W3 & Q3: Decomposition does not lead to consistent improvements

Thank you for this good question. Our manual analysis identifies two potential reasons. First, VLMs sometimes misinterpret visual content, leading to incorrect descriptions. This issue is also highlighted in Section 6.3. Such errors can cause cascading errors in the subsequent deep reasoning tasks. Second, the generated descriptions are often lengthy (over 100 words), resulting in more complex input prompts, which can complicate the reasoning process for VLMs.

Additionally, the decomposition is less beneficial for title matching than philosophy. Titles are more abstract and require more in-depth reasoning, whereas decomposition of surface-level description may not suffice. We will update our paper to include examples that better illustrate these potential causes.


W4 & Q6: Concrete suggestions and proposal on future improvements

Thank you for your insightful suggestion! We agree that highlighting directions for future research is crucial. Currently, we aim to uncover potential areas for improvement through our results analyses.

First, our analysis indicates that VLMs often struggle with accurately interpreting image content and may make errors in literal descriptions (Sec. 6.1 & 6.2). This suggests a need for future work to enhance visual interpretation capabilities.

Second, improving the in-depth reasoning ability of VLMs is essential. For instance, LLaVA-1.6 significantly outperforms LLaVA-1.5, likely due to the advancements in reasoning abilities [1]. Future work might incorporate recent advanced reasoning approaches (e.g., multi-agent debate, refinement-based reasoning) to further improve model performance.

Finally, our error analysis reveals that models tend to suffer from hallucination (Line 333), suggesting the need for incorporating external knowledge to enhance human understanding. To mitigate this problem, knowledge augmentation methods can be employed to enhance VLMs performance.

We will revise our paper to include a dedicated section that provides a more detailed discussion on these points and outlines future research directions.


Q1: Different cultural contexts or styles

This is a good question. Currently, we focus on humor understanding based on common interpretation rather than individual preferences with specific contextual information. However, we recognize that understanding humor and accounting for cultural and stylistic variations is crucial. In future work, we plan to delve deeper into these nuances by incorporating more contextual information into the predictions. We will also include a more detailed discussion on this part in our revised version.


Q2: Expanding the benchmark

Yes, we plan to expand our benchmark to include comics with more than two panels. This will allow us to evaluate models' ability to understand more complex narrative structures. Additionally, we will incorporate comics with diverse narrative logic types beyond contradictions. We are committed to continuously updating our benchmark to foster future research in this area.


Q5: Correlation with pre-training

This is a very insightful question! First, we believe that the reasoning ability and world knowledge of VLMs are highly correlated with their performance on our benchmark. Two observations support this hypothesis: (1) Larger models typically outperform smaller models, and it is widely recognized that larger models tend to have better reasoning abilities and world knowledge; (2) LLaVA-1.6 significantly outperforms LLaVA-1.5, likely due to advancements in these aspects [1].

Second, social understanding ability is also highly correlated with model performance. The comics in the YESBUT benchmark mainly focus on daily life concepts, and understanding these nuances requires a deep understanding of human norms.

Therefore, improving models' abilities in reasoning, world knowledge, and social understanding during the pretraining stage will enhance their performance on this task. We will incorporate this discussion into our revised version.

[1] Llava-next: Improved reasoning, ocr, and world knowledge


Suggestions on Limitations

Thank you for your detailed and valuable suggestions! We acknowledge the raised limitations, and we will revise our paper and incorporate your raised points. We will discuss these aspects in greater depth to ensure that our work is aligned with responsible AI practices, respecting cultural and social nuances. Thank you again for the insightful suggestions!

评论

Thank you for your rebuttal. I have carefully read and considered your response to my review. I appreciate the detailed explanations and clarifications you have provided on several key points.

Regarding the dataset size, I understand your rationale for the current scale and your plans for future expansion. This addresses my initial concern adequately.

I am glad to see you have acknowledged the importance of bias mitigation in the annotation process. Your planned additions to discuss this in more detail will strengthen the paper.

Overall, your responses have addressed my main concerns and questions effectively. The proposed additions and clarifications will certainly strengthen the paper. I believe these changes will result in a more comprehensive and impactful contribution to the field.

Given your thorough response and planned revisions, I maintain my original assessment of the paper. The proposed work remains technically solid with high potential impact in its sub-area.

评论

Dear Reviewer yTj4,

Thank you again for your thoughtful review and suggestions. We appreciate your feedback and will make revisions to clarify the points you raised and incorporate your valuable suggestions. Should you have any further questions or concerns, please don't hesitate to reach out; we would be more than happy to address them.

Best regards,

The Authors

审稿意见
7

The paper proposes a new evaluation benchmark to evaluate how much current VLMs understand the humor and uses the new benchmark to compare various VLMs and LLMs.

优点

Although some previous studies such as [7] and [10] have proposed the humor benchmark for VLM, like author mentioned in 112-114, the proposed benchmark is the first one which has two input images in each sample and the focus is the relation between two images. As far as I know, this is indeed novel.

Based on the dataset, the authors proposed several tasks, which allow researchers to conduct more in-depth analysis on LLM's performance.

The paper uses the benchmark to compare many VLMs and conduct some good analyses.

缺点

The dataset size is relatively small. Both [7] and [10] have > 1k samples. The small dataset size might create some instability of the scores. I guess that is why the paper does not provide statistical significance analyses. The subjectiveness of the task and the noise of automatic metrics could intensify this problem. We can see that although GPT-4 achieves the best performance in most metrics, the correlations between BERT, R-2, and GPT are not super high for some models. Although automatic evaluation using GPT has shown to have high correlation with human evaluation, prior studies (e.g., https://arxiv.org/html/2404.13076v1) show that it is biased toward their own generation and my experience is that BERT score or R-2 are both not reliable metrics for creative writing and GPT evaluation is only more reliable when the dataset size is large.

The paper does not provide human performance, which make interpreting the state-of-the-art performance more difficult. For example, are the scores of GPT-4 in Table 2 really good?

问题

Could you provide some statistical significance analyses?

Which GPT you are using for evaluation? GPT-4? GPT 3.5 turbo?

局限性

The discussion of the weaknesses I mentioned could be added.

作者回复

Thank you for your insightful feedback! We are pleased to learn that you consider our benchmark novel, which includes two input images per sample to emphasize the relationship between the images. Additionally, we are grateful for your acknowledgment of our task settings and good analysis. Below, we address each of your concerns and questions in detail:

W1: The dataset size is relatively small, which might lead to instability of the scores

Regarding relative small dataset size

Thanks for raising this comment. Please see point 1 in the Overall Response.

Regarding potential instability of the scores and significance analyses

This is a good point. To enhance the stability and reduce potential bias, we create three distinct prompts for each task and report the average scores from three runs with each prompt (Line 213). Here, we include additional significance analyses for the description and contradiction generation tasks on VLMs. Specifically, we consider the results from all three prompts, resulting in each model producing 1,044 outputs per task. We use t-test [1] for statistical significance analysis:

  1. For description generation: GPT4 is significantly better than other baselines on all metrics (p<0.0001). Claude-3 achieves the second best result and is significantly better than all open-sourced VLMs (p<0.0001).

  2. For contradiction generation: Similarly, GPT4 output is significantly better than other baselines on all metrics (p<0.01). However, the results between Claude-3 and LLaVA-1.6-13B on ROUGE-2 and between Claude-3 and CogVLM on BERT score are not significant.

We will incorporate the complete results in our revised version.

[1] https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html


W2: The subjectiveness of the task and the noise of automatic metrics.

Subjectiveness of the task

This is a good point! Indeed, humor understanding can be subjective. To address this, we have designed our tasks to minimize subjectivity. Specifically, we formalized literal description and contradiction generation as text generation tasks because they are less subjective and focus on specific descriptions and narrative illustrations of the comics. For the more abstract and subjective components, such as underlying philosophy and title selection, we formulated them as selection tasks (i.e., determining which title is “objectively” better), and evaluated these using straightforward accuracy metrics. We will better clarify this in our revision.

The noise of automatic metrics

Regarding evaluation metrics, we acknowledge that evaluations of text generation tasks remain challenging, and current automatic metrics have limitations. To mitigate this issue: (1) For semantic-based evaluations like BERT score and Rouge, we report recall scores to measure how many key points of the reference are captured by the model, ensuring a more precise assessment of content coverage. (2) To mitigate bias in GPT-based evaluations, we use different GPT variants for different purposes: for experiments, we use the gpt-4-vision-preview version as a baseline, while for evaluation, we employ the gpt-3.5-turbo-0125 version. This approach helps reduce potential bias towards the GPT4 model’s own generation.

We will revise our paper to better clarify these points and discuss the potential limitations of automatic evaluations in more detail. Despite the inherent challenges, we believe our approach offers valuable insights into the model's capabilities.


W3: Human performance

Thank you for this good suggestion! We have included human performance. Due to time constraints, we randomly select 50 samples and ask two participants to perform the underlying philosophy selection and title matching tasks. One participant is male and the other is female, with cultural backgrounds from East Asia and North America, respectively, to ensure a fair and diverse evaluation.

The results are shown as below. These results highlight the significant room for improvement for VLMs on these tasks. We will revise our paper to incorporate these discussions.

Acc(%)LLaVA-1.6-34BClaude-3GPT-4Human
Philosophy82.0085.5577.3394.00
Title66.0068.0062.0093.00

Q1: Details of GPT used for evaluation

We employ the gpt-3.5-turbo-0125 variant for the automatic evaluation. We will better clarify this in the paper.

评论

Based on the answer, I have increased the contribution score from 2 to 3. Although GPT4 and gpt-3.5-turbo-0125 are different LLMs, they might be trained on the similar data. If authors have time and budgets, I recommend to try other LLMs (e.g., Claude 3) for evaluating GPT4 and Claude 3 again and see if the results are different from what you got from gpt-3.5-turbo-0125.

评论

Dear Reviewer wVUt,

Thank you once again for your review and feedback. We appreciate the increased contribution score. We have included additional results using Claude-3 (claude-3-opus-20240229) as the base model for automatic evaluations, with the same evaluation prompts. The results are presented below:

  • Literal Description
Eval LLMGPT-4Claude-3LLaVA-1.6-34BLLaVA-1.6-13BLLaVA-1.5-13B
GPT-3.53.763.282.862.962.51
Claude-33.442.642.482.642.02
  • Contradiction
Eval LLMGPT-4Claude-3LLaVA-1.6-34BLLaVA-1.6-13BLLaVA-1.5-13B
GPT-3.54.033.793.513.363.36
Claude-33.693.262.782.842.65

As shown, the results of Claude-3 display a trend similar to those of GPT-3.5. We will incorporate these findings into our revised paper. We will also modify our paper to clarify the previously raised points and incorporate your suggestions. If you have any further questions or concerns, please feel free to reach out, and we will be happy to address them.

Best Regards,

Authors

评论

Thank you. I have increased the overall rating from 6 to 7 and have no further question.

评论

Dear Reviewer wVUt,

We sincerely appreciate your recognition of our efforts and the increased score. We are also grateful for your valuable suggestions and will ensure they are incorporated into our revisions.

Best regards,

Authors

审稿意见
8

This paper presents a benchmark YESBUT, which contains pairs of images exhibiting a sense of humor via juxtaposition. For each pair, the authors employed human-AI collaboration methods to annotate detailed tasks to assess why the humor can be understood, including literal description writing, contradiction generation, underlying philosophy selection, and title matching. Based on the YESBUT benchmark, there are rich benchmark results presented, including the comprehensive comparison with varying LLMs and VLMs and in-depth analyses to provide insights.

优点

  1. The paper presents a new task of juxtaposition-based humor in multimodal settings. It can reflect the system's ability to interpret and generate nuanced, contextually appropriate responses, thereby improving its interaction with humans in more natural and engaging ways. The task is challenging because it requires models to make sense of the non-linear connections between two semantically related images and social reasoning skills to capture a sense of humor. The results also show the limitations of existing models in handling the task.
  2. The YESBUT benchmark crafted to research the task is helpful. It contains rich human annotations reflecting varying perspectives for task evaluation. The benchmark can be beneficial for future research.
  3. The paper presents rich experimental studies. The results and analyses offer a comprehensive view of the pros and cons of the cutting-edge VLM techniques and provide in-depth analyses to interpret where the challenges are and how to address them.
  4. The paper is written very clearly with well-designed organizations, clear writing, and good presentations with cases, figures, tables, etc.

缺点

  1. The dataset is relatively small, with only 348 image pairs. Although I understand that the data is very hard to gather and the annotation is very labor intensive, a larger dataset will enable more sound evaluation results and allow the potential for model training (now it can only support model evaluation).
  2. It would be good if the authors could also examine how the models can be further improved to tackle the task well. In the current version, most of the findings center on the limitations, but it would be good for the authors to also point out the directions of further studies in advancing the technology.

问题

  1. For philosophy selection, why it is formulated as the MCQs instead of text generation like other tasks?
  2. How to ensure the diversity of the benchmark data (so the image covers varying testing scenarios)?
  3. Would the prompts play crucial roles in models’ decision making? What will happen if different sets of prompts are used?
  4. Why GPT-4 can engage in the annotation as well as the model comparison? Will that involve any bias?

局限性

The dataset size is relatively small, which hence cannot be used for model training. Also, it would be good if the authors can provide insight into how to further advance the techniques to better tackle the task.

作者回复

Thank you very much for your valuable comments and suggestions! We are pleased to learn that you find our introduction of the YESBUT benchmark is helpful and beneficial to future research, and our rich experimental studies and in-depth analyses offer a comprehensive view. We address your questions and concerns below:

W1: Regarding data size

Thank you for the question! Please see point 1 in the Overall Response.


W2: Regarding model improvements

Thank you for your insightful suggestion! We agree that highlighting directions for future research is crucial. Currently, we aim to uncover potential areas for improvement through results analyses.

First, our analysis indicates that VLMs often struggle with accurately interpreting image content and may make errors in literal descriptions (Sec. 6.1 & 6.2). This suggests a need for future work to enhance visual interpretation capabilities.

Second, improving the in-depth reasoning ability of VLMs is essential for this task. For instance, LLaVA-1.6 significantly outperforms LLaVA-1.5, likely due to the advancements in reasoning abilities [1]. Future work might incorporate recent advanced reasoning approaches (e.g., multi-agent debate, refinement-based reasoning) to further improve model performance.

Finally, our error analysis reveals that models tend to suffer from hallucination (Line 333), suggesting the need for incorporating external knowledge to enhance human understanding. To mitigate this problem, knowledge augmentation methods can be employed to enhance VLMs performance.

We will revise our paper to include a dedicated section that provides a more detailed discussion on these points and outlines future research directions.

[1] Llava-next: Improved reasoning, ocr, and world knowledge


Q1: Why philosophy selection is formulated as the MCQs

This is a good question! Compared to the literal description and contradiction illustration, philosophy and title are more open-ended and subjective, where there might exist multiple valid philosophies and titles for one comic. This characteristic makes the evaluation of these two tasks challenging if formulating them as text generation. Therefore, we follow previous work and formulate them as MCQs [1,2].

[1] Can Large Multimodal Models Uncover Deep Semantics Behind Images?
[2] Do Androids Laugh at Electric Sheep? Humor “Understanding” Benchmarks from The New Yorker Caption Contest


Q2: The diversity of the benchmark

Thank you for raising this good question! The comics in our benchmark encompass a diverse range of everyday life scenarios. To ensure and analyze the topic coverage, we prompted ChatGPT to generate topical keywords for each comic based on its description and then clustered these keywords. The complete clusters and their statistics are provided in the one-page PDF of the Overall Response. The results indicate the diversity of our benchmark. We will incorporate the analysis in our revised version.


Q3: Would the prompts play crucial roles in models’ decision making?

To address the potential influence of prompts on model performance, in the experiments we created three distinct prompts for each task and reported the average scores from three runs with each prompt. The specific prompt sets are detailed in Appendix C.3. Our initial observations indicate that different VLMs exhibit varying degrees of sensitivity to prompts (for example, commercial models are less sensitive to the prompts than smaller VLMs); however, the overall performance gap across different prompts is not significant. We will revise the relevant section to clarify this point more effectively.


Q4 : Why GPT-4 can engage in the annotation as well as the model comparison? Will that involve any bias?

This is a good question! To clarify, we use different GPT variants for distinct purposes: For data annotation, we leverage the gpt4-turbo variant; for experiments, we report results using gpt-4-vision-preview; and for GPT-based automatic evaluation on text generation tasks, we employ gpt-3.5-turbo-0125. While we do use different GPT models variants, there might still be some inherent bias since we do not know the backend model details. We will revise our paper to better clarify this distinction and acknowledge the potential for bias.

评论

Thanks for your detailed response. Most of my concerns have been well addressed, and I’ve increased my score. Although the dataset scale is still a concern, I acknowledge the challenge to gather large-scale data and thoughtful consideration to manage the diversity of the data samples. Maybe it would be interesting to see how to improve model performance given small-scale data, yet it should be beyond the scope of this pilot benchmark study.

评论

Dear Reviewer,

Thank you once again for your review and feedback! We appreciate the increased score and will revise our paper to clarify the previously raised points and incorporate your suggestions. If you have any follow-up questions or concerns, please let us know and we would be happy to answer them.

Best Regards,

Authors

评论

I engaged in the discussion relatively earlier, and I saw a few new messages coming up. So, I read the updates, and I still think it is a nice contribution to a novel task (to examine how LLMs capture a sense of humor given images in juxtaposition). No one has explored this before. Also, the paper shows rich experimental results showing existing limitations and the directions of future work. The work is very sound and the rebuttal has well addressed the minor concerns the original version may have. I see no reason for rejecting the paper despite the unavoitable limitation as a pilot study (if authors can face these limitations and well discuss them).

评论

We sincerely thank the reviewer for their thoughtful and supportive comments. Your recognition of the novelty and value of our work is deeply appreciated and offers strong encouragement for us. We are committed to discussing the limitations and incorporating the suggestions to enhance our work. Once again, we are grateful for your valuable suggestions.

Best regards,

Authors

审稿意见
4

This paper investigates the capability of large vision language models (VLMs) to understand humor in comics through narrative contradictions. The authors introduce the YESBUT benchmark, comprising tasks designed to evaluate AI’s ability to recognize and interpret contradictory narratives in comics. Experiments are conducted with both commercial and open-source VLMs to assess their performance on tasks ranging from literal content comprehension to deep narrative reasoning.

优点

  1. Innovative Benchmark: The introduction of the YESBUT benchmark provides a structured approach to evaluating AI’s understanding of humorous contradictions in comics.

  2. Human-AI Collaborative Annotation: The use of a human-AI collaborative pipeline for data annotation is innovative and helps in obtaining high-quality annotations efficiently.

  3. Insightful Analysis: The analysis of results, including error types and human evaluation, provides valuable insights into the challenges faced by current models in understanding humorous contradictions.

缺点

Novelty: The concept of using VLMs to understand humor in comics has been explored in various forms in previous research. The YESBUT benchmark, though focused on narrative contradictions, might be seen as a slight variation on existing benchmarks that evaluate multimodal understanding and humor recognition.

Limitation: The benchmark focuses on a very specific type of reasoning—juxtaposition—which may not be broadly applicable to other forms of humor or narrative understanding. This narrow focus limits the benchmark’s utility in evaluating general AI capabilities in humor comprehension. The task of matching comics with titles involves a high degree of subjectivity. Different annotators might have varying opinions on what constitutes a suitable title, leading to inconsistent and potentially biased evaluations. The dataset consists of only three hundreds of comics, which may limit the generalizability of the findings. Expanding the dataset could provide more robust insights.

Annotation: The annotation process, which relies heavily on human-AI collaboration, might introduce bias. Human annotators may inadvertently guide the AI’s outputs, leading to annotations that reflect human reasoning more than autonomous AI understanding.

Metrics: The primary evaluation metrics include BERT score, ROUGE-2 (recall), and GPT-based evaluation scores for literal description and contradiction generation tasks. While these metrics are useful for assessing surface-level content generation, they may not fully capture the depth of understanding required for humor comprehension.

Suitability: Given the nature of the contribution, the work may be better suited for venues that specifically focus on datasets and benchmarks, such as the NeurIPS dataset/benchmark track and workshops or tracks dedicated to introducing new datasets. This setting would allow the authors to highlight the value of the YESBUT benchmark without the expectation of a significant theoretical or methodological breakthrough.

问题

How did you address the potential subjectivity and variability in human annotations, especially for tasks like underlying philosophy selection and title matching? Did you measure inter-annotator agreement, and if so, what were the results?


How did you ensure that the negative titles and philosophies were sufficiently challenging yet distinct from the correct ones? Can you provide examples where models struggled with these distinctions?

For example, I feel that the example in Figure 2, the negative title "Graphs Don't Lie, People Do" is also quite suitable to the comic.


Did you assess the models for any cultural or contextual biases in their understanding of humor? If so, what were your findings, and how do you plan to address these biases in future work?


Could you provide more details on why you chose BERT score, ROUGE-2, and GPT-based evaluations as your primary metrics? Have you considered using other evaluation metrics that might better capture the nuances of humor understanding?

局限性

It's discussed in Sec 8

作者回复

Thank you for your valuable and constructive comments. We appreciate your recognition of our work's innovative benchmarks, human-AI collaborative annotation, and insightful analysis. Below, we address your concerns and questions individually:

W1 & W5: Regarding Novelty and Suitability

Thank you for the feedback! However, we believe our work is uniquely novel and suitable for this venue. First, our benchmark formulates a new task to address humor understanding through narrative contradictions, a complex and underexplored aspect of AI research. This task challenges models to integrate multiple fundamental abilities including the comprehension of human norms, critical thinking about similarities and differences of elements, and nonlinear reasoning, which are critical for developing socially intelligent systems. Reviewers aFj6 and yTj4 have recognized these contributions, which we will clarify further in our revision. In addition, our work extends beyond a dataset introduction with comprehensive evaluations and detailed analyses, providing valuable insights into AI's interpretive and reasoning capabilities, making it highly relevant for broader AI and NLP communities. As a pilot study on this topic, we believe our work is suitable for the venue and can bring valuable insights for future research.


W2: Limitation of benchmark

juxtaposition may not be broadly applicable

This is a good point. Juxtaposition is a common and sophisticated technique with applications in domains such as arts, literary, and mathematics [1]. As a pilot study, we consider it as a suitable entry point for evaluating complex narrative understanding for AI models.

The task of matching comics with titles involves a high degree of subjectivity

Thank you for this important question. Indeed, humor understanding can be subjective. To address this, we have formalized title matching as a selection task rather than a generation task (i.e., determining which title is "objectively" better). To ensure a "common preference," we conduct multiple quality checks and verifications during annotation, where each sample is reviewed by different annotators to reach an agreement (Line 165). We will clarify this in our revised version.

The dataset consists of only three hundreds of comics

Please see the Overall Response, point 1.

[1] https://en.wikipedia.org/wiki/Juxtaposition


W3 & Q1: Regarding bias and subjectivity in annotation

The annotation process might introduce bias and how did you address the subjectivity

Please see the Overall Response, point 2.

Did you measure inter-annotator agreement

As each component was annotated and verified by at least three annotators to reach agreement, we did not further measure inter-annotator agreement.


W4 & Q4: Regarding the weakness and question of evaluation metrics

Please see the Overall Response, point 3.


Q2: Regarding the quality of negative titles and philosophies

How did you ensure that the negative titles and philosophies were sufficiently challenging yet distinct from the correct ones

Thank you for raising this important question! The negative titles and philosophies are constructed by human annotators. For each sample, we first prompt GPT-4 to generate negative candidate options (as shown in Table 4). The annotators then revise and edit these options to ensure they are sufficiently challenging yet distinct from the correct ones. During the quality check stage, annotators will further verify the quality. As illustrated in Figure 7, all negative options are on-topic and relevant to the comic but may contain incorrect logic or do not accurately reflect the narrative. This process ensures quality of the negative options.

The negative title "Graphs Don't Lie, People Do" in the example of Figure 2

This is a very good point! The comic itself illustrates how the interpretation and presentation of data can be manipulated or selectively used to tell misleading stories, while the graphs and data are factual and real. Therefore, the title implying that people lie is not appropriate.


Q3: Regarding the question of cultural or context bias

This is a very insightful question! In our current study, the annotations are based on the common interpretation of humor, and our evaluations focus on how model performance aligns with such average preferences, without consideration of specific contextual or cultural information. However, we acknowledge the critical importance of considering cultural and contextual variations in humor understanding. We will explore this aspect in our future research. Additionally, we will include a discussion of these considerations in the Limitations section of our revised version.


Regarding the raised ethical consideration

Thank you for highlighting these important concerns. To address them, we have included a comprehensive ethics statement in our paper (Section A in appendix). Below we outline the key points:

Data privacy, copyright, and consent

All data samples are sourced from publicly available content on social media platforms. We strictly adhere to copyright laws by using original links to the comics, thereby avoiding any infringement. We will provide the original links of each image when releasing the benchmark.

Discrimination, bias, and fairness

We have conducted a thorough review and rigorous quality checks of our samples to filter out potentially offensive or harmful content. Our annotators come from diverse cultural and gender backgrounds, and we conduct multiple quality checks and verification to minimize bias.

We will revise our paper to better clarify these points and provide reassurance regarding the ethical considerations of our work. We will also incorporate more detailed instructions to guide users in adhering to copyright regulations, and encourage users to carefully consider the ethical implications of the generated outputs.

评论

Thanks for the detailed response. I still have many concerns about this paper.

  • I still feel the novelty of the paper is quite limited. The ACL best paper [1], published in 13 Sep 2022 on arxiv, already have started human understanding from cartoon. And this submission just change it two a specific, narrow-domain situation: presenting a pair of pictures (i.e., Juxtaposition). Although this is a nice extension for studying VLMs' ability of reasoning on two images instead of one, such reasoning abilities of VLMs have already been studied with many other multi-image VQA tasks for a long time. I don't think the creation process of this dataset is novel neither. The authors also claim that the major difference from this submission to [1] is just from reasoning on single image to a pair of images.

  • the tasks are subjective by nature so a larger set of annotators per example is need but authors do not provide inter-human annotator agreement. In the example that I mentioned "The negative title "Graphs Don't Lie, People Do" in the example of Figure 2". I think this can already show the problem -- you think it is not "appropriate" but I feel it is --- intentionally showing partial or chosen factual information is a form of lie. We may not need to agree with each other on this, but this suggests that the authors did not handle the subjectiveness of the dataset very well.

  • concerns on the cultural bias is still there. the authors did not explain well what they did to ensure the content in the dataset only contain culture-insensitive tasks.

  • concern on the metric: "Metrics: The primary evaluation metrics include BERT score, ROUGE-2 (recall), and GPT-based evaluation scores for literal description and contradiction generation tasks. While these metrics are useful for assessing surface-level content generation, they may not fully capture the depth of understanding required for humor comprehension." The authors' response in the general rebuttal does not address this point very well.

To sum up, I think the dataset is fun and interesting and may be a good contribution, but in my humble opinion it is a bit less suitable for the main track at NeurIPS. I have also read the comments from other reviewers. Based on my concerns, I'd like to lower my rating.

[1] Do androids laugh at electric sheep? humor “understanding” benchmarks from the new yorker caption contest

评论

We thank Reviewer 2fnt’s constructive reply and we regret to see the score decrease. While we appreciate the reviewer's efforts, we would like to briefly clarify a few points in the Reviewer 2fnt’s reply in case of any potential misunderstandings.

The novelty of the paper is quite limited & nonlinear reasoning abilities of VLMs have been studied for a long time

We would like to clarify that our work is not to “just from reasoning on single image to a pair of images”. Instead, our focus is on understanding complex, nonlinear narratives that emerge through the juxtaposition, particularly in contexts that involve abstract, human-centered events as depicted in comics. This is important for VLMs, which have been shown to struggle with deep, nonlinear reasoning by recent work. By exploring how VLMs understand contradictions and abstract social concepts through sophisticated reasoning processes, we aim to uncover insights that are critical for the advancement of these large models and can guide future developments in the field.

authors do not provide inter-human annotator agreement

We would like to clarify that our annotations are not generated by models, but rigorously produced by human annotators with the assistance of GPT4. Each sample is annotated, verified, and checked by at least three annotators (constituting 43% of our total annotator pool) to ensure that each annotation achieves a high degree of consensus. Therefore, it is not feasible for us to compute inter-human annotator agreement because annotators do not produce annotations individually and independently.

The example suggests that the authors did not handle the subjectiveness of the dataset very well

We formalize title understanding as a “selection task” where the positive title is deemed more appropriate based on common interpretations, instead of a “binary classification” task about right or wrong. For this specific example, the positive title is “commonly” considered to be better than the negative options.

concerns on the cultural bias

Cultural bias often arises from varying interpretations across different cultural backgrounds. To reduce such bias and focus on common interpretation, in the annotation process each sample is verified by multiple annotators from different cultural backgrounds. Their consensus was required for an annotation to be included; if no consensus was reached, the annotation was either modified or excluded from the dataset. This approach ensures that we retain only samples that minimize cultural bias.

While this approach significantly reduces bias, it is important to acknowledge that no dataset can completely eliminate bias. We recognize that our annotator pool may not represent every cultural perspective, and we are committed to including a more detailed discussion of this limitation in our revised paper.

concerns on metric

Our use of BERT score, ROUGE-2 (recall), and GPT-based evaluation scores is designed to assess whether the model-generated outputs accurately capture the key information of the narratives. Achieving this level of generation inherently needs “the depth of understanding required for humor comprehension”. While these metrics may appear surface-level, they are effective proxies for measuring the model's comprehension, as they ensure that the outputs reflect a nuanced understanding of the comic.

Again, we thank Reviewer 2fnt for providing constructive comments and suggestions. We also thank Reviewer 2fnt for the reply of our rebuttal. We will revise the paper to better incorporate these insights and clarify the unclear parts.

评论

Dear Reviewer,

Thank you once again for your valuable review. With the discussion deadline approaching, we would like to ask whether our response has adequately addressed your questions. If there are any outstanding issues, we would like the chance to respond before the discussion period is over.

Thanks again for your thoughtful review!

Best regards,

Authors

作者回复

Overall Response to All Reviewers

We thank all reviewers for their valuable comments and suggestions. We are pleased to know that the reviewers consider our benchmark novel (reviewer wVUt, yTj4) and innovative (reviewer 2fnt), with a rigorous human-AI collaborative annotation process (reviewer 2fnt, yTj4). We are also excited to learn that our results and analyses are considered as comprehensive (reviewer aFj6), insightful (reviewer 2fnt, yTj4), and is beneficial for future research (reviewer aFj6). We also thank reviewer aFj6 and yTj4 for considering our paper clearly written and well presented.

Below we address some common concerns shared by the reviewers.

1. Regarding data size

We acknowledge the relatively small size of our benchmark due to the challenges and costs associated with data collection and annotation. However, as a benchmark for a novel task, we have rigorously collected and annotated each comic to ensure high quality and reliability. Despite its size, YESBUT covers a broad range of domains and topics (please see our additional analysis on topic coverage in the attached PDF). We believe it serves as a pioneering benchmark to enhance future research in this area.

Meanwhile, we are committed to updating our benchmark with more samples and additional narrative logic types beyond contradictions. We will also explore synthesizing comics using image generation models in future work.

2. Regarding potential bias and subjectivity of annotation

Our benchmark focuses on common interpretation of humor. However, we recognize that the subjectivity of this task may introduce bias. To mitigate this issue, we have taken several steps in our annotation process:

  1. Diverse Annotator Backgrounds: Our annotators come from different genders and diverse cultural backgrounds, including North America and East Asia, providing a range of perspectives. This diversity helps to mitigate cultural and gender biases in humor interpretation.

  2. Consensus Among Annotators: The annotation process incorporates multiple quality checks and verifications to ensure consensus among different annotators. Any comics with controversy and potential bias are filtered out. During the cross verification stages, annotations identified as biased by any annotator are properly modified. This process helps reduce biases stemming from individual perspectives.

  3. Verification with Social Media Comments: We also verify our annotations by checking the comments on social media for each comic. This step ensures that our annotations align with the common interpretation of the comic.

Additionally, we recognize that subjectivity is an inherent aspect of data annotation, especially for more open-ended components such as title and philosophy. Therefore, we frame these two tasks as selection tasks, ensuring the correct option is “objectively” better than the negative options. Despite these efforts, we acknowledge that our annotations may still carry inherent biases. We will further clarify this and discuss the potential biases in the Limitation section for better guidance of future usage of our benchmark.

3. Regarding the evaluation metrics of generation tasks

Evaluations of text generation tasks remain challenging and current automatic metrics are not perfect. Therefore, we try to mitigate this challenge from several aspects:

  1. Task formalization: We have strategically chosen to formalize literal description and contradiction generation as text generation tasks because they are less open-ended and more focused on specific descriptions and narrative illustrations of the comic. For the more abstract and subjective components, such as underlying philosophy and title selection, we formulated them as selection tasks (i.e., determining which option is “objectively” better), and evaluated these using straightforward accuracy metrics.

  2. Evaluation metrics: For semantic-based evaluation, we report recall scores to measure how many key points of the reference are captured by the model. This ensures a more precise assessment of content coverage. Additionally, we employ a GPT-based metric alongside a gold standard reference, an approach that has shown strong alignment with human judgment in previous studies [1,2,3].

  3. Human Evaluations: We incorporate human evaluations to provide a comprehensive assessment of model output quality. This human judgement helps to capture the nuances that automatic metrics might miss, especially in understanding complex and abstract content like humor.

We will revise our paper to better clarify these points and provide a more detailed discussion of the potential limitations of automatic evaluations in NLG. Despite the inherent challenges, we believe our multifaceted evaluation approach provides a more balanced and comprehensive assessment of the model's performance.

[1] Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
[2] CLAIR: Evaluating image captions with large language models.
[3] Geval: NLG evaluation using gpt-4 with better human alignment.


We also provide a one-page pdf that includes the additional analysis on the following content:

(1) Analysis on the diversity of comic scenarios covered by our benchmark.

(2) The results of human performance compared with VLMs on 50 randomly sampled comics.

Once again, we appreciate all reviewers for your insightful comments and valuable suggestions. We will revise our paper to clarify the unclear parts and incorporate the suggestions. If our responses have addressed your concerns, please kindly consider increasing your scores. We sincerely appreciate your consideration.

Best regards,

Authors

最终决定

Summary (taken from yTj4's excellent recap)

This paper introduces YESBUT, a new benchmark for evaluating large vision-language models' ability to understand humor and contradictions in comics with juxtaposed panels. The benchmark consists of two-panel comics with contradictory narratives, along with annotations for literal descriptions, contradiction explanations, underlying philosophies, and titles. The authors design four tasks of increasing difficulty: literal description writing, contradiction generation, underlying philosophy selection, and title matching. They evaluate several commercial and open-source vision-language models on these tasks using both automatic and human evaluation. The results show that even state-of-the-art models struggle with these tasks, especially the deeper reasoning required for understanding contradictions and abstractions. The paper provides insights into current limitations of AI in comprehending complex human expressions and offers directions for improvement.

Metareview

All but one of the reviews of this paper are strong accepts, and I agree with that consensus. This is a simple but clever idea executed well. While there are technical criticisms noted in the reviews (using BLEU for equivalence judgement) these are relatively minor and well-addressed by the author response.

I cannot put it in my Metareview easily, but I encourage the SAC to look at Figure 6 in the paper. It's a good encapsulation of the goal of the paper and what we learn about the limitations of multimodal models. This is backed up by the paper's thorough analysis. In the end, I think this is the best recommendation of the paper: I found it interesting, I cannot think of anyone who has done something like this, and I learned something.

I think this paper would be a valuable addition to the NeurIPS program, so I am strongly arguing for its acceptance.

However, I would strongly encourage the authors to more deeply engage with copyright and fair use in the final version. This is not just scraping from Twitter, this is scraping from a single artist's Twitter feed. In an ideal world, you should reach out to them and explicitly get their permission to use this as a part of your dataset.