Do Language Models Agree with Human Perceptions of Suspense in Stories?
We show that while language models can detect when a text is meant to be suspenseful, they fail to match human judgments on its intensity and dynamics and are vulnerable to adversarial manipulations.
摘要
评审与讨论
This contribution presents a set of experiments following the paradigm of machine psychology to investigate the functionality of different open-weight LLMs on the detection of suspense in narratve texts. The authors selecte four datasets that have been used (in different contexts) to run experiments with humans on perception of suspence. This allows the authors to compare the results of the LMs against those obatined by humans to validate their alignment and differences.
Typos/suggestions:
- line290: reference is missing
- line 232: misisng "and" in "charts the details"
- in the abstract "We probe the abilities of LM suspense understanding by adversarially permuting the story text to identify what cause human and LM perceptions of suspense to diverge." maybe change "abilities" with functionalities and rephrase the last part of the sentence to avoid that "perceptions" may be understood as referring to LMs
- For sentiment analysis, you may want to add this as a seminal reference https://link.springer.com/book/10.1007/978-3-031-02145-9
接收理由
- Sound experiment settings - experments are run 3 times and averages are reported
- Methodolgy is grounded in theoretical approaches and the experument rationales is guided and follow well motivated research questions
- Paper is clear and easy to follow. The tone of the paper avoids hypersensationalism and carefully avoids to antromorphise LMs
- The findings could be relevant also for other disciplines such Computational Literay Studies
拒绝理由
- I do not have reasons to reject this paper.
We greatly appreciate your enthusiastic and positive review. Thank you for your detailed feedback and helpful minor suggestions. Your commentary about how thoughtfully we worked to avoid anthropomorphizing the LLMs is sincerely appreciated. We will correct all mentioned typos (lines 290, 232) and add the missing reference promptly. Thank you for suggesting this additional reference for sentiment analysis. We agree it strengthens the theoretical grounding and will include it in our related work section, along with a specific construction of the problem framing to fit the work. Your review significantly enhances the clarity and accuracy of our manuscript, and we appreciate your precise suggestions and support. We hope that you can be a strong advocate for the novelty and impact of our work.
The paper investigates the behaviour of LLMs in response to suspense studies. They reproduce 4 psychological studies,s replacing the human responses with LLMs. The comprehensive experimentation shows that, overall, LLMs fail to behave similarly to humans regarding the perception of suspense.
These results were confirmed by an interesting analysis performed by a series of adversarial attacks. The attacks demonstrate that LLMs do not substantially modify their patterns.
接收理由
Interesting topic, extremely well-written and explained paper, reproducing psychological experiments with LLMs, originality of the adversarial attacks. This paper helps to better understand LLMs' generated signals in suspense related input. I think this paper will make for interesting discussion in the conference.
拒绝理由
No real reasons for rejection, I really like the paper. However:
- Table 2 and Heatmaps in the results section are barely readable. Because of this, it is quite difficult to follow the discussion about results with respect to the heatmaps. I suggest that the authors simplify this section for better clarity.
Thank you very much for your supportive review and recognition of our paper's originality and experimental rigor. The Readability of heatmaps and Table 2 is a very valid concern, and we agree that readability is crucial for communication. We fully commit to improving these visualizations in the revised manuscript by clearly labeling axes, reducing visual clutter, and enhancing color contrast using colorblind palletes for improved clarity and interpretability. We appreciate your valuable feedback and aim to incorporate your suggestions fully in our final version, improving readability and enhancing the paper’s impact. We acknowledge that the use of heatmaps are being used in an effort to visualize the data, but we have found that all other alternatives are even more difficult for non-authors to interpret. We would sincerely appreciate any and all feedback as to what you believe is the best approach.
I suggest reducing the number of heatmaps in the section, moving some to the appendix, and redoing a bit the writing of that section. In this way if you have fewer figures their size can be larger and more readable.
This work presents an analysis comparing human cognitive responses and LLM outputs in the context of suspense reasoning in narrative stories. It builds on existing cognitive science literature to evaluate several state-of-the-art LLMs, analyzing their differences from humans in detecting and scoring suspense.
接收理由
-
The analysis is comprehensive and grounded in prior cognitive science research.
-
The experimental setup is well-structured and thorough.
-
The comparison between human and LLM suspense perception is clearly presented.
-
The inclusion of adversarial experiments seems interesting.
拒绝理由
-
While the work offers detailed behavioral comparisons, it lacks deeper insight into why LLMs behave differently from humans. Beyond showing that "LLMs are not humans," it would be helpful to identify specific cognitive gaps or limitations (e.g., "LLMs lack X capability"), which could better guide future model development.
-
The work builds on existing components but does not introduce a new benchmark or propose an improved model.
给作者的问题
-
Given the comprehensive analysis, could the authors provide guidance or concrete takeaways for future LLM development?
-
Do LLMs exhibit behavior specific to narrative reasoning, or are similar patterns observed in other NLP tasks?
Minor: L290 missing reference?
Thank you for your thorough and thoughtful review. We appreciate your acknowledgment of the comprehensive and rigorous nature of our study. Regarding the search for deeper insights into cognitive gaps -- the experiments on adversarial attacks attempt to provide more insight into cognitive limitations of LLMs, notably highlighting their superficial text-processing strategies and limited narrative-context understanding. We will clearly state these cognitive gaps explicitly as limitations and attempt to guide future machine psychology. For example, featuring the use of adversarial attacks to test the validity of the underlying results is a compelling demonstration of how the 'Clever Hans' effect needs to be mitigated among LLM researchers. One of the contributions is a reproducible methodology that others can follow when running their own machine psychology experiments. While it was not our intention to produce a benchmark, experiments on LLMs suspense perception can now be reproduced en masse. One major contribution is providing an example of how to hold to a rigorous standard for machine psychology experiments.. This research establishes a baseline for future LM experiments when measuring the phenomnia of suspense. Psychological literature can be combined with NLP for new benchmarks by using our methodology. We will further highlight this methodological novelty explicitly in the revised paper. Your suggestion to provide concrete guidance for future LLM development is excellent. We agree on the necessity of explicit future directions. Based on our results, we suggest targeted improvements in LLM mid-training to incorporate stronger narrative-context understanding and explicitly reward models during reasoning reinforcement to understand the probabilities and certainties of uncertain (i.e., suspenseful) outcomes. These recommendations will be explicitly stated in our conclusion section. Thank you! Finally, regarding the generalizability to other NLP tasks -- while testing generalization to other narrative and non-narrative NLP tasks exceeds the scope of our current experiments, we acknowledge this as a significant and promising direction for future work. We will explicitly state this limitation and suggest further studies for broader NLP generalization. Also thank you for pointing out the missing reference -- we will fix it. We believe these clarifications and suggested revisions will significantly strengthen the paper and help address your primary concerns.
Thank you for addressing my concerns in the rebuttal; I have adjusted my score accordingly.
The paper tests whether LLMs can judge suspense in stories like humans do. It recreates four classic psychology experiments, using LLMs instead of people. The authors also test the models with adversarial story changes to see if their judgments hold up. They find that LLMs can detect if a story is suspenseful, but they do not match human ratings well, especially when measuring changes in suspense over time.
接收理由
It's a nice fun study and it raises important questions how LLMs understand stories. The findings are interesting. It's a creative way to bring together NLP and psychology. The methodology is thoughtful. The paper evaluates a wide range of models, providing a comparative benchmark for suspense detection across architectures. It raises questions about the core challenges for aligning LLMs with human reasoning.
拒绝理由
I'm not sure if it's an issue to reduce suspense to a number -- does it capture the full emotional experience? LLMs rate story parts one at a time. But suspense builds across a story. Models may miss context that humans would use. Models fail at key moments (like plot twists). The paper shows this, but does not explain why. It’s unclear what cues the models rely on. Even when stories are scrambled or altered, LLM ratings barely change. This could mean they ignore structure or rely on surface patterns. The paper does not decide.
Some models do much better than others, but differences between models (e.g., training, size, instruction tuning) are not analyzed.
It would be interesting to learn whether LLMs understand story flow or just rate each paragraph in isolation. Why did prompt changes (like adding “in relation to the story so far”) have no effect? Are models guessing based on keywords or shallow features?
Thank you for the detailed and insightful feedback. We appreciate your recognition of our methodology and the importance of our research questions. Regarding the simplification of the phenomenon of suspense into a single number, we agree that reducing suspense to numeric values simplifies complex emotional experiences. However, numeric ratings using Likert scales are standard in psychological literature and facilitate clear, reproducible comparisons between human and LM judgments. We will explicitly acknowledge the limitations of numeric measures and suggest how we can integrate more thorough qualitative evaluations of human subject research for future work in machine psychology. To clarify the contextual understanding of LLMs -- our methodology explicitly provided LLMs with cumulative narrative context (prior story segments appended with each rating prompt). Additionally, we tested context-enhanced prompts (e.g., "in relation to the storyline so far"), but this did not significantly alter LM performance, reinforcing our conclusion that current LLMs inherently struggle with deep context integration. We will explicitly state this finding clearly in the manuscript. Our adversarial experiments directly addressed concerns about the analysis of cues used by LLMs. The adversarial experimentation explores how and when an LLM relies on local textual features and superficial cues rather than deeper structural or contextual elements. We will expand the discussion in our revised manuscript to explicitly include these insights. Finally, thank you for highlighting the point about differences in the models. While our current results already offer a comprehensive benchmark across diverse architectures, we commit to adding a brief comparative discussion of the model architectures and training differences explicitly to provide clearer guidance for interpreting model variability in the final manuscript. We hope these clarifications address your concerns and help improve your confidence in our evaluation.
Thank you for the detailed reply. This addresses most of my concerns, especially around context handling and cue analysis. I appreciate the planned clarifications and additions.
We sincerely thank all the reviewers for providing thoughtful, constructive, and detailed comments. We are genuinely encouraged by the universal recognition among reviewers of our methodological rigor, original research, and the relevance of bridging together both NLP/LLMs and psychological studies. Below we address all the points raised by the reviewers, provide clarifications for some key issues, and suggest revisions for the final camera-ready manuscript. We hope these detailed responses clearly address all concerns raised by the reviewers and reinforce the paper’s strengths and contributions.
This paper reproduces several psychological experiments measuring human responses to suspense but instead using several open-source LLMs, drawing data from each of the original papers (G&B94, B&O88, D18, L&K15); it shows that LLMs are able to predict some degree of suspense intent, but do not correlate well with human judgments of relative suspect (the degree of suspense for a passage in the entire story). Reviewers generally found this work to creative and original, with a strong empirical design and no significant weaknesses; it is well grounded in cognitive science research and speaks (through its experiments with adversarial permutations) to current research in LLM reasoning. As one reviewer notes, "this paper will make for interesting discussion in the conference."