SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs
SimpleToM is a novel Theory-of-Mind dataset which reveals the intriguing insight that frontier LLMs perform well on explicit ToM (predicting mental state), but poorly on applied ToM (predicting behavior and judgment).
摘要
评审与讨论
The paper introduces SimpleToM, a dataset designed to test large language models' (LLMs) ability to apply theory of mind (ToM) reasoning in practical scenarios. While previous research has explored whether LLMs possess ToM capabilities, this work specifically examines how well models can use ToM knowledge to predict and evaluate behavior in realistic situations. They created SimpleToM dataset featuring concise stories with Mental state prediction, Behavior prediction, and Rationality judgment questions. They demonstrated that current LLMs show uneven ToM capabilities and developed interventions to improve model performance in applied ToM.
优点
- The paper introduces SimpleToM, a well-designed dataset for evaluating both explicit and applied Theory of Mind capabilities in LLMs.
- The three-level assessment approach (mental state, behavior, judgment) is innovative and provides a more comprehensive evaluation framework than previous work.
- These results highlight an important gap in current LLM capabilities: understanding mental states in isolation versus applying that understanding to reason about behavior and make appropriate judgments.
缺点
- The biggest weakness of this work is that the motivation for evaluating the Theory of Mind (ToM) capabilities of large language models (LLMs) is not clearly articulated. The statement in lines 49-50, "Given the increasing use of LLMs in human interactions... it is crucial to assess their ToM capabilities," does not sufficiently explain why these evaluations are necessary. Specifically, it remains unclear how improvements in "prediction" and "judgment" in certain human behaviors or narrative scenarios would benefit the practical application of LLM systems or AI agents.
- Your evaluation is based on performance in "applied" ToM , but it lacks consideration of how LLMs can improve outcomes in real-world applications, it means while LLMs can demonstrate the ability to make "predictions" and "judgments" about behavior, whether these abilities align with their capacity to meet human expectations in real applications such as AI agents is a separate issue. There seems to be a gap between these two aspects, and I hope the authors can provide more discussion on this matter.
- The "RESULTS AND ANALYSIS" section lacks sufficient analysis of your results. For example, in lines 316-317, "Only the latest o1-preview model ... on this question type (84.1%)," you should provide an analysis of why built-in inference-time reasoning tokens improve the model's performance.
- A human baseline is needed for your evaluation. As you mentioned in lines 42-43, "People infer what others know, anticipate their actions, and expect them to choose cost-minimizing behaviors," I think you should include an evaluation of human performance on your test for comparison.
问题
As mentioned in Weaknesses 1, I hope the authors can provide more details on the benefits of achieving a higher score in SimpleToM. Specifically, how does this relate to the performance of large language models or their applications in real-world scenarios?
Thank you for your review! Here are some responses to your helpful comments:
> Importance of evaluating the ToM capabilities of LLMs?
See our general overall response above for examples of applications where failing on applied ToM would be critically problematic. For instance, having an AI personal assistant that cannot distinguish its own state of mind from that of its user and other independent individuals, would then fail to predict other’s behavior and not be able to highlight critical information in time. Furthermore, an AI judge that fails to appropriately consider whether one has a “guilty mind” could make poor judgments like blaming innocent people for another's harm, generally incorrectly judging someone’s actions as wrong or stupid (like purchasing moldy chips when they couldn’t see through the Pringles can to know it was moldy). We will highlight such motivating examples in the paper.
More generally, a fundamental challenge of modern LLMs is their opacity, i.e., our lack of understanding about how they behave and when they fail. This has been a major stumbling block and risk for more widespread LLM adoption. Our work develops a richer understanding of LLM behavior in the critical area of reasoning about other people’s beliefs, something that is fundamental to any conversational system. So as well as specific practical applications, addressing this fundamental problem with current LLMs in a general-purpose way is an important motivation for this work, and one where we contribute both new resources and new insights that we believe are both novel and impactful - we would be grateful if you could reassess your judgment on this. Again we will clarify this in the paper.
> Why does o1-preview's built-in inference-time reasoning tokens help?
The built-in inference-time reasoning tokens are akin to the chain-of-thought responses, although lengthier, indicating the model ruminating more. As noted in Appendix H, empirically we notice that the o1-preview model uses a lot more tokens than other models with CoT*. This further highlights the novelty of the gap our paper exposes - even this recently released model, using so many reasoning tokens to reason about simple 2 sentence stories, still shows a significant gap in explicit and applied ToM performance (see Table 2).
We do not have access to the specific o1 reasoning tokens due to the closed-access nature of the o1-preview model, but we analyze CoT examples across our interventions on models for which we can access the CoT output, discussing them in Lines 478-484, with detailed CoT outputs in Appendix I. Our analysis shows that more fine-grained human interventions/hints generally help the performance more by increasingly reminding models of key factors to consider. One hypothesis regarding o1-preview's built-in inference-time reasoning tokens being helpful in ToM reasoning is that they go through a longer reasoning process, which could potentially involve backtracking or self-questioning along the way (mimicking human intervention), leading to better performance.
> Human baseline?
The strict crowdsourcing requirements for SimpleToM ensured that each instance was labeled with unanimous agreement from three qualified workers. These workers independently verified and all agreed on the following:
- the key information in each story is something that Person X is unaware of
- the next likely “unaware action'' is appropriate if and only if Person X is unaware of the key information
- the “aware action'' is appropriate if and only if in the counterfactual situation where Person X is somehow aware of the key information
Requiring agreement among all 3 qualified workers on such questions minimizes ambiguity and ensures the high-quality nature of the labels in our dataset. Given the simplicity of the scenarios, we can expect a very high human baseline. It is unlikely that other annotators would significantly deviate from the unanimously agreed-upon answers obtained during the verification process.
We also note in the paper that with enough hints, the strongest model (Claude-3.5-Sonnet with CoT* + MS in Table 5) gets 97% overall performance, with consistently >95% on the 3 different types of questions. The close to 100% performance attained across the mental state, behavior, and judgment questions illustrates the high-quality answerability of the questions and is an indicator that the human baseline ceiling would be close to 100%.
Thank you for your detailed response and the examples provided. However, I am still somewhat confused. Do these examples have real-world support? For instance, in which types of models (e.g., as AI assistants or AI judges) do these errors occur, and how frequently do they happen? For highly practical use cases, ToM seems to be a fairly general objective. I am unclear about its concrete impact on specific cases.
Thank you so much for your response and for engaging in the discussion with us!
You are correct these are speculative examples. Note, though, this is not an application paper comparing AI products for deployment, but rather, a basic research contribution paper with both an innovative dataset and surprising findings about frontier LLMs. We believe there is significant value in fundamental research on LLMs within the community, particularly when addressing broad and fundamental questions such as “how well can LLMs reason about other people’s beliefs?”. This is not an esoteric question, given we live in a world where different people have different beliefs, and we hope our examples help convey this.
We follow the definition of explicit vs applied ToM as used literature (see lines 196-199), and cover both explicit theory of mind (conceptual knowledge about others’ mental states; i.e., via (a) mental state question) and applied theory of mind (the ability to apply that conceptual knowledge in different contexts; i.e., via (b) behavior and (c) judgment questions). While there has been much work testing LLMs on explicit ToM, their applied ToM capabilities are much less studied. In particular, our finding that LLMs can superficially report other people’s beliefs correctly (giving the impression of understanding their mental state), but then act as if otherwise (when predicting/judging behavior), has not been observed before (despite many earlier ToM studies), and we hope is a particularly interesting and impactful finding.
This paper proposes a new dataset called SimpleToM, designed to assess existing large language models on their Theory of Mind reasoning abilities. SimpleToM features well-crafted, simple, and straightforward stories that cover a variety of scenarios and includes three types of questions: mental state questions (testing whether the model can infer a character’s awareness of key information), behavior prediction questions (assessing if the model can predict a character’s actions based on mental states), and judgment questions (evaluating if the model can determine the reasonableness of actions). The paper experiments with various inference-time interventions to improve the performance of LLMs and concludes that these interventions only temporarily enhance model performance, indicating that there is still significant potential for LLMs to improve their Theory of Mind capabilities.
优点
- SimpleToM is a straightforward and comprehensive dataset for evaluating models' Theory of Mind abilities.
- The asymmetric information ensures diversity in the dataset, allowing for a more thorough assessment of models' capabilities, aligning better with the diversity characteristic of humans.
- This paper evaluates the Theory of Mind capabilities of both closed-source and open-source LLMs.
- The dataset is publicly available, and the experiments are highly reproducible.
缺点
-
The results and analysis in section5 are quite interesting, but each paragraph contains too many bolded words, making it difficult to follow the main points. Clarifying this section would improve readability.
-
In Table 2, Llama-3.1-8B stands out as an outlier, compared with Llama-3.1-405B, it performs more than five times better in the judgment evaluation. I think this is an interesting case and I would expect to see more explanation for this outlier, such as potential reasons behind this unusual performance, would be more informative than simply labeling it as an outlier.
问题
I'm curious about whether the humans personality or demographic characteristics in the mental state phase might influence the model's mental state evaluation. Some ablation experiments would be insightful.
Thank you for your review of our paper and for your positive feedback on our work!
> Bolded words
We will remove some bolding in Section 5. Thank you so much for the suggestion!
> Interesting Llama-3.1-8B outlier in Table 2
Llama-3.1-8B’s score of 54.6% on behavior judgment is basically near random chance (50%), but this judgment score is indeed much higher compared to many other models where this is below random, including its bigger counterpart Llama-3.1-405B. This reveals the following insights about the Llama-3.1-8B model:
(1) It has less bias to being consistently wrong in the judgment task than the other models.
(2) Analyzing its performance on behavior prediction (BP) and behavior judgment (JU) reveals that despite this higher judgment score, there is much inconsistency within itself in the behavior-to-judgment reasoning chain. For instance, in 33% of the cases, the model predicts the behavior wrongly but inconsistently gets the judgment right; in 17% of the cases, it predicts the behavior correctly but still gets the judgment wrong. This further highlights the importance of assessing ToM in LLMs using different question types as models may not be consistent in their responses across questions.
> Human personality
Injecting a persona into the prompt would be an interesting follow-up experiment to see the impact of social factors on ToM performance, thanks for the suggestion!
I thank the authors for their clarifications and their extensive responses to the rebuttals. However, I think it is important to have the human personality or demographic experiments as the ablation.
Thanks for the suggestion! As a follow-up to the suggestion, we experimented with different personas from PersonaHub (https://huggingface.co/datasets/proj-persona/PersonaHub/viewer/persona). Here are some results on these 5 different personas in the system prompt:
PERSONAS = {
"lawyer": "You are a partner at the law firm, recognized for their extensive knowledge of healthcare laws.",
"worker": "You are a factory worker who doesn't trust the COVID-19 vaccine.",
"atheist": "You are an atheist, philosophy lecturer who encourages open dialog about faith and belief systems.",
"psychology_student": "You are a university psychology student who is currently studying creativity and personality.",
"psychologist": "You are a clinical psychologist collaborating with the music therapist to provide holistic patient care."
}
| model | persona | mental state | behavior | judgment |
|---|---|---|---|---|
| gpt-4o-2024-05-13 | NONE | 95.64 | 49.52 | 15.26 |
| gpt-4o-2024-05-13 | lawyer | 95.54 | 49.7 | 17.18 |
| gpt-4o-2024-05-13 | worker | 93.13 | 55.79 | 15.87 |
| gpt-4o-2024-05-13 | atheist | 95.02 | 50.9 | 15.61 |
| gpt-4o-2024-05-13 | psychology_student | 94.42 | 47.55 | 15.52 |
| gpt-4o-2024-05-13 | psychologist | 95.19 | 53.91 | 16.65 |
| claude-3-5-sonnet-20240620 | NONE | 97.91 | 67.04 | 24.93 |
| claude-3-5-sonnet-20240620 | lawyer | 98.37 | 67.55 | 32.0 |
| claude-3-5-sonnet-20240620 | worker | 97.85 | 67.04 | 24.5 |
| claude-3-5-sonnet-20240620 | atheist | 97.94 | 65.84 | 23.37 |
| claude-3-5-sonnet-20240620 | psychology_student | 97.25 | 68.58 | 24.76 |
| claude-3-5-sonnet-20240620 | psychologist | 97.85 | 68.58 | 26.94 |
Across personas, we see that the gap between explicit ToM (mental state prediction) and applied ToM (behavior prediction and judgment) remains consistently prominent, similar to the case where no persona is specified. This indicates that our finding on the gap between explicit and applied ToM in LLMs is robust to injecting different personas. There are, however, minor differences across personas that may open up interesting directions for future studies. For instance, applying the “worker” persona with GPT-4o results in slightly worse performance than other personas on the mental state questions (though minor, < 3%) but slightly better performance on the behavior prediction questions. It is also consistent across GPT-4o and Clause 3.5 (and more prominent in the case of Claude 3.5) that the “lawyer” persona yields somewhat better performance on judgment questions (still way below random), potentially an effect of the model trying to mimic careful judgment when operating under that persona.
We'll also include these results and analysis in an Appendix with the final version of the paper. Thank you again for your suggestion!
Thanks for the additional experiments. I think this paper can be accepted by conference, I will keep my positive score.
This paper exams whether LLMs can infer and more importantly, apply the mental state of character in stories. In particular, the author creates a dataset of stories where some key information is unknown for the character. Based on this foundation, the author proposes three tasks: mental state estimation, behavior prediction, and judgment of behavior. The experiment results show that LLMs without CoT can achieve high accuracy on the first task but fail in the others, which implies that LLMs fail to proactively make use of the mental state information for reasoning tasks. However, CoT drastically improves model's performance on the reasoning tasks.
优点
- The author proposes a new task with a new dataset. The author use rigorous crowdsourcing to ensure the correctness of the label, although there are some systematic bias (see Weaknesses).
- The experimental design is thorough, including a wide range of SOTA LLMs and exploring various prompting strategies.
缺点
- It’s unclear how this paper’s findings differ fundamentally from the large body of existing work that LLMs can acquire various types of knowledge (e.g., factual [1, 2, 3], ontological [4], or arithmetic skills (exactly mentioned in CoT paper)) but often struggle to apply this knowledge consistently in downstream tasks. Is reasoning about mental states inherently different in how it’s processed by LLMs compared to other types of knowledge?
- It's also well-known that cot reasoning facilitate multi-step reasoning, and that task-specific instructions are generally helpful. Therefore, I'd expect the author to highlight any unique challenges that ToM reasoning poses for LLMs, distinguishing it from other forms of reasoning.
- The curated dataset appears biased. Every story assumes that characters are unaware of key information, and thus the ground-truth answer to the judgement question is always "reasonable". An ideal dataset would include a balanced mix of cases, with characters both aware and unaware.
- While the author hints throughout the paper that low performance on SimpleToM with naive decoding may signal a risk for LLM deployment, this connection feels weak to me. I'd expect more concrete examples of applications where failing SimpleToM might lead to problematic outcomes.
[1] Allen-Zhu, Zeyuan, and Yuanzhi Li. "Physics of language models: Part 3.2, knowledge manipulation." arXiv preprint arXiv:2309.14402 (2023).
[2] Kazemnejad, Amirhossein, et al. "Measuring the Knowledge Acquisition-Utilization Gap in Pretrained Language Models." The 2023 Conference on Empirical Methods in Natural Language Processing.
[3] Berglund, Lukas, et al. "The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”." The Twelfth International Conference on Learning Representations.
[4] Wu, Weiqi, et al. "Do PLMs Know and Understand Ontological Knowledge?." Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023.
问题
- See Weakness 1 & 4
- Why do models generally achieve higher accuracy in behavior prediction than in judgment tasks? Both tasks should require the same level of reasoning
Thank you for your review! Here are some responses to your helpful comments:
> Does ToM reasoning fundamentally differ from other types of knowledge?
ToM reasoning, being a form of social reasoning, is indeed inherently different from factual or logical reasoning because it requires making implicit commonsense inferences, making sense of social norms, context-dependent actions, and social dynamics, which are often not explicitly presented in models’ training data and not governed by fixed rules like factual or logical reasoning. E.g., it would be difficult to imagine someone writing something blatantly obvious like “In the supermarket, you can’t see through a Pringles can.” As the nature of the knowledge and reasoning is different, it is worthwhile to conduct experiments targeted at ToM capabilities and not use the performance of LLMs on factual, ontological, or arithmetic reasoning to infer their ToM capabilities.
While our work also involves the challenge of applying knowledge, it has broader implications as well. ToM is fundamentally about deeply understanding others' behavior and enabling effective interaction. In this sense, many existing studies have overlooked the bigger picture of what ToM truly entails and aims to achieve. Our work seeks to go beyond the question of whether LLMs possess conceptual knowledge of ToM, to the application of ToM-related reasoning when appropriate, thus expanding and deepening the scope of the discussion.
> Unique challenges that ToM reasoning poses for LLMs beyond needing CoT reasoning?
To close the gap between explicit and applied ToM, we show that more needs to be done than just using CoT reasoning. Table 5 shows that with a generic CoT prompt like “think step by step”, the behavior prediction accuracies are consistently more than 20% lower than mental state prediction, while judgment prediction accuracies are below random chance. Even with the task-specific CoT* (explicitly guiding models to think about the awareness of each person) gives behavior and judgment accuracies significantly below the mental state prediction (e.g., judgment scores of 77.8%-86.7% vs. mental state prediction scores all above 95%). To further close the gap, we found that we also need to combine CoT* with appending the relevant mental state question that is instance-specific (dependent on the story and character of interest in the story, not just generally the ToM task). Our work thus presents the surprising finding that very detailed interventions are needed to close the gap between explicit and applied ToM in the simple 2-sentence stories.
More generally, we think it is important to highlight these ToM-specific challenges, as real-world usage of LLMs rarely involves active customized "CoT prompting" by the user, rather the (late-stage) training of LLMs now involves learning to automatically generate appropriate reasoning chains without special prompting. The existing instruction training sets generally use examples from traditional reasoning tasks like math, not examples geared towards the kind of ToM reasoning. This lack of training partially explains the poor performance of even the fully trained frontier models, and hopefully, our work will inspire more ToM-focused training data in the LLM development pipeline.
> Why the focus on false belief setups?
To test LLMs’ ToM capabilities, the understanding that others’ mental states may be different from their own, like the classical Sally-Anne and Smarties tests, we focus our analysis on a series of false belief setups that distinguish between reasoning on the basis of the state of the world and reasoning on the basis of someone’s beliefs about the state of the world. We also have the instances of stories that were annotated as PersonX is “likely aware” of the key information from our filtering annotation process, and we will also release this subset as the control setup.
> Importance of evaluating the ToM capabilities of LLMs?
See our overall general review response above for some concrete examples of applications where failing on applied ToM would be problematic.
> Do behavior prediction and judgment require the same level of reasoning?
This is a great question! While these tasks both involve applied ToM, behavior judgment requires the LLM to first implicitly perform behavior prediction to infer what the likely action is, and then additionally compare the predicted behavior to stated behavior to judge whether the stated behavior is reasonable. Therefore, as illustrated in the inference graphs in Figure 1, behavior judgment requires an additional step of reasoning. We will update the paper to further clarify this difference in the reasoning required.
Thank you for your response. but I still have most of my concerns.
ToM-specific challenges: The findings align with what the community already understands: zero-shot < CoT < task-specific instruction for CoT. That said, I’m not suggesting these findings lack value entirely, just that their overall impact might be somewhat limited.
Focus on false belief setups: I agree this setup is more interesting. however, the dataset should be unbiased for a faithful evaluation.
concrete examples of applications where failing on applied ToM would be problematic: I find it surprising that an AI judge would rely solely on a short story for its predictions. Wouldn’t it at least consider the testimonies from all parties involved? In that scenario, the mental state information would already be present in the text.
Thank you for your review of our work! We’re grateful for the positive feedback on the following aspects of SimpleToM:
- Well-designed, well-crafted, simple, and straightforward stories for evaluating both explicit and applied Theory of Mind capabilities in LLMs.
- Comprehensive dataset covering a variety of realistic scenarios and information asymmetry, allowing for a thorough assessment of models’ capabilities.
- Rigorous crowdsourcing to ensure the correctness of labels, resulting in a high-quality dataset.
- Comprehensive evaluation framework covering three levels of assessment - mental state, behavior, and judgment.
- Thorough experimental design, including a range of SOTA LLMs, both close-source and open-source LLMs, and exploring various prompting strategies.
- Results that highlight an important gap in the current LLM capabilities (understanding mental states in isolation vs applying that understanding to reason about behavior and make appropriate judgments)
- Dataset is publicly available and experiments are highly reproducible.
A concern raised by some reviewers is the general motivation behind evaluating the ToM capabilities of LLMs. To further emphasize the importance of the capabilities tested in SimpleToM, we provide concrete examples of applications where failing on applied ToM would be problematic:
The case of a bad personal AI assistant - failing to implicitly reason over other’s mental states to predict behavior:
Matt is a professional athlete. A growth hormone got into the supply chain for Bob's Burgers, where Matt regularly had dinner. Despite this, the owners decided to continue to sell their burgers to save costs. Imagine a personal AI assistant, having read complaints about the growth hormone contamination at Bob’s Burgers but failing to apply the understanding that others’ mental states may be different from their own (ToM), could reason that Matt, like the AI assistant, is also aware of this (awareness) and then incorrectly predicts that Matt will “refuse the burger due to the growth hormone” (behavior). This could then lead to undesirable consequences such as not being able to warn Matt in time to stop him from consuming the contaminated burger. In the untimely case of an unannounced blood test, Matt could even show up with positive traces of this very illegal growth hormone and be accused of doping. In this case, such an AI assistant, lacking a nuanced understanding of human awareness and motivations, might also falsely assume Matt's intentional wrongdoing. This highlights a critical limitation: without robust ToM, the AI fails to grasp that Matt’s actions could stem from unawareness rather than culpability, leading to flawed judgments that could unjustly tarnish his reputation or career.
The case of a bad AI judge - failing to make appropriate judgments of behavior:
Alice visited the supermarket to purchase some carrots to pack lunch for her husband Bob. After consuming the lunch Alice packed, Bob succumbed to a severe E. coli infection. It turned out the supermarket's carrots were contaminated with E. coli and were subsequently recalled. Imagine an AI judge is tasked with evaluating this case to decide whether Alice should be held responsible and imprisoned for Bob's death. Imagine a bad AI judge, failing to apply the understanding that others’ mental states may be different from their own (ToM), incorrectly assuming that Alice was aware of the E. coli and judging that Alice’s act of packing the contaminated carrots for her husband was wrong (judgment). This could then lead to undesirable consequences such as severely punishing the innocent Alice who didn’t know feeding Bob the carrots would kill him. In common law jurisdictions, whether a defendant is found guilty is often decided taking into account both mens rea (“guilty mind”) and actus reus (“guilty act”). Therefore, the ability to apply ToM is important for potential AI judges to appropriately assess whether an individual has a “guilty mind” when making key judgments, such as determining whether to convict someone.
We are excited about the possibility of future generations of models to improve on applied ToM in our dataset. This could pave the way for models to effectively interact with humans – for instance, serving as reliable personal AI assistants as well as trustworthy AI judges. We hope SimpleToM, as the first resource of its kind to measure LLMs’ capability on such diverse applied ToM scenarios, will help facilitate the community in pursuing exciting directions that bring us there.
We thank all the reviewers for their helpful comments and suggestions! We have now updated the PDF with the following changes:
- Further describe motivational applications of applied ToM (details in Appendix A - Importance of Applied ToM)
- More explicit description of the difference between behavior prediction and judgment (Section 2.3: "As the inference graph in Figure 1 illustrates, the judgment question goes beyond behavior prediction as it requires two levels of implicit reasoning, first implicitly predicting the behavior of Person X, which itself relies on implicitly understanding their mental state.")
- More in-depth result analysis, such as Llama-3.1-8B outlier performance on judgment questions and the o1-preview's reasoning tokens (details in Appendix K - Further analysis on interesting behavior)
- Results on performance across personas (details in Appendix M - Performance across personas)
- Fewer bolded words (Section 5)
We hope we have addressed the questions raised by the reviewers through our authors’ responses, as well as updates to the paper. And we’re happy to follow up if there are any remaining concerns.
SimpleToM is a new dataset designed to assess Theory of Mind (ToM) reasoning abilities in large language models (LLMs). The dataset features simple and concise stories covering a variety of scenarios where some key information is unknown to a character, creating a basis for ToM reasoning tasks. The dataset includes three distinct question types to test ToM reasoning: (i) Mental State Questions that test whether the model can infer a character’s awareness of key information, (ii) Behavior Prediction Questions that assess if the model can predict a character’s actions based on their mental state, and (iii) Judgment Questions that evaluate whether the model can determine the reasonableness or rationality of a character’s actions. Experimental findings show LLMs achieve relatively high accuracy in mental state prediction but struggle with behavior prediction and judgment tasks. It also shows that using CoT reasoning drastically improves model performance in behavior prediction and reasoning tasks. The authors experimented with interventions to improve performance but found that these improvements were only temporary, indicating room for further development in ToM reasoning.
Strengths identified:
-
SimpleToM is a well-designed, straightforward, and comprehensive dataset for evaluating Theory of Mind (ToM) reasoning abilities in LLMs.
-
The experiments are rigorous and reproducible, involving a wide range of state-of-the-art (SOTA) LLMs and exploring various prompting strategies.
-
The three-level assessment framework (mental state prediction, behavior prediction, judgment) is innovative and provides a more comprehensive evaluation compared to previous work in ToM.
Weaknesses to be addressed:
-
The motivation for evaluating Theory of Mind (ToM) capabilities in LLMs is not clearly articulated. Why is assessing ToM abilities in LLMs important for real-world applications or AI agent interactions? The connection between low performance on SimpleToM and potential risks in real-world LLM deployment is weak. The paper could benefit from more concrete examples of practical applications where failing ToM tasks might lead to problematic outcomes.
-
The findings do not seem to differ fundamentally from previous research that demonstrates LLMs' struggles in applying acquired knowledge (e.g., factual, ontological, or arithmetic knowledge) in downstream tasks. More specifically, The paper does not highlight any unique challenges specific to ToM reasoning that distinguish it from other forms of multi-step reasoning or knowledge application.
-
Every story assumes that characters are unaware of key information, making the ground-truth answer to the judgment question always "reasonable". A more balanced dataset, with scenarios where characters are both aware and unaware of key information, would be ideal.
-
The paper does not sufficiently analyze why specific models or interventions (e.g., inference-time reasoning tokens) improve performance. Also, the paper does not include a human performance baseline for comparison, which is important given the claim that humans infer mental states, predict behavior, and make judgments effectively.
审稿人讨论附加意见
In response to reviewer jF3J, the authors provided adequate clarifications on the distinctiveness of ToM reasoning and the novelty thereof. However, regarding the dataset biases, authors should provide quantitative details (e.g., the size of the control subset) and examples to better demonstrate dataset diversity. Also, although the argument that current training pipelines focus on logical reasoning tasks (e.g., math), not ToM reasoning, impacting LLM deployment in real-world settings is valid, concrete examples of real-world scenarios where ToM failures might cause issues are lacking.
In response to reviewer J5Pr, the authors provide a plausible hypothesis but lack access to o1-preview reasoning tokens for deeper analysis. They could strengthen their argument with comparative examples of reasoning chains from other models to illustrate why longer reasoning improves performance. Also, Reviewer 3's comment is valid because it points out the need for concrete real-world evidence to support the claim that ToM reasoning is critical for LLMs. While the authors’ examples are illustrative, they remain hypothetical, and addressing this gap with specific, supported use cases would make their argument significantly stronger. Finally, a human performance benchmark on the dataset is critical to contextualize model results and validate the evaluation framework.
Reject