Do Large Language Models Have a Planning Theory of Mind? Evidence from MindGames: a Multi-Step Persuasion Task
Humans significantly outperform LLMs at our complex theory of mind task
摘要
评审与讨论
This paper introduces a novel task called MINDGAMES to test the model's ability for Planning Theory of Mind (PTOM). They claim that this places the model in more action taking role than a passive role in addition to checking their causal understanding. They frame this task as an interaction between two agents (persuader and target) with different objectives. Target has to choose between various proposals combing their prior information with information disclosed by persuader. Persuader has to persuade target to pick the proposal of persuader's choice by carefully disclosing pieces of information They test the model's ToM capability in two setups
- REVEALED Mental States - the persuader knows the target’s value function and the information the target has access to, the problem here simplifies to predicting target's behavior based on various disclosures.
- HIDDEN Mental States - the persuader has to infer the target’s belief state through dialogue, requiring multiple steps of planning
They implemented target as naively rational agent and persuader was LLM/ human (subject for experiments)
They ran experiments with humans to gauge LLM's performance on this task using Chain of Thought prompting and found that in HIDDEN state, humans outperformed LLMs with more gains as opposed to the REVEALED state (for o1-preview model based experiments). They hypothesized that the TOM capabilities of humans and LLMs are different in nature and LLM's capabilities might not extend to planning in Hidden state.
接收理由
- This paper introduces a novel task which helps in revealing LLMs strengths and weaknesses from Planning theory of mind perspective.
- The task is complex enough to warrant reasoning and planning especially counterfactual planning, so the task set up justifies the capability testing aspect of the LLMs.
- This work addresses some of the lackings of the classic ToM tasks by placing the LLM in more 'actionable' role than a mere spectator.
- The analysis on different mental conditions affecting the outcome is interesting and can lead to potential new direction exploration for planning models and evaluation is robust with well defined baselines and experiment set ups.
- This work reveals the lack of information seeking capability by LLMs in hidden state to gain understanding of target's mental state.
- The survey seemed thorough for the target space they tackled.
拒绝理由
- Diversity in the proposal domains can help in revealing if the lack of performance both by humans and LLMs in different mental states is domain specific or agnostic.
- This task is a complex and intricate one, would be good to establish this reasoning on less complex tasks.
- The target agent is naive, not reflective of how humans would act in persuading situations.
给作者的问题
How do we know these learnings are transferable to other ToM set ups? How does introducing a non naive target agent affects the outcome?
We appreciate that you recognize the import of our novel task of ToM: how, in contrast to prior work, it puts participants in an actionable role and shows the weaknesses of LLMs in complex planning (the Hidden condition). We are eager to see future work which builds off of our insight and tries to develop better mentalizing components (such as asking questions about mental states) in subsequent models to achieve better success at our task. Please let us know if you have any more questions.
Thanks for your response and answering my questions. Thanks for sharing the details about the payoff matrices and experiments on different cover stories, it'd be helpful to include these in the appendix along with details and design choices regarding the constraints for tight control of the experiment
Given the newly provided information, although the design of the experiment is novel, I do have a concern that this approach is transferable within PToM space given the very specific / targeted design of the proposals themselves and the agent being naive.
I've already given a high score for acceptance and I'd like to maintain my score
Thanks for your feedback and receptiveness. We'll definitely include the further details about the payoff matrices and experiments in the appendix. We appreciate the limitations of our current study and look forward to more opportunities to design future experiments to address them!
2.
This task is a complex and intricate one, would be good to establish this reasoning on less complex tasks.
We would also love to see future work build on ours to establish easier to understand tests for planning theory of mind. We also expand on this point to reviewer 7dfj (answer 1). Our task is the simplest one we could design given the constraints we needed to satisfy.
For example, there need to be at least three choices (proposals) because the target has to have a different preference than the persuader initially and there must be a confounding proposal to prevent the persuader from convincing the target with simple heuristics (such as by revealing all of the information). Additionally, the target must not know all of the available information (they must have something hidden to them). There also must be at least two attributes for each proposal and value function. We chose to have three attributes because this allowed for more possible payoff matrices. (There are only 56 possible sets of information and value functions with two attributes as compared to tens of thousands with three attributes.)
We further elaborate on these constraints in reply to CxkX above (answer 4).
3.
The target agent is naive, not reflective of how humans would act in persuading situations.
Expanding on our answer 2 and to the point we make to CxkX, there may be tasks which loosen the construct of PToM which would be interesting to study. For example, as you mention, the naively rational target has canned responses, which does not reflect real human persuasive dialogue in general. This is intentional by design as it carefully tracks the underlying measure. (And even though it gives canned responses, human participants perform much better than LLMs in the Hidden condition.) We are eager to run future studies in which we place real humans in the role of the target, trading off a reduction in construct validity for an increase in ecological validity. (In these cases, we would no longer be sure that successful persuasion would involve a mental state inference.) We have designed our experiment carefully so that it can flexibly accommodate such extensions.
1.
Diversity in the proposal domains can help in revealing if the lack of performance
We agree that it is crucial to test our PToM task with different cover stories (proposal domains, in your words). Indeed, we did test five different cover stories (as shown in Tab. 2), and found the same overall trends in all of them. Additionally, we tested 40 different payoff matrices (information sets and value functions) for each cover story. We discuss this in reply to reviewer CxkX above.
This paper introduces MINDGAMES, a task framework designed to evaluate planning theory of mind (PToM) capabilities in large language models compared to humans. The authors try to distinguish between spectator Theory of Mind (ToM) (the ability to predict and interpret behavior) and planning Theory of Mind (PToM) - the ability to intervene on others' mental states to bring about desired outcomes. This research claims that the proposed framework used for evaluating complex ToM abilities goes beyond simple classification. The experimental design includes HIDDEN and REVEALED conditions that vary the complexity of mental state inference. The study tests several LLMs (o1-preview, GPT-4o, deepseek-r1, and llama models) against human performance on a persuasion task.
接收理由
There are some strengths of this research:
- This paper proposes a task framework and provides a controlled environment for evaluating planning ToM capabilities of LLMs.
- The experiment tests several LLMs (o1-preview, GPT-4o, deepseek-r1, and llama models) with 200 LLM trials against human performance on a persuasion task.
- The paper provides the two variants of the task, which include the HIDDEN and REVEALED conditions.
拒绝理由
There are some weaknesses in this paper:
- The critical one is how this proposed persuasion task in Figure 1 relates to the mental state (i.e., belief and desire). What are the beliefs and desires defined by this paper for evaluating the LLMs?
- The proposed framework and task cannot truly reflect the planning ToM ability of LLMs, as the stories and scenarios in the proposed task are limited (i.e., 202 HIDDEN and 173 199 REVEALED).
- The contribution of this paper is incremental. There are some prior works [1-3] that study ToM and persuasion, which are not referred to and discussed in this paper. This paper should discuss the difference between the task of prior works and the proposed task, as these papers are clearly related to this paper.
- The experimental results are not reliable and convincing as it only based on only 200 trials with limited stories.
[1] PersuasiveToM: A Benchmark for Evaluating Machine Theory of Mind in Persuasive Dialogues [2] NegotiationToM: A Benchmark for Stress-testing Machine Theory of Mind on Negotiation Surrounding [3] Improving Dialog Systems for Negotiation with Personality Modeling
给作者的问题
For other questions, please refer to the reasons to reject section. The most important question I would like to ask the author is how this persuasion task in Figure 1 explicitly relates to the mental state (i.e., belief and desire).
Since desire is a motivational mental state, and the paper mentions "desire" in the abstract and introduction several times, what is defined desire mental state in this task? Moreover, please elaborate and explain more on the belief dimension in this task.
Thanks for carving out the time to review our paper!
We greatly appreciate the time you spent on this review and your clear subject matter expertise as demonstrated by your awareness of relevant literature. We welcome the opportunity to discuss how to make the best ToM tests possible with you and encourage you to consider the contribution of our work.
4.
Thanks for bringing up this point! Would you be able to clarify if it is any different than your question two, above?
Thank you for your response to the comments. I appreciate the effort you put into addressing the issues raised. In fact, I am very interested in this work but hesitate to consider it appropriate for presentation at the conference due to its lack of definition and treatment of core theory of mind (ToM) concepts (i.e., beliefs and desires) and the absence of highly relevant prior works in a ToM paper.
Since the author replied that they will append to the paper these missing concepts and highly relevant prior works, I assume that this issue will be addressed, and I will increase my score. In my view, the data amount issue is still not addressed during the rebuttal period. If the author wants to increase the influence of this paper and hopes that the following works will use the benchmark, the amount of data in the benchmark is a critical problem for conducting a reliable and robust experiment.
Thanks for the opportunity to improve our work through your comments. We look forward to collectively contributing to the theory of mind literature. We appreciate the desire to make our work as impactful as possible as a benchmark.
1.
What are the beliefs and desires defined by this paper for evaluating the LLMs?
It's a good question how beliefs and motivations relate to our paper!
Beliefs: What the target knows about each proposal's attributes. In Figure 1, these are shown in the "What the other player knows" section. Desires: The target's preferences over attributes. In Figure 1, the target "feels indifferent to safety and control, dislikes development speed, and likes public trust."
In the example in Figure 1, the target initially prefers Proposal C because they only know it "will increase public trust" (which they like). The persuader must first infer the target's mental states by asking "What do you know?" (asking about beliefs) and "How do you feel about the attributes?" (asking about desires).
When the persuader learns that the target dislikes development speed, the persuader strategically reveals two pieces of information: "Proposal A will decrease development speed" (appealing to the target's dislike) and "Proposal C will increase development speed" (making Proposal C less attractive). This intervention changes the target's beliefs which, combined with their existing desires, causes them to switch from preferring Proposal C to preferring Proposal A
This demonstrates planning theory of mind as success on the task requires: 1. inferring mental states (beliefs, desires), and 2. strategically intervening on beliefs to move the target from Proposal C to Proposal A.
2.
The proposed framework and task cannot truly reflect the planning ToM ability of LLMs, as the stories and scenarios in the proposed task are limited
It is important to make sure that our results generalize well! We appreciate your concern that our framework is too limited to reflect planning ToM, but we respectfully disagree.
In our response to reviewer CxkX (answer 1), we expand on why we think that our task meaningfully measures ToM. We've summarized those points here: 1) By tightly controlling the construct we're measuring, our task draws on readily accepted approaches in cognitive science, which the CoLM CFP asks for. This is also done in computer science, as with the ARC prize. 2) We show results on five different cover stories with 40 different payoff matrices each allowing us to measure generalization in closely related PToM tasks. 3) We think our task provides a fruitful base for subsequent work to build off of in the direction of more ecologically valid tasks.
3.
There are some prior works [1-3] that study ToM and persuasion, which are not referred to and discussed in this paper.
We appreciate you bringing those valuable prior works to our attention! We will reference these contributions in the discussion section of our revised paper.
Improving dialog (Yang et al.)
This study integrates theory of mind concepts into active dialogue agents (i.e. not only passive observation benchmarks). The study focuses on improving negotiation strategies by enabling an agent to model and infer an opponent's personality type from dialogue history. The inferred personality informs a 'first-order ToM' which predicts the opponent's likely next response. The agent can then adapt its strategy via a one-step lookahead. Performance is assessed via negotiation outcomes e.g., agreement rates on the CRAIGSLISTBARGAIN dataset. We think this study provides important groundwork for investigating more sophisticated ToM capabilities.
While Yang et al. focus on inferring relatively static opponent personality types to anticipate their reactions, our work investigates planning ToM. Planning ToM requires agents to reason about another agent's more dynamic beliefs and desires to actively plan interventions (e.g., strategic information disclosure) in order to alter the other agent’s mental states and subsequently influence their behaviour.
Yang et al.’s study demonstrates how ToM can help an agent to understand ‘who the opponent is' and predict 'what they will do next.' Our study builds on this by asking whether an agent can use ToM to reason about 'what another agent knows and wants' to plan 'how I can change their mind.' We will ensure our manuscript references this study in the Related Work section and makes clear the distinct contributions of our PToM evaluation.
NegotationToM (Chen et al.) and PersuasionToM (Yu et al.)
Both studies extend existing ToM evaluation to negotiation and persuasion contexts, respectively, and they test a comprehensive set of LLMs. NegotiationToM tests whether LLMs can track mental states (with explicit differentiation between beliefs, desires, intentions, and H/M/L preferences) of negotiation participants using the CaSiNo human-to-human conversational dataset. PersuasionToM evaluates whether LLMs can track mental states in persuasive dialogues by observing the DailyPersuasion dataset. Both papers focus on the accuracy of mental state inferences made by LLMs through passive observation of these conversational datasets.
Notably, both studies acknowledge the constraints of the passive observer role in their limitations, and call for precisely the type of interactive study our current work presents.
Further information on why our study is a novel contribution relative to these two papers:
-
Observation vs. participation: NegotiationToM and PersuasionToM use pre-existing dialogue datasets. In both studies, LLMs answer comprehension questions about mental states after observing the dialogues. In contrast, our study requires LLMs to participate in real-time dialogue with other agents and make strategic decisions about what information to gather and disclose.
-
Tracking vs. planning: Our key finding builds on these papers. While NegotiationToM and Persuasion ToM show that LLMs can successfully track mental states in observed dialogues, our study shows that LLMs struggle to plan and intervene on those states through interaction. This suggests that understanding mental states and strategically acting on them are different abilities.
-
Methodological differences: We note that human performance in both NegotiationToM and Persuasion ToM consists of three graduate computer science students' responses with majority voting. In contrast, our paper includes a human baseline study (n=124), which allows broader human performance comparison and improves its validity.
-
Experimental control: As earlier noted in our response to CXKW, we use a naively-rational target (hard-coded bot) and a game-theoretic structure to reduce confounding variables and specifically isolate the ToM planning aspect.
Again, we will ensure our manuscript references these studies in the Related Work section and the distinct contributions of our PToM evaluation.
This paper introduces a multi-step persuasion task to evaluate whether language models possess a "Planning Theory of Mind". Unlike prior ToM benchmarks, this task emphasizes interactive and causal reasoning. The authors find that humans outperform the strongest LMs in the complex version of the task, while LLMs excel in a simpler "spectatorial" condition. The results show evidence of a gap between human like social reasoning and current LM capabilities in this space.
接收理由
- The paper aim shifts the conversation from passive ToM benchmarks to more meaningful interactive, planning-based ones.
- The experimental setup is well motivated and rigorous. The "hidden vs. "revealed" design teases apart PToM from simpler predictive strategies. The paper compares multiple models, includes a human baseline, uses pre-registered hypotheses, and applies mixed-effects modeling.
拒绝理由
- The "hidden" condition involves many moving parts and is complex. The failure cases could partly reflect limitations in working memory or instruction-following, not just ToM.
- Limited ablation on prompt engineering: It's unclear how sensitive model performance is to minor prompt or format changes. Could different prompting yield better hidden scores?
给作者的问题
I am surprised at the overall low human performance even in the hidden case. Can you provide some more context, why is the task too hard for even humans to follow?
We appreciate the time that you have taken to carefully read and review our paper. As you say, we differentiate ourselves from prior work by focusing more on an interactive, planning-based task in a motivated and rigorous way. We greatly appreciate the feedback you have given us and the chance to discuss it with you.
1.
The "hidden" condition involves many moving parts and is complex. The failure cases could partly reflect limitations in working memory or instruction-following, not just ToM.
Our initial prediction was that human performance on Hidden would be worse than on Revealed, because Hidden requires an initial step to infer mental states of the target (asking questions of the type “what do you know” and “what do you like”) before planning the intervention. However, our results show there isn’t a significant difference in human performance in the two conditions. We note that once the target has disclosed information in response to the questions referred to above, the Hidden condition is fundamentally no different from the Revealed condition; there are the same number of moving parts (working memory demands). The similarity in human performance in the two conditions suggests that, while the Hidden condition introduces an additional step of information gathering, human participants are generally proficient at navigating this step. On the other hand, this additional step presented a significant challenge for LLMs in our study, which suggests a key difference in PToM capabilities between humans vs. LLMs (our key finding). Performance differences for humans might change if there are fewer working memory demands.
2.
Could different prompting yield better hidden scores?
We were also very interested to see if different prompting could yield better scores in the "Hidden" condition which is why we ran the "Task Variants" described in section 4.2. What is so curious is that even when we explicitly tell models to ask about the beliefs and motivations of the target in our "add-hint" prompt, models fail to perform any better (Fig. 4 shows the effects of various task variants. Fig 19 shows the "add-hint" prompt which we've quoted below). Of course, we could have tried even more prompts in the "Task Variants" section than we did. Nonetheless, the central finding is that people perform better at the task in the Hidden condition when given the exact same instructions; our results reflect models' capabilities absent any case-specific (and hence rigid) prompt tuning.
"Add-hint" prompt: "On each turn, you might disclose certain information about the state of the game to the other player. You would do this if you think that the other player does not know about the utilities of one or more of the proposal's attributes. The other player will repeat any disclosures you make back to you. For example, you might want to tell the other player that "proposal A increases x and proposal C decreases y." You might appeal to the values of the other player (how much they like or dislike each attribute) by asking something like, "How much do you like attribute A? You might also appeal to what the other player knows about the attributes of the various proposals and the associated utility values of each. For example, "What is all that you know about proposal A? Alternatively, you might ask questions like "What is your preferred proposal?"
3.
why is the task too hard for even humans to follow?
You're right to point out that average human performance on the task is relatively low, at about 30%. As you said, even the Revealed version of the task is hard and does pose memory challenges. Nonetheless, individual human performance varies considerably. As you can see in the scatter plot points on the "Human" bars in Fig 2., some humans always persuade the target in both the Hidden and Revealed conditions. As the violin plots in Fig. 8 further elucidate, we see a bimodal distribution in human performance, with some participants never succeeding to persuade the target (despite having five chances to do so) and others persuading the target quite often (80% of the time or more). Because of the complexity of the task, our intuition is that human participants simply have a hard time fully reading and understanding the instructions. Further still, one of the problems with existing ToM benchmarks is that they are too easy. We wanted to find a task that had significant individual differences. Marked individual differences in ToM ability are in line with what's been found in the broader ToM literature for human adults and children alike (Livingston et al., 2023; Cavallini et al., 2013; Hughes et al., 2015; Bernstein et al., 2015). In this way, we think that individual differences in human performance at our task are actually a feature.
Thank you to the reviewers for their response and for answering my questions.
I would recommend that the authors include a detailed discussion of why this task was hard for the human participants in the final draft.
I was already leaning accept and will maintain my score.
Thanks for your feedback and the opportunity to improve our paper through your comments. We will be sure to include a discussion of why this task is hard for humans in the final task.
This paper introduces MINDGAMES, a novel task framework for evaluating "planning theory of mind" (PToM) in both humans and large language models (LLMs). Unlike previous Theory of Mind (ToM) evaluations that focus on passive, static aspects, MINDGAMES tests the ability to influence another agent's mental states to achieve desired outcomes, specifically through persuasion. The task involves a persuader (human or LLM) and a rational target agent choosing between three proposals. Each has an independent value function, and the persuader must selectively disclose information to make the target choose their preferred proposal. The authors test two conditions: REVEALED and HIDDEN. The work shows that humans significantly outperform the o1-preview model in the HIDDEN condition, while o1-preview dramatically outperforms humans in the REVEALED condition (along with several other hypotheses, with or without being verified)
接收理由
I really appreciate the idea of borrowing concepts from cognitive science and creating a novel task design, which offers a more interactive and complex ToM evaluation than previous work. The paper has a good presentation, and all the investigated questions/hypotheses are clear and easy to understand. Furthermore, the paper includes substantial human studies and tests multiple LLM variants, which could be an important contribution by distinguishing between interventional/planning aspects of ToM between LLM agents and humans.
拒绝理由
My major concern is that the task setup is overly simplified with the linguistic phenomenon (as shown in the example), the naively rational target agent (acknowledged in the limitations), as well as the almost pseudo-three-proposal structure, while allowing for controlled experiments, significantly limiting the generalization of the finding. And ultimately render the conclusions of PToM abilities in LLMs/humans less convincing.
The paper predicted humans would perform better in REVEALED than HIDDEN (H2), but this was not confirmed, which could also be due to the superficial setup of the task. The post-hoc explanation that humans may use similar strategies across conditions could use stronger evidence.
Furthermore, it's unclear how the dataset is made as described in the paper, "Participants saw up to five different scenarios. We used a constraint solver to generate 10,000 value functions and information sets available to each player." And I highly encourage the authors to include a more thorough paragraph in the main text to explain that.
Lastly, there are works focusing on interactive ToM abilities, and it would be good for the authors to discuss such works: e.g., SOTOPIA (Zhou et al., 2024) and How FaR Are Large Language Models From Agents with Theory-of-Mind? (Zhou et al., 2023). Additionally, there is a risk that the persuasive abilities of LLM-powered agents could cause further harm to human society (see AI-LieDar (Su et al., 2025)), and authors should discuss that in their paper.
给作者的问题
-
How were the 10,000 value functions and information sets generated? What constraints were applied to ensure scenarios were meaningful and solvable?
-
Since human performance did not differ significantly between HIDDEN and REVEALED conditions (H2 not confirmed), have you considered other explanations beyond "humans use similar strategies across conditions"?
Thank you for the careful attention to our paper and for your thoughtful review. We appreciate that you recognize how our novel task differentiates itself from prior, noninteractive work, that we present our work well, make our research questions easy to understand, and offer substantial human studies and compare LLM variants.
4.
We're happy to add more information about how we generated value functions and information sets. We did this such that the naively rational target could always be convinced to move to the persuader's preferred proposal. We constructed the scenarios (cover stories) such that they would be meaningful for any combination of information sets and value functions. Commented below is some more info on that process which we will add to the paper.
We use a constraint solver to enumerate payoff matrices (outcomes for proposals), value functions, and information hidden from the target such that the naively rational target would initially choose one proposal, ; given all the information an optimal target would choose another proposal, ; and given some of the revealed information an optimal target would choose the persuader's preferred proposal, .
In greater detail:
Let the attributes, A, have individual members i. Let the proposals, P, have individual members p. Let the coefficients of the value function of the target be Let the H be a function which maps from a proposal and action to whether those are hidden to the target. (False, True) Let the R be a function which maps from a proposal and action to whether those are revealed in optimal play by the persuader to the target. (False, True)
These three conditions must be satisfied:
- -- Given all info, the target chooses
x - -- Given only info that isn't hidden (start state), the target chooses
z - -- Given the revealed info (what a persuader should say), the target chooses
y
The value function for the target is the sum of the coefficients of the target's value function for each attribute times the utility of each attribute of each proposal.
$V_T(p) = \forall_{i \in A} v_T(a) U(p_a)$
And with the hidden information ( evaluates to 0 if hidden, 1 otherwise)
$V_T^H(p) = \forall_{i \in A} v_T(a) U(p_a) (1 - H(p_a))$
And with the revealed information ( evaluates to 0 if hidden and not revealed, 1 if not hidden or revealed)
$V_T^R(p) = \forall_{i \in A} v_T(a) U(p_a)(1 - H(p_a) + R(p_a) - 1)$
Also these conditions must be satisfied:
If you're still curious about the technical details about how we generated the value functions and information sets, you can see the code we used here: https://anonymous.4open.science/r/mindgames-A03A/src/mindgames/make_games.py
3.
The post-hoc explanation that humans may use similar strategies across conditions could use stronger evidence.
You ask for a deeper explanation of the lack of difference in human performance in the Hidden and Revealed conditions. In the discussion section, we hypothesize that people use similar strategies in both conditions. We are happy to weaken the assertion here; we are not claiming to explain all of the possible differences in their behavior.
We did measure the degree to which human participants appealed to all of the mental states (the informational and motivational states) of the target in both conditions. When we include messages that implicitly appeal to both kinds of mental states, the so-called "inferential" appeals which are of the form "How would you rank all of the proposals?", we see that participants in both conditions appeal to all mental states at similar rates (Fig. 3). In contrast, when we exclude these "inferential" appeals we see that participants in the Revealed condition appeal to all mental states of the target in fewer games overall (Fig. 12)---they ask questions of the form "what do you know" and "how do you feel about the attributes" less often. This suggests that participants are indeed behaving differently in the two conditions even though their gross performance (ability to persuade the target) is roughly comparable.
Furthermore, the individual differences we see in task performance may be masking some possible average differences in the two conditions. As Fig. 8 shows, some participants perform at ceiling (persuading the target in all five trials) while other participants never succeed. It is possible that if we were able to reduce the working memory demands in both conditions of our task, we could see higher average performance and thereby allow for the emergence of more subtle effects of experimental condition. In contrast, reasoning models don't have any working memory demands (as evidenced by the near ceiling performance of o1-preview in Revealed) and so we can readily see the effects of Hidden vs. Revealed with them.
Notably, we did not use a within subjects design here and so we are not able to say anything about any changes specific participants may have had in Hidden versus Revealed conditions. (Participants only ever played games in one of the two conditions.)
2.
there are works focusing on interactive ToM abilities [...]
Thank you for referring us to those papers, which we overlooked. We will make the potential harms of persuasive AI agents more explicit in the discussion section, citing the Su et al. paper.
In SOTOPIA, Zhou et al. introduce a dynamic social framework to test models’ ability to simulate aspects of social interactions, some of which may draw on theory of mind. They use GPT-4 as an evaluator of the LLMs' social believability (among other measures). Relying on another LLM as an evaluator could introduce confounding variables, which makes it unclear the extent to which an agent’s performance on the task is related to the underlying construct of theory of mind (e.g., the agent could simply be optimizing for the GPT-4’s preferences as evaluator). In contrast, our work offers a more direct measure of planning ToM.
By modifying an existing ToM benchmark, Zhou et al.'s "How FaR" paper provides important prior work on how LLMs struggle to make implicit inferences about people's mental states when planning future actions. Importantly, though, this is not an interactive measure while ours is.
1.
the task setup is overly simplified with the linguistic phenomenon
You're right to note our stated limitation that the task setup of MindGames is controlled and might not generalize to other settings.
Notably, we ran five different versions scenarios (cover stories), providing models forty different payoff matrices (sets of value functions and information sets) for each of those scenarios. Across five different cover stories (involving different attributes) we see the same trends reflected; o1-preview, for example, continues to perform poorly in the Hidden condition across all scenarios. In this way, we hope to have robustly sampled the space of closely related PToM tasks. We will include a figure on this in the appendix.
Your interesting question reminds us how sometimes it can be useful to strip away the detail and the flesh of real world situations. We think that controlled tasks like ours are useful; we specifically designed MindGames to overcome some of the limitations of previous work on ToM that can be passed by simpler heuristics (i.e. that may not actually be testing ToM at all as we discuss in the related work). This is a common technique in the cognitive sciences to be very precise about what is actually measured; tight experimental control is often regarded as a requirement for inferring a cognitive capacity, even if this sacrifices some ecological validity (generalizability to a variety of real-world settings like you ask for). This is the case in e.g. the false belief task (Wimmer and Perner 1983) or the chimpanzee chess task (Kaminski et al. 2008). Once a tightly controlled task has managed to identify the capacity, adapted tasks can try to expand generalizability. (The CoLM CFP encouraged work from the cognitive sciences.) Indeed, computer science datasets at times also sacrifice ecological validity in order to be sure of the construct they're measuring. For example, the abstract color matching and related tasks in the ARC benchmark are quite unrelated to real life examples of "intelligence" but we nonetheless think that they test for what is desired.
More generally, while our task could be rendered more robust with more varied studies, we meet the bar of many influential ToM tasks quite well. In the ToM literature for humans, although some classic ToM tasks such as Happé's strange stories task has 24 stories total, the stories all aim to assess different components of ToM (such as recursive theory of mind, affective theory of mind, cognitive theory of mind) by assessing understanding of false beliefs, sarcasm, double bluffs, faux pas, etc. Likewise, the theory of mind developmental scale studies test a number of different components of ToM with a diversity of tasks (Wellman & Liu 2004, Peterson et al. 2011). Our task is highly focused in comparison, with a lot of variation within the cover stories themselves, which all aim to assess planning with ToM.
Furthermore, we think that our task provides a flexible and easily adaptable base for future experiments to build off in the generalizable directions you're interested in. For example, we are eager to run future experiments in which we replace the naively rational target with a real human with their own values.
"A common technique in the cognitive sciences is to be very precise about what is actually measured; tight experimental control is often regarded as a requirement for inferring a cognitive capacity."
While I agree with the authors that such abstraction is reasonable in cognitive science, directly applying these methods to the study of LLMs is more problematic.
LLMs are not human subjects, and we should be cautious about treating them as such or drawing human-like conclusions from similar experimental setups (e.g., https://www.gsb.stanford.edu/faculty-research/working-papers/theory-mind-may-have-spontaneously-emerged-large-language-models).
I would look forward to more evidence to show the validity of the findings, and thus, I will keep my score for now.
We are grateful to the reviewer for raising this point and the opportunity to clarify our position. We are in complete agreement that the inferences from behavior to latent capacities which are common in cognitive and psychological science cannot be straightforwardly applied to artifacts like LLMs [1-3 (Ivanova; McCoy; Hu)].
We believe these inferences are valid in human subjects only insofar as certain assumptions are met, crucially construct validity (the latent construct measured by the behavioural task is cohesive and well-specified) and predictive validity (the behavioral task predicts behavior on relevant downstream tasks). There have of course been criticisms of human ToM work with respect to both of these criteria [4-5 (Gernsbacher and Yergeau, 2019; Hayward and Homer, 2017)], however, research aiming to characterize latent constructs in LLMs has both been less careful and should arguably be held to a higher standard (given the marked differences in processing mechanisms between humans and machines).
We believe that our contribution serves as an improvement on both of these criteria over existing research on ToM in LLMs [6-8 (Kosinksi, Trott, Strachan)]. We believe the construct we are targeting is both more narrowly defined and more accurately isolated because we explicitly control for the extent to which predictive capacities could provide a more superficial route to strong task performance, in line with recent theoretical work in planning theory of mind [9 (Ho)]. Moreover, we believe the predictive validity of the task is stronger in that our task is closer to the contexts in which ToM is used in the real world (dynamic influence of other agents, rather than static responses to multiple choice questions).
However, we did not mean to imply that our task solves these problems or should invite the same inferences for LLMs as it does for humans. There is a rich tradition of work measuring the extent to which behavioral tests for humans measures latent constructs and predicts downstream behaviors which is simply lacking for LLMs. We view our task as a step on a scale toward more ecologically valid tasks for LLMs, which would include measuring their behavior to persuade humans in controlled contexts like ours.
However, we believe that our contribution could still be valuable to the community for a) making incremental progress on measuring ToM in LLMs, b) attempting to empirically resolve theoretical issues which have been raised elsewhere [3, 9], and c) providing a task which can serve as the basis for future work in more ecologically valid contexts (like interactions with human interlocutors).
We would be happy to clarify this point in a revision.
Morton Ann Gernsbacher and Melanie Yergeau. 2019. Empirical failures of the claim that autistic people lack a theory of mind. Archives of Scientific Psychology, 7(1):102–118. https://doi.org/10.1037/arc0000067
Elizabeth O. Hayward and Bruce D. Homer. 2017. Reliability and validity of advanced theory-of-mind measures in middle childhood and adolescence. British Journal of Developmental Psychology, 35(3):454–462. https://doi.org/10.1111/bjdp.12186
Thanks for the clarification! I'll raise my score in light of the explanation, and I intend to be open-minded about this type of work, though my original concerns still remain. And I encourage the authors to clearly and accurately point out the implications of their experiments in the paper.
Thanks! We will be happy to more clearly and accurately point out the implications of our experiments in the paper.
This paper proposes a benchmark to evaluate planning Theory of Mind in LLMs. While the task setup is synthetic, I find the overall framing valuable. Even canonical ToM tests like Sally-Anne are highly abstracted, and this work usefully extends ToM evaluation into the interventional/planning space. The paper is not a definitive account of ToM in LLMs, but it advances the conversation in a rigorous and thought-provoking way.
Strengths:
- Introduces a novel ToM task with some interesting new aspets.
- Solid experimental design with preregistration, multiple LLM baselines, and a reasonably sized human subject study.
- Core empirical finding is robust: humans outperform LLMs in the harder HIDDEN condition, while o1-preview excels in REVEALED.
- Authors engaged thoughtfully with reviewer concerns, including prior work, construct definitions, and methodological clarity.
Weaknesses:
- The task’s simplicity and abstraction raise concerns about ecological validity and generalization.
- No clear human upper bound in REVEALED; performance across conditions is unexpectedly flat.
- It remains somewhat unclear whether LLM failures are due to ToM limitations or more basic instruction-following or memory issues—as always the competence/performance issue looms large.
- The benchmark scale (200 trials) may limit reuse and long-term impact.
Despite these limitations, I believe the paper makes a genuine contribution to the study of ToM in LLMs and is above the bar for acceptance.