Language Agents Mirror Human Causal Reasoning Biases. How Can We Help Them Think Like Scientists?
Language model agents exhibit human-like reasoning biases, leading them to arrive at incorrect conclusions of causal relationships
摘要
评审与讨论
This paper presents a clear and compelling study investigating whether LLMs exhibit human-like biases in causal reasoning, specifically a disjunctive bias, using a text-based adaptation of the classic Blicket Test. It also proposes a method to encourage more scientific reasoning in LLMs through hypothesis sampling. The paper is well written, with clear motivations and well-designed experiments. Overall, I believe it offers valuable insights for the community.
接收理由
-
This study is interesting and timely
-
It is well written, and the experiments are well presented
-
The authors provide strong analysis and propose potential methods for improvement.
-
It offers valuable insights to the community.
拒绝理由
-
I am concerned about the logic in lines 121–124. Given the task setup, it seems that the conjunctive condition is indeed harder than the disjunctive one. Identifying all required elements (as in the conjunctive rule) is naturally more challenging than finding any one contributing factor (as in the disjunctive rule). I do not see how the InfoGain Oracle establishes that the conjunctive case is not harder for LLMs. The Oracle represents an idealized agent with perfect reasoning abilities, which LLMs do not possess. Therefore, the observed performance gap could simply reflect task difficulty rather than a disjunctive bias. I am not convinced by the conclusion that LLMs inherently favor disjunctive reasoning.
-
it appears that DeepSeek Reasoner, a reasoning model, performs differently from non-reasoning models. So does this indicate reasoning models are thinking more scientifically? It would be great to see if there are significant pattern difference between reasoning and non reasoning models.
-
I think this might be addressed by simple instruction tuning. And it's unclear whether such tuning would enable models to generalize to OOD causal scenarios (other than the Blicket test). It would be interesting to explore this.
给作者的问题
See above
We thank the reviewer for the insightful review. We are happy the reviewer finds this interesting and timely, with well-written presentations, strong analyses, and offers valuable insights to the community.
We address the reviewer’s concerns below.
Do not see how the InfoGain Oracle establishes that the conjunctive case is not harder for LLMs… Observed performance gap could simply reflect task difficulty rather than a disjunctive bias.. not convinced that LLMs inherently favor disjunctive reasoning
We thank the reviewer for the astute question. While InfoGain oracle reasons optimally, it spends the same amount of compute on both (conj & dish) settings. This shows that when perfect reasoning is possible, one setting is not computationally harder than the other. We refer to the (systematic) absence of perfect reasoning as a “cognitive bias”.
We can also consider the possibility that due to bounded computational resources, one task may be harder. One way to control for potential inherent task difficulty is to normalize performance relative to a random baseline. We plot the “exploration progress” (between 0 and 1) and random-normalized exploration progress here: https://imgur.com/a/8LPHKLR (details in figure, higher = better):
We observe in Fig 10b that for the high-performing models (e.g. GPT-4o, deepseek-reasoner, deepseek-chat), a disjunctive bias is still present. For completeness, we also include a table below for each model’s progress at end-of-episode.
We will add the above results to the Appendix section, and soften L131-124 to “this is likely not due to the conjunctive case being inherently computationally harder, see appendix [XX] for discussion”. Please let us know if you have any other thoughts on this!
Do reasoning models think more scientifically than non-reasoning ones?
We do not wish to make strong claims about “reasoning” in token sequences and “thinking scientifically” [Shanahan 2022]. For instance, recent works have found that intermediate “reasoning” tokens need not be correct to arrive at the correct final answer [Stechly 2025].
It is certainly intriguing that DeepSeek R1 performs differently. It appears that DeepSeek-R1 does well making inferences on fixed data (both Figure 5 and 6 are inference-only tasks, rather than explore-then-inference tasks), but not well on exploration (Fig 4). We ran additional evaluations using the GPT-o-mini series of reasoning models on the 8-objects envs, and found that setting a high “reasoning effort” does not improve overall performance, and leads to worse exploration (https://imgur.com/a/tsL29hf). Overall, the role of intermediate “reasoning” tokens are to our knowledge not well-established, and suggestive of exciting future work.
For completeness, we also provide a few qualitative examples of intermediate "reasoning" token traces below for Deepseek-R1 and GPT-4o. Both will be added to the appendix of the final paper.
Can this be addressed by simple instruction tuning
While this might be addressed by instruction tuning on this particular single task, instruction tuning cannot feasibly cover all possible causal reasoning scenarios. Whether instruction tuning on e.g. a subset of blicket tasks can enable OOD generalization in other causal settings is an exciting future work.
We did not explore instruction tuning as our goal is to probe the inherent biases in LM responses. Analogous to how cognitive scientists study biases in humans, they do not train humans to do better on their tasks, as it would obscure rather than reveal the underlying naturalistic behaviour patterns they aim to understand. For now, we believe the Blicket test serves as a representative probe of zero shot causal reasoning abilities with tight experimental control and relevant human data.
[Shanahan 2022] Shanahan, M. "Talking About Large Language Models”: 10.48550." arXiv preprint arXiv.2212.03551 (2022).
[Stechly 2025] Stechly, Kaya, et al. "Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens." arXiv preprint arXiv:2505.13775 (2025).
Absolute and random-normalized exploration progress
4 Objects Environment
| model | disjunctive progress ± std | conjunctive progress ± std | normalized disjunctive progress ± std | normalized conjunctive progress ± std |
|---|---|---|---|---|
| baseline random | 0.968 ± 0.049 | 0.964 ± 0.068 | 0 | 0 |
| infoGain oracle | 1.000 ± 0.000 | 1.000 ± 0.000 | 1.000 ± 0.000 | 1.000 ± 0.000 |
| gpt-4o | 1.000 ± 0.000 | 0.991 ± 0.033 | 1.000 ± 0.000 | 0.756 ± 0.900 |
| gpt-4o-mini | 0.982 ± 0.073 | 0.911 ± 0.113 | 0.439 ± 2.297 | −1.444 ± 3.109 |
| deepseek-chat | 0.995 ± 0.011 | 0.975 ± 0.046 | 0.857 ± 0.356 | 0.319 ± 1.255 |
| deepseek-reasoner | 0.934 ± 0.224 | 0.895 ± 0.236 | −1.085 ± 7.063 | −1.881 ± 6.509 |
| gemma3:12b | 0.970 ± 0.050 | 0.993 ± 0.020 | 0.048 ± 1.589 | 0.806 ± 0.541 |
| gemma3:27b | 0.968 ± 0.056 | 0.965 ± 0.029 | −0.016 ± 1.750 | 0.028 ± 0.794 |
| qwq | 0.978 ± 0.067 | 0.883 ± 0.145 | 0.302 ± 2.122 | −2.222 ± 4.000 |
8 Objects Environment
| model | disjunctive progress ± std | conjunctive progress ± std | normalized disjunctive progress ± std | normalized conjunctive progress ± std |
|---|---|---|---|---|
| baseline random | 0.944 ± 0.098 | 0.877 ± 0.183 | 0 | 0 |
| infoGain oracle | 1.000 ± 0.000 | 1.000 ± 0.000 | 1.000 ± 0.000 | 1.000 ± 0.000 |
| gpt-4o | 0.999 ± 0.005 | 0.924 ± 0.125 | 0.983 ± 0.085 | 0.383 ± 1.019 |
| gpt-4o-mini | 0.812 ± 0.357 | 0.798 ± 0.307 | −2.342 ± 6.346 | −0.647 ± 2.499 |
| deepseek-chat | 0.976 ± 0.092 | 0.935 ± 0.138 | 0.581 ± 1.634 | 0.472 ± 1.121 |
| deepseek-reasoner | 0.956 ± 0.174 | 0.875 ± 0.215 | 0.226 ± 3.091 | −0.014 ± 1.752 |
| gemma3:12b | 0.934 ± 0.150 | 0.885 ± 0.219 | −0.182 ± 2.664 | 0.066 ± 1.780 |
| gemma3:27b | 0.784 ± 0.281 | 0.799 ± 0.269 | −2.847 ± 4.998 | −0.639 ± 2.195 |
| qwq | 0.916 ± 0.217 | 0.855 ± 0.219 | −0.491 ± 3.866 | −0.182 ± 1.786 |
Example Intermediate tokens during Q&A
We provide qualitative examples of intermediate tokens. We will add these as examples to the Appendix of the final paper. However, while intermediate tokens have been used as a proxy for “reasoning” in the literature, we do not wish to draw strong conclusions here. Recent evidence suggests their role in the LM’s computation is illusive, for instance, they do not necessarily have to be correct for the final answer to be correct [Stechly 2025].
GPT-4o, react prompting (correct answer)
Q: Is object 1 a blicket? (true blicket indices: [1, 4])
From the observations, the machine turns on when object 1 and object 4 are on the machine together. When object 1 is on the machine alone, the light is off. When object 4 is on the machine alone, the light is off. However, when both objects 1 and 4 are on the machine, the light turns on. This suggests that the rule for the machine to turn on requires both objects 1 and 4 to be present, indicating that both are blickets.
Therefore, object 1 is a blicket.
> True
GPT-4o, react prompting (incorrect answer)
Q: Is object 5 a blicket? (true blicket indices: [2, 5])
From the observations, we can see that:
- When object 5 is on the machine alone, the light is off.
- When object 5 is on the machine with object 2, the light is on.
- When object 2 is on the machine alone, the light is on.
This suggests that object 2 is a blicket because it can turn the machine on by itself. Object 5 does not turn the machine on by itself, indicating it is not a blicket. The machine turns on when object 2 is present, regardless of whether object 5 is also present.
Therefore, object 5 is not a blicket.
> False
DeepSeek-R1 (correct answer)
Q: Is object 2 a blicket? (true blickets: [2, 3] conjunctive)
Okay, let's try to figure out if object 2 is a blicket. So, the machine turns on when certain objects (blickets) are placed on it. From the steps taken, let's recap what happened.
First, when objects 5 and 0 were placed on the machine, the light stayed off. Then adding 1, 2, and 3 caused the light to turn on. When objects 5 and 1 were removed, the light stayed on. But when object 3 was removed, the light turned off. Then removing 2 also left it off.
[omitted ~1k characters]
Then removing 3 turned it off. Then putting 3 back alone didn't help, but adding 2 again made it work. So yes, 2 and 3 together activate the machine. Therefore, object 2 is a blicket because when combined with 3, the machine turns on. So the answer should be True.
DeepSeek-R1 (incorrect answer)
Q: Is object 5 a blicket? (true blickets: [5, 7] conjunctive)
Okay, let's try to figure out if object 5 is a blicket. From the actions taken in the game, here's what happened:
Initially, all objects except object 0 were on the floor, and the machine was off. The user started by putting objects 1 through 7 one by one onto the machine. Each time, the machine's light stayed off until object 7 was added. When all objects (1-7) were on the machine, the light turned on. Then, removing object 7 caused the light to turn off, and putting it back on turned it on again.
This suggests that the machine's rule requires the presence of object 7 to activate. Since adding all other objects (1-6) didn't trigger the light until 7 was added, it's likely that object 7 is a blicket. However, the other objects (including 5) were on the machine both when the light was off (before 7 was added) and when it was on (after 7 was added). Since the presence of 5 alone (without 7) didn't activate the machine, and removing 7 deactivates it regardless of 5's presence, it seems that 5 isn't a blicket. The rule might be that the machine only needs at least one blicket, which is 7 here. Therefore, object 5 is not a blicket.
[Stechly 2025] Stechly, Kaya, et al. "Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens." arXiv preprint arXiv:2505.13775 (2025).
Reviewer w25T, could you please engage with or just acknowledge the authors' response? Thank you.
Thank you for your response. I will maintain my current positive score.
We'd like to thank the reviewer for their valuable comments. We will be sure to include the additional details and results in the final paper!
The authors adapt an experimental paradigm (the “Blicket test”) to investigate how large language models make causal inferences. In this text-based version, an “agent” can “explore” various actions with a machine, which follows a logical rule (disjunction or conjunction); the agent must then generalize the underlying rule. The authors find that all the LLMs tested struggle when the number of objects under investigation increases, and they also struggle more with conjunctive rules than disjunctive rules. In general, the models seem to prefer disjunctive answers (like human adults) and spend less time exploring disjunctive settings (unlike human children). The authors then demonstrate that they can artificially “flatten” the model’s distribution over hypothesis space in such a way that it behaves more optimally.
接收理由
The paper addresses a topic of high-level importance (causal reasoning and inference in LM “agents”), and adapts an interesting psycholinguistic paradigm (the “blicket test”) that has been extensively tested and used for human children and adults. The results themselves point to potentially promising directions for future work.
拒绝理由
The underlying premise is interesting, but I think there are a few potential weaknesses.
First, the paper does not include (as far as I can tell) a human baseline for the results. The authors do include an “oracle” model as a comparison, which is helpful, but it would also be useful to include an analysis of human data. Again, the paper makes reference to results obtained with humans, but as far as I know, we do not have results for this text-based task, which limits our ability to draw strong generalizations or comparison to human behavior.
A deeper question concerns the utility of this task in particular. It has proven useful in characterizing the causal inference behaviors of human children and adults. But what is its construct validity for assessing models?
Finally, I found some of the methodological details a little hard to understand, especially some of the follow-up analyses (e.g., section 5); I understand that space is limited but I think more detail here would be helpful.
给作者的问题
Is response length a good proxy for “amount of reasoning”?
Have the authors considered alternative prompting strategies to encourage more exploration? E.g., there is some work on how certain prompting environments encourage or discourage explore behavior (e.g., Coda-Forno et al., 2023).
We thank the reviewer for the interesting points. We are glad to hear the topic is important, our approach is interesting, and that the results point to promising future work.
We address the reviewer’s concerns below, and also discuss the more philosophical points raised.
the paper does not include (as far as I can tell) a human baseline for the results
While we currently do not have human data in our blicket game (sec 3), we do compare LM directly with published human data in section 4. The findings in section 4 corroborates our finding in section 3 (of the models' "disjunctive bias").
We can try to run a small pilot study on human adults with the time we have in the remainder of the week, and add the results if it is feasible (it is very time consuming to properly conduct a human experiment). All in all, while we agree that comparison with humans is an exciting direction, we think the LM results in this work stand on their own merits, and paves the way for subsequent works which comprehensively studies human vs. LMs on this task.
the utility of [the blicket] task in particular
The blicket detector task captures fundamental causal structures—disjunctive (OR) and conjunctive (AND) relations—that are ubiquitous in real-world reasoning and decision-making. These structures exist in many tasks across domains: e.g. in chemistry games like Alchemy, combining elements follows conjunctive rules; in tool use, some tools work independently (disjunctive) while others require multiple components (conjunctive). The list goes on.
The construct validity lies in the task's ability to probe models' understanding of these compositional causal primitives that structure so many real-world problems, in a tightly controlled experimental setting that is already well-established from human studies. While we do not wish to make claims about how the model might “think” [Shanahan 2022], poor performance on our task does mean LMs acts in a way that is sub-optimal -- which has implications for how we should employ LMs in settings where autonomous information gathering and making inferences on them is important [e.g. Yamada et al 2025].
methodological details a little hard to understand
We apologize for the lack of details. We kindly point the reviewer to the Global Response, in which we have extensively detailed the infoGain baseline agent (A.1), and our hypothesis sampling agent (A.2). We will add the additional information into the Appendix of the final paper for better clarity.
Does response length == “amount of reasoning”?
We did not intend to equate response length with “amount of reasoning”. We used quotes to refer to how this analogy is used commonly in the literature [Guo 2025, Zeng 2025]. While longer response does literally mean more FLOPs spent, it is unclear whether it maps onto the illusive concept of “reasoning” [Stechly 2025]. We are happy to soften this sentence and provide this context in-text.
As a somewhat related point, the reviewer may find interesting that we ran additional evaluations using the GPT-o-mini series of reasoning models on the 8-objects envs, and found that setting a high “reasoning effort” does not improve overall performance, and leads to worse exploration (https://imgur.com/a/tsL29hf). Overall, the role of intermediate “reasoning” tokens are to our knowledge not well-established, and suggestive of exciting future work.
[Shanahan 2022] Shanahan, M. "Talking About Large Language Models”: 10.48550." arXiv preprint arXiv.2212.03551 (2022).
[Yamada et al 2025] Yamada, Yutaro, et al. "The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search." arXiv preprint arXiv:2504.08066 (2025).
[Guo 2025] Guo, Daya, et al. "Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning." arXiv preprint arXiv:2501.12948 (2025).
[Zeng 2025] Zeng, Weihao, et al. "Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild." arXiv preprint arXiv:2503.18892 (2025).
[Stechly 2025] Stechly, Kaya, et al. "Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens." arXiv preprint arXiv:2505.13775 (2025).
Alternative prompting strategies to encourage more exploration
Many thanks for this suggestion. While our original prompts were designed to be close to the referenced psych test, we ran additional studies on the effect of system prompts and exploration:
system_human_conj: the default system prompt used in the main textsystem_hyp_reduce: ask the agent to think more like the infoGain agent: to come up with possible hypotheses, then take actions that will eliminate the most numbersystem_rule_explain_detailed: give additional information -- explain there can only be two rules and explaining the two rules in detailsystem_rule_explain_hyp_reduce: explain the two rules in detail and ask the agent to think like the infoGain gentsystem_rule_explain_conj_pref: explain the two rules in detail, then explicitly ask the agent to prefer a conjunctive interpretation
We detail the changes each new prompt makes below.
The resulting exploration plot is here: https://imgur.com/a/WvbWaYv.
- Telling the agent to explore more like the infoGain agent do not help
- Explaining the rules in more details helps a bit
- Directly asking the agent to prefer conjunctive rules improves exploration in conjunctive setting, but decreases exploration slightly in disjunctive setting
- None of the prompts reach the same level as the infoGain oracle in the conjunctive case
We will add this additional result to the extended results section of the final paper.
Exploration Prompt Variants
We describe the changes with respect to the system_human_conj prompt (Appendix B.2.2) below for each of the additional experiments. We only added additional information.
system_hyp_reduce adds
- You should first think about a number of hypothesis for which objects might be blickets, and the rule for turning on the machine. Then, you should think about which action you should take which you expect to maximally reduce this hypothesis space.
system_rule_explain_detailed adds:
- There are two rules that govern how objects turn on the machine: "disjunctive" and "conjunctive". You don't know which rule the machine follows a priori.
- If the rule is "disjunctive", the machine turns on when any of the blickets are placed on top of it.
- If the rule is "conjunctive", the machine only turns on when all of the blickets are placed on top of it.
system_rule_explain_hyp_reduce adds
- There are two rules that govern how objects turn on the machine: "disjunctive" and "conjunctive". You don't know which rule the machine follows a priori.
- If the rule is "disjunctive", the machine turns on when any of the blickets are placed on top of it.
- If the rule is "conjunctive", the machine only turns on when all of the blickets are placed on top of it.
- You should first think about a number of hypotheses for which objects might be blickets, and the rule for turning on the machine. Then, you should think about which action you should take which you expect to maximally reduce this hypothesis space.
system_rule_explain_conj_pref adds
- There are two rules that govern how objects turn on the machine: "disjunctive" and "conjunctive". You don't know which rule the machine follows a priori.
- If the rule is "disjunctive", the machine turns on when any of the blickets are placed on top of it.
- If the rule is "conjunctive", the machine only turns on when all of the blickets are placed on top of it.
- You should always consider first the hypothesis that all the blickets have to be on the machine for it to turn on (i.e. "conjunctive" rule). In this case, the objects required for the machine to turn on are all considered "blickets". You should consider this both when taking actions to uncover which objects are blickets, and when asked about it.
Reviewer o9YV, could you please engage with or just acknowledge the authors' response? Thank you.
We can try to run a small pilot study on human adults with the time we have in the remainder of the week, and add the results if it is feasible (it is very time consuming to properly conduct a human experiment). All in all, while we agree that comparison with humans is an exciting direction, we think the LM results in this work stand on their own merits, and paves the way for subsequent works which comprehensively studies human vs. LMs on this task.
I certainly understand that it'd be very difficult to conduct a human study a week! Sorry if I didn't see this in your reply, but something I was wondering in my original review was:
Again, the paper makes reference to results obtained with humans, but as far as I know, we do not have results for this text-based task, which limits our ability to draw strong generalizations or comparison to human behavior.
Re: construct validity:
The construct validity lies in the task's ability to probe models' understanding of these compositional causal primitives that structure so many real-world problems, in a tightly controlled experimental setting that is already well-established from human studies.
In general I'm quite well-disposed to using tests developed to assess human cognition for LLMs. At the same time, I think it's prudent to be cautious about assuming the same test means the same thing when applied to humans and LLMs. At minimum, I think it'd make sense to devote a bit more space in the paper to working through this question: to what extent are we licensed in drawing the same inferences about causal reasoning in LLMs from this design as we are when we study humans? I am not an expert in this particular paradigm so I'm not making a strong claim here, but I'd like to see a little more engagement with the question—I don't think it's a given that any psychometric test exhibits equivalent construct validity for humans and LLMs.
The other aspect of construct validity that's relevant is something like predictive validity. In humans, many benchmarks/tests are validated by asking whether they predict other outcomes of interest. To the extent that the blicket test sheds useful light on LLM reasoning, I'm wondering whether it's diagnostic of something beyond the behavior on the test itself.
On the utility of the Blicket task to assess models, and on strong generalizations / comparison to human behavior.
We thank the reviewer for this very interesting question. We will answer this question in two parts.
On the utility of the Blicket task generally
Each Blicket task has a correct answer, and an information-theoretic optimal way to explore the problem [*]. This correct answer is an objective measure, independent of how humans respond, akin to e.g. the correct answer to a math question. Thus, we believe one utility of the Blicket task is simply to provide a test for the causal reasoning capabilities of LMs, akin to how we use e.g. AIME math exam to measure the mathematical capability of language models.
Indeed, one should be careful here not to over anthropomorphize LMs, as we mentioned in our previous response [Shanahan 2022]. Nevertheless, evaluating performances of LMs is an extremely new field (last 2-3 years), compared with the study of human behaviour. The methods from the latter at least have a community of researchers who believe it to be valid and useful in humans. Thus we think there is value in bringing “verifiable” tests from the cognitive sciences into LMs, as many others have already [Ullman 2023, Webb et al 2023, Frank 2023].
Ultimately, any measurements (Blicket, AIME, etc.) are just approximations for some underlying capabilities which are fundamentally inaccessible. The hope is that the measurement correlates with the ultimate goal of what we want out of these systems (human or artificial), for instance, for them to be “generally intelligent”.
[*] One could ask for deeper questions, e.g. optimal given bounded computational resources / “bounded rational”, optimal given different priors, etc. We opted for a simple and general version in our work.
On comparison with human data
We agree with the reviewer that we should acknowledge our limitations with interpreting our result to be the same as humans. The really proper way of applying psychometric tests to AI models is to control 2 things: (1) use realistic environment or ecologically relevant training data, (2) run the eval exactly the same way you would do with humans with calibrated measures. As the reviewer pointed out, we currently do not have exactly the same evaluations between humans in section 3 (although we do try to replicate the same procedure in section 4 with existing human experiments). Our intention was not to make a causal statement that observation of adult-like bias in LLMs (on the blicket task) implies LLMs think like adults. We can soften our language if the reviewer feels this is not clearly communicated.
Indeed, the truly 1-1 way to compare with human results would be if we have multimodal, fully embodied models which share exactly the same sensory motor modalities as humans. This is even more true for infants and children, who learn about the world primarily with action-taking, embodiment and multimodality (including speech) than adults who can learn through text [Smith & Gasser 2005]. However, this is infeasible currently, as human-level embodied multimodal models are still an active area of research . The alternative is to try to convey similar informational content as the original task, such as we did in section 4 for existing human data tasks, and ideally, in section 3 for humans to play the same text based games. This is a limitation of our approach, and of comparative approaches generally.
We’d be happy to consider predictive validity as a potential future direction in the discussion. If the reviewer has some human or AI tasks in mind they want to predict we’d be happy to include them in the discussion.
All in all, we appreciate this very interesting question. We plan on including this discussion in the final paper to provide a more nuanced perspective for the readers. Please let us know if you have any additional thoughts. We hope we have adequately discussed the epistemological basis for our test and the motivation behind it. With the above, in addition to the new experimental results, we hope the reviewer will consider raising the score.
[Shanahan 2022] Shanahan, M. "Talking About Large Language Models”: 10.48550." arXiv preprint arXiv.2212.03551 (2022).
[Ullman 2023] Ullman, Tomer. "Large language models fail on trivial alterations to theory-of-mind tasks." arXiv preprint arXiv:2302.08399 (2023).
[Webb et al 2023] Webb, Taylor, Keith J. Holyoak, and Hongjing Lu. "Emergent analogical reasoning in large language models." Nature Human Behaviour 7.9 (2023): 1526-1541.
[Frank 2023] Frank, Michael C. "Baby steps in evaluating the capacities of large language models." Nature Reviews Psychology 2.8 (2023): 451-452.
[Smith & Gasser 2005] Smith, Linda, and Michael Gasser. "The development of embodied cognition: Six lessons from babies." Artificial life 11.1-2 (2005): 13-29.
Dear reviewer o9YV, as it is the last day of discussion, we'd love to know if you have any additional questions, and if our response have addressed your concerns. We plan on including the additional discussion and empirical results in the final version of this paper.
The paper addresses the question of causal reasoning bias in LLMs, using the Blinket test from studies on development psychology. In a study of various open- and closed-source models, the authors find that LLMs exhibit a disjunctive bias, similarly to human adults. They then go on to explore the factors contributing to success, with the finding that information gain is a strong predictor, they find that LLMs do not explore efficiently, often revising hypotheses that should have been eliminated, but they do spend less time exploring disjunctive hypotheses. Among the included LLMs, deepseek stands as least susceptible to the disjunctive bias.
接收理由
The paper is very well written.
The work explores an interesting and relevant question and does so with what seems like a solid experimental design.
Results are discussed (for the most part) in sufficient depth, with meaningful findings being drawn.
拒绝理由
The proposed test-time sampling method used to eliminate the casual reasoning bias is highly tailored to the Blinket test in particular and cannot be used in more general settings. As such, it is not really practical and I am not clear even on its theoretical contribution, since it seems quite obvious that such a controlled intervention would reduce the bias.
There are minor but must be addressed:
- Details are missing on the implementation of the Oracle InfoGain agent.
- The LLMs used in the study are never propertly introduced and their choice motivated. The authors point to Appendix B.1 but that section only contains details on the access to API-based LLMs.
- What is qwq which appears in Figure 2 but is seemingly never introduced?
- Some context is needed in Appendix B.2 to understand the motivation behind the different prompt variants. What does Human Conjunctive mean? Why does it make sense to include the Math Definition variant? What is the variance in accuracy between these different settings?
- More questions below.
It would be good to see a discussion of how the human bias may have been injected into the models, considering that this type of reasoning might rarely be codified in writing. What sort of data might the LLMs have picked this bias up from? Or could there be a third hidden variable that explains both the adult human bias and the LLM bias?
给作者的问题
Questions:
- It is not clear whether the results in Figure 2 are averaged over all system messaged and prompting methods.
- It is not clear how many "game instances" models are evaluated on.
- Does the agent have any incentive to exit early? Is the number of steps needed to solve the task important in the real Blicket Test?
- What is the point in the random agent taking any action if it will randomly answer the question at the end?
- If I understand correctly, the Oracle InfoGain agent is not LLM-based. Can you explain how it was implemented?
- Does the model know how many objects are blickets or does it have to discover that too?
- What is meant by "environment rule" in the caption of Figure 3? Dis v Con or something else?
- In 3.4 how long of an exploration path is presented to the LLMs? How many steps are left to the solution?
- 211: what are these system prompts?
Suggestions:
- It might be better to visualise the conjuctive-disjunctive pair for each model side by side, to make it easier for the reader to see the consistently higher performance in the disjunctive case. This is more important than seeing how the different models compare on a per-setting basis, at least as reflected in the text.
- It is worth pointing out in 4.1 that deepseek is a clear outlier, correctly identifying the conjunctive rule, and also showing a conjunctive inclination in the baseline setting (together with llama, interestingly)
Typos:
- blicket should be capitased in the abstract and IMO does not need quotation marks
- l74: perhaps Section should be replaced with Appendix here and later on page 7
- l161 an->a
- l162 performs->performing
- l200 Proportion -> proportion
We are deeply grateful for the very detailed response. We are encouraged that the work is well written, explores interesting + relevant questions with solid experimental design, and that the results are in sufficient depth with meaningful findings.
We hope to address additional concerns below.
Test-time sampling method used to eliminate the causal reasoning bias is highly tailored to the Blicket test in particular, not clear even on its theoretical contribution,
Our point with the hypothesis sampling is simply to demonstrate that the principle is a useful one: rather than directly prompting the LM to act, we can make it act more rationally (here measured by its ability to eliminate hypotheses effectively and make causal inferences) by making it externalize its belief explicitly for us to control. While not explored, this can potentially be useful generally in settings where we wish to sample functions [Romera-Paredes et al 2024]; future work can also explore controlling its belief e.g. toward certain types of ideas about the world. All in all, we believe using test-time sampling to overcome the LM's prior is a general principle, and our result is just one demonstration of this (which in our case takes the form of a hypothesis distribution over binary python functions).
Clarity comments
We apologize for the clarity issues. We have added extensive details in the Global Response about the Oracle InfoGain agent (A.1) and our hypothesis sampling agent (A.2). We will also add this to the Appendix of the final paper.
Choice of LMs
We select a mix of models to span several key axes—reasoning capability, capacity, openness, and architectural family—to explore how each factor may relate to hypothesis elimination and exploration behaviour. The GPT family are some of the highest-performing, state-of-the-art LMs available. The 4o and 4o-mini variants also allow us to see the effect of model size with similar architectures. The DeepSeek family are state-of-the-art LMs in their own right, allowing us to compare "reasoning model" vs. a non reasoning model, along with full access to the reasoning content. We also study a few small-to-medium sized models, including QwQ (medium sized reasoning model from qwen), and the gemma3 series of models (<30B). We will include a more detailed description of each model in the Appendix of the final paper.
Motivation behind the different prompt variants in Appendix B.2
We constructed the human default and human conjunctive variants to be similar in information as what is presented to human participants in psych Blicket test experiments. In particular, human conjunctive makes use of the term "blicketness", which [Lucas et al 2014] used to encourage the conjunctive interpretation (as "blicketness" can be interpreted as an additive quantity). Finally, we designed the math definition variant to provide a rigorous definition of what a blicket is, minimizing any semantic relation to disjunctive or conjunctive interpretations.
Please refer to Figure 10 to see the variance between prompts: the multiple points with the same shape and colour differ in their system prompts.
Discussion of how the human bias may have been injected into the models
It is certainly an interesting area of study. However, we do not wish to make strong claims here as the “human-like” cognitive bias discovered in this paper is correlational. We can briefly speculate on potential hypothesis for this similarity: one of which is the human data, for instance, it could be that the vast majority of text on the internet are spoken in a way where cause have singular or independent probability to bring about effect (e.g. fever resulting from a virus), rather than multiple causes conjunctively bringing about effect (e.g. heart attack may only result if a person has high blood cholesterol, AND a particular genetic susceptibility). Alternative hypotheses could be that the (English) language may have an implicit disjunctive structure. To properly study these hypotheses requires intervening on e.g. the training data which makes for interesting (albeit difficult) subsequent works.
Specific Questions
It is not clear whether the results in Figure 2 are averaged over all system messaged and prompting methods.
In Figure 2a, we select the best system message and prompt for each model, averaged over multiple random seeds to get error bars. In Figure 2b, we plot the mean performance for each prompt variant (shape of dot) and system message (different dots with the color & shape).
It is not clear how many "game instances" models are evaluated on.
We evaluate each model + prompt variant + system prompt on a minimum of 16 independent trials (per combination) to compute the 95 standard error of mean.
Does the agent have any incentive to exit early? Is the number of steps needed to solve the task important in the real Blicket Test?
No. The agent is told it has the option to exit early if it thinks it has understood the relationship (section B.2), but there is no incentive to finish early. This design mirrors that of the child experiment in [Kosoy et al 2022], which allowed children to explore for as long as they wanted. We do set an upper cap of 32 actions -- well beyond after the InfoGain agent has finished -- to limit total experimental time & cost.
What is the point in the random agent taking any action if it will randomly answer the question at the end?
The random agent is meant to provide a baseline in both the exploration (random actions) and Q&A (random True/False) phases. It is helpful to think of them as two separate random baselines (we will clarify in the final paper).
If I understand correctly, the Oracle InfoGain agent is not LLM-based. Can you explain how it was implemented?
Indeed it is not LLM-based. We have provided a detailed description of the InfoGain agent in Global Response A.1. We will add it to the Appendix section of the final paper.
Does the model know how many objects are blickets or does it have to discover that too?
The model does not know how many are blickets, it has to discover that. The same goes for the InfoGain baseline and the hypothesis sampling agent.
What is meant by "environment rule" in the caption of Figure 3? Dis v Con or something else?
Yes it refers to disj vs. conj.
In 3.4 how long of an exploration path is presented to the LLMs?
| data source | rule | traj length (± stdev) |
|---|---|---|
| llm (gpt-4o) | conjunctive | 18.94 ± 5.04 |
| llm (gpt-4o) | disjunctive | 18.06 ± 2.35 |
| oracle | conjunctive | 12.70 ± 4.79 |
| oracle | disjunctive | 14.58 ± 4.28 |
| object_count | conjunctive | 32.00 ± 0.00 |
| object_count | disjunctive | 32.00 ± 0.00 |
| random | conjunctive | 32.00 ± 0.00 |
| random | disjunctive | 32.00 ± 0.00 |
L211: what are these system prompts?
System prompt for fig 7 is the human conjunctive variant (Appendix B.2.2) which follows from [Lucas et al 2014].
Suggestions and typos
We are deeply grateful for the clarity suggestions and will incorporate them. We will fix all typos in the final paper.
[Romera-Paredes et al 2024] Romera-Paredes, Bernardino, et al. "Mathematical discoveries from program search with large language models." Nature 625.7995 (2024): 468-475.
[Lucas et al 2014] Lucas, Christopher G., et al. "When children are better (or at least more open-minded) learners than adults: Developmental differences in learning the forms of causal relationships." Cognition 131.2 (2014): 284-299.
[Kosoy et al., 2022] Kosoy, Eliza, et al. "Learning causal overhypotheses through exploration in children and computational models." Conference on Causal Learning and Reasoning. PMLR, 2022.
Reviewer badZ, could you please engage with or just acknowledge the authors' response? Thank you.
Thanks for the detailed response. All details and clarifications provided here should be included in the final paper as many of them are key to the interpretation and reimplementation of the work.
As the discussion period comes to an end -- we'd like to thank the reviewer for their very constructive questions and comments. We will be sure to include above details and clarifications in the final paper!
As language models are being used for autonomous decision making, it is relevant to understand the biases exhibited by them. In this paper, the authors use the blocker test to understand how LMs infer causal relationships. Their findings show that models exhibit disjunctive biases similar to humans. They also propose a test time approach to de-bias the skewed prior learnt by the model to eliminate any prior hypotheses, showing that this can help reduce disjunctive bias.
接收理由
- This is a relevant line of work and the direction of comparing LM biases with human psychology is interesting.
- The mitigation approach of flattening the skewed prior seems like a promising direction in terms of results.
拒绝理由
- The paper lacks information and is very difficult to read. In addition, it includes typos and grammatical mistakes in multiple sections(Lines 161 and Lines 171 are a few examples).
- While the paper mentions disjunctive and conjunctive biases in humans, this needs to be clearly articulated with examples so that one can follow better.
- The evaluations performed by authors are sparse in detailing the evaluation datasets, their diversity or any qualitative analyses from their evaluations.
We thank the reviewer for the response. We are glad to hear the work is relevant, the human comparison is interesting, and the mitigation approach is a promising direction.
We address the reviewer’s concerns below.
The paper lacks information and is very difficult to read. In addition, it includes typos and grammatical mistakes in multiple sections(Lines 161 and Lines 171 are a few examples).
We apologize for the grammatical mistakes. We have fixed the all raised typos (including from other reviewers). Please see our global response for additional details on the InfoGain agent baseline (A.1) and our hypotheses sampling agent (A.2). Please see below for examples about human biases, and more details about evaluations. We will add all new information to the final version of this paper. We hope the reviewer finds this helpful and let us know if there is anything else unclear so we can properly address it.
While the paper mentions disjunctive and conjunctive biases in humans, this needs to be clearly articulated with examples so that one can follow better.
We provide this excerpt from [Lucas et al 2014], which was one of the first cog sci papers that used both conjunctive and disjunctive rules in blicket detectors:
“We contrast two abstract causal principles (overhypotheses) about the forms that relationships take in a causal system. One is that relationships have a disjunctive form, in which each cause has an independent probability of bringing about an effect. This form is pervasive in the literature on adult causal inference. For example, a burglar alarm may be triggered by an intruder or the wind, and a fever may result from a virus or a bacterium. The other overhypothesis is that causal relationships have a conjunctive form in which individual causes are unable to produce an effect, but multiple causes in conjunction can do so. For example, a microwave turns on when both the plug is connected and a button is pressed, but not if either of these causes occurs by itself; likewise, a heart attack may only result if a person has both high blood cholesterol and a particular genetic susceptibility. Knowing when a machine or a disease has a conjunctive form or a disjunctive form helps us make the right inferences when we want to use the machine or cure patients.”
We are happy to include this in Appendix B if the reviewer finds it helpful (or provide alternative examples).
The evaluations performed by authors are sparse in detailing the evaluation datasets, their diversity or any qualitative analyses from their evaluations.
We apologize for the clarity concerns. For reference, we evaluate all LMs in the “blicket Test” text environment detailed in section 2.1 & Figure 1. We provide a full trace from an “episode” in appendix B.4. For the experiments in section 3.1-3.3 and section 5, we run the agent in the text game with 4 or 8 total objects, where 2 of them are blickets (randomly assigned each episode). All objects start off on the floor or on the machine with 0.9 or 0.1 independent probability. The agent is allowed up to 32 actions in the environment (exploration phase) before the game transitions to the Q&A phase where we quiz the agent about the identity of the blickets. The two phases are illustrated in Figure 1.
We have included more details about the Oracle InfoGain agent and our Hypotheses Sampling agent in the Global Response (they will be added to appendix to complement section 2.3).
The human inference experiments corresponding to Fig 6 can be found in appendix B.5.
Qualitatively, we added a few examples of LM's intermediate token (“reasoning trace”) below.
Please let us know if there is anything else unclear! We plan on adding these details as appropriate in the final version of this paper.
[Lucas et al 2014] Lucas, Christopher G., et al. "When children are better (or at least more open-minded) learners than adults: Developmental differences in learning the forms of causal relationships." Cognition 131.2 (2014): 284-299.
Example intermediate tokens during Q&A
We provide qualitative examples of intermediate tokens. We will add these as examples to the Appendix of the final paper. However, while intermediate tokens have been used as a proxy for “reasoning” in the literature, we do not wish to draw strong conclusions here. Recent evidence suggests their role in the LM’s computation is illusive, for instance, they do not necessarily have to be correct for the final answer to be correct [Stechly 2025].
GPT-4o, react prompting (correct answer)
Q: Is object 1 a blicket? (true blicket indices: [1, 4])
From the observations, the machine turns on when object 1 and object 4 are on the machine together. When object 1 is on the machine alone, the light is off. When object 4 is on the machine alone, the light is off. However, when both objects 1 and 4 are on the machine, the light turns on. This suggests that the rule for the machine to turn on requires both objects 1 and 4 to be present, indicating that both are blickets.
Therefore, object 1 is a blicket.
> True
GPT-4o, react prompting (incorrect answer)
Q: Is object 5 a blicket? (true blicket indices: [2, 5])
From the observations, we can see that:
- When object 5 is on the machine alone, the light is off.
- When object 5 is on the machine with object 2, the light is on.
- When object 2 is on the machine alone, the light is on.
This suggests that object 2 is a blicket because it can turn the machine on by itself. Object 5 does not turn the machine on by itself, indicating it is not a blicket. The machine turns on when object 2 is present, regardless of whether object 5 is also present.
Therefore, object 5 is not a blicket.
> False
DeepSeek-R1 (correct answer)
Q: Is object 2 a blicket? (true blickets: [2, 3] conjunctive)
Okay, let's try to figure out if object 2 is a blicket. So, the machine turns on when certain objects (blickets) are placed on it. From the steps taken, let's recap what happened.
First, when objects 5 and 0 were placed on the machine, the light stayed off. Then adding 1, 2, and 3 caused the light to turn on. When objects 5 and 1 were removed, the light stayed on. But when object 3 was removed, the light turned off. Then removing 2 also left it off.
[omitted ~1k characters]
Then removing 3 turned it off. Then putting 3 back alone didn't help, but adding 2 again made it work. So yes, 2 and 3 together activate the machine. Therefore, object 2 is a blicket because when combined with 3, the machine turns on. So the answer should be True.
DeepSeek-R1 (incorrect answer)
Q: Is object 5 a blicket? (true blickets: [5, 7] conjunctive)
Okay, let's try to figure out if object 5 is a blicket. From the actions taken in the game, here's what happened:
Initially, all objects except object 0 were on the floor, and the machine was off. The user started by putting objects 1 through 7 one by one onto the machine. Each time, the machine's light stayed off until object 7 was added. When all objects (1-7) were on the machine, the light turned on. Then, removing object 7 caused the light to turn off, and putting it back on turned it on again.
This suggests that the machine's rule requires the presence of object 7 to activate. Since adding all other objects (1-6) didn't trigger the light until 7 was added, it's likely that object 7 is a blicket. However, the other objects (including 5) were on the machine both when the light was off (before 7 was added) and when it was on (after 7 was added). Since the presence of 5 alone (without 7) didn't activate the machine, and removing 7 deactivates it regardless of 5's presence, it seems that 5 isn't a blicket. The rule might be that the machine only needs at least one blicket, which is 7 here. Therefore, object 5 is not a blicket.
[Stechly 2025] Stechly, Kaya, et al. "Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens." arXiv preprint arXiv:2505.13775 (2025).
Reviewer Gqdc, could you please engage with or just acknowledge the authors' response? Thank you.
Thank you for your responses. On the explanation of disjunctive and conjunctive biases, I would recommend adding the details you mentioned above(with the examples) to the main section of the paper.
I have now updated my scores accordingly.
We'd like to thank the reviewer for their valuable discussions and comments. We will be sure to incorporate them into the final version of the paper!
We are grateful to their reviewers for their insightful reviews. We are encouraged to hear our work is relevant and important (Gqdc, badZ, o9YV, w25T), that it is well written and presented (badZ, w25T), has solid experimental design and deep analysis (badz, w25T), points to promising directions for future work (Gqdc, o9YV), and is ultimately meaningful and valuable to the community (badz, w25T)
We have addressed all reviewer concerns in our individual responses. Multiple reviewers requested a more detailed description of (i) the infoGain oracle baseline, and (ii) our hypothesis sampling agent in section 5. We apologize for this oversight, we have provided a detailed description below and will also include it in the appendix of the final paper.
A1: The oracle infoGain agent
For objects' states , machine state , and the set of all hypotheses , the InfoGain agent maximizes expected information gain at each step, via:
- Generate a set of candidate next observations to reach
- For each candidate, compute
- Go to that maximizes
The set of candidate next observations are all result of single actions. We represent it as a binary vector. E.g. if the current state is [0, 0, 0], one candidate next state could be [0, 0, 1], another is [0, 1, 0], and so on. For N objects, there are N such candidates at each step.
Concretely, we have the space of hypotheses which is all possible items being the blicket / not a blicket, and the two rules (disj / conj). There are hypotheses total for N objects. Algorithmically we represent them as boolean masks (for which objects are blickets) and numpy operations. E.g. a conjunctive rule where the last two objects are blickets is:
def func(x):
mask = np.array([0, 1, 1], dtype=bool)
return np.all(x[mask])
For each candidate , we compute (this is equivalent to Eq.2):
Below we show all components of are computable quantities. First, is the prior entropy:
where is uniformly distributed over all 's consistent with all observations so far.
The second term is
Note the summation over amounts to iterating over all values of , which in our case is binary . (machine off or on)
We can write using Bayes,
where is from the prior as defined above. if , and 0 otherwise.
Finally, is the marginal (assuming independence of ):
After computing for all candidates, we take action corresponding with . We do this for every action the oracle infoGain agent takes.
A2: hypotheses sampling agent
The sampling agent approximates the InfoGain agent to reduce the entropy of the hypothesis distribution. Instead of iterating over candidate next observations and computing , we prompt the LM to take actions that minimizes . The procedure is as follows:
1. Hypotheses Sampling to form set of active hypotheses
If there are zero active hypotheses, we prompt the agent to generate hypotheses as python functions (for brevity, the full generation prompt will be added to the appendix) :
You have seen the following observations so far:
[HISTORICAL OBSERVATIONS]
Come up with some hypotheses about how the world works. Write each one as a python function... Produce #NUM_HYPOTHESES# hypothesis, do not generate the same hypothesis. Be diverse in both the blicket masks and the potential functions.
Do not generate hypothesis you have already eliminated:
[ELIMINATED HYPOTHESES]
Do not generate hypothesis you have already generated:
[ACTIVE HYPOTHESES]
Directly output the answer, ensure only one answer is included. Return a set of hypotheses as python functions.
We save all valid (python executable) hypotheses as string python code in the list of currently active hypotheses. We repeat this process until a desired number of active hypotheses are sampled.
2. Eliminate hypotheses based on observation history
For each python function (i.e. hypothesis) in the currently active list, we check if it perfectly predicts all past observations: i.e. for all . If it does not, we move it to the "eliminated hypotheses" list.
3. Prompt agent to take actions that eliminate more hypotheses
We prompt the agent to take actions conditioned on the observations and the list of active hypotheses:
You are currently entertaining the following list of hypothesis:
[ACTIVE HYPOTHESES]
You have seen the following observations so far:
[OBSERVATIONS SO FAR]
Given the observations so far, and the list of hypotheses (hypothesis space), take an action which will disprove the existing hypothesis.
Directly output the answer, ensure only one answer is included. Output the action in the format \'> action\'. Ensure only one action is included.
The agent then takes the next actions and we repeat the above procedure for exploration.
Q&A Phase
During the Q&A phase, we prompt the agent to answer questions conditioning on the remaining set of hypotheses:
You have seen the following observations so far:
[HISTORICAL OBSERVATIONS]
You have disproven the following hypothesis:
[ELIMINATED HYPOTHESES]
You have not yet disproven the following hypothesis:
[ACTIVE HYPOTHESES]
Based on the information above, answer the following question: [QUESTION]
Output the answer in the format \'> True/False\'. Ensure only one answer is included.
In this way, the sampling agent explicitly represents its beliefs as a set of active hypotheses, which the LM conditions on to take action.
This is an interesting and timely paper evaluating LMs in a "blicket test" paradigm adapted from developmental psychology. The authors find that LMs tend to struggle with inferring conjunctive causal relationships, and tend to pattern like the inference behaviors made by adults (cf. 4- or 5-year-old children). The authors then propose a way to mitigate these biases through a test-time sampling approach.
The experiments are relevant both from scientific and practical perspectives, and grounded in a literature with rich developmental and adult behavioral data. I think the findings would be of interest to the broad LM research community, so I recommend acceptance.
In light of the discussion with reviewers, for the camera-ready I would recommend (1) adding more detail about the implementation of the baselines, (2) motivating the choice of LLMs evaluated, and (3) carefully fixing typos and grammatical errors, among the other suggestions raised by the reviewers. I am also sympathetic to reviewer o9YV's point about elaborating on "to what extent are we licensed in drawing the same inferences about causal reasoning in LLMs from this design as we are when we study humans", especially for intricate tasks such as the blicket test.