Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse
We compare human verbal thinking with model chain-of-thought.
摘要
评审与讨论
This paper aims to explore the settings where CoT reduces performance from the perspective of cognitive psychology. It focuses on six tasks where extensive thinking affects human performance. For three of these tasks, the authors find that current models experience a performance drop when allowed more reasoning. In the other three tasks, models exhibit both positive and negative effects. This work demonstrates how insights from psychology literature can inspire the evaluation of LLMs and inference-time reasoning.
给作者的问题
- Is there a correlation between performance drop and model size? The current selection of models appears somewhat arbitrary, making it difficult to observe a clear scaling relationship. Have you conducted any analysis to examine this trend?
论据与证据
The claims made in the paper are well supported by both psychological analysis and evaluation results.
方法与评估标准
The evaluation is the main contribution of this work and is well-performed. The paper also provides a detailed description of the evaluation data curation.
理论论述
No theoretical claims or proofs were discussed in this work.
实验设计与分析
The experimental design heuristics are derived from psychology literature, making them a reasonable choice for studying LLM behavior. The experiments are conducted on a variety of models but lack a consistent model size scaling.
补充材料
I briefly reviewed the data generation process. The authors provide detailed steps for creating the data.
与现有文献的关系
This work provides a fresh perspective on cases where CoT may fail to improve performance. It represents a novel contribution to the community and could inspire further exploration of the underlying reasons and test-time inference to mitigate such issues.
遗漏的重要参考文献
NA
其他优缺点
Strengths:
- The paper is well-written and easy to follow, even for readers without a background in psychology.
- It draws inspiration from psychological findings and offers novel insights. The studied topic is of interest to the community.
- The overall evaluation is comprehensive and sound.
Weaknesses:
- It is difficult to determine the generalizability of the findings, especially since each task is specifically designed for its respective category.
- Most models are evaluated with a temperature of 0 during inference. Since the authors also mention advanced prompting methods such as self-consistency, it would be interesting to see whether the findings still hold when sampling multiple times using temperature sampling.
其他意见或建议
NA
We thank the reviewer for their encouraging review!
Weakness 1:
“It is difficult to determine the generalizability of the findings, especially since each task is specifically designed for its respective category.”
Response:
A. Generalizability within categories
The studies we choose are the most classic or well known studies in the psychology literature with respect to its category. In our experimental design, we strived to create the most representative task with minimal edits to the original study. In addition, we did not cherry-pick results — all the results represent the first iteration of testing said category.
B. Generaliability with respect to variations
To show generalizability across variations of the task, we conduct 3 additional experiments varying the problems in each of the failure cases. We find that our results are consistent across variations in difficulty, which provides further confidence in the generalizability of the findings:
- Artificial grammar learning, varying complexity of underlying FSG:
We conduct an ablation reducing the nodes in the finite state automata that generate the artificial grammars. While the original had 6 nodes, we iteratively reduce to 5, 4, and 3 nodes by merging nodes (see https://imgur.com/a/VCBQYEB). Across all valid FSGs with no unused nodes, we observe the following accuracies:
5 nodes, zero-shot = 0.886, CoT = 0.766
4 nodes, zero-shot = 0.837, CoT = 0.665
3 nodes, zero-shot = 1.000, CoT = 1.000
We see across varying complexity that CoT consistently hurts performance.
- Facial recognition, varying level of similarity:
We conduct an ablation reducing the difficulty of the task. Instead of similar faces, we sample 5 faces with different descriptions. For a visual representation, see https://imgur.com/a/1PsqIzd.
Across 100 randomly selected sets, we find that CoT continues to drastically reduce model performance. GPT-4o has a direct prompting accuracy of 0.61, but CoT accuracy is only 0.32, corroborating our findings that CoT reduces performance.
- Classifying data with exceptions, binary oracle feature:
In this task, the oracle feature was license plates, which mapped to the correct category without exceptions. However, LLMs may find it difficult to build a map from 6 character license plates to a binary class.
We conduct an ablation where we change the oracle feature to a binary feature. We replaced “license plate” with “license plate type”, a feature with labels {0, 1}. Other features remained the same.
We evaluated GPT-4o with 25 trials for up to 7 passes over the list. Direct prompting took an average of 1.84 passes to get all labels correct, while only 1 / 25 CoT trials achieved perfect classification within 7 passes, so CoT took >250% more passes to learn all labels. This mirrors our findings that CoT hurts performance in this type of task.
C. Generalizability across categories
Our generative process for the categories of human failure tasks was as follows:
-
Two senior cognitive scientists generated all cases they could come up with in which explicit or verbal thinking impairs human performance on some task. Our list branches cognitive psychology (e.g., task 1), perception (task 2), educational psychology (e.g., task 3), and spatial cognition (task 5).
-
We categorized these under themes and chose tasks that were most representative of each category based on the literature.
-
We then adapted these tasks to an LLM, ensuring that our dataset matches ML standards of scale and LLM / LMM applications, yielding our final 6.
Thus, we have reasonable belief that our list of categories is generalizable across the psychology literature. At the same time, we acknowledge that we are limited by the coverage of psychological literature, and thus generalizability across types of tasks is restricted (see Discussion, “Scope of application”).
Weakness 2:
“Most models are evaluated with a temperature of 0 during inference. Since the authors also mention advanced prompting methods such as self-consistency, it would be interesting to see whether the findings still hold when sampling multiple times using temperature sampling.”
Response:
This is a great suggestion. We conducted some ablatons on different temperatures (t = 0.5, 0.7) across the full 4400 problems for the artificial grammar learning task on GPT-4o. Accuracies were as follows:
t = 0, zero-shot = 87.5, CoT = 64.4 (original results for reference)
t = 0.5, zero-shot = 88.3, CoT = 63.6
t = 0.7, zero-shot = 87.8, CoT = 63.6
Thus, our results seem to be robust to variations in temperature sampling.
The paper investigates the conditions under which CoT prompting, a widely used technique to improve the performance of LLMs/LMMs, can actually reduce model performance.
The authors draw inspiration from cognitive psychology, focusing on six tasks where verbal thinking (deliberation) has been shown to impair human performance. They adapt these tasks to evaluate the impact of CoT on state-of-the-art models, finding that CoT significantly reduces performance in three of the six tasks (implicit statistical learning, face recognition, and classification of data with exceptions), while the effects are mixed or negligible in the remaining three (logical inconsistency, spatial intuitions, and working memory).
The paper suggests that tasks where verbal thinking harms human performance may also be problematic for models using CoT.
给作者的问题
Please kindly see the weaknesses.
论据与证据
The claims made in the paper are generally supported by clear and convincing evidence. The authors conduct extensive experiments across multiple state-of-the-art closed-sourced models (e.g., GPT-4, Claude, Gemini) and provide detailed results showing significant performance drops when CoT is applied to tasks like implicit statistical learning and face recognition.
It is better to provide a deeper exploration of why CoT fails in these specific tasks, as the current analysis is somewhat surface-level.
方法与评估标准
The methods and evaluation criteria are appropriate for the problem at hand. The authors adapt well-known psychological tasks to evaluate LLMs/LMMs, scaling them up to modern use cases. The evaluation metrics (accuracy, number of passes, etc.) are standard and well-suited to the tasks.
理论论述
The paper does not make strong theoretical claims. The authors primarily focus on empirical results and the heuristic that connects human cognitive limitations to model performance under CoT.
实验设计与分析
The experimental design is sound. The authors adopt tasks from psychological studies to fit the capabilities of LLMs/LMMs.
-
The paper does not explore the impact of different CoT prompting strategies in depth, which could provide more insights into why CoT fails in certain tasks.
-
The tasks where CoT does not harm performance (e.g., logical inconsistency) are not analyzed as thoroughly as the tasks where it does.
补充材料
Yes.
The supplementary material includes detailed descriptions of the tasks, prompts, and datasets used in the experiments.
I also reviewed the additional results, such as per-round accuracy analysis for the classification task with exceptions.
与现有文献的关系
The paper connects well with the broader literature on CoT prompting and cognitive psychology. It builds on prior work showing that CoT can improve performance on certain tasks (e.g., symbolic reasoning) but also acknowledges cases where CoT can be detrimental.
遗漏的重要参考文献
It is better to include a more thorough discussion of prior work on the limitations of CoT prompting.
For example, recent studies have shown that CoT can increase harmful outputs or fail in tasks requiring planning (e.g., Kambhampati et al., 2024). Additionally, the paper does not discuss alternative prompting strategies like Tree-of-Thought or Self-Consistency, which have been shown to improve reasoning in some cases. Including these references would provide a more comprehensive view of the current state of CoT research.
其他优缺点
Strengths:
-
The paper tackles the significant and underexplored issue of when CoT might harm model performance.
-
The experiments are well-designed and provide clear evidence of CoT's negative impact in certain tasks.
Weaknesses:
-
The paper lacks a deeper theoretical explanation for why CoT fails in certain tasks. Although the proposed heuristic is reasonable, it lacks rigorous testing or validation.
-
The analysis of tasks where CoT does not harm performance is less thorough. It raises some questions about the generalizability of the findings.
-
It is better to include a broader discussion of alternative prompting strategies and their potential to mitigate the issues identified.
-
The heuristic connecting human cognitive limitations to model performance is interesting but not rigorously tested. Could the authors provide more evidence or experiments to validate this heuristic?
其他意见或建议
-
The paper is well-written and easy to follow, but it is better to provide a more detailed discussion of the implications of the findings for the design of LLMs and LMMs and their reasoning prompting.
-
The authors should consider exploring the impact of different CoT prompting strategies (e.g., Tree-of-Thought).
We thank the reviewer for their thoughtful review.
Essential References:
“Include a more thorough discussion of prior work on the limitations of CoT prompting (e.g., Kambhampati et al., 2024).”
Response:
This paper is already cited in our related work (L124-126): “In related settings such as planning, there is little benefit from CoT prompting (Kambhampati et al., 2024)”. Feel free to suggest more references if others are missing.
Weakness 1 & 4:
“The proposed heuristic is reasonable, but lacks rigorous testing or validation.”
“Could the authors provide more evidence or experiments to validate this heuristic?”
Response:
We agree with the reviewer that more depth is better. We add 4 additional experiments. We vary the difficulty of tasks 1–3, and conduct an ablation on temperature.
- Artificial grammar learning, varying complexity of FSG:
We conduct an ablation reducing nodes in the finite state automata that generate the artificial grammars. While the original had 6 nodes, we iteratively reduce to 5, 4, and 3 nodes by merging nodes (see https://imgur.com/a/VCBQYEB). Across all valid FSGs with no unused nodes, we observe the following accuracies:
5 nodes, zero-shot = 0.886, CoT = 0.766
4 nodes, zero-shot = 0.837, CoT = 0.665
3 nodes, zero-shot = 1.000, CoT = 1.000
We see across varying complexity that CoT consistently hurts performance.
- Facial recognition, varying level of similarity:
We conduct an ablation reducing task difficulty: Instead of similar faces, we sample 5 faces with different descriptions. For a visual example, see https://imgur.com/a/1PsqIzd.
Across 100 randomly chosen sets, CoT continues to drastically reduce performance. GPT-4o has a zero-shot accuracy of 0.61, but CoT accuracy is only 0.32, corroborating our findings that CoT reduces performance.
- Classifying data with exceptions, binary oracle feature:
In this task, the oracle feature was license plates, which mapped to the correct category without exceptions. However, LLMs may find it difficult to build a map from 6 character license plates to a binary class.
We conduct an ablation where we change the oracle feature to a binary feature. We replaced “license plate” with “license plate type”, a feature with labels {0, 1}. Other features remained the same.
We evaluated GPT-4o with 25 trials for up to 7 passes over the list. Direct prompting took an average of 1.84 passes to get all labels correct, while only 1 / 25 CoT trials achieved perfect classification within 7 passes, so CoT took >250% more passes to learn all labels.
- For our temperature ablation, please see our response to reviewer 3 (fD9f).
Weakness 2:
“The tasks where CoT does not harm performance are not analyzed as thoroughly as the tasks where it does.”
Response:
We focused on the negative cases because the vast majority of the literature so far has focused on the positive impacts of CoT. Understanding the negative cases can help us identify the settings where CoT is likely to fail in the future.
Weakness 3:
The paper does not discuss alternative prompting strategies like Tree-of-Thought or Self-Consistency, which have been shown to improve reasoning in some cases.”
Response:
We discuss both Tree-of-Thought (ToT) and self-consistency in the paper, and we also conducted an ablation on ToT for artificial grammar learning. See Discussion, L427-438, “Types of inference-time reasoning”.
For artificial grammar learning, ToT improved accuracy on GPT-4o (64.55% vs. 62.52%), but was still far from zero-shot performance (94.00%), providing support that our findings extend across these techniques.
We are happy to conduct more analyses before the camera ready if the reviewer believes it is imperative.
Other Comments:
“it is better to provide a more detailed discussion of the implications of the findings for the design of LLMs and LMMs and their reasoning prompting.”
Response:
Great idea! We add the following to our discussion. Please feel free to suggest changes.
Implications for the design of LLMs, LMMs, and prompts In our experiments, we observe that CoT can also perform worse at certain types of tasks. This suggests that models should be flexible in choosing when to use CoT. Towards this, one promising direction is rational metareasoning — when people are faced with a task, they often make trade-offs between increased costs of reasoning and marginal benefits they would attain. We could prompt LLMs to do the same before solving a task. In this direction, De Sabbata et. al., (2024), trained LLMs to use intermediate reasoning steps only when necessary by incorporating a term into the reward function that penalizes unnecessary reasoning. Future works may further incorporate types of reasoning failures such as the ones we study into these training objectives.
De Sabbata, C. Nicolò, Theodore R. Sumers, and Thomas L. Griffiths. "Rational metareasoning for large language models." arXiv preprint (2024)
This work delves (no, not written by an LLM) into the conditions under which CoT works. Often in ML research, only positive results are presented, and the many things that don't work never see the light of day. The contribution of this work is firstly to make explicit that CoT doesn't always works, and more importantly to test a hypothesis of why/when it doesn't work. As a good starting point, the authors look at the behavioral psychology literature, and shortlist 6 tasks for which human performance drops with (more) thinking. Proxy tasks are then performed by a number of SOTA LLMs/LMMs to investigate if CoT also causes performance drops in these models.
Update after rebuttal:
In review of the additional experiments and the reasonable responses, I upgrade my review from 2 to 3. However, overall while I like the approach and sympathize with the conclusions, I think the piece of work could still be improved significantly. If there were a 2.5 rating option, I would select that.
Regardless of acceptance/rejection, I think the work may not be entirely convicing to a more traditional ML/CS readership. I urge the authors to consider how their case can be made more convincing, and whether perhaps a different venue (if rejected) or additional venue for follow-on work (if accepted) with a different readership/slant may be better.
给作者的问题
Nil.
论据与证据
-
I like the goal of this work, but ultimately it's an observational / analytical report. While the paper purports to "understand and predict" (Introduction, pg 1 right column top para) when CoT has a negative effect, the paper doesn't really predict in the rigorous or quantitative sense. It would have been much stronger if ultimately a large variety of tasks were performed by the LLMs/LMMs with and without CoT, and some sort of predictive computational model trained. In other words, what we really want (but the authors didn't do) is that given a new/unseen task, predict accurately whether CoT should be used or not.
-
If only 3 of the 6 tasks had clear negative effects in models (i.e. like humans), then there are very mixed results, and it's not clear whether the work can make any strong claims, other than just reporting these results.
-
On page 8, first para, the authors write: "strongly supporting the hypothesis that our heuristic is better than random at finding failure in CoT"... this seems to be a very weak and uninteresting claim.
方法与评估标准
Methods and evaluation criteria were generally sound.
理论论述
Not applicable.
实验设计与分析
Experimental designs / analyses were generally sound.
补充材料
I did not review the supplementary material.
与现有文献的关系
Interesting contribution to investigate deeper the phenomenon of CoT, where many papers and model use it, but never discuss or show the setting / expts in which CoT doesn't help or makes things worse.
This is a negative (or unclear) result in the sense that of the 6 tasks tested, only 3 fulfilled the (implicit) expectation or hypothesis that CoT would produce performance reductions, as they did with humans. I mean this as a neutral comment (negative results can be valuable).
遗漏的重要参考文献
Nil.
其他优缺点
Nil.
其他意见或建议
Nil.
We thank the reviewer for their thoughtful review. We are also grateful that the reviewer appreciates our approach of providing our full results rather than cherry-picking cases.
Claims and Evidence 1:
“While the paper purports to "understand and predict" when CoT has a negative effect, the paper doesn't really predict in the rigorous or quantitative sense.”
Response:
The sense in which we are able to predict cases where CoT has negative effects is that we are able to identify risky tasks based on the psychological literature, for which we do find large negative effects from CoT in model performance.
We do agree that this is a looser interpretation of “predict” — however, the failure cases that we do find are within structured categories that share psychological explanations, thus allowing us to rigorously predict reductions in performance for altered variations of these tasks.
To illustrate this, we provide results from 3 additional experiments:
- Artificial grammar learning, varying complexity of underlying FSG
- Facial recognition, varying level of similarity between faces
- Classifying data with exceptions, binary oracle feature
These additional experiments each have changes that significantly differentiate them from their original counterparts by varying difficulty/complexity. Across all these experiments, we find large decreases in CoT performance compared to zero-shot, demonstrating the within-category robustness of our findings. Please see our response to reviewer 3 (fD9f) for details on each additional experiment.
Claims and Evidence 1, pt 2:
“It would have been much stronger if ultimately a large variety of tasks were performed by the LLMs/LMMs with and without CoT, and some sort of predictive computational model trained. In other words, what we really want (but the authors didn't do) is that given a new/unseen task, predict accurately whether CoT should be used or not.”
Response:
We agree that this a valuable goal for future research. However, it’s also worth mentioning that this is incredibly hard: In decades of overthinking research in psychology, there is still no progress towards a generalizable “overthinking” classifier for humans. Thus, it seems reasonable that building something similar for LLMs would also be very difficult.
The current state of literature on CoT largely focuses on tasks developed in the NLP literature such as MMLU. Our paper builds towards the goal the reviewer suggests by exploring a novel set of tasks inspired by the psychology literature, which we find result in a number of large negative effects of CoT. Such cases are uniquely informative for understanding the limits of CoT, and relevant to developing better predictive models.
At the same time, we certainly do not claim to have found the predictive computational model that the reviewer proposes. We highlight this in our discussion (“Scope of application”, page 8):
“While our psychology-based heuristic offers a strategy for identifying failure cases of CoT, it is unlikely to cover all cases where CoT decreases performance. Existing psychological research has been guided by a variety of theoretical and practical considerations, but does not offer an exhaustive or representative sample of all tasks, and will miss cases that are uniquely interesting to study in models but not humans. Thus, we envision our contribution to be complementary to existing evaluation methods in natural language processing.”
Claims and Evidence 2:
“If only 3 of the 6 tasks had clear negative effects in models (i.e. like humans), then there are very mixed results, and it's not clear whether the work can make any strong claims, other than just reporting these results.”
Response:
In previous studies of CoT, negative impacts are much rarer than those seen in the six cases we considered. To support the claim that our tasks resulted in a higher failure rate, we conducted a bootstrapping significance test that found that our method of searching for CoT failures is more effective than previous attempts. This includes quantifying by both failure magnitude (estimated p < 0.000001) and failures irrespective of magnitude (estimated p < 0.00011). See Section 4.5 in the paper for details. Thus, we believe that we can make the claim that our method for exploring CoT failure tasks is more efficient than (and also complementary to) previous endeavors.
Claims and Evidence 3:
"strongly supporting the hypothesis that our heuristic is better than random at finding failure in CoT"... this seems to be a very weak and uninteresting claim.
Response:
We agree this claim was not particularly strong in its original statement. We have tried to clarify our claim by replacing this with “our heuristic is much more efficient than past endeavors at finding failures in CoT”, which is also more precise. We welcome further suggestions from the reviewer for how to phrase the takeaways of this section more clearly.
As Chain-of-thought (CoT) prompting becomes a widely used practice, this paper aims to answer the limitations of the approach. Authors propose a "heuristic" for determining limitations of CoT by drawing a comparison between CoT prompting and humans engaging in verbal thought. Inspired by psychological literature, six tasks where it's claimed that verbal thought perils performance were chosen and adapted them to evaluate LLMs and LMMs.
Experiments show that three of these tasks see a drastic decrease in performance caused by adding CoT to zero-shot prompting, and the effect is more pronounced in stronger models. In other tasks, however, performance was either improved or not affected by CoT. Some bootstrapping results were provided to prove the effectiveness of the heuristic of task choosing.
给作者的问题
I've stated some of mine confusions in the 'Claims', 'Method' and 'Experiments' section; It'd be great if these confusions can be answered. One of the more important questions is clarification on the setting and results of your bootstrapping experiment.
论据与证据
The major claim of the paper is the effectiveness of the proposed heuristic: LLMs/LMMs perform worse with CoT in tasks where verbal thinking perils humans. Authors claim that, although not perfect due to intrinsic differences between humans and models, the heuristic allows for a more efficient identification of tasks unfit for CoT. The claim was supported by choosing 6 tasks guided by the heuristic, evaluating and analyzing effects of CoT, and conducting bootstrapping experiments to prove that the effectiveness of the heuristic is beyond random selection.
The claim is fairly intuitive and, given the significant experiment and bootstrapping results, largely convincing. The experiments involved a wide variety of models and produced results with high confidence in some scenarios; there are also experiments corresponding to scenarios where authors predicted the heuristic would be less convincing; bootstrapping against random selection is considered too.
However, I find it hard to be convinced by a qualitative heuristic. It's not explicitly stated, not to mention theorized, how the 6 tasks were chosen based on the heuristic, and nor is how should the heuristic be applied to a broader task pool. Therefore I don't see how the experiment results conclude the effectiveness of the heuristic. And this is not obvious: for some of the tasks, psychological literature only claim that participants "cannot provide verbal basis" rather than "perform worse when providing verbal basis"; why are such tasks chosen based on the heuristic?
Another thing is the brevity of the bootstrapping section. I imagine ablation studies in proving this claim is quite important, but there's no explanation of the setting and illustration of complete results to make this section convincing enough.
方法与评估标准
The method proposed is to detect unfit-for-CoT tasks based on whether verbal contemplation perils human performance in these tasks. Evaluation criteria is a subset of intuitive reasoning tasks in psychology literature, on which experiments were performed to study the effect of CoT.
As mentioned before, the "heuristic" proposed is a little confusing as a method: it lacks in practicability (i.e. how to determine extents to which tasks are appropriate for CoT in an arbitrary setting) and informativeness (i.e. what aspects and what extents of human under-performance under verbal deliberation signals a unfit-for-CoT task).
Evaluation was more focused on performance change of models with respect to CoT addition in separate tasks. Indeed promising results that align with authors' prediction were provided, but I'd like to see more overall evaluations on whether the given heuristic is providing trustworthy predictions of CoT impairment. Otherwise, this paper would seem more like a case study of discrete tasks that's a little less powerful in supporting the method.
I'm aware of the bootstrapping study, but the setting and results are still a little hazy. What does the 378 comparisons consist of? Are the 6 tasks newly selected to the same as before? If so, why sample multiple times again, as it is already done in the main experiments?
理论论述
I didn't see explicit theoretical claims and/or proofs, except for the comparison between human underperforming with verbal contemplation and LLM/LMM underperforming with CoT. This claim is comprehensive in that multiple defects of the comparison were discussed, and that it's based on solid psychological evidence, but I don't think it's a provable theoretical claim, at least it's not framed to be one.
实验设计与分析
The main experiments were conducted with 6 tasks based on the psychological heuristic on models with ranging capabilities. Results for each model in zero-shot prompting and CoT prompting were provided, as well as p-values. Detailed analyses were present in both experiments where the hypothesis hold true and false.
However, some details of the results were not in line. In some experiments, o1-preview was used as the CoT version of GPT-4o, while other experiments directly used GPT-4o with CoT; in some tasks, some models were evaluated on subsets of the problems while others were evaluated on the entire problem set; p-values were provided in some studies but not others. Reasons for these disparities were not explicitly stated (or maybe I missed them?)
The analyses were a little lacking to me as well. For example, it's claimed that in a certain task “CoT often improved performance, attributable to both the low base performance and the logical reasoning component", but can these be solved as well by few-shot prompting? Does CoT improve performance in few-shot settings as well? Moreover, sound explanations for situations where performance drop caused by CoT is less pronounced (even negligible) for weaker models were lacking.
补充材料
I reviewed all the code provided as supplementary material. The task generation and evaluation scripts and api calling scripts were quite well-organized, but I didn't find the inference code for open-source models present in the paper, such as Llama 3.
与现有文献的关系
Previous studies on limitations of CoT have focused on the computational expenses, problems caused by its sensitivity to prompting and problems in training (i.e. over-fitting, difficulties in evaluation). This paper provides a new light on the intrinsic incompatibility of some tasks, encouraging carving a clear boundary for applications of CoT. Should the authors solidify the method (i.e. the proposed "heuristic"), I believe this would be a novel line of work to build upon.
遗漏的重要参考文献
I'm not aware of any.
其他优缺点
The introduction of psychological studies and the parallel between human performance and predictions with LLM is brilliant; authors also removed previous assumptions that CoT's perils arise mainly from sensitivity to prompts, and focused on tasks themselves.
Clarity of writing needs improving, especially in that figures showing experiment results and settings are lacking. The lack of theoretical framework also seems like a problem.
其他意见或建议
Please refer to the previous sections.
Methods And Evaluation Criteria 2:
“I'd like to see more overall evaluations on whether the given heuristic is providing trustworthy predictions of CoT impairment.”
We agree with the reviewer that more depth is better. We add 4 additional experiments:
- Artificial grammar learning, varying complexity of underlying FSG
- Facial recognition, varying level of similarity between faces
- Classifying data with exceptions, binary oracle feature
- Ablation on temperature sampling
Across all these experiments, we find large decreases in CoT performance compared to zero-shot, demonstrating the within-category robustness of our findings. Please see our response to reviewer 3 (fD9f) for details on each additional experiment.
Claims And Evidence 1, Methods And Evaluation Criteria 1 & 3
“How the 6 tasks were chosen based on the heuristic [...] "cannot provide verbal basis"”
“How should the heuristic be applied to a broader task pool”
“practicability and informativeness”
Choosing tasks based on the heuristic:
In Section 3, we provide an overview of the verbal thinking literature, including that artificial grammar experiments found that humans “cannot provide verbal basis” for their judgments. However, in section 4.1, paragraph “Human failure”, we specify that “In the artificial grammar learning task, humans prompted to verbalize performed more poorly than those who were not so prompted (Fallshore & Schooler, 1993)”. For each of the other tasks, we also justify the choice in section 4.
We recognize that this could be better structured for readers, and we have redrafted section 3 for the camera ready to provide better clarity on why each task was chosen.
How the heuristic can be applied to a broader task pool:
We chose the tasks based on the six most well-studied categories of human verbal thinking failures. However, each category does not only consist of one such task, but instead a broad range (e.g., verbal overshadowing does not only apply to facial recognition, but also phenomena like wine-tasting! (Melcher and Schooler, 1996)) Thus, for other tasks, we could search for stimulus patterns (such as reliance on another modality) that are based in psychological findings, and predict that e.g., due to verbal overshadowing, performance of that task with CoT would also be poor.
We do acknowledge that this approach is limited in that it only covers patterns or categories that are studied in the psychology literature. At the same time, it nicely complements existing ML approaches (e.g., Sprague et. al., 2024). We highlight this limitation in our discussion (“Scope of application”, page 8).
Claims And Evidence 2:
“Brevity of the bootstrapping section. There's no explanation of the setting and illustration of complete results to make this section convincing enough”
We agree that this could be more detailed. We have updated the writing to include the following points (we omit the full subsection due to space constraints).
For the larger population, we take all evaluations that compare zero-shot and CoT in a recent metastudy, Sprague et al., (2024), for a total of 378. Models evaluated include Llama 2 7b, Mistral 7b, Llama 3.1 8b, Llama 3.1 70b, Gemma 2 9b, Phi-3 Small 8k, Qwen 2 7b, Qwen 2 72b, GPT-4o Mini, Gpt-4o, Claude-3 Haiku, Claude-3.5 Sonnet, Gemini 1.5 Flash, and Gemini 1.5 Pro. Tasks evaluated span various domains such as mathematical reasoning (e.g., GSM8k-Hard), commonsense reasoning (e.g., CommonsenseQA), soft reasoning (e.g., AGIEval LSAT AR Soft Reasoning), and various commonly used benchmarks (e.g., MMLU-Pro, Big-Bench Hard).
For our experiments, we take all comparisons between zero-shot and CoT in our 6 tasks, for a total of 50. These are exactly all of the comparisons that we list in tables 1–6 in the main paper. For task 3, our main metric was number of rounds and not accuracy, so we replaced this with the difference in classification accuracy (e.g., y-axis of Figure 5).
For each comparison, we take the percentage accuracy decrease (consistent with the Sprague et. al., paper) and use this as the value of the datapoint. We then bootstrap 100,000 samples of size 50 from the population and compute the mean percentage accuracy decrease. None of these 100,000 means were lower than the average percentage mean that we obtained in our experiments.
For each comparison, we labeled accuracy decreases from CoT compared to zero-shot. Separate to the previous analysis, we bootstrapped 100,000 samples of size 50 and counted the number of accuracy decreases. Only 11 of the 100,000 samples had more instances of performance decreases than the 50 datapoints in our experiments.
Other Responses
Due to rebuttal length limitations, we were not able to provide all our answers in the box provided. For answers to the remaining questions, we provide the following anonymous link: https://cryptpad.fr/doc/#/2/doc/view/YYLHK9InmafRO6yRL0legsg2tB4IUv6KY-3Whm+ddhw/
Thank the authors for the efforts in providing extra experiments and details. Experiment details provided in the external link address my concern over the validity of experiments pretty well, and the additional tasks seem solid too.
However, I still believe that a necessary condition for acceptance is your work's significance and how this significance matters to a larger community. On seeing the title of this paper, I was initially expecting a robust framework over impairment caused by CoT and heuristics based the framework. You "envision your contribution to be complementary to existing evaluation methods", but at least it's not explicit to me how can I incorporate your results when I conduct such evaluation over reasoning tasks.
Thank you for engaging further with our rebuttal! We are happy to hear that our additional experiments and provided details have resolved concerns regarding validity.
Reviewer Rebuttal Comment: (aspect 1)
"how this significance matters to a larger community [...] I was initially expecting a robust framework over impairment caused by CoT and heuristics based the framework"
Response:
We agree that something like this would be the final goal of this line of research --- a robust test-time algorithm that can reliably infer when to use reasoning. However, it’s also worth mentioning that this is incredibly hard: In decades of overthinking research in psychology, there is still little progress toward a generalizable “overthinking” classifier for humans. Thus, it seems reasonable that building something similar for LLMs would also be very difficult.
We believe this paper's significance is that we shed light on a psychology-inspired connection to help explain when CoT failures occur -- leveraging promising findings with significant statistical power. Parallels with human cognitive errors suggest that these failures we observe are not arbitrary, but also reflect deeper patterns in reasoning. We will also adjust the title and framing to better reflect its role as a valuable scientific observation to build upon, rather than a finalized solution.
Reviewer Rebuttal Comment: (aspect 2)
"complementary to existing evaluation methods [...] how can I incorporate your results when I conduct such evaluation over reasoning tasks."
Response:
As the reviewer has stated, the questions of when and why CoT failures occur is under-explored. Current literature on CoT has only focused on tasks in the NLP literature e.g., MMLU. Our paper explores a novel set of tasks inspired by the psychology literature, which resulted in a number of large negative effects from CoT. Such cases are uniquely informative for studying limits of CoT because of existing psychology literature that explain why these failures happen, and are thus relevant to developing better predictive models.
In addition to providing a valuable foundation for future work, we will also release our six scaled-up task datasets as a human overthinking benchmark that ML practitioners can use to evaluate human-like overthinking failures.
Thank you very much for your feedback --- it helps us solidify our contribution and frame our position more clearly.
Sincerely,
Authors of Mind Your Step (by Step)
The paper investigates an interesting question: when CoT might harm the performance of LLMs. The authors draw inspiration from cognitive psychology and identity tasks where humans tend to perform worse when they verbalize their thinking process. They adapt six such tasks to test whether models also show performance drops with CoT in these situations. CoT achieves significant worse results on three tasks but mixed results on the others. The psychology-inspired approach is novel and the work studies an important but often overlooked aspect of CoT. This new perspective is appreciated by the reviewers. Concerns about generalization beyond these specific tasks and need for more analysis also exist, but this work presents a step in an interesting and under explored direction.