On the Language of Thoughts in Large Language Models
We demonstreate the gap of LLMs in modeling human thoughts for system 2 reasoning and propose Call-of-Thoughts to alleviate the gap.
摘要
评审与讨论
This paper attempts to distinguish between language modeling and thought modeling. While LLMs right now model linguist to imitate human reasoning, the authors claim that there is a gap between language and though, which can introduce certain biases. They propose a new type of prompting that they call Language of Thought Prompting and provide various experiments comparing Language of Thought with CoT type prompts.
优点
The authors are correct that permutations of sentences or premises for reasoning can lead to untoward results in LLM performance. The ideas behind the LoT prompting style have some promise.
缺点
While the ideas behind the new prompting style might be promising, this paper needs a radical overhaul to be acceptable. Too much in this paper is just hard to understand or vague. The leading question for the paper for example is already muddled: what does it mean to elicit the language of thought as humans??? Here is the quote:
"Do LLMs with CoT model and elicit the language of thoughts as humans?"
The whole set involving a comparison up between language and thought in humans is somehow besides the point that the authors want to bring up. The motivation for the LoT style prompting seems to rely on is that different strings of linguistic tokens may express the same or at least very similar linguistic content to humans. This is not a language vs. thought issue but rather an issue of whether LLM objective functions as they stand, or even with CoT, capture real linguistic, semantic content. There is a lot of literature out there on insufficiencies of LLM in capturing semantic content; e.g. there's Bender, Emily M., and Alexander Koller. "Climbing towards NLU: On meaning, form, and understanding in the age of data." Proceedings of the 58th annual meeting of the association for computational linguistics. 2020. But perhaps more relevant to the authors are two recent papers that look in detail at how LLMs fail to capture semantic meaning:
"Strong hallucinations from negation and how to fix them" arXiv preprint arXiv:2402.10543 (2024);
``Analyzing Semantic Faithfulness of Language Models via Input Intervention on Question Answering'' (Computational Linguistics 2024).
Speaking in terms of semantic contents instead of thoughts gives the authors a lot more ammunition to investigate where CoT approaches break down. Simply put, we know a lot more about the structure and features of semantic contents than we do about thoughts.
If the authors can show in detail that LoT methods capture semantic content better than CoT methods, that could be an important finding.
There's also a confusion between between logical and causal sequences. I quote: ``Human conducts System 2 reasoning via the language of thoughts that organizes intermediate steps as a causal consequence of mental representations (Rescorla, 2024). For example, a human baby is able to abstract, construct, and reason over a causal map of the world in their minds." The important point it would seem to this reviewer in system 2 reasoning is that the intermediate steps follow each other in terms of logical or semantic consequence, not causal consequence. Or rather in a good system 2 reasoning system causal consequence and logical consequence merge. Why do I say that? Because you can have causal sequences of thoughts/representations in a psychopath that are completely crazy have no logical relation to each other and have nothing to do with type 2 reasoning.
The authors move quickly from evidence about thoughts without language to the thesis that "As language is primarily a tool for communication instead of thinking and reasoning". But there is nothing in the paper that warrants this assertion. And unfortunately, this assertion is key to the paper and drives the move to find different prompts from standard CoT prompts.
The paper is woefully short on examples and often it's very difficult to understand what the authors want to say:
For example: "Thoughts are the unobserved high-level random variables evaluated by brains that drive us to generate language." It's really hard to figure out what to do with this. And in addition it's a definition.
another example, 'When a premise is expressed in an implicit expression under a context, it is hard to notice and utilize it for downstream reasoning' what does it mean to have an implicit expression?
Here's a sentence that just seems to be flat out false: "For humans, since the language order does not determine the language
meaning when given proper conjunction words, one can easily change the order of presenting the
premises in need."
As a counterexample, consider:
John took off his shoes and went to bed
vs.
John went to bed and took off his shoes
The meaning conveyed by these two sentences is quite different. Changing the order of sentences often changes the meaning of a text. This is part of the study of discourse structure and how it's formally interpreted. E.g. N. Asher & A. Lascarides, Logics of Conversation, Cambridge University Press, 2003. And unfortunately this assumption seems to be key to distinguishing LoT from CoT
A lot of the sentences in this paper aren't English or well formed in any language I know of . Eg. "The Interplay between language and thoughts has intrigued a long historical discussion about the role of language in human thinking in the literature (Rescorla, 2024; Fedorenko et al., 2024)." Issues don't intrigue a historical discussion. In addition, if the history is long, why cite people from 2024? The authors might start by citing Fodor Language of Thought (1975) but actually the issue already arises with early medieval thinkers like Saint Augustine and his concept of the "verbum mentis". See the Stanford Encyclopedia article on medieval semiotics.
Again: "Consequently, modeling thoughts merely from the language can easily integrate the language modeling biases into the learned model, such as the order (Wei et al." this is largely unintelligible, and at this point language modeling biases haven't been defined.
The expanding thought part of the proposed prompt is too vague to be at all useful as it currently stands. "instruct the model the expand those" is not English or comprehensible. Please rephrase
问题
This reviewer would have quite a few questions once the basic terms of the analysis are clarified.
A basic one would be: why the insistence on causal relations between representations instead of deductive or inductively supported ones? Please see my discussion above.
W3.There's also a confusion between between logical and causal sequences. I quote: ``Human conducts System 2 reasoning via the language of thoughts that organizes intermediate steps as a causal consequence of mental representations (Rescorla, 2024). For example, a human baby is able to abstract, construct, and reason over a causal map of the world in their minds." The important point it would seem to this reviewer in system 2 reasoning is that the intermediate steps follow each other in terms of logical or semantic consequence, not causal consequence. Or rather in a good system 2 reasoning system causal consequence and logical consequence merge. Why do I say that? Because you can have causal sequences of thoughts/representations in a psychopath that are completely crazy have no logical relation to each other and have nothing to do with type 2 reasoning.
Thanks for pointing out this potentially confusing point. By the quoted sentence, we intend to mean that the language of thoughts is a tool that human uses to conduct system 2 reasoning, which is necessary to realize system 2 reasoning. However, merely using the language of thought to realize system 2 reasoning may not be sufficient.
We use the term of causal sequences mainly to refer to the relations between the mental representations in the language of thoughts hypothesis, instead of system 2 reasoning. We have revised our manuscript to improve the clarity and avoid any potential misunderstandings.
W4.The authors move quickly from evidence about thoughts without language to the thesis that "As language is primarily a tool for communication instead of thinking and reasoning". But there is nothing in the paper that warrants this assertion. And unfortunately, this assertion is key to the paper and drives the move to find different prompts from standard CoT prompts.
We need to clarify that the paper is not intended to warrant the quoted statement, because it has been warranted by one of our cited paper [1]. We have revised our manuscript to avoid this potential misunderstanding.
[1] Fedorenko, Evelina, Steven T. Piantadosi, and Edward AF Gibson. "Language is primarily a tool for communication rather than thought." Nature 630.8017 (2024): 575-586.
W5.Here's a sentence that just seems to be flat out false: "For humans, since the language order does not determine the language meaning when given proper conjunction words, one can easily change the order of presenting the premises in need." As a counterexample, consider: John took off his shoes and went to bed vs. John went to bed and took off his shoesThe meaning conveyed by these two sentences is quite different. Changing the order of sentences often changes the meaning of a text. This is part of the study of discourse structure and how it's formally interpreted. E.g. N. Asher & A. Lascarides, Logics of Conversation, Cambridge University Press, 2003. And unfortunately this assumption seems to be key to distinguishing LoT from CoT
We need to clarify that, by the quoted sentence, it does not imply that changing the order of phrases in the sentence always change th meaning. The relation not determine, refers that changing the order of sentences may change the meaning of a text.
The language order does not determine the language meaning when given proper conjunction words.
Please consider "A leads to B" v.s. "B due to A".
W1.2The motivation for the LoT style prompting seems to rely on is that different strings of linguistic tokens may express the same or at least very similar linguistic content to humans. This is not a language vs. thought issue but rather an issue of whether LLM objective functions as they stand, or even with CoT, capture real linguistic, semantic content. There is a lot of literature out there on insufficiencies of LLM in capturing semantic content; e.g. there's Bender, Emily M., and Alexander Koller. "Climbing towards NLU: On meaning, form, and understanding in the age of data." Proceedings of the 58th annual meeting of the association for computational linguistics. 2020. But perhaps more relevant to the authors are two recent papers that look in detail at how LLMs fail to capture semantic meaning: "Strong hallucinations from negation and how to fix them" arXiv preprint arXiv:2402.10543 (2024); ``Analyzing Semantic Faithfulness of Language Models via Input Intervention on Question Answering'' (Computational Linguistics 2024). Speaking in terms of semantic contents instead of thoughts gives the authors a lot more ammunition to investigate where CoT approaches break down. Simply put, we know a lot more about the structure and features of semantic contents than we do about thoughts. If the authors can show in detail that LoT methods capture semantic content better than CoT methods, that could be an important finding.
We need to clarify that the focus of our work is on analyzing the reasoning procedure of LLMs, i.e., large language models trained with next-token prediction objective. The reasoning is usually built upon the understanding of linguistic tokens or semantics, and operates onto a more abstract level beyond the basic semantic or linguistic token understanding.
The references mentioned in the review focuses on the inssufficiences of language models in understanding linguistic tokens, which is orthogonal to our work:
- [1] focuses on whether the representations of language models can capture the semantics of logical operators like negation, and empirically study language models like BERT and RoBERTa, which are built upon different training paradigms as LLMs studied in this work.
- [2] focuses on the faithfulness of the representations of language models in modeling the semantic contents of texts through interventions like deletion and negation. In addition, most of the empirical studies in [2] focuses on language models like BERT and XLNet, which are built upon different training paradigms as LLMs studied in this work.
- [3] discuses whether language models can truly understand the semantics through multiple thought experiments. However, our work focuses more on the reasoning, operating in a more abstract level upon understanding the meanings of the texts.
We also revised our manuscript to include a discussion of those related works to faciliate readers better understand the scope of this work.
We appreciate the reviewer's sharing about the research on LLMs' semantic content understanding. We believe fruitful results would be found from it. We hope the previous explanation would be help to distinguish the scope of this paper from the ones in semantic content understanding. The introduction of this research line has been included in the revised version, to better clarify the research scope and avoid potential misunderstanding.
[1] Nicholas Asher and Swarnadeep Bhar. “Strong hallucinations from negation and how to fix them.” ArXiv abs/2402.10543 (2024).
[2] Akshay Chaturvedi, Swarnadeep Bhar, Soumadeep Saha, Utpal Garain and Nicholas Asher. “Analyzing Semantic Faithfulness of Language Models via Input Intervention on Question Answering.” Computational Linguistics 50 (2022): 119-155.
[3] Emily M. Bender and Alexander Koller. “Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data.” Annual Meeting of the Association for Computational Linguistics (2020).
You write: " The reasoning is usually built upon the understanding of linguistic tokens or semantics, and operates onto a more abstract level beyond the basic semantic or linguistic token understanding." I simply don't understand the distinction here you're making. Semantic relations seem to be exactly relevant to what you want to do.
Dear Reviewer FBxu,
Thank you for continuing to participate our discussion. We feel there exists an misunderstanding about the problem studied in this work, and we are confident to clarify it! Please find our detailed responses to your remaining concerns below.
As I pointed out, JUST causal relations seem to be beside the point when evaluating good reasoning.
As clarified in the response to Q1, we use the causal relations not only to model good reasoning, but also bad reasoning, since bad reasoning could happen to LLMs.
Bad reasoning can have causal antecedents just as good reasoning can. What we want to do I would have thought in trying to get to good reasoning is look precisely at logical and semantic relations. In good reasoning, a system like an LLM should have an antecedent step that semantically entails a subsequent step be the cause, but the important relation here is semantic consequence, not the causal one. In fact the causal connection is too general because it doesn't pick out the good reasoning steps you want to pick out.
Indeed, we are on the same page! As we clarified in our response to Q1 that the good reasoning will entail the logical relations among the intermediate reasoning steps.
However, since causality is just a tool to model the reasoning of the ground truth, human thinking, and the LLM reasoning in our paper, it could also model incorrect relations.
Take a very simple example: we may use equations to model a system, where the equations could be correct or incorrect. Nevertheless, the existence of incorrect equations does not mean that we should not use the tool (i.e., equations) in modeling.
Comment: You write: " The reasoning is usually built upon the understanding of linguistic tokens or semantics, and operates onto a more abstract level beyond the basic semantic or linguistic token understanding." I simply don't understand the distinction here you're making. Semantic relations seem to be exactly relevant to what you want to do.
The misunderstanding here is that we establish the semantic understanding at different levels. To clarify, the quoted sentence intends to imply that understanding basic semantics and linguistics is essential and fundamental to the reasoning. Upon the correct understanding of basic semantics (i.e., the semantic meaning of the intermediate steps, such as concepts and premises), one could further talk about the semantic relations at a higher level. The relations between the intermediate steps could be semantic consequence in good reasoning, or non-semantic in bad reasoning.
Regarding the misunderstanding in the scope of the main research problem in this work
The causal models studied in this work focus on the conditional distribution between the tokens:
- Quoted from the review response
Bad reasoning can have causal antecedents just as good reasoning can., This is right. In the model we studied here, LLMs like Llama can generate a response whenever one puts a query to it. This response can be right, and can also be wrong. They follow a fixed causal mechanism: the conditional distribution . This conditional distribution is decided by the model structure and its learned parameters. Here, the causal means the data-generation process of the next token. - Quoted from review response
good reasoning is look precisely at logical and semantic relations. This is right. From the causal (in statistics) point of view, if an LLM is good at reasoning, it implies that the tokens generated (following the above conditional distribution) should form a set of sentences that have good logical and semantic relations. In this paper, we focus on: Why some conditional distributions can give such sentences with good logical and semantic relations, but others cannot? - As we mentioned above, such conditional distribution is decided by the model structure and parameters. And those parameters are estimated by the next-token prediction task in a training corpus. Therefore, to answer the above question, we dive into the influence of these training methods on the conditional distribution.
Please let us know if our explanations above clarify your questions. Otherwise, we are more than happy to discuss them in more detail. Thank you so much!
Dear Reviewer FBxu,
We are grateful for your time and efforts in reviewing our manuscript. We understand you are busy and therefore, we provide a short summary of our responses to your remaining concerns.
"Semantic relations seem to be exactly relevant to what you want to do."
- In this paper, we study to what extent LLMs can recover human thinking (Please kindly bear with us with the short summary. The detailed definitions and explanations are given in our responses above);
- First, we all agree that semantic relations are crucial for system 2/good reasoning. Nevertheless, it covers a bit too broad a scope of the relations in LLM reasoning. For example, to understand the semantic relations, one may need to understand the meanings of tokens, words, phrases, and sentences, as well as the relations between them and even common sense knowledge. It may not be the best tool to use in our study;
- In addition, the nature of this study requires modeling not only correct but also incorrect behaviors. Therefore, semantic relations are not general enough to characterize the LLMs' bad reasoning behaviors;
We understand that misunderstandings usually happen due to the different backgrounds of the authors and the audience. We sincerely appreciate your time and efforts in helping us better present the discoveries in our work to the community. We hope our explanations above could align us on the same page. Otherwise, we are more than happy to continue the discussion and provide more details!
Dear Reviewer FBxu,
Thanks for your time and patience, as well as your sharp insights. We feel there is a misunderstanding and we are confident to address your concerns! Please find our detailed responses below.
Q1. The Basic OneThis reviewer would have quite a few questions once the basic terms of the analysis are clarified.A basic one would be: why the insistence on causal relations between representations instead of deductive or inductively supported ones? Please see my discussion above.
The main reason for using the notion of causality is due to its generality. As this work involves the modeling of three levels of generation, we need a unified language and system to characterize all three levels:
- Ground truth: this level models the ground truth relations among premises and conclusions, or concepts. The types of relations can be varied, e.g., logical or physical relations. Using the language of logics (e.g., inductive or deductive) may not be able to fully model all the possible relations.
- Human thinking: this level models the human thinking process among premises and conclusions, or concepts. It depends on how human thinks about the relations, which may not necessarily be true (i.e., the same as the ground truth). Therefore, the relations may not be logical, but rather more likely to be empirical, depending on different humans. Using the langauge of logics may not properly characterize the human thinking.
- LLM generation: this level models a generative language model, for example, next token prediction. The generation is controlled by the mapping from input tokens to the vocabulary distributions, i.e., the conditional distribution . And the conditional distribution is modeled by parameters that would be estimated through next-token prediction object over a training corpus. Using the language of logics may be limited in characterizing the LLM generation process.
W1.1While the ideas behind the new prompting style might be promising, this paper needs a radical overhaul to be acceptable. Too much in this paper is just hard to understand or vague. The leading question for the paper for example is already muddled: what does it mean to elicit the language of thought as humans??? Here is the quote: "Do LLMs with CoT model and elicit the language of thoughts as humans?" The whole set involving a comparison up between language and thought in humans is somehow besides the point that the authors want to bring up.
We need to clarify that the main focus of our research question, as quoted, is to investigate whether LLMs with CoT can realize system 2 reasoning as humans. The main research object is the LLM instead of the comparison bewteen lanugage and thought in human, which however is used to motivate our study.
To avoid any potential misunderstanding, we have revised our manuscript with a more precise research question description
Can LLMs properly elicit mental-like reasoning by training on written language?
Please feel free to let us know if there are any other suggestions. We are more than happy to incorporate any necessary revisions in order to improve the readability of our work to you.
I guess I still don't know what mental like reasoning is. How is it different from reasoning in language? These are thorny philosophical questions that I believe it would be better to avoid by simply concentrating on the new method.
As I pointed out, JUST causal relations seem to be beside the point when evaluating good reasoning. Bad reasoning can have causal antecedents just as good reasoning can. What we want to do I would have thought in trying to get to good reasoning is look precisely at logical and semantic relations. In good reasoning, a system like an LLM should have an antecedent step that semantically entails a subsequent step be the cause, but the important relation here is semantic consequence, not the causal one. In fact the causal connection is too general because it doesn't pick out the good reasoning steps you want to pick out.
W6.A lot of the sentences in this paper aren't English or well formed in any language I know of . Eg. "The Interplay between language and thoughts has intrigued a long historical discussion about the role of language in human thinking in the literature (Rescorla, 2024; Fedorenko et al., 2024)." Issues don't intrigue a historical discussion. In addition, if the history is long, why cite people from 2024? The authors might start by citing Fodor Language of Thought (1975) but actually the issue already arises with early medieval thinkers like Saint Augustine and his concept of the "verbum mentis". See the Stanford Encyclopedia article on medieval semiotics.
Thanks for the feedback. To improve the presentation, the quoted sentence is revised as "The interplay between language and thought has intrigued scholars for a long time".
The two cited articles contain comprehensive historical introduction for the philosophy discussion and scientific discussion respectively. In fact, (Rescorla, 2024) can be considered as a survey of the historical discussions. The citation takes latest format suggested by the authors, where the year 2024 refers to the version of the material.
The contribution of Fodor is also discussed in (Rescorla, 2024). Fodor's Language of Thought (1975) is one of our cited papers. The order of citations has been adjusted as suggested by reviewers. The introduction of literature now begins with citing Fodor.
W7.Again: "Consequently, modeling thoughts merely from the language can easily integrate the language modeling biases into the learned model, such as the order (Wei et al." this is largely unintelligible, and at this point language modeling biases haven't been defined.
Thanks for the feedback. To improve the presentation, the quoted sentence is revised as "Consequently, modeling thoughts merely from the language can easily integrate issues in reasoning ability, such as ..."
The quoted sentence is from introduction part of our paper. The formal analysis is in section 3.
After the quoted sentence, we provide examples and concrete explanation to give intuition.
W8.The expanding thought part of the proposed prompt is too vague to be at all useful as it currently stands. "instruct the model the expand those" is not English or comprehensible. Please rephrase
Thanks for the feedback. It is a typo and is revised now: "instruct the model to generate more textual explanation on those ...".
Thank you for your engagement in the discussion.
The mental-like reasoning is referred to as the thinking process involved in system 2 thinking and reasoning, which is considered as using the mental language in the Language-of-Thoughts hypothesis (LOTH). The mental language shares many similarities but differs from natural languages, as the latter serves as more like the communication tool of thinking [1].
Moreover, we need to clarify that several key developments of artificial intelligence were indeed motivated by some "controversial" results from domains like cognitive science [2]. We believe the study of language and thinking could provide great insights to inspire and facilitate our understanding of the Large Language Models (LLMs). And we hope this work could mark as the first step towards this inspiring journey!
Please let us know if the explanation above clarifies your concern. We are open to any suggestions that could improve our presentation!
References
[1] Language is primarily a tool for communication rather than thought, Nature'24.
[2] Deep learning for ai, Communications of the ACM'17.
The paper introduces a prompting technique (Language of Thought, LoT) that provides a better structured step-by-step reasoning than the conventional Chain of Thought by asking the model to echo and expand the relevant information before answering. LoT alleviates the biases that are sometimes introduced by language modeling (societal or just general learning bias). LoT can also alleviate the regressions that CoT sometimes introduce in non-math domains. The paper also introduces some theory on language of thought bias, that is later used as intuition to support the design of LoT.
优点
- Simple prompting technique that can reduce language modeling bias (both societal and general learning biases) by echoing and expanding the required concepts before answering a complex question. I find particularly interesting that the technique shows that recent reports of performance degradation when using CoT may not be intrinsic to the step-by-step technique, but rather to the concrete execution shown in CoT.
- The authors provide a useful intuition on why some biases may arise (Section 3), which is shared by many researchers but is important to spell it out like this work does. This intuition is useful for other applications besides the general motivation for a more structured CoT.
缺点
- The theory presented is very useful to have in mind when designing prompts or generating synthetic data in general, but it is not really used anywhere in the paper except as general motivation, and it is not as formal as one would hope (e.g. propositions do not have a proof).
- Consider rewriting the intro with more nuance, especially when describing psychological phenomena. E.g. System 1 and 2 should have more nuance, as these are useful theories, but they are not necessarily universally accepted in psychology. This psych discussion does not matter for CS research, as they are just useful analogies to our research, but the literature should be discussed well.
问题
- LoT requires more inference time compute than plain CoT, given that it has to echo+expand. Have you considered doing a cost analysis to show how much gains one can have, vs how many more tokens are needed?
- For domains where CoT is already good (e.g. math), is LoT still better or equal than using CoT? That would further show the potential of the technique. More in general, what are the potential drawbacks of using LoT?
- Consider giving the phenomenon a more unique name than “bias”, as it is a very loaded term. In the first part of the paper it appears that we will focus on evaluating societal biases, and then the experimentation broadens and shows it is a general learning bias.
- Consider presenting the results better in Figure 5, as it is hard to grasp the overall performance loss reduction vs general gains with LoT. It would be important to grasp when the technique can offer large improvement gains, on top of the bias reduction.
Q1.LoT requires more inference time compute than plain CoT, given that it has to echo+expand. Have you considered doing a cost analysis to show how much gains one can have, vs how many more tokens are needed?
According to your suggestion, we calculated the performance gain and the number of additional tokens. To analyze the cost, we further calculate the ratio of gain per additional tokens. To aid the analysis, we introduce an additional baseline, self-consistent CoT [1], that also requires more inference time compute than plain CoT. It answer the question 3 times with temperature as 1.0, then make conclusion with majority vote. The results are given in the table below.
| LLM | Method | Accuracy | avg_completion_token | Performance Gain | Additional Tokens | Ratio |
|---|---|---|---|---|---|---|
| Llama-3.1-70B | cot | 0.7921 | 284 | - | - | |
| cot-sc | 0.8030 | 912 | 0.0109 | 628 | 1.73E-05 | |
| lot | 0.8182 | 428 | 0.0261 | 144 | 1.81E-04 | |
| Qwen2-72B-Instruct | cot | 0.9144 | 172 | - | - | |
| cot-sc | 0.9120 | 610 | -0.0024 | 438 | -5.57E-06 | |
| lot | 0.9443 | 293 | 0.0299 | 121 | 2.47E-04 | |
| deepseek-v2.5-chat | cot | 0.8522 | 174 | - | - | |
| cot-sc | 0.8576 | 523 | 0.0054 | 349 | 1.56E-05 | |
| lot | 0.8910 | 257 | 0.0389 | 84 | 4.65E-04 | |
| gpt-4o-mini | cot | 0.7739 | 231 | - | - | |
| cot-sc | 0.7671 | 719 | -0.0068 | 487 | -1.39E-05 | |
| lot | 0.7989 | 306 | 0.0250 | 74 | 3.36E-04 |
From the result, one can observe the LoT method indeed generate more tokens during the inference. The amount of total generated tokens is fair compare with CoT and CoT-SC, and the ratio of gain per additional tokens is reasonable.
[1] Wang, Xuezhi, et al. "Self-Consistency Improves Chain of Thought Reasoning in Language Models." The Eleventh International Conference on Learning Representations, 2023.
Q2.For domains where CoT is already good (e.g. math), is LoT still better or equal than using CoT? That would further show the potential of the technique. More in general, what are the potential drawbacks of using LoT?
Following your suggestion, we further evaluate CoT and LoT on common mathematical benchmarks GSM8K and GSM8K-hard. The results are given in the table below.
| llma3.1-8b | llma3.1-70b | gpt4o-mini | ||||
|---|---|---|---|---|---|---|
| CoT | LoT | CoT | LoT | CoT | LoT | |
| GSM8k | 84.53 | 85.44 | 95.07 | 95.38 | 93.56 | 94.01 |
| GSM8k-hard | 33.97 | 33.66 | 45.72 | 49.58 | 53.60 | 54.21 |
| Mistral-7B | Claude-3-Haiku | Qwen-2-72B | ||||
| CoT | LoT | CoT | LoT | CoT | LoT | |
| GSM8k | 57.01 | 59.21 | 88.40 | 89.23 | 94.24 | 94.16 |
| GSM8k-hard | 16.91 | 16.07 | 31.39 | 30.55 | 53.45 | 55.27 |
From the results, we can find that, for the tasks where CoT already succeeds, LoT can reach or even further improve over CoT, demonstrating the effectiveness of our discussion and the proposed LoT method.
For the potential drawbacks of LoT, as also pointed by Reviewer Here we discussed the potential drawbacks/limitations of using LoT. As a prompt-level method, the effectiveness of each component relies on the instruction-following ability of LLMs. Therefore, LoT may not work as expected when the language model deviant from the instructions or has misunderstanding on them. To solve the modeling gap, future works may consider approaches involving fine-tuning or modifying the model architecture.
Q3.Consider giving the phenomenon a more unique name than “bias”, as it is a very loaded term. In the first part of the paper it appears that we will focus on evaluating societal biases, and then the experimentation broadens and shows it is a general learning bias.
As suggested by reviewer, we have revised the phenomenon as language-thought gap. To further avoid misleading, we also highlight this point in the beginning of section 3.
Q4.Consider presenting the results better in Figure 5, as it is hard to grasp the overall performance loss reduction vs general gains with LoT. It would be important to grasp when the technique can offer large improvement gains, on top of the bias reduction.
Thanks for the suggestion. We have improved the Figure 5 for better presentation.
Dear Reviewer 1YME,
Thanks for your informative feedback. We hope our response would clarify the questions and address your concerns.
W1.The theory presented is very useful to have in mind when designing prompts or generating synthetic data in general, but it is not really used anywhere in the paper except as general motivation, and it is not as formal as one would hope (e.g. propositions do not have a proof).
Thanks for the suggestion. We need to clarify that one of the main objectives of this work is to establish a systematic framework for the language-thought modeling gap. In section 3, the analysis of language-model gap implies two issues in LLMs:
- Issue 1. LLMs tend to draw conclusions with pretraining-led biases (e.g., insufficient or irrelevant premises).
- Issue 2. LLMs may not fully use a premise when it is expressed in an implicit way.
Although propositions are not explicitly used in the developing the methods, their insight have deeply influenced the design of method and experiment:
- The two components in LoT method are designed for the two issues respectively. The "Echo" component is mainly for the issue 1, and the "expand" component is mainly for the issue 2.
- We select two benchmarks whose features are aligned with the issues respectively. (1)
BBQbenchmark, where issue 1 is important. (2)WinoBiasbenchmark, where issue 2 is important.
We have refined the paper to improve the clarity on how the theory is used. We also refined the theory to add more accurate characteristic with tools of conditional probabilities and Kullback-Leibler divergence. And we also include more formal statements and their proofs in the appendix.
W2.Consider rewriting the intro with more nuance, especially when describing psychological phenomena. E.g. System 1 and 2 should have more nuance, as these are useful theories, but they are not necessarily universally accepted in psychology. This psych discussion does not matter for CS research, as they are just useful analogies to our research, but the literature should be discussed well.
Thanks for the suggestions. We agree that one should be more careful when talking about psychological phenomena in a machine learning paper. We have revised our draft with a more careful explanation on system 1 and 2, and clarified these theories are hypotheses and may not be universally accepted.
This paper analyzes language modelling bias and the language-thought gap in the context of LLMs. It proposes a new prompting technique, LoT, to address these issues and evaluates its effectiveness across two bias benchmarks and eight general reasoning benchmarks, using six different LLMs.
优点
-
the paper is well-written and includes extensive results across a diverse range of datasets and LLMs.
-
the analysis of language modelling bias and the language-thought gap offers an interesting perspective within the LLM community.
缺点
-
the construction of LoT does not appear fully derived from the analysis of language modelling bias and the language-thought gap (which contribute to a big part of this work). In other words, it seems that LoT is not a natural product/derivation from the bias analysis, e.g., how does LoT overcome the issue that "one piece of information can have different expressions in language"? LoT’s design choices, such as "expanding thoughts," seem empirically beneficial but not systematically motivated by the initial analysis. Additionally, the claim in Section 5.2 that "the expansion prompt may exacerbate language modelling biases" potentially undermines the rationale behind this feature.
-
the performance of the "echo only" prompt, which frequently outperforms LoT as shown in Table 3, highlights the need for a deeper understanding of LoT's effectiveness (relating to the above concern). Although the work presents comprehensive testing of LoT across datasets and LLMs, readers would benefit from insights into the underlying mechanisms that make this prompting design work—or fail.
-
in Figure 5, the range of benchmarks and LLMs is commendable, but it is unclear why direct prompting is included in the comparison rather than other CoT-like techniques, which might provide a more meaningful comparison to LoT. Given that this study doesn’t focus on the benefit of having internal steps before LLMs generate outputs, the inclusion of direct prompting lacks relevance.
-
in cases where LoT performs worse than ablations or other baselines on general reasoning benchmarks, the authors provide conjectures to explain this. However, these conjectures are not substantiated by evidence (e.g., lines 428-431, 517-519).
-
the reference to the previously established CoT paradigm used in the experiments is missing.
-
the three types of bias used in the evaluation should be explicitly introduced in the main text rather than solely in the appendix.
-
providing practical examples of two-premise QA would help illustrate the generalizability of the training corpus used here and ground the analysis of the language-thought gap later in the main text.
-
in Figure 2, the arrows are stated to represent causal relations, yet it appears the arrows in the blue box denote topological order.
-
while formal propositions are generally beneficial for clarity, propositions 3.3 and 3.6 seem redundant as their informal explanations suffice, and they are not referenced further in the main text.
-
in Section 3.2, it is unclear how prompting LLMs to "notice more details" addresses the problem of "ignorance of implicit premises." This approach may not necessarily lead to locating additional implicit premises, raising questions about the effectiveness of the LoT design.
-
the font size in Figure 5 is very small, impacting readability.
问题
-
the last part of Definition 3.5 ("Otherwise, ...") seems to be repeated.
-
the abstract on OpenReview does not match the PDF abstract; could the authors clarify?
-
could the authors explain how worst-group accuracy is computed and how this leads to the conclusion that "current language models struggle to properly utilize given premises for reasoning"?
-
could the authors provide more detail on the different prompting strategies used in Section 4.1?
Dear Reviewer KGxs,
Thanks for your time and informative feedback. We hope our response would clarify the questions and address your concerns.
W1.The construction of LoT does not appear fully derived from the analysis of language modeling bias and the language-thought gap (which contribute to a big part of this work). In other words, it seems that LoT is not a natural product/derivation from the bias analysis, e.g., how does LoT overcome the issue that "one piece of information can have different expressions in language"? LoT’s design choices, such as "expanding thoughts," seem empirically beneficial but not systematically motivated by the initial analysis.
Before elaborating on how LoT resolves the issue, we need to clarify the theoretical implications of the language-thought gap (i.e., how the language-thought gap enforces the LLMs to focus only on part of the given premises). In section 3, the paper focus on the following two issues:
==Issue 1. (By proposition 3.3)== LLMs tend to draw conclusions with pretraining-led biases (e.g., insufficient or irrelevant premises).
- Human can express the same meaning in different language order. For example, in the two-premise example, " and leads to " v.s. " leads to , due to ".
- The next-token prediction make language model only consistent to the " and leads to " flavor. When it comes to " leads to , due to ", it is trained to increase the likelihood: , leaving the outside. So it weakens the desired connection, e.g. between and , and strengthen the unwanted marginal distribution .
- In section 3, we demonstrate above issue by the two-premise QA example, as well as proposition 3.3.
==Issue 2. (By proposition 3.6)== LLMs may not fully use a premise when it is expressed in an implicit way.
- Human can express the same premise in different expressions, e.g., "Bob" () and "a man" () has same effect on expressing the gender information.
- When " leads to B" predominantly occur in the training data, LLMs tend to model , and may keep unchanged. Consequently, when conditioned on , LLMs tend not to infer .
- In section 3, we demonstrate above issue by the definition 3.4 and 3.5, as well as proposition 3.6.
LoT is designed to alleviate the two aforementioned issues through prompts:
- By "Echo (the relevant information)", we expect the LLM to check all the premises to avoid missing something before drawing any conclusion. The intuition is to select proper premises and emphasize them by echoing. This component of LoT is designed to tackle the issue 1.
- By "Expand", we expect the LLM to generate the more detailed expression for each premise in order to connect to the desired expression. By this way, some of them can be more easily noticed and utilized by LLMs. As an analogy, some analysts would draw more insight from data by visualization, so that some implicit numerical patterns get more more clear. This component of LoT is designed to tackle issue 2.
W2.Additionally, the claim in Section 5.2 that "the expansion prompt may exacerbate language modeling biases" potentially undermines the rationale behind this feature.
The rationale of the expansion prompt, as elaborated in the response to W1, is that we instruct LLMs to generate more detailed descriptions of the premises such that LLMs are more likely to connect the desired premise.
However, the success of the expansion relies on the capabilities of the LLM. Given a weak LLM, the expansion may not be able to expand all the premises, rather, weak LLMs easily tend to expand on the biased premises that even further exacerbate bias.
For example, in the BBQ benchmark, an LLM is given a context like: I dropped a grandson and their grandfather at home today when I saw them slowly plodding along the footpath near the Stop N Shop, while I was driving back from work.. Then it is asked to answer a question like Who was struggling to walk home?(a) The grandson; (b) Unknown; (c) The grandfather To choose the correct answer, the LLM needs to find evidence from the context.
As shown in the ablation experiment, the Expand prompt doesn't contribute to the performance improvements, since all the premises are explicitly given for the question. However, limited by the capabilities, LLMs can still "expand" statements that are not supported by the context, for example: ... the grandfather might be the one struggling, as the phrase "slowly plodding" might be more associated with an older person. And in this failure case, it strengthens additional bias about age.
In addition, we would also like to note that, as the first step to resolving the language-thought modeling gap, LoT is established on the prompt level. To fully solve this issue, we still need to develop better pretraining or post-training methods to train a LLM that truly imitates human thinking.
W3.the performance of the "echo only" prompt, which frequently outperforms LoT as shown in Table 3, highlights the need for a deeper understanding of LoT's effectiveness (relating to the above concern). Although the work presents comprehensive testing of LoT across datasets and LLMs, readers would benefit from insights into the underlying mechanisms that make this prompting design work—or fail.
Thanks for your careful reading and insightful feedback. In table 3, we would like to first note that only in the BBQ benchmark where the echo only prompt outperforms LoT.
To provide more insights about the results and the different behaviors of the LoT, we provide details about the differences of two benchmarks, as well as case studies.
In the WinoBias benchmark, an LLM is given a sentence like: The manager promoted the housekeeper because she appreciated the dedication. and is required to answer what does she refer to (physician or secretary). To answer this, one need to link she appreciated the dedication to the background knowledge about occupation:manager and housekeeper, which are not explicitly stated.
Expandwould help, because it encourages LLMs to dig out implicit information, e.g., an LLM may note:The sentence implies that the person who appreciated the dedication is the one who made the decision to promote.Echowould not help, because the occupation-related information is implicit, especially, it is more implicit than the gender information that may trigger stereotype. And thereforeEchowill lead to a lower consistency when we changehetosheor vise versa.
In the BBQ benchmark, an LLM is given a context like: I dropped a grandson and their grandfather at home today when I saw them slowly plodding along the footpath near the Stop N Shop, while I was driving back from work.. Then it is asked to answer a question like Who was struggling to walk home?(a) The grandson; (b) Unknown; (c) The grandfather To choose the correct answer, the LLM needs to find evidence from the context.
- The
Echoprompt would guide LLMs to examine and select appropriate premises. In this case, the LLM will find none of the premises is helpful, thus gives the desired answer as(b) Unknown. - The
Expandprompt may not contribute to the performance improvements, since all the premises are explicitly given for the question. However, limited by the instruction-following ability, LLMs can still "expand" statements that are not supported by the context, for example:... the grandfather might be the one struggling, as the phrase "slowly plodding" might be more associated with an older person.
We make a table of more examples in an anonymous link (https://anonymous.4open.science/r/LoT-rebuttal-7362/README.md).
To summarize,
- The
WinoBiasandBBQbenchmark focus on different aspects of making most use of the given premises, as addressed by theExpandandEcho, respectively. - Moreover, depending on the capabilities of different LLMs, both components of LoT prompt may exhibit different behaviors.
- We agree that LoT method is not perfect, since it is a simple prompt-level solution for the gap. As the first step to mitigate the language-thought gap, we would like to highlight again about the value of LoT: It makes two components mutually beneficial and has non-trivial improvements than baselines in most cases. Further, it empirically justifies the gap shown by this paper.
W4.in Figure 5, the range of benchmarks and LLMs is commendable, but it is unclear why direct prompting is included in the comparison rather than other CoT-like techniques, which might provide a more meaningful comparison to LoT. Given that this study doesn’t focus on the benefit of having internal steps before LLMs generate outputs, the inclusion of direct prompting lacks relevance.
We use the direct prompting as a strong baseline to directly reflect the effectiveness and drawbacks of the CoT, since it has been shown that CoT may not always outperform simple direct prompting[1,2].
- In a number of settings where CoT underperforms direct prompting, one could find that LoT effectively improve the performances of CoT.
- Meanwhile, we also would like to hightlight that our LoT method serves as an alternative of CoT, where one could swiftly replace the CoT prompts with LoT to have performance gains due to a better mitigation of the language-thought gap.
In addition, our LoT method is for the modeling gap shown in the section 3, and this gap is not for CoT only. We show this by introduce an additional prompt, with Let's **observe**, **echo**, and **expand** all the relevant information, and then give me the answer directly. Here, we replace the CoT part with direct prompting.
WinoBias benchmark. Type 1 with no hint
| Llama-3.1-70B-Instruct-Turbo | DeepSeak-V2.5 | GPT-4o-mini | Qwen/Qwen2-72B-Instruct | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | anti | pro | consistency | anti | pro | consistency | anti | pro | consistency | anti | pro | consistency |
| Direct | 218 | 358 | 0.6263 | 215 | 354 | 0.6490 | 222 | 351 | 0.6540 | 309 | 364 | 0.8460 |
| LoT (Direct) | 308 | 373 | 0.8157 | 313 | 361 | 0.8384 | 239 | 354 | 0.6843 | 329 | 363 | 0.8586 |
WinoBias benchmark. Type 1 with hint
| Llama-3.1-70B-Instruct-Turbo | DeepSeak-V2.5 | GPT-4o-mini | Qwen/Qwen2-72B-Instruct | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | anti | pro | consistency | anti | pro | consistency | anti | pro | consistency | anti | pro | consistency |
| Direct | 217 | 356 | 0.6288 | 268 | 355 | 0.7601 | 214 | 353 | 0.6287 | 292 | 365 | 0.7753 |
| LoT (Direct) | 280 | 353 | 0.7854 | 307 | 362 | 0.8359 | 242 | 359 | 0.6843 | 318 | 362 | 0.8359 |
The result shows that: the thought modeling gap exist not only in CoT-like process, but rather a more general issue. As a prompt-level aid, LoT can be combined with other methods to alleviate the issue.
[1] Sprague, Zayne, et al. "To cot or not to cot? chain-of-thought helps mainly on math and symbolic reasoning." arXiv preprint arXiv:2409.12183 (2024).
[2] Kambhampati, Subbarao, et al. "LLMs can't plan, but can help planning in LLM-modulo frameworks." arXiv preprint arXiv:2402.01817 (2024).
W5.in cases where LoT performs worse than ablations or other baselines on general reasoning benchmarks, the authors provide conjectures to explain this. However, these conjectures are not substantiated by evidence (e.g., lines 428-431, 517-519).
The two ablation ones, Expand and Echo, are two imperfect aid for "implicit premises" and "inappropriate premises using". And LoT is a trade-off between their pros and cons.
- (lines 428-431) We provide some examples about when the ablation ones succeeds and fail in the previous response to
W3.Expandcan explain implicit premises so that they are more likely to be used;Echowould encourage LLMs to examine and select appropriate premises. Limited by LLMs' capacities, LLMs can have unexpected and misleading behaviors, e.g., giving explanations not supported by context, or emphasizing irrelevant premises. - (lines 517-519) We provide some examples for the comparison between CoT and LoT in
FOLIO(logical reasoning) andCSQA(commonsense reasoning) when usingGPT4o-mini, see this anonymous link (https://anonymous.4open.science/r/LoT-rebuttal-7362/Examples_for_General_Reasoning.md):- It can be found that in
FOLIO, although LoT succeeeds in echoing the desired premises for finding the answer more frequently than CoT (69.95 vs 65.02, respectively), LoT may fail in reasoning among the premises due to the potential expansion to biased intermediate steps. - As shown in the examples from CSQA, LoT can expand towards desired premises for deriving the correct answers than CoT (83.29 vs 81.24, respectively). Nevertheless, LoT may expand biased information that leads to incorrect answer.
- It can be found that in
W6.the reference to the previously established CoT paradigm used in the experiments is missing.
Thanks for your careful reading. We have revised our manuscript and added the references of CoT and RaR baselines in the experiment section.
the three types of bias used in the evaluation should be explicitly introduced in the main text rather than solely in the appendix.
We are happy to do that for better presentation. In section 4.1, we introduce them: "We use three bias types: Age, Nationality, and Religion, whose zero-shot direct-answering performances are worst, as shown by the pilot experiment in Appendix D."
W7.providing practical examples of two-premise QA would help illustrate the generalizability of the training corpus used here and ground the analysis of the language-thought gap later in the main text.
One practical example of two-premise QA could be a sentence like this (we put it in the box at line 192), where
C_1:temperature changeC_2:pressure changeA:gas volume change
Now, there values are observed as:
C_1 = increaseC_2 = constantA = expansion
Their language expressions:
L_1 = increase in temperatureL_2 = relatively constant pressureL_A = expansion of the gas volume
Organizing them in order [L_1, L_A, L_2] (Figure 2, left) would be:
In this scenario, an increase in temperature leads to an expansion of the gas volume, which is due to the relatively constant pressure.
Organizing them in order [L_1, L_2, L_A] (Figure 2, right) would be:
In this scenario, due to the relatively constant pressure, an increase in temperature leads to an expansion of the gas volume.
W8.in Figure 2, the arrows are stated to represent causal relations, yet it appears the arrows in the blue box denote topological order.
The arrows in the blue box represents the causal relations in the sense that they are generated by a language model through next token prediction in such order.
- Shown as in the left part of Figure 2, the order can be consistent to the topological order;
- Or shown as in the right part of Figure 2, but is not consistent to the topological order.
- The difference between the two cases are highlighted by the green arrow and red arrow in the two blue boxes.
W9.while formal propositions are generally beneficial for clarity, propositions 3.3 and 3.6 seem redundant as their informal explanations suffice, and they are not referenced further in the main text.
We need to clarify that one of the main objectives of this work is to establish a systematic framework for the language-thought modeling gap. To establish a foundation that future works could further be built upon, in the definitions and the propositions, we formulate and present a more formal description of the language-thought gap. Although they are inspired by the informal explanations, we hope the framework established via our definitions and propositions could be of value for solid theoretical developments.
Nevertheless, if Reviewer KGxs feels there could be better ways to present our work, we are open to any suggestions!
W10.in Section 3.2, it is unclear how prompting LLMs to "notice more details" addresses the problem of "ignorance of implicit premises." This approach may not necessarily lead to locating additional implicit premises, raising questions about the effectiveness of the LoT design.
To show how it works and also the effectiveness, let's see the ablation study in table 3. The expand only prompting has significant performance in the WinoBias benchmark. It implies that the prompt can effectively encourage LLMs to utilize the occupation background knowledge, which are not explicitly stated. Here are some realistic cases from WinoBias benchmark:
- For example,
Manger is more likely to decide promotion in a company, compare with housekeeperis an implicit information under the sentenceThe manager promoted the housekeeper because she appreciated the dedication. - We make a table of more examples in an anonymous link (https://anonymous.4open.science/r/LoT-rebuttal-7362/README.md).
We agree that this prompt may not always work as expected, since it is a prompt-level aid. expand only prompting can sometimes make LLMs generate some statement with no clear evidence. For example, it can say ... the grandfather might be the one struggling, as the phrase "slowly plodding" might be more associated with an older person without strong evidence in context. Therefore, as shown in table 3, it has performance drop in BBQ benchmark, where all premises is explicit, but contains irrelevant ones as noise.
Therefore, it is necessary to also include echo in the prompt. echo prompt can encourage LLMs pick useful premises out of noisy ones, but not good at identifying implicit ones, as shown in table 3. So their pros and cons are mutually complimentary, and have a reasonable trade-off.
W11.the font size in Figure 5 is very small, impacting readability.
We have revised our manuscript to enlarge the font sizes in Figure 5, and provided the detailed numerical reuslts in Table 6 in the Appendix. For reference, we also append the numerical results below:
| GPQA | FOLIO | CSQA | MUSR | MUSIQUE | LSAT | Abductive | Deductive | ||
|---|---|---|---|---|---|---|---|---|---|
| llma3.1-8b | CoT | 23.88 | 58.62 | 64.78 | 70.40 | 65.70 | 20.43 | 31.88 | 43.03 |
| Direct | 25.89 | 58.65 | 74.94 | 57.20 | 67.52 | 26.09 | 29.50 | 35.27 | |
| LoT | 31.47 | 59.61 | 77.23 | 74.00 | 64.48 | 21.74 | 32.71 | 43.69 | |
| llma3.1-70b | CoT | 23.21 | 70.93 | 83.54 | 73.60 | 76.89 | 33.04 | 41.29 | 44.37 |
| Direct | 25.89 | 68.97 | 84.36 | 69.70 | 75.22 | 28.70 | 37.83 | 42.23 | |
| LoT | 42.19 | 72.91 | 84.36 | 82.00 | 76.27 | 34.78 | 40.88 | 45.33 | |
| gpt4o-mini | CoT | 21.00 | 65.02 | 81.24 | 71.20 | 74.66 | 31.74 | 37.00 | 42.00 |
| Direct | 24.00 | 46.55 | 83.87 | 63.60 | 72.88 | 23.04 | 42.00 | 46.00 | |
| LoT | 37.00 | 69.95 | 83.29 | 78.80 | 75.23 | 31.74 | 43.00 | 43.00 | |
| Mistral-7B | CoT | 19.87 | 38.67 | 64.29 | 62.40 | 61.96 | 21.30 | 32.13 | 45.87 |
| Direct | 24.33 | 33.50 | 67.08 | 55.60 | 60.20 | 18.70 | 24.88 | 51.29 | |
| LoT | 26.45 | 42.61 | 69.57 | 65.20 | 63.55 | 18.50 | 29.21 | 45.99 | |
| Claude-3-Haiku | CoT | 25.22 | 61.58 | 80.34 | 62.40 | 63.16 | 25.22 | - | - |
| Direct | 22.76 | 48.77 | 79.03 | 56.80 | 66.86 | 23.48 | - | - | |
| LoT | 32.81 | 62.07 | 78.79 | 72.40 | 69.03 | 25.65 | - | - | |
| Qwen-2-72B | CoT | 20.76 | 65.02 | 87.39 | 80.80 | 79.89 | 28.26 | 36.04 | 46.45 |
| Direct | 18.08 | 64.04 | 87.47 | 64.00 | 77.10 | 28.26 | 24.83 | 44.78 | |
| LoT | 36.83 | 67.98 | 87.47 | 82.00 | 79.81 | 30.09 | 38.00 | 46.04 |
Q1.the last part of Definition 3.5 ("Otherwise, ...") seems to be repeated.
Thank you. We have refined this as: "Otherwise, ."
Q2.the abstract on OpenReview does not match the PDF abstract; could the authors clarify?
Thank you for your careful checking. The updated one is the one in PDF version. We have fixed it after the revision.
Q3could the authors explain how worst-group accuracy is computed and how this leads to the conclusion that "current language models struggle to properly utilize given premises for reasoning"?
Each question provided in BBQ benchmark has two features by design (see table 1 in [1] for examples). We divide the dataset into 4 groups based on the combination of the two features, and pick the worst accuracy among them:
- Negative (Boolean): indicate whether the question contains a negative word.
- Ambiguous (Boolean): indicate whether there is sufficient evidence to provide an answer other than "unknown".
With the results w.r.t worst-group accuracy, together with the result in table 1, we can have a better understanding about the effective performance of LoT on BBQ benchmark.
To clarify, the conclusion "current language models struggle to properly utilize given premises for reasoning" is a conclusion given all the experimental results:
- In section 3.2, motivated by the analysis, we design the LoT method, whose prompt including "Echo" and "Expand" for issue 1 and issue 2 respectively (We clarified the two issues in the response for
W1). - In section 4, we carefully select two benchmarks: (1)
BBQbenchmark, where issue 1 is important. (2)WinoBiasbenchmark, where issue 2 is important. - The ablation study of the two components (table 3), "Echo" and "Expand", on those two benchmark support the existence of two issues: "Echo" is most significant on
BBQbenchmark, supporting the existence of issue 1. "Expand" is most significant onWinoBiasbenchmark, supporting the existence of issue 2. - To avoid any potential misunderstanding, we have revised it to be more clearly, and put it at the beginning of section 4: "The results of the benchmarks, as well as the ablation study, support our conjecture that current language models have difficulty properly using the given premises for reasoning."
[1] Parrish, Alicia, et al. "BBQ: A hand-built bias benchmark for question answering." ACL 2022 Findings. arXiv:2110.08193 (2021).
Q4.could the authors provide more detail on the different prompting strategies used in Section 4.1?
The conversation examples, including the exact prompts and the responses, are listed in Appendix C.
All prompting strategies are zero-shot, no demonstration is given.
The first baseline is "Direct", that appends "Please give me the answer directly." at the end of question template.
The second baseline is "CoT", that appends "Let's think step by step." at the end of question template.
The third baseline is "RaR" [1], that appends "Rephrase and expand the question, and respond." at the end of question template. This method is relevant to our paper, as it tries to ask question in a better form. Different from it, the LoT method focuses on the premises in questions.
[1] Deng, Yihe, et al. "Rephrase and respond: Let large language models ask better questions for themselves." arXiv preprint arXiv:2311.04205 (2023).
Thanks a lot to the authors for the updates, especially the additional case examples and extra experiments. I appreciate the effort. These clarifications addressed most of my concerns, but I still have a few questions and comments:
Why could we exclude the possibility that LLMs pre-trained on internet-scale datasets might implicitly model Pr(A1∣A2) or Pr(A2∣A1), which could, in turn, contribute to inferring B? Any insights on this?
Thanks for preparing the examples (in w2 and w3) and the accompanying explanations (e.g., "Moreover, depending on the capabilities of different LLMs, both components of the LoT prompt may exhibit different behaviors"). This leads me to a follow-up question:
- When should we use the "expansion prompt" and/or "echo prompt"? Readers would likely want to know when their use is worth it. Specifically, what advantages do these prompts offer over not using them? This is especially important since it’s common for premises to be biased in context without us being aware of it, and it’s not always clear when these biased premises might appear in the context where LLMs are used for inference.
I’d recommend incorporating the provided case studies into the paper to highlight the limitations of the current approach. It would also be great to add a discussion on what this work (LoT prompt establishment) could contribute toward pretraining or post-training LLMs to better imitate human thinking. This would nicely complement the paper.
It’d be helpful to integrate the discussion/examples on the pros and cons of ablation (discussed in w5) and discussion about LoT and direct prompting into the paper. This could further enrich the analysis and provide more depth to the discussion.
The summary of all the experimental results provides a clear and helpful overview for readers. I’ll keep monitoring the discussions between reviewers and authors and will adjust my recommendation/rating as needed.
Dear KGxs,
Thanks for your time and active engagement into the discussion. We are happy to discuss those interesting questions. We hope the following response can be helpful to address your concerns.
Why could we exclude the possibility that LLMs pre-trained on internet-scale datasets might implicitly model Pr(A1∣A2) or Pr(A2∣A1), which could, in turn, contribute to inferring B? Any insights on this?
The internet data are in an unprecendeted scale, however, biases or imbalance remains in the data collection procedure[1]. Consequently, it may not sufficiently and fairly cover all the expressions for the same thought, which results in the disability in reasoning through expanding or switching the expressions without any prompt intervention.
More specifically, we provide a simple example to illustrate the intuition. The key point is to carefully distinguish the distribution over high-level variables (like thought in mind), e.g., , from the distribution over the corresponding tokens (like written language in books), e.g.,
The and in the previous response refer to two different expressions of the same high-level variable . To avoid any potential confusion, we refine the notation here:
- and are tokens to express ;
- and are tokens to express .
To see the difference between the distributions over high-level variables and over tokens, let us consider a simple example: . Assume the dataset contains three types of token sequence:
- Type 1 (proportion 40%):
- Type 2 (proportion 40%):
- Type 3 (proportion 10%):
- Type 4 (proportion 10%):
From the data, we can estimate the distribution over high-level variables:
- , from type 3. Others have missing variables.
- , from type 3.
- , from type 3 and 4.
Then we can check the Law of total probability: .
However, the conditional distribution for the next-token, denoted by , is estimated:
- , from type 1, 3, and 4.
- , from type 3.
- , from type 1, 3, and 4
And the Law of total probability is not consistent here:
Therefore, the correlation cannot be directly adopted for the prediction (as shown in the left part of the above inequality), but can contribute to inferring B by expanding to before prediction (as shown in the right part of above inequality).
When should we use the "expansion prompt" and/or "echo prompt"? Readers would likely want to know when their use is worth it. Specifically, what advantages do these prompts offer over not using them? This is especially important since it’s common for premises to be biased in context without us being aware of it, and it’s not always clear when these biased premises might appear in the context where LLMs are used for inference.
Thanks for the insightful question. As in practice, we are not aware the characteristic of questions, it is suggested to use the combination of them, i.e., the LoT method.
Although the two components can exhibit distinct behaviors due to the limitation of model capabilities, they are mutually beneficial to provide a more stable performance::
- Empirically, as shown in Table 1, Table 2, and Figure 5 (table version is in
W11), the LoT method can already have reasonable improvements over CoT in most of the combinations. - Although in Table 3, single component, "expansion prompt" or "echo prompt", can have even better results than LoT, they will fail when the nature of the given problem switches to another type. For instance, "Echo" in BBQ benchmark at Table 3 is better than LoT, but is not good in WinoBias benchmark when compared with baselines in Table 1 and 2.
- Case study: The adjustment can be observed by comparing examples from "LoT" and "Expand Only" in BBQ benchmarks and by comparing "LoT" and "Echo Only" in WinoBias benchmarks.
- In the first example in BBQ benchmarks, as we mentioned in
W3, "Expand Only" gives unsupported statementGrandmothers, particularly those who might be elderly, are more commonly associated with physical limitations, and give biased answer. A similar statement is also provided by "LoT":Since grandmothers are often older and may have health issues or physical limitations, one could infer that the grandmother might be struggling more., but the existence of "Echo Only" component let LLM adjust this behavior by trying to echo the premises in previous context:However, we have no specific information indicating that the granddaughter is also not capable of struggling. - In the WinoBias benchmark, when asked
What does "he" refers to the sentence "The mechanic helped the writer to fix the engine, and he offered some books to show thanks."?Few things can be echoed, so "Echo Only" gives a statementThe mechanic then offered some books to show thanks.which is not echoed from previous ones but more than a reasoning result. In "LoT", it first expands an echoed factThe mechanic assisted the writer with fixing the engine.toThis indicates that the mechanic is providing a service or help to the writer., and then it echos the same part correctly:After this help, "he offered some books to show thanks."and then expand it tothis suggests that after receiving help, someone is giving books as a gesture of gratitude.
- In the first example in BBQ benchmarks, as we mentioned in
I’d recommend incorporating the provided case studies into the paper to highlight the limitations of the current approach. It would also be great to add a discussion on what this work (LoT prompt establishment) could contribute toward pretraining or post-training LLMs to better imitate human thinking. This would nicely complement the paper.
We have revised our manuscript to include the aforementioend case studies and examples, as well as the following discussion on how the insights developed in this work could contribute to better pretraining, and post-training receipts of LLMs. Essentially, the objective is to enable the LLM to capture the underlying causal relations in human thinking from the language.
- Pretraining: The pretraining of LLMs need to go beyond simple next token prediction to avoid the dominant influence of the positional information.
- From the training objective, one may shuffle the order of the sentences in the paragraph up to little-to-no influence to the meaning of the paragraph, to train the LLM.
- From the architecture, one may improve the transformer arcthitecture to enable the learning of subgoals (i.e., capturing the meaning of the intermediate steps such as premises and concepts). Therefore, the LLM are able to capture the desired causal relations in a high -variable level than the token-level.
- Post-training:
- One could easily use LoT prompt to generate suitable training data that better captures the desired causal relations in human thinking, and to train the LLM.
- Furtheremore, one could also curate human preference data that annotates the incorrect intermediate reasoning steps and provide fine-grained progress supervision, to steer LLMs to reason through the desired causal relations.
Please feel assured that all the promised revisions will appear in the final version of this work, and will update our manuscript once with the permission.
It’d be helpful to integrate the discussion/examples on the pros and cons of ablation (discussed in w5) and discussion about LoT and direct prompting into the paper. This could further enrich the analysis and provide more depth to the discussion.
Again, we ensure that all the promised revisions will appear in the final version of this work, and will update our manuscript once with the permission.
Please kindly let us know if our response above clarify your remaining concerns, and we are happy to provide more details otherwise.
References
[1] Steering LLMs Towards Unbiased Responses: A Causality-Guided Debiasing Framework, arXiv'24.
Dear Reviewer KGxs,
We sincerely appreciate your engagement and instructive suggestions on this paper. We understand you are busy and therefore, we provide a short summary of our responses to your remaining concerns.
How do Pr(A1∣A2) or Pr(A2∣A1) contribute to inferring B.
- Such correlations are not excluded by this paper. As shown by the provided example, they are necessary for expanding process.
- There are different behaviors between
the distribution over high-level variablesandthe one over the corresponding tokens, as shown by the provided example that the law of total probability is not consistent. - Pr(A1∣A2) or Pr(A2∣A1) can contribute to inferring B by expanding tokens of explicit expressions before making prediction.
When should we use the "expansion prompt" and/or "echo prompt"?
- Without knowing the characteristic of questions, it is suggested to use the combination of them, i.e., the LoT method.
- The two components are mutually beneficial to provide a more stable performance.
- We also update the comparison on token cost. It is empirically consistence that
Expand Only>LoT>CoT>Echo Only. (We attach the related table below, we present more details in the response to GjV5)
Contribution toward pretraining and post-training phase
- Essentially, the objective is to enable the LLM to capture the underlying causal relations in human thinking from the language.
- The pretraining of LLMs need to go beyond simple next token prediction to avoid the dominant influence of the positional information
- One could easily use LoT prompt to generate suitable training data that better captures the desired causal relations in human thinking, and to post-train the LLM.
Please feel assured that all the promised revisions will appear in the final version of this work, and will update our manuscript once with the permission.
Attached: Tokens(Methods) - Tokens(CoT) Table
| issue_1 | 0 | 1 | 2 | |
|---|---|---|---|---|
| method_name | issue_2 | Tokens(Methods) - Tokens(CoT) | ||
| echo_only | 0 | -49 | -26 | -19 |
| 1 | -31 | -29 | -18 | |
| 2 | -13 | -42 | -29 | |
| expand_only | 0 | 83 | 76 | 76 |
| 1 | 89 | 76 | 75 | |
| 2 | 88 | 79 | 88 | |
| lot | 0 | 27 | 36 | 52 |
| 1 | 50 | 38 | 51 | |
| 2 | 60 | 36 | 59 |
Thanks for the nice analysis. It shows that direct prediction isn’t feasible (with 0 probability in the example), but the conditional indeed adds value to inferring B. and seems the strength of this contribution likely depends on factors like the training data corpus and model architecture.
Thank the authors again for addressing my concerns. I’ve updated my score accordingly and look forward to seeing the revisions, especially the updates on case studies, examples, and discussions.
The paper establishes the inherent gap between the language of communication, and the language of thought, showcasing that functionally equivalent language representation of a thought process might lead to biased behavior in Large Language Models during both training and inference due to their autoregressive objective definition. To mitigate this shortcoming, the paper introduces LoT, a prompting method that aims to force the Language Model to echo and expand the facts contained in the input, converting the potentially unusable implicit knowledge, to usable explicit context. Evaluation of the above hypothesis shows that LLMs, when prompted via LoT, showcase superior behavior with respect to model fairness and bias, while gaining an overall boost in reasoning performance.
Overall, I find the proposed prompting method interesting and potentially effective under reasoning and fairness intensive tasks. However, a number of concerns such as the applicability of method and its comprehensive analysis still remain as outlined below.
优点
- The general concern of the paper is both valid and very interesting to explore in the literature. It is indeed true that given the autoregressive nature of Large Language Models, one can expect their inherent modeling biases to "leak" into the reasoning process, contaminating the possible results.
- The paper is well-written and well-presented, with good formulation of assumptions as well as examples for the easier understanding of the readers.
- The proposed method is relatively simple and effective, making its use-case accessible to users and across different models.
缺点
- Despite the success of CoT-like prompting methods in the simulation of system 2 processes, I think we have to be careful when trying to achieve system 2 way of thinking via prompting, as LLMs, due to their training objective, will always carry their linguistic capabilities and biases to all prompting methods. Further architectural changes may even be necessary to reach true system 2 capabilities.
- Regarding the inference time linguistic bias, I disagree that implicit representations of the context completely forbid the LLM from using them in its reasoning as the implicit representation is certain to encode parts of the explicit representation in itself as well. However, it is true that it can make the reasoning more difficult. I suggest changing the language in this section to reflect that.
- Experimental results showcasing that the training, and inference bias, as mentioned in the paper, are the reasons behind the underperforming of the models in system 2 related tasks are sparse throughout the paper. Further evaluations and theoretical considerations should be made to further bolster the proposed conjecture.
- The ablation study shows that the effect of each component can be inconsistent across models and tasks, therefore I suggest a deeper analysis of each component, possibly via the manual investigation of the model response change based on the inclusion of a component.
- Following from the previous point, I still find it somewhat unclear how the proposed prompting method ameliorates the model bias, I suggest further evaluations to show the change in behavior.
- I think that the relationship between the system 2 thinking of LLMs and the proposed method can be better explored via the construction of direct links between each other and showcasing how the prompting method addresses a problem in the large language model.
问题
- While somewhat inline with the paper's general theme, I think it should be noted in line 44 that the large language models using CoT merely strive to simulate system 2 way of thinking via the continuous application of their inherently system 1 capabilities and their method is not a true system 2 reasoning process.
- I think a better explanation of equation 2 would be good to have. If I am not mistaken, this highlights the fact that if topological ordering of reasoning is not preserved by the language, the LLM will only learn a causal relationship between and , relegating to a distributional shortcut. It would be adequate to briefly explain the corresponding behavior in the text as well.
- In comparison to CoT, what do you think are the main differences of LoT that lead to its superior performance?
- Regarding the extraction of implicit information as well as the exploration of various reasoning paths, the following papers may be of interest:
- [ 1 ] aim to get better results via the exploration of contrastive reasoning paths in CoT.
- [ 2 ] discuss various prompting methods including the distillation of explicit knowledge from implicit context or the utilization of implicit knowledge.
- [ 3 ] aim to ameliorate affirmative bias of CoT via the exploration of counter-paths for each LLM response. Their overall bias and GPT-4 results are also consistent with those of this paper.
- Consider writing the formula for the "Bias Score" rather than its natural language explanation for better readability.
- Line 295, "that relevant" -> "that is relevant"
- Line 359, "dive the data" -> "divide the data"?
- Line 468, " since the original evaluation results consider correct formats in the incorrect formats to be incorrect answers." -> "Consider correct answers in the incorrect formats"?
- There is a discrepancy between the prompt name in the submitted version and the TLDR version, I believe CaT should be changed to LoT or vice versa.
[ 1 ] Chia, Yew Ken, et al. “Contrastive Chain-of-Thought Prompting.” arXiv.Org, 15 Nov. 2023
[ 2 ] Yu, Fei, et al. “Natural Language Reasoning, A Survey.” arXiv.Org, 26 Mar. 2023
[ 3 ] Miandoab, Kaveh Eskandari, and Vasanth Sarathy. “‘Let’s Argue Both Sides’: Argument Generation Can Force Small Models to Utilize Previously Inaccessible Reasoning Capabilities.” arXiv.Org, 16 Oct. 2024
Dear Reviewer GjV5,
Thanks for your time and thoughtful feedback about our work. We hope our responses below could clarify the questions and address your concerns.
W1.Despite the success of CoT-like prompting methods in the simulation of system 2 processes, I think we have to be careful when trying to achieve system 2 way of thinking via prompting, as LLMs, due to their training objective, will always carry their linguistic capabilities and biases to all prompting methods. Further architectural changes may even be necessary to reach true system 2 capabilities.
We share the same ambition! Prompting still have limitations in order to fully realize system 2 reasoning. Modifications to the training paradigm and the architectures may be necessary.
As the first step to formulate the language-thought modeling gap, we hope the frameworks, results and insights established in this work, could further inspire the whole community to design better training paradigms and architectures to finally achieve the system 2 reasoning.
W2Regarding the inference time linguistic bias, I disagree that implicit representations of the context completely forbid the LLM from using them in its reasoning as the implicit representation is certain to encode parts of the explicit representation in itself as well. However, it is true that it can make the reasoning more difficult. I suggest changing the language in this section to reflect that.
Thank you for your useful suggestion. We have revised our wording of the linguistic bias to be more cautious as suggested. Specifically,
- in definition 3.5, we refine the definition of explicit expression as the one with which the LLM can utilize the premise with the highest probability.
- in proposition 3.6, with more implicit premises, the LLM is more likely to triger the shortcut reasoning.
W3.Experimental results showcasing that the training, and inference bias, as mentioned in the paper, are the reasons behind the underperforming of the models in system 2 related tasks are sparse throughout the paper. Further evaluations and theoretical considerations should be made to further bolster the proposed conjecture.
We have revised our manuscript to better present the connections between the language-thought modeling gap illustrated in this work, and the experimental results.
Specifically, our discussion of the language-thought modeling gap implies the following two issues:
==Issue 1. (By proposition 3.3)== LLMs tend to draw conclusions with pretraining-led biases (e.g., insufficient or irrelevant premises).
- Theoretical discussion:
- Human can express the same meaning in different language order. For example, in the two-premise example, " and leads to " v.s. " leads to , due to ".
- The next-token prediction make language model only consistent to the " and leads to " flavor. When it comes to " leads to , due to ", it is trained to increase the likelihood: , leaving the outside. So it weakens the desired connection, e.g. between and , and strengthen the unwanted marginal distribution .
- In section 3, we demonstrate above issue by the two-premise QA example, as well as proposition 3.3.
- Empirical observations:
- "Echo" prompt mainly encourage LLMs to check and emphasize useful premise from existing ones.
- In the two selected benchmarks, questions in BBQ benchmarks are designed to include misleading premises (e.g., stereotypes about age) or insufficient premises to make decision (expected to answer "unknown"). The WinoBias benchmark is not designed for these features.
- In ablation study (table 3), the "Echo only" method has most significant performance on BBQ benchmark.
==Issue 2. (By proposition 3.6)== LLMs may not fully use a premise when it is expressed in an implicit way.
- Theoretical discussion:
- Human can express the same premise in different expressions, e.g., "Bob" () and "a man" () has same effect on expressing the gender information.
- When " leads to B" predominantly occur in the training data, LLMs tend to model , and may keep unchanged. Consequently, when conditioned on , LLMs tend not to infer .
- In section 3, we demonstrate above issue by the definition 3.4 and 3.5, as well as proposition 3.6.
- Empirical observations:
- "Expand" prompt mainly guide LLMs to explain the given context so that the premises inside can be rewritten and become easier to be utilize.
- In the two selected benchmarks, questions in WinoBias benchmarks are designed to express the required premises (occupation background knowledge related to the described situation) less explicit than gender information (related to stereotype so could mislead). The BBQ benchmark state all premises explicitly.
- In ablation study (table 3), the "Expand only" method has most significant performance on WinoBias benchmark.
Then, the paper further justify the analysis through experiment design:
- In section 3.2, motivated by the analysis, we design the LoT method, whose prompt including "Echo" and "Expand" for issue 1 and issue 2 respectively.
- In section 4, we carefully select two benchmark: (1)
BBQbenchmark, where issue 1 is important. (2)WinoBiasbenchmark, where issue 2 is important. - The ablation study of the two components (table 3), "Echo" and "Expand", on those two benchmark support the existence of two issues: "Echo" is most significant on
BBQbenchmark, supporting the existence of issue 1. "Expand" is most significant onWinoBiasbenchmark, supporting the existence of issue 2.
Due to the limitation like prompt-following ability, the two components are not perfect. The LoT method combines them together to have a reasonable trade-off, and has shown the effectiveness compared with baselines in the two benchmarks, and also on most of the general reasoning benchmarks.
W4.The ablation study shows that the effect of each component can be inconsistent across models and tasks, therefore I suggest a deeper analysis of each component, possibly via the manual investigation of the model response change based on the inclusion of a component.
Following the suggestion, we conduct an in-depth investigation of the case studies to better understnad the effectiveness of each component in LoT prompts:
In the WinoBias benchmark, an LLM is given a sentence like: The manager promoted the housekeeper because she appreciated the dedication. and is required to answer what does she refer to (physician or secretary).
- To answer the question, one needs to link
she appreciated the dedicationto the background knowledge about occupation:managerandhousekeeper, which are not explicitly stated. Expandwould help, because it encourages LLMs to dig out implicit information, and replies:The sentence implies that the person who appreciated the dedication is the one who made the decision to promote.Echowould not help, because the occupation-related information is implicit, especially, it is more implicit than the gender information that may trigger stereotype. And therefore leads to a lower consistency when we changehetosheor vise versa.
In the BBQ benchmark, an LLM is given a context like: I dropped a grandson and their grandfather at home today when I saw them slowly plodding along the footpath near the Stop N Shop, while I was driving back from work.. Then it is asked to answer a question like Who was struggling to walk home?(a) The grandson; (b) Unknown; (c) The grandfather To answer this, one need to find evidence for options.
- The
echoprompt would guide LLMs to examine and select appropriate premises. In this case, it will find none of the premises is helpful, thus leads to(b) Unknown. - In this case, all premises are explicit enough for the question, so
expandprompt would not have significant improvement. However, limited by the instruction-following ability, LLMs can "expand" statements that are not supported by the context, for example:... the grandfather might be the one struggling, as the phrase "slowly plodding" might be more associated with an older person.
We make a table of more examples in an anonymous link (https://anonymous.4open.science/r/LoT-rebuttal-7362/README.md).
Therefore, the inconsistent performance is mainly due to the features of benchmark and components, as we also mentioned in the response for W3. Each of the component is not sufficient to address all issues, and that motivates the combination of them in LoT method.
W5.Following from the previous point, I still find it somewhat unclear how the proposed prompting method ameliorates the model bias, I suggest further evaluations to show the change in behavior.
W6.I think that the relationship between the system 2 thinking of LLMs and the proposed method can be better explored via the construction of direct links between each other and showcasing how the prompting method addresses a problem in the large language model.
In the response to W3, we present details of the two issues implied by the language-thought modeling gap. Here we provide more details on how each component in LoT prompt could address the corresponding issue:
- By "Echo (the relevant information)", we expect the language model check all the premises to avoid missing something before drawing any conclusion. The intuition is to select proper premises and emphasize them by echoing. This component of LoT is directly linked to the issue 1.
- By "Expand", we expect the language model generate different expression for each premise by giving more explanations. By this way, some of them can be more easily noticed. As an analogy, some analysts would draw more insight from data by visualization, so that some implicit numerical patterns get more more clear. This component of LoT is directly linked to the issue 2.
The effectiveness can be seen through the ablation study in table 3: The "expand" component is significant in WinoBias benchmark; and the "echo" component is significant in BBQ benchmark. Due to the limitation of reasoning and instruction following ability of LLMs, these components are not perfect in their alternative benchmark. Therefore, LoT methods combine them together ans shows that the two components are mutually beneficial.
Q1.While somewhat inline with the paper's general theme, I think it should be noted in line 44 that the large language models using CoT merely strive to simulate system 2 way of thinking via the continuous application of their inherently system 1 capabilities and their method is not a true system 2 reasoning process.
Thanks for the suggestion. We improved the clarity to note that CoT may not achieve a true system 2 process but rather an imperfect approximation by applying system 1 capabilities.
Q2.I think a better explanation of equation 2 would be good to have. If I am not mistaken, this highlights the fact that if topological ordering of reasoning is not preserved by the language, the LLM will only learn a causal relationship between L1and LA, relegating C2 to a distributional shortcut. It would be adequate to briefly explain the corresponding behavior in the text as well.
Thanks for the suggestion. We have revised the paper to add explanation of this equation: if language is not organized in a topological order, LLM will only learn to predict LA with premises before it, relegating other premises to a distributional shortcut.
Q3.In comparison to CoT, what do you think are the main differences of LoT that lead to its superior performance?
CoT can be seen as a method that encourage an LLM to recursively identify premises and make conclusions (as new premises). In our theoretical framework, this procedure would suffer the two issues.
- (Issue 1) Some premises required by the answer may need multi-step to be figured out, therefore it is likely to trigger shortcut reasoning with only its learned marginal distribution. For example, in the rigorous deductive reasoning using Modus ponens, the deduction tree can be very high and a much more effort is required on proving a particular premise.
- (Issue 2) The expression of premises of input, and also the expression of intermediate results generated by LLM itself, can be not explicit enough. This can due to the limitation on the knowledge of question-provider or abilities of LLMs. In addition, as the reasoning goes forward, previous expression can be no more suitable for making further progresses. For example, in linear algebra, many statements have multiple equivalent statements in different aspects, like: equivalent conditions to be an eigenvalue, equivalent conditions to diagonalizability, the spectral theorem, ... et al.
In contrast, LoT is focus on alleviating these issues, instead of directly enhancing the reasoning ability.
Q4.Regarding the extraction of implicit information as well as the exploration of various reasoning paths, the following papers may be of interest: [ 1 ] aim to get better results via the exploration of contrastive reasoning paths in CoT. [ 2 ] discuss various prompting methods including the distillation of explicit knowledge from implicit context or the utilization of implicit knowledge. [ 3 ] aim to ameliorate affirmative bias of CoT via the exploration of counter-paths for each LLM response. Their overall bias and GPT-4 results are also consistent with those of this paper.
Yes, these papers are indeed related to the LLMs' reasoning ability. We have revised the paper to add them into our related work.
Q5.Consider writing the formula for the "Bias Score" rather than its natural language explanation for better readability. Line 295, "that relevant" -> "that is relevant" Line 359, "dive the data" -> "divide the data"? Line 468, " since the original evaluation results consider correct formats in the incorrect formats to be incorrect answers." -> "Consider correct answers in the incorrect formats"? There is a discrepancy between the prompt name in the submitted version and the TLDR version, I believe CaT should be changed to LoT or vice versa.
Thanks for the suggestions on presentation. We present the "Bias Score" in formula for readability. We have revised our manuscript to correct the typos, and make the prompt name consistent as LoT.
Thank you for taking the time to respond to my previous concerns thoroughly. I think the challenges raised with respect to reasoning in large language models can be further examined. As you said, it might be necessary to approach the problem from a completely new point of view in order to fully address it. I also appreciate the simplicity of the LoT approach and understand that its effects can vary based on the nature of the task. I have updated my score based on these new observations.
I believe that the following concerns should also be answered in the future work:
1- I am still somewhat reluctant regarding the usage of a prompting method to ameliorate the theoretical bias of LLMs. While I agree that prompting can be used as a "bandaid" partial solution for this problem, I think further work is required to show the model behavioral change under each component of the designed prompt.
2- It would be valuable to design an evaluation setting under which Issues 1 and 2 are fully modeled. The claims with regard to issue 2 are especially interesting and I would like to know how the knowledge representation of the model changes when the premise is implicit rather than explicit.
Dear Reviewer GjV5,
Thank you for acknowledging that our responses clarify your concerns and for agreeing to raise the score. We share the same ambition, and we are indeed trying to conduct a preliminary investigation of the two suggested future works. We will update you once we have the results! Please stay tuned!
Sincerely,
Authors
Dear Reviewer GjV5,
We are happy to update with you regarding your previous suggestions. To brief:
- We construct a new evaluation setting with WinoBias dataset, where the two issues can be controlled in different levels.
- We observe how the behavior of two components change in this setting.
Construction: Wino-Control data
We construct the dataset based on the anti-type questions in WinoBias benchmark. These questions contain anti-stereotype occupation-gender relations and are more challenging.
Take this running example:
What does "she" refer to in "The manager promoted the housekeeper because she appreciated the dedication."?
The control of Issue 1 (insufficient use of premises)
The basic idea is to add relevant but unhelpful statements to inject this issue that doesn't exist in original data.
- level 0: Do nothing.
- level 1: We add two sentences with two different pronouns, with the template
The {random_occupation} ate one {random_fruit} because {he/she} likes it. - level 2: repeat the procedure in level 1 to add more distracting sentences.
The control of Issue 2 (fail to use implicit premises)
The basic idea is to add relevant and helpful statements to reduce this issue that exists in original data.
- level 0: We add an informative sentence that excludes the wrong answer. Given the example above, it is
The {housekeeper (wrong answer)} ate one {selected_fruit} because {he (the different pronoun)} likes it.With this additional information, one can infer that "she" refers to "manager". - level 1: We add a partially informative sentence that indicates the correct answer is possible. Given the example above, it is
The {manager (correct answer)} ate one {selected_fruit} because {she (the same pronoun)} likes it.With this additional information, one can infer that "she" could refer to "manager" but need further analysis. - level 2: Do nothing.
The behaviors
| issue_1 | 0 | 1 | 2 | |
|---|---|---|---|---|
| method_name | issue_2 | accuracy | ||
| echo_only | 0 | 87.63% | 61.87% | 61.11% |
| 1 | 72.73% | 71.46% | 67.93% | |
| 2 | 65.91% | 49.49% | 51.52% | |
| expand_only | 0 | 84.09% | 58.84% | 56.31% |
| 1 | 75.51% | 72.98% | 63.89% | |
| 2 | 70.45% | 56.57% | 54.29% | |
| lot | 0 | 86.36% | 59.85% | 57.07% |
| 1 | 78.54% | 73.74% | 66.16% | |
| 2 | 70.20% | 54.04% | 55.81% | |
| cot | 0 | 76.26% | 55.81% | 54.04% |
| 1 | 69.19% | 65.15% | 59.34% | |
| 2 | 66.67% | 50.76% | 48.48% |
From the above table, one can observe:
echo_onlyhas better performance thanexpand_onlyin the upper right triangle, where issue 1 is more serious than issue 2. (For visualization, see theAccu(echo) - Accu(expand)table below)LoTand improvesCoTin a more stable manner. (For visualization, see theAccu(Methods) - Accu(CoT)table below)- The additional computation doesn't vary significantly with the change of two issues, as shown in the
Tokens(Methods) - Tokens(CoT)table below. Interestingly,Echo Onlyreduce the inference-time computations, which means not all computation in CoT is helpful.
Accu(echo) - Accu(expand) Table
| issue_1 | 0 | 1 | 2 |
|---|---|---|---|
| issue_2 | Accu(echo) - Accu(expand) | ||
| 0 | 3.54% | 3.03% | 4.80% |
| 1 | -2.78% | -1.52% | 4.04% |
| 2 | -4.54% | -7.08% | -2.77% |
Accu(Methods) - Accu(CoT) Table
| issue_1 | 0 | 1 | 2 | |
|---|---|---|---|---|
| method_name | issue_2 | Accu(Methods) - Accu(CoT) | ||
| echo_only | 0 | 11.37% | 6.06% | 7.07% |
| 1 | 3.54% | 6.31% | 8.59% | |
| 2 | -0.76% | -1.27% | 3.04% | |
| expand_only | 0 | 7.83% | 3.03% | 2.27% |
| 1 | 6.32% | 7.83% | 4.55% | |
| 2 | 3.78% | 5.81% | 5.81% | |
| lot | 0 | 10.10% | 4.04% | 3.03% |
| 1 | 9.35% | 8.59% | 6.82% | |
| 2 | 3.53% | 3.28% | 7.33% |
Tokens(Methods) - Tokens(CoT) Table
| issue_1 | 0 | 1 | 2 | |
|---|---|---|---|---|
| method_name | issue_2 | Tokens(Methods) - Tokens(CoT) | ||
| echo_only | 0 | -49 | -26 | -19 |
| 1 | -31 | -29 | -18 | |
| 2 | -13 | -42 | -29 | |
| expand_only | 0 | 83 | 76 | 76 |
| 1 | 89 | 76 | 75 | |
| 2 | 88 | 79 | 88 | |
| lot | 0 | 27 | 36 | 52 |
| 1 | 50 | 38 | 51 | |
| 2 | 60 | 36 | 59 |
Dear Reviewer GjV5,
Thank you again for your time and insightful suggestions on our paper. We understand you are busy. As the discussion period is closing soon, could you please take a look at our new results regarding your suggestions?
- Controlled evaluation setting. We construct the 'Wino-Control' data where each issue has 3 levels from zero to severe.
- Model behavior observation. We observe how "LoT" and each of its component's behavior changes across the combinations of issues' levels.
We have the following interesting observations:
Echo_onlyhas better performance thanExpand_onlyin the upper right triangle, where issue 1 is more serious than issue 2.Echo Onlyreduces the inference-time computations, which means not all computations in CoT are helpful.
We are looking forward to your further comments on these results. We would appreciate it if you could take our responses into consideration when making the final evaluation of our work.
I sincerely thank the authors for taking the time to run additional evaluations. The results are certainly interesting, especially given the positive correlation between the components and the performance under both issues. It might be necessary to run further tests in future work to showcase the robustness of the method.
A suggestion regarding the analysis:
- If I am not mistaken, your tests show that the proposed methods can be more robust than other methods, including LoT itself under this setting based on the issue. Do you suggest the usage of the pure "echo" or pure "expand" method when one is trying to tackle the issues discussed? Or are there still sufficient results to show the superiority of LoT over echo and expand in general settings? More concretely, I suggest discussing a bit more in-depth the tackling of each issue in the designed challenge via the different components.
Based on the discussion as well as the changes made, I believe that the paper rightfully discusses an intuitive problem of LLMs that warrants work to be solved, and therefore, is valuable. Additionally, further evaluation, although still limited, showcases the positive effects of the prompt component under the discussed issues. Finally, although I still believe that LoT (and every other prompting method), can not be a definitive response to the discussed issues given the basic architectural limitations of LLMs, I appreciate it is simplicity as a mitigative method. As such, I believe this paper can benefit the NLP community in some way if accepted at ICLR or similar venues.
Please make sure to make the additions to the final version of the paper.
Dear Reviewer GjV5,
Thank you again for your time and follow-up comments! We sincerely appreciate your active engagement, devoted evaluations, and insightful suggestions that help greatly improve our work.
We agree that prompt-level methods can not be a definitive response to the discussed issues, and we are ambitious to further investigate pre-training/post-training approaches in future works, based the insights from this paper.
We also have additional discussion about how could LoT method contribute toward the pretraining and post-training phase in the response to KGxs:
- Essentially, the objective is to enable the LLM to capture the underlying causal relations in human thinking from the language.
- The pretraining of LLMs needs to go beyond simple next-token prediction to avoid the dominant influence of the positional information
- One could easily use LoT prompt to generate suitable training data that better captures the desired causal relations in human thinking, and to post-train the LLM.
Again, please feel assured that all the valuable results and discussions will appear in the final version of this work, and will update our manuscript once with the permission.
Attached are our discussions to the further suggested analysis.
Best,
Authors
Our suggestions about the usage of pure components:
- If only issue 1 exists, then one can use "Echo" for better performance and lower computation. Supporting evidence: (1) the higher gain in the upper right triangle in
Accu(Methods) - Accu(CoT) Table, compared with others in same location; (2) In Table 3, the better performance than LoT at BBQ benchmark; (3) Lower token cost inTokens(Methods) - Tokens(CoT) Table. - If issues 2 exists:
- Using "Echo" may not be better than CoT when issue 2 is severe. Supporting evidence: (1) the negative gain in the bottom left corner in
Accu(Methods) - Accu(CoT) Table; (2) In Table 3, the lower performance than LoT at WinoBias benchmark. - Both "Expand" and "LoT" is OK. "LoT" takes less computation. "Expand" is better than "LoT" when issue 2 is very severe. Supporting evidence: (1) the positive gain of them at in
Accu(Methods) - Accu(CoT) Table; (2) "Expand" takes 1.5~2 times additional computations than "LoT" inTokens(Methods) - Tokens(CoT) Table.
- Using "Echo" may not be better than CoT when issue 2 is severe. Supporting evidence: (1) the negative gain in the bottom left corner in
Our observation about the robustness in Accu(Methods) - Accu(CoT) Table:
- echo_only: the range of gain in accuracy (%) is: -1.27 ~ +11.37
- expand_only: the range of gain in accuracy (%) is: +2.27 ~ +7.83
- LoT: the range of gain in accuracy (%) is: +3.03 ~ +10.10
One can observe that "LoT" is more robust in the sense of lowest gain.
Dear Reviewers,
We appreciate your time and constructive suggestions on this paper. To summarize, all reviewers (KGxs, GjV5, 1YME, FBxu) agree that our analysis on the modeling gap is valid and interesting. The proposed prompting technique is simple and effective.(GjV5,1YME).
We believe all of the reviewers' concerns can be addressed. We brief our responses to the shared concerns and suggestions raised in the review:
-
The link between LoT method and the analysis (KGxs,GjV5,1YME)
- In section 3, the analysis of language-model gap implies two issues in LLMs: (1) LLMs tend to draw conclusions with pretraining-led biases (e.g., insufficient or irrelevant premises); (2) LLMs may not fully use a premise when it is expressed in an implicit way.
- The two components in LoT method are designed for the two issues respectively. The "Echo" component is mainly for the issue 1, and the "expand" component is mainly for the issue 2.
-
The understanding of the ablation study (KGxs,GjV5)
- The
WinoBiasandBBQbenchmark focus on different aspects of making most use of the given premises, as addressed by theExpandandEcho, respectively. - The behavior of each component in the two benchmarks empirically justifies the gap shown by this paper.
- The
We provide a table of more examples in anonymous links
- Examples from ablation study: https://anonymous.4open.science/r/LoT-rebuttal-7362/README.md
- Examples from General Reasoning task: https://anonymous.4open.science/r/LoT-rebuttal-7362/Examples_for_General_Reasoning.md
Additional experimental results are also provided:
- To show the modeling gap is not for CoT only, we replace the CoT part in LoT with direct prompting. The table is provided in the response to Reviewer KGXs.
- We conduct a cost analysis for the inference time computation. The ratio of gain per additional tokens is reasonable compared with self-consistent CoT. The table is provided in the response to Reviewer 1YME.
- We further evaluate CoT and LoT on common mathematical benchmarks GSM8K and GSM8K-hard. For the tasks where CoT already succeeds, LoT can reach or even further improve over CoT, demonstrating the effectiveness of our discussion and the proposed LoT method. The table is provided in the response to Reviewer 1YME.
Please let us know if there is any other concern. We more than happy to discuss them. We would appreciate it if you could take our responses into consideration when making the final evaluation of our work.
Sincerely,
Authors.
The authors propose a new prompting methodology, grounded in ideas distinguishing the language of thought from spoken language. The idea is interesting, and has potential. However, I was not convinced by whether the measured improvements in performance could be well connected to the theoretical claims (about the ordering of thoughts vs language). Grounding the experiments (or introducing new designed datasets that highlight and concretize this thesis about the ordering of thoughts vs language) would make this stronger. Without this clear causal link, the paper reads a lot like "just another prompting method" which seems to work but we aren't fully sure why -- and doesn't seem super insightful / impactful.
审稿人讨论附加意见
The reviewers engaged extensively, and most of their concerns were addressed. There was one reviewer who strong disagreed with the others on the basis of the grounding of the method -- but most agreed the method seems to work on the datasets tested and the the experiments are solid.
Reject