Large Language Models can Learn Rules
A prompting method that induces explicit rules from training data and applies them in deductive reasoning.
摘要
评审与讨论
This paper introduces the "Hypotheses-to-Theories" (HtT) framework, which is designed to equip LLMs with a rule library for conducting reasoning. HtT comprises two distinct stages: an induction stage and a deduction stage. In the induction stage, the LLM is initially tasked with generating and validating rules based on a set of training examples. Rules that frequently appear and lead to correct answers are aggregated to create a rule library. In the deduction stage, the LLM is then prompted to leverage this acquired rule library to engage in reasoning and respond to test questions. The authors have evaluated their approach on a numerical reasoning benchmark and a relational reasoning benchmark and argue that their approach can significantly enhance the performance of existing few-shot prompting methods.
优点
The study and development of reasoning capabilities in LLMs is a very interesting and topical research area. LLMs have already demonstrated emerging capabilities across a wide range of reasoning tasks, primarily due to the evolution of sophisticated prompting methodologies. This paper provides further insights in this direction. Furthermore, the results of the experimental study seem to support the effectiveness of the proposed approach on existing benchmark datasets.
缺点
In spite of the pertinence of the addressed problem and the promising results from the experiments, I feel that paper also comes with significant weaknesses, which I will detail next. Specifically, I feel that the proposed Hypotheses-to-Theories (HtT) framework lacks comprehensive development within this submission. The authors describe general ideas but do not provide sufficient technical details describing how their methods advance the state-of-the-art. The core technical contribution of the paper is succinctly described in just one and a half pages (pages 3 and 4), while the majority of the paper is devoted to the description of the experiments and the obtained results. Unfortunately, the description of the approach appears somewhat lacking in depth and places an undue emphasis on "tricks" like XML tagging, detracting attention from fundamental principles and methodologies that can be adopted and further developed by other researchers.
As an example of the paper lacking important detail, I feel that the concept of a "rule" remains ambiguously defined in the paper. Specifically, the paper lacks an explicit description of the nature of "rules" that can be incorporated into the library and subsequently used in the deduction phase. The examples presented in Figure 1 proved to be somewhat confusing; for instance, the rule library in Figure 1 includes statements like "3 + 4 = 7," which, in my view, represent concrete facts rather than rules. A rule typically constitutes a formalised, general statement that applies to a potentially infinite collection of objects (e.g., "the successor of an even natural number is an odd number" or "all men are mortal"). In this context, the assertion that the proposed approach can induce "rules" appears unjustified and potentially misleading.
Similarly, it remains unclear how these "rules" are to be applied. Rule application, in its essence, involves the process of using a general statement to derive new facts from existing information (e.g., given that 4 is an even natural number and the established rule that the successor of an even number is odd, we can deduce the new fact that 5 is an odd number). It remains unclear how a "rule" such as "3 + 4 = 7" would be employed in a deductive context to generate new insights in the aforementioned sense.
问题
I do not have specific questions. I believe that the paper should be substantially rewritten before it can be published at a top venue. In particular, the description of the core approach should be substantially expanded, the contributions to science should be emphasised and the new techniques developed should be made explicit so that they can be adopted and further developed by other researchers in the field.
W3: Rule application, in its essence, involves the process of using a general statement to derive new facts from existing information (e.g., given that 4 is an even natural number and the established rule that the successor of an even number is odd, we can deduce the new fact that 5 is an odd number). It remains unclear how a "rule" such as "3 + 4 = 7" would be employed in a deductive context to generate new insights in the aforementioned sense.
A3: We agree with you on the definition of rule application, and our rules follow that. While arithmetic rules may not be so obvious, the relational rules such as “father’s mother is grandmother” learned for CLUTRR naturally satisfy your definition. For arithmetic rules, a formal way is to think “3 + 4 = 7” as “in this base system, if the current input digits are 3 and 4, and there is no carry from the previous digit, the output digit should be 7”. However, the formal version is unnecessarily redundant and costly for both human prompt writers and LLM inference.
Thanks for your constructive comments. We noticed that you didn’t mention any technical problems. If you have additional questions, feel free to let us know before the end of the discussion period. Here is our response to your concerns.
W1: The authors describe general ideas but do not provide sufficient technical details describing how their methods advance the state-of-the-art. The core technical contribution of the paper is succinctly described in just one and a half pages (pages 3 and 4), while the majority of the paper is devoted to the description of the experiments and the obtained results. Unfortunately, the description of the approach appears somewhat lacking in depth and places an undue emphasis on "tricks" like XML tagging, detracting attention from fundamental principles and methodologies that can be adopted and further developed by other researchers.
A1: HtT is a simple prompting method, where the implementation details are from the standard CoT prompt, the generation and verification of rules on the training set. These can be found in Section 3.1 & 4.1 and appendix A. Specifically, the rules are then filtered by two hyperparameters, minimal occurrence k and minimal accuracy p. Technically, the rules are extracted by regular expressions, but they are not sensitive to the choice of regular expressions. The occurrence of a rule is the number of times that it appears in the intermediate steps of all examples in the training set. The accuracy of a rule is the probability of the LLM reaching the correct answer when the rule occurs. We will follow the reviewer's advice to move part of appendix A to the main text, in order to provide more details and better readability.
XML tagging is a trick to augment the retrieval ability of LLMs without finetuning them on retrieval tasks. Our hypothesis is that it provides the right hint to the transformer attention mechanism and that's how it helps the rule deduction. We will follow the reviewer's advice to better position this with respect to the overall framework.
In a nutshell, the fundamental principles of this paper is to modify CoT prompts into an induction stage to learn rules and a deduction stage to use rules. We will emphasize this more in the revision.
W2: I feel that the concept of a "rule" remains ambiguously defined in the paper. Specifically, the paper lacks an explicit description of the nature of "rules" that can be incorporated into the library and subsequently used in the deduction phase. The examples presented in Figure 1 proved to be somewhat confusing; for instance, the rule library in Figure 1 includes statements like "3 + 4 = 7," which, in my view, represent concrete facts rather than rules. A rule typically constitutes a formalised, general statement that applies to a potentially infinite collection of objects (e.g., "the successor of an even natural number is an odd number" or "all men are mortal").
A2: We use the term “rules” to refer to any intermediate steps that we can reuse for multi-step reasoning problems. This can be arithmetic rules like “3 + 4 = 7”, propositional logic rules like “father’s mother is grandmother”, or first-order logic rules like “If Y is X’s father and Z is Y’s mother, Z is X’s grandmother”. As discussed in Section 2, rules and facts are relative concepts, and every rule can be a fact in a certain context. “3 + 4 = 7” is a rule when we are playing with different base systems, since “3 + 4 = 10” in base-7 and “3 + 4 = 11” in base-6. Of course you can regard “3 + 4 = 7” as a fact when you assume the context is a decimal system.
As you said, a rule is a general statement that applies to a potentially infinite collection of objects. We agree with you and our rules also satisfy that. Here the objects are not the digits, but the arithmetic problems themselves. The arithmetic rules can be applied to any arithmetic problems in the same base system, which satisfies the infinite requirement you pointed out.
Additionally, we provide the results on CLUTRR with first-order logic rules (e.g. If Y is X's aunt and Z is Y's father, Z is X's grandfather). Our HtT method can be applied to different forms of rules, such as first-order logic rules, propositional logic rules and arithmetic rules.
| CLUTRR | 2 hops | 3 hops | 4 hops | 5 hops | 6 hops | 7 hops | 8 hops | 9 hops | 10 hops | Avg |
|---|---|---|---|---|---|---|---|---|---|---|
| 0-shot CoT | 50.0 | 22.2 | 22.6 | 32.0 | 37.5 | 38.1 | 33.3 | 46.2 | 16.0 | 33.1 |
| 5-shot CoT (propositional) | 50.0 | 55.6 | 71.0 | 80.0 | 50.0 | 52.4 | 30.0 | 46.2 | 20.0 | 50.6 |
| 5-shot CoT+HtT (propositional) | 100.0 | 61.1 | 74.2 | 84.0 | 75.0 | 38.1 | 56.7 | 53.8 | 36.0 | 64.3 (+13.7) |
| 5-shot CoT (first order) | 50.0 | 71.4 | 90.9 | 90.6 | 76.5 | 88.6 | 75.9 | 66.7 | 66.7 | 75.2 |
| 5-shot CoT+HtT (first order) | 100.0 | 71.4 | 97.0 | 96.9 | 82.4 | 85.7 | 93.1 | 88.9 | 83.3 | 88.7 (+13.5) |
The key idea of the paper is that LLMs can learn rules from examples, and then these rules can be used to deduce answers for other queries to the LLMs. The paper proposes an induction and deduction step. In the induction step, the model infers rules from exemplars, some of which are thrown-out, based on coverage (how often is the rule used) and accuracy (how often is the rule correct) of the rules. In the deduction step, the induced rules are given as a knowledge base, and the model is expected to use rules to infer new answers.
The authors test there idea on simple synthetic datasets, by extending the Chain of Thoughts (CoT) and Least to Most (LtM) prompting methodology, showing consistent advantage of using the proposed method. The datasets consist of simple task of arithmetic in different bases and learning (simple) kinship relationship rules.
优点
- Easy and intuitive prompting method
- Shows that LLMs can learn simple rules
- Can be useful when rules that need to be learnt are simple true/false propositions
缺点
The general idea of the paper is nice, but the developed setup is too simplistic, and has not ben tested in any realistic setting. Specifically, I have the following concerns:
- The setting is too simple, the learnt rules are just true/false propositions of the form "A is B"
- The examples as shown in appendix are not very impressive, at least from a skim through, it seems the rules already exist explicitly in the prompt text. At this point how is this different from just knowledge retrieval as done in [1]. In fact the tasks presented in [1] seems much more nuanced than the one presented here.
- The gains without XML tagging (an existing method in the prompting technical know-how) are marginal. Furthermore, (it seems to me) that the authors have not tested CoT and LtM with XML tagging, making it unclear how much of their gains are from tagging, and how much is from the extracted knowledge.
[1] Trivedi et al. Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions. https://arxiv.org/pdf/2212.10509.pdf
问题
- On page 4, you explain that you add XML tags to the prompts, Do you add the XML tags to the prompts of the methods compared?
- The examples that you show in appendix seem very simple rules, are you able to extract complex rules? --- beyond "A is B"
Thanks for your constructive comments. Here is our response to your concerns.
W1 & Q2: The setting is too simple, the learnt rules are just true/false propositions of the form "A is B".
A1: The goal of using a simple setting is to reduce inference cost and allow manual analysis of error cases (e.g. Figure 3 in the paper). However, we should emphasize that HtT is not limited to learning proposition logic rules. Here we experiment HtT with first-order logic rules (e.g. If Y is X's aunt and Z is Y's father, Z is X's grandfather) on CLUTRR. It is observed that HtT with first-order logic rules perform even better than the propositional logic rules. We will update these results in the revision.
We note that HtT is not limited to certain forms of rules. However, besides rule learning, another problem that is important but is beyond the scope of this paper is the planning problem. In general, reasoning with first-order logic rules requires unification (i.e. to match rules and facts at each step), where for CLUTRR dataset the reasoning is a path-based special case. Empirically, we find it is hard to design a CoT prompt that teaches LLMs to perform unification for general case, as it requires planning abilities that LLMs typically don’t possess[1]. As a result, we can’t learn useful rules based on the CoT prompt when the task requires unification. A future work is to empower the rule learning with planning. If we have a deductive reasoning prompt that can perform unification, we believe HtT with that base prompt should be able to learn rules in general first-order logic settings.
| CLUTRR | 2 hops | 3 hops | 4 hops | 5 hops | 6 hops | 7 hops | 8 hops | 9 hops | 10 hops | Avg |
|---|---|---|---|---|---|---|---|---|---|---|
| 0-shot CoT | 50.0 | 22.2 | 22.6 | 32.0 | 37.5 | 38.1 | 33.3 | 46.2 | 16.0 | 33.1 |
| 5-shot CoT (propositional) | 50.0 | 55.6 | 71.0 | 80.0 | 50.0 | 52.4 | 30.0 | 46.2 | 20.0 | 50.6 |
| 5-shot CoT+HtT (propositional) | 100.0 | 61.1 | 74.2 | 84.0 | 75.0 | 38.1 | 56.7 | 53.8 | 36.0 | 64.3 (+13.7) |
| 5-shot CoT (first order) | 50.0 | 71.4 | 90.9 | 90.6 | 76.5 | 88.6 | 75.9 | 66.7 | 66.7 | 75.2 |
| 5-shot CoT+HtT (first order) | 100.0 | 71.4 | 97.0 | 96.9 | 82.4 | 85.7 | 93.1 | 88.9 | 83.3 | 88.7 (+13.5) |
W2: The examples as shown in appendix are not very impressive, at least from a skim through, it seems the rules already exist explicitly in the prompt text. At this point how is this different from just knowledge retrieval as done in [1]. In fact the tasks presented in [1] seems much more nuanced than the one presented here.
A2: We guess you mistook the rules in HtT prompts as human-written or code-generated, while they are actually learned by LLMs in the induction stage. We’ve clarified that rules in HtT prompts (Prompt 3, 5, 8 & 10) are just for demonstrating the format of the prompts in Appendix B. All the rule libraries used by HtT (Prompt 13-16) are learned in the induction stage. The only human-written rules are those in the few-shot exemplars, which are less than 10% of the size of the learned rule libraries for every dataset we used. Note the same set of human-written rules are used in the few-shot CoT prompt. So it is fair to use them to elicit and learn rules in HtT.
IRCoT[2] augments CoT with an external corpus and a pretrained retriever, which is totally different from the setting of HtT. HtT learns a rule library from a training set and a standard CoT prompt, without any external corpus. Once the rule library is learned, the deduction stage of HtT is similar to the setting of IRCoT. We agree that IRCoT may be a better solution for our deduction stage, but we implement the rule library and retrieval steps as a prompt to make it a fair comparison with CoT.
[1] Karthik Valmeekam, et al. On the Planning Abilities of Large Language Models (A Critical Investigation with a Proposed Benchmark). arXiv 2023.
[2] Harsh Trivedi, et al. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv 2022.
W3 & Q1: The gains without XML tagging (an existing method in the prompting technical know-how) are marginal. Furthermore, the authors have not tested CoT and LtM with XML tagging, making it unclear how much of their gains are from tagging, and how much is from the extracted knowledge.
A3: We agree that XML tagging brings a large gain in the performance, as shown by the ablation studies in Table 4. We are aware that XML tagging is used to separate different sections of the input, as documented by OpenAI and Anthropic. Here we extend this technique to guide the retrieval of knowledge in LLM generation. The XML tagging provides a prompting solution to improve the retrieval ability of LLMs without finetuning the model on retrieval tasks. As far as we know, there is no existing paper using XML tagging in such a way. With that being said, if you know any paper regarding this trick for retrieval, we are happy to cite and credit it.
This is a good suggestion on ablation studies. Here we apply the same XML tagging trick to CoT and LtM. Note that there isn’t any rule library in CoT and LtM, so the tags only help elicit rules from the LLM, rather than retrieve rules from the library. We observe that XML tagging doesn’t contribute significantly to the performance of our baseline methods except for CoT on Arithmetic. For both datasets and both prompting methods, HtT consistently outperforms baseline methods with XML tagging by a large margin, which suggests that our performance gain comes from the synergy of the learned rule library and XML tagging, rather than XML tagging alone. We will add these results in the revision.
| Prompt | Arithmetic | CLUTRR |
|---|---|---|
| 5-shot CoT | 57.4 | 50.6 |
| 5-shot CoT (w/ XML tags) | 64.0 (+6.6) | 47.7 (-2.9) |
| 5-shot CoT + HtT (w/ XML tags) | 78.4 (+21.0) | 64.3 (+13.7) |
| Prompt | Arithmetic | CLUTRR |
|---|---|---|
| 5-shot LtM | 51.0 | 46.8 |
| 5-shot LtM (w/ XML tags) | 52.6 (+1.6) | 36.4 (-10.4) |
| 5-shot LtM + HtT (w/ XML tags) | 62.0 (+11.0) | 57.9 (+11.1) |
This paper tries to improve LLM's reasoning ability, by inducing rules and applying the induced rules in deductive problems. Experimental results show that explicitly learning some rules and inject them into prompts can significantly benefit strong LLM models such as GPT-4, but not weaker ones like GPT-3.5.
优点
- Extensive experiments to verify the effectiveness of the proposed method.
- The presentation of this paper is articulate, and easy to read.
- The experimental results verified the effectiveness of this work.
缺点
- Many details are missing, e.g., how does the induction stage work? How to use ground truth answers to verify the induced rules, does it require human annotators? What is the confidence of rules, how are they evaluated, and does LLM output a confidence score associated with the rules?
- This work is basically a technical report, some claims lack supportive facts or resources. For example, the authors say "hallucination in LLMs resembles hypothesis generation in scientific discovery", which is incorrect. In scientific discovery, hypotheses are generated by logical induction and abduction (reference), while the logic behind hallucination remains unknown.
- Rules in the form of natural language weaken the generalisation ability and usually cause ambiguity and may confuse people. This is exactly the reason why Gottfried Leibniz calls for mathematical logic. However, most of the time, humans, like LLMs, are using natural language for reasoning, so I don't think the method proposed by this paper is bad. It would be interesting to make a more comprehensive test for the learned rules using formal methods, for example, ask LLMs to abstract those numerical rules in Appendix C into higher-order forms, such as Peano axioms and see if LLMs can make use of those more advanced rules.
问题
Please see my above comments.
Below is the instruction we used to ask the LLM to generate descriptions about the rule library.
Instruction: The following rules are learned from base-{{ base }} arithmetic problems. Use the rules to write some knowledge in a Wikipedia style. The knowledge will be used as a part of a prompt to teach a large language model to solve new problems. Try to make the knowledge precise and detailed so that even a weak large language model can understand the knowledge and reconstruct all the original rules from the knowledge.
Here is an example of the description generated by the LLM.
In base-9 number system, addition operates on nine digits: 0, 1, 2, 3, 4, 5, 6, 7, and 8. The operation can be broken down into following basic patterns:
- When any number is added to 0, the sum remains the same number (e.g., 3 + 0 = 3, 7 + 0= 7).
- Summing two numbers less than 9 (without carry) remains a standard operation similar to base 10 addition (e.g., 3 + 4 = 7).
- Exceptions are observed in specific cases where summing two numbers may yield a result in double digits. For instance, 1 + 8 yields the sum as 10 instead of 9. It changes the unit column to 0 and adds 1 to the tens column. This rule applies to all combinations that sum to 9 or more (e.g., 4 + 5 = 10, 6 + 3 = 10).
- It's noteworthy that, in base-9 arithmetics, 10 is not a two-digit number but a single digit representing what would be 9 in decimal count.
The concept extends to the addition of multiple numbers. When sums exceeding 8 (the highest digit in base-9) result, $n overflow or carry occurs to the left, similar to traditional addition operation. For instance:
- 1 + 3 + 1 results in 5, while 1 + 7 + 1 yields 10 (not 9 as in the decimal system).
- Adding 7, 1, and 1 together will yield a sum of 10.
- Adding 8, 1, and 1 together will yield 11 in base-9 arithmetic.
Overall, base-9 arithmetic's behavior is similar to base 10, except that sums of 9 and above represent overflow conditions, leading to the next higher order of magnitude (e.g., tens column in simple addition), much like how 10 does in $ase-10. In other words, '10' is a single digit value in base-9 that translates to '9' in base-10.
Thanks for your recognition and constructive comments. Here is our response to your concerns.
W1: How does the induction stage work? How to use ground truth answers to verify the induced rules, does it require human annotators? What is the confidence of rules, how are they evaluated, and does LLM output a confidence score associated with the rules?
A1: The induction stage uses an LLM and CoT prompt to generate intermediate steps containing rules and the final answer, which are verified based on the ground truth answer. To compare the generated output and the ground truth answer, we take the last sentence of the generation and check if the ground truth answer is a full word in the output, following the open-source implementation of CoT[1]. As pointed out in Section 3.1, the confidence of a rule is defined by the probability that it leads to the correct answer on the training set. For example, if a rule occurs 5 times across the entire training set, and results in the correct answer for 3 times, its confidence is 0.6. The LLM doesn’t need to output any confidence score for the rules. All the confidence scores are computed by the verification results on the final answer.
W2: This work is basically a technical report, some claims lack supportive facts or resources. For example, the authors say "hallucination in LLMs resembles hypothesis generation in scientific discovery", which is incorrect. In scientific discovery, hypotheses are generated by logical induction and abduction, while the logic behind hallucination remains unknown.
A2: We agree that scientific discovery may be generated by logical induction and/or abduction in many cases, and this is likely to be different from the mechanism of hallucination. Here we don’t want to make any claim, but try to draw parallels between the hypotheses and hallucination – both may contain useful information but must be verified on a set of samples. This serves as a motivation for our method. If this doesn’t address your concerns, we are happy to further discuss this and modify the introduction accordingly.
W3: It would be interesting to make a more comprehensive test for the learned rules using formal methods, for example, ask LLMs to abstract those numerical rules in Appendix C into higher-order forms, such as Peano axioms and see if LLMs can make use of those more advanced rules.
A3: That’s a good question. We tried to prompt an LLM to automatically discover some more abstract rules or principles, but found the model can only do so when we provide the term “Peano axioms” in the instruction, which is too handcrafted for the arithmetic dataset and doesn’t generalize to new tasks. Instead, we use a more generic instruction to ask the LLM to write some knowledge describing the rules. Due to the large variance in the generated description, we sample 5 descriptions for each base system, and report the mean and variance of the results. We found that the description generated by the LLM from the learned rule library can improve the reasoning performance compared to standard CoT, yet it is worse than HtT since the description often loses some detailed information of the rule library. One advantage of the generated description is that it compresses the rule library into some compact descriptions, which reduces the length of the prompt. We will include these results in the appendix of the paper.
| Arithmetic | 2 digits | 3 digits | 4 digits | 2 digits | 3 digits | 4 digits | 2 digits | 3 digits | 4 digits | Avg |
|---|---|---|---|---|---|---|---|---|---|---|
| 5-shot CoT | 83.3 | 71.1 | 61.5 | 52.8 | 47.4 | 46.2 | 75.0 | 36.8 | 42.3 | 57.4 |
| 5-shot CoT+HtT | 100 | 94.7 | 84.6 | 88.9 | 71.1 | 46.2 | 86.1 | 68.4 | 65.4 | 78.4 (+21.0) |
| 5-shot CoT+higher-order forms | 81.6±6.2 | 67.7±8.6 | 60.6±10.2 | 79.5±8.1 | 61.5±11.4 | 62.8±6.7 | 81.6±4.4 | 65.4±8.1 | 55.0±6.9 | 68.4±2.4 (+11.0) |
[1] Yao Fu. Chain-of-thought hub. https://github.com/FranxYao/chain-of-thought-hub
The paper discusses the challenges of large language models (LLMs) in complex reasoning tasks, addressing their tendencies for generating plausible but inaccurate outputs and decreased performance in unconventional knowledge scenarios. Then it proposes the Hypotheses-to-Theories (HtT) framework as a solution, inspired by the scientific method, incorporating rule induction and deduction stages to reduce hallucinations and improve reasoning accuracy. Empirical tests with GPT on numerical and relational reasoning datasets demonstrated significant performance improvements over baseline methods, showcasing the potential of HtT to enhance LLMs’ reasoning capabilities while mitigating existing challenges.
优点
- The idea that uses LLM to act as a rule learner is novel, distinguishing itself from previous methodologies that typically employ alternative strategies to mitigate hallucination or rely on symbolic methods for rule acquisition.
- Several innovative tricks pertaining to prompts have been introduced, effectively addressing intricate implementation details and enhancing the method’s practicality.
- The method's effectiveness has been validated through experiments.
缺点
- Although the method presented in the article exhibits a certain degree of innovation, its articulation fails to meet the standards of ICLR, leaving many details unaddressed within the paper. This omission results in confusion among readers trying to grasp the intricacies of the proposed approach.
For instance, in the "Induction from Deduction" section, it is not specified how the rules are extracted — is it through regular expressions?
It is also unclear how the occurrence k and accuracy p are calculated based on the paper. These concepts are borrowed from the field of rule learning, yet the author does not elucidate how they are applied in the domain of natural language. This transition from rule learning to natural language processing necessitates a clear explanation, as the methodologies and challenges inherent to these domains can be vastly different.
In the appendix, considering Prompt 2, it is evident that a substantial number of ground rules are already present within the prompt. This raises a question: If we rely solely on the rules from the prompt, what level of performance can be achieved?
- The citation is inconsistency. For instance, the first referenced NeurIPS paper does not include page numbers, while the second one does. The third citation is missing its source of publication, and the fourth one includes the conference name’s abbreviation, unlike the others.
Xinyun Chen, Chen Liang, Adams Wei Yu, Dawn Song, and Denny Zhou. Compositional generalization via neural-symbolic stack machines. In Advances in Neural Information Processing Systems, 2020.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Antonia Creswell, Murray Shanahan, and Irina Higgins. Selection-inference: Exploiting large language models for interpretable logical reasoning. 2023.
Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5418–5426, 2020.
-
The paper falls short of providing a comprehensive description regarding the spectrum of problems that the introduced approach is adept at solving. This leaves a ambiguity as to whether the proposed method can effectively tackle all varieties of hallucination issues, a matter that necessitates further elucidation for a complete understanding of the method’s capabilities and limitations.
-
Furthermore, is the method capable of learning complex rules, such as first-order logic rules? If unification is required during reasoning, can LLMs still utilize these rules for inference?
问题
See Weakness
W5: Is the method capable of learning complex rules, such as first-order logic rules? If unification is required during reasoning, can LLMs still utilize these rules for inference?
A5: Yes. Although we implement HtT to learn arithmetic rules and propositional logic rules in the paper, HtT is compatible with different forms of rules. Here we implement HtT to learn first-order logic rules (e.g. If Y is X's aunt and Z is Y's father, Z is X's grandfather) on CLUTRR. Surprisingly, we find that first-order logic rules perform significantly better than the propositional logic rules for both 5-shot CoT and 5-shot CoT+HtT. We will update these results in the revision.
One thing we need to point out is that, in general the first-order logic rules involves unification steps, which require planning and backtracking. For the CLUTRR the path of relations can be reduced from left to right in a deterministic way, but generally the unification is difficult and neither CoT or HtT can work in this case. However, we should emphasize that failure on unification is an issue of CoT, as HtT doesn’t change the CoT prompt except prepending the learned rule library in the deduction stage. If we have a deductive reasoning prompt that can perform unification, we believe HtT with that base prompt should be able to learn rules in general first-order logic settings.
| CLUTRR | 2 hops | 3 hops | 4 hops | 5 hops | 6 hops | 7 hops | 8 hops | 9 hops | 10 hops | Avg |
|---|---|---|---|---|---|---|---|---|---|---|
| 0-shot CoT | 50.0 | 22.2 | 22.6 | 32.0 | 37.5 | 38.1 | 33.3 | 46.2 | 16.0 | 33.1 |
| 5-shot CoT (propositional) | 50.0 | 55.6 | 71.0 | 80.0 | 50.0 | 52.4 | 30.0 | 46.2 | 20.0 | 50.6 |
| 5-shot CoT+HtT (propositional) | 100.0 | 61.1 | 74.2 | 84.0 | 75.0 | 38.1 | 56.7 | 53.8 | 36.0 | 64.3 (+13.7) |
| 5-shot CoT (first order) | 50.0 | 71.4 | 90.9 | 90.6 | 76.5 | 88.6 | 75.9 | 66.7 | 66.7 | 75.2 |
| 5-shot CoT+HtT (first order) | 100.0 | 71.4 | 97.0 | 96.9 | 82.4 | 85.7 | 93.1 | 88.9 | 83.3 | 88.7 (+13.5) |
I have read the author responses.
Thanks for your constructive comments. Here is our response to your concerns.
W1: How are the rules extracted? Is it through regular expressions? How are the occurrence k and accuracy p calculated?
A1: As discussed in Appendix A, we extract the rules with regular expressions. Here are the regular expressions we used:
Numerical reasoning: [A-Z0-9]+(?: \+ [A-Z0-9]+)+ = [A-Z0-9]+\., which accepts a summation equation of arbitrary numbers or characters.
Relational reasoning: ([a-zA-Z\-]+)'s ([a-zA-Z\-]+) is ([a-zA-Z\-]+)\., which accepts a sentence describing the combination of two arbitrary words. We note that HtT is not sensitive to the choice of regular expressions, as long as the regular expressions can extract rules from the human-written few-shot exemplars with 100% precision and recall. If the regular expression recalls wrong rules from the LLM, they can be easily filtered by our minimal occurrence criterion.
The occurrence of a rule is just the number of its occurrences in the induction stage. The accuracy of a rule is the probability of reaching the correct answer when the rule occurs. Taking the rules learned in base-9 as an example, a correct rule “4 + 3 = 7” occurs 20 times and leads to correct answers for 9 times in the induction stage, resulting in an accuracy of 0.45. The reason why correct rules don’t have 100% accuracy is that they are sometimes combined with other wrong rules in some examples. A wrong rule “7 + 5 = 12” occurs 42 times and never leads to correct answers in the induction stage, resulting in an accuracy of 0. Therefore, a minimal accuracy p of 0.3 is enough to distinguish these rules. The minimal occurrence k is used to filter less reusable rules such as “5 + 1 + 2 = 8”, which is correct but only occurs when the previous step generates a carry of 2 by mistake. The choices of and for different datasets are provided in Appendix A.
W2: In the appendix, considering Prompt 2, it is evident that a substantial number of ground rules are already present within the prompt. If we rely solely on the rules from the prompt, what level of performance can be achieved?
A2: Prompt 2 is the CoT baseline, of which the results have already been provided in Table 1. The same prompt is also used in the induction stage of HtT to learn rules. We copy the results of GPT-4 here for visibility.
| Arithmetic | 2 digits | 3 digits | 4 digits | 2 digits | 3 digits | 4 digits | 2 digits | 3 digits | 4 digits | Avg |
|---|---|---|---|---|---|---|---|---|---|---|
| 0-shot CoT | 72.2 | 26.3 | 7.7 | 22.2 | 10.5 | 3.8 | 30.6 | 34.2 | 23.1 | 25.6 |
| 5-shot CoT | 83.3 | 71.1 | 61.5 | 52.8 | 47.4 | 46.2 | 75.0 | 36.8 | 42.3 | 57.4 |
| 5-shot CoT+HtT | 100 | 94.7 | 84.6 | 88.9 | 71.1 | 46.2 | 86.1 | 68.4 | 65.4 | 78.4 (+21.0) |
We note that Prompt 2 provides only a very small fraction of the ground truth rules (10 rules compared to 512/242/162 ground truth rules in base-16/11/9 respectively). It is possible that these rules somehow help the LLM to recall rules, which explains why 5-shot CoT is better than 0-shot CoT. However, as shown in Table 1 & 2, with the learned rule library, HtT significantly improves over CoT.
W3: The citation is inconsistency.
A3: Thanks for pointing that out. We will update it in the revision.
W4: The paper falls short of providing a comprehensive description regarding the spectrum of problems that the introduced approach is adept at solving. The authors may provide a complete understanding of the method’s capabilities and limitations.
A4: We plan to add a paragraph discussing the scope of HtT in the camera ready version. Based on what we observed for HtT in experiments, it has the following constraints.
- The task is approachable via CoT prompting, where each intermediate step explicitly uses a rule.
- While the rules can be in natural languages, they should have a consistent template, otherwise it is hard to count the occurrences of rules of the same semantics.
- The same set of rules are applicable for every sample in the whole dataset, otherwise we can’t reuse rules learned from the training samples on the test samples.
- CoT has a reasonable performance on the task, otherwise we often get errors in the induction stage and can’t learn the rules.
- The size of the rule library is moderately small, otherwise we can’t fit it into the context length of the LLM in the induction stage.
We believe only constraint 3 is a hard limitation on the scope of problems solvable via HtT compared to CoT. Constraint 2 may be improved by using external models for matching rules of the same semantics. Constraint 4 is a general drawback on CoT. For example, CoT fails on tasks requiring unification of first-order logic rules, and we can’t apply HtT in this case either. Constraint 5 may be solved by offloading the learned rule library to external databases using RAG or equivalent methods.
We would like to thank all reviewers for your time and patience on our submission. We’ve updated the paper to include our responses to the weaknesses and questions discussed in your reviews. All the updates are marked up with blue in the paper. Below is a list of our updates:
-
Figure 1 (Reviewer CJEP, r3qq, Zpi4): We changed the example from arithmetic problems to kinship relation problems, since arithmetic rules are somehow trivial and can be easily synthesized. We included the number of occurrences and correct answers in the rule library to show how the library gets updated in the induction stage.
-
Scope of tasks that can be solved by HtT (Reviewer CJEP, Zpi4): We added a paragraph discussing the scope of tasks in Section 3. Compared to tasks solved by CoT, HtT cannot solve (or improve over CoT) tasks that are not logical (e.g. natural language entailment, reading comprehension) or require planning abilities (e.g. sudoku, minesweeper). HtT does not impose constraints on the type of rules it learns.
-
Results of XML tagging on baselines (Reviewer eHJ8): We included the results of CoT with XML tagging on both datasets in Table 4 (full results in Appendix B), and discussed the results in Section 4.4. It is observed that the XML tagging trick moderately improves the performance on Arithmetic and even hurts on CLUTRR. Therefore, we concluded that XML tagging mostly improves the retrieval ability rather than simply elicits rules from the parameters of the LLM.
-
Results of first-order logic rules on CLUTRR (Reviewer CJEP, eHJ8, Zpi4): We added the results of CoT and CoT+HtT with first-order logic rules in Appendix C. It is observed that first-order logic rules significantly improve performance over propositional logic rules on CoT, and HtT consistently improves over CoT with the learned rule library. This suggests that HtT is not limited to numerical rules and propositional logic rules.
-
Results of HtT with summarized high-level knowledge (Reviewer r3qq): We prompted GPT-4 to summarize the rule library into high-level knowledge and apply the summaries in the deduction stage. We included results in Appendix D. The summaries generated by the LLM from the learned rule library can improve the reasoning performance over standard CoT, yet they are worse than the rule libraries since the description often loses some detailed information of the rule library. One advantage of the summaries is that they require significantly shorter prompts.
-
Reference formats (Reviewer CJEP): We updated all the references to make them consistent in formats. We changed arXiv papers to their conference versions if they’ve got accepted recently.
If you have further questions or concerns regarding our paper, please feel free to discuss with us. We are happy to enhance both the technical content and the clarity of our paper.
The paper presents the "Hypotheses-to-Theories" (HtT) framework designed to improve reasoning abilities of large language models (LLMs). HtT includes two stages: rule induction and deduction. The induction stage involves generating and verifying rules over training examples, while the deduction stage applies these rules to answer test questions. The approach demonstrated significant accuracy improvements in numerical and relational reasoning tasks over existing methods. However, reviewers raised concerns about the experimental setup, the lack of technical details, and the definition and application of the "rules." While the paper's concept is promising, it seems to lack depth in its technical presentation and comprehensive development of its core contributions. This leaves questions about the framework's capabilities in handling complex rules and its generalizability across different reasoning contexts.
为何不给更高分
Not enough support for acceptance.
为何不给更低分
NA
Reject