The Two-Hop Curse: LLMs trained on A→B, B→C fail to learn A→C
We show that LLMs (LLaMA-3-8B) fail to learn how to combine two facts to answer a two-hop question, even despite being finetuned to do so
摘要
评审与讨论
This paper investigates the limitations of LLMs in performing "two-hop" reasoning in their latent space. The authors create a controlled setup using Llama 3 8B, where they fine-tune the model with three strategies aimed at eliciting two-hop reasoning without CoT: data mixtures that encourage two-hop reasoning, layer-ordering of facts to align with logical steps, and activation-level supervision. Despite these methods, LLMs could not reliably perform two-hop reasoning without CoT, failing to exceed chance-level accuracy. This suggests that LLMs may lack fundamental latent reasoning capabilities, potentially highlighting an intrinsic limitation of current transformer models.
优点
1.The authors address a highly intriguing problem, investigating weaknesses in LLMs and pointing to directions for future optimization.
2.The experimental design minimizes the impact of the model's pre-existing knowledge on the results, thereby increasing the reliability of the conclusions.
缺点
-
The paper lacks novelty, as previous works, such as arxiv.org/pdf/2406.12775 and arxiv.org/pdf/2402.16837, have already investigated the limitations of LLMs in multi-hop reasoning. The authors should further discuss the distinctions between their study and these prior works.
-
The study identifies the "two-hop curse" phenomenon through experimental analysis but does not delve into the underlying causes of this limitation, nor does it propose any effective methods to alleviate it.
-
The experimental design lacks sufficient depth; the constructed dataset contains only one pattern (“The spouse of e1 is e2. The birth city of e2 is e3”), without covering other relational structures. Additionally, only the Llama 3 8B model is evaluated, leaving open the question of whether larger models or different architectures would also experience the two-hop curse.
-
Several details remain unclear, such as specific hyperparameters for the training setup (e.g., learning rate, warmup ratio), and some methods need further theoretical explanation, particularly Inventions 2 and 3. There is approximately a page and a half of space that could be used to expand on these aspects.
Minor Issues:
- The color differentiation in Figure 1 is minimal, making it difficult to discern details.
问题
See above.
The author proposes a series of experiments to explore the "two-hop curse" observed in large language models (LLMs). Using atomic data in different configurations, they select two types of two-hop data: one without chain-of-thought (CoT) reasoning and one with CoT reasoning, to fine-tune the Llama3-8B model. This approach effectively allows control over the data that influences the LLM's parameters. The author finds that adding two-hop CoT data increases accuracy for two-hop questions; however, it still fails to improve performance on two-hop questions without CoT. Additionally, two intervention tests were conducted to assess their impact, but both showed minimal effect. Overall, this paper provides a detailed analysis of the phenomenon known as the "compositionality gap."
优点
- The study employs fine-grained control over the training data and conducts a series of experiments to meticulously examine the compositionality gap in large language models.
缺点
This is an analytical paper that conducts various experiments and research on a specific phenomenon. However, it does not present particularly impressive conclusions or unique perspectives. The results obtained from intervention2 and intervention3 are not positive, but we should see more reasons that lead to the occurrence of the phenomenon in the main papaer, rather than this "process of elimination." In addition, the author's motivations and explanations for their interventions are not convincing.
问题
-
Intervention 2: We observe that one-hop accuracy from layer-selective experiments also declined. Could you elaborate on how this affects the decrease in two-hop reasoning performance?
-
Additionally, could you explain your motivation for choosing these two specific interventions over other possible options?
The work sets out to investigate the compositional reasoning gap of LLMs. The authors design three different approaches to elicit two-hop reasoning of pre-trained LLaMA-3-8B: fine-tuning on mixed data, staged fine-tuning to force the first-hop facts and the second-hop facts to be stored in different layers and leveraging additional supervision signal to encourage the emerge of bridge entity in the middle layers. They offer a converged conclusion: LLaMA-3-8B completely fails to learn to generalize to compositional reasoning cases without chain-of-thought prompting.
优点
- The presentation, logic-flow of the paper is good. The paper is overall well writen and easy-to-follow.
- The topic of the work "the limitations of the compositional reasoning in large language models" is interesting and important as well.
- The experiments designed in the paper are quite multi-faceted, offering some insightful results to readers.
缺点
- Though the experiments presented in the paper, it only explore a few settings (fine-tuning on mixed data, staged fine-tuning to force the first-hop facts and the second-hop facts to be stored in different layers and leveraging additional supervision signal to encourage the emerge of bridge entity in the middle layers). Negative results on such settings might be insufficient to claim that LLMs exhibit a near-complete failure of two-hop latent reasoning.
- As the authors stated in the Limitation section, the paper mainly focus on making LLMs acquire knowledge via fine-tuning, different from pre-training (where typically we do). This may weaken the insights brought by the work.
- The variation of the data is quite limited: only covering factual knowledge data and only two semantic templates (spouse and birth city), which may prevent the model from learning some general composition skills.
问题
- One of the contributions claim that the experimental setup can alleviate memorization or reasoning short cuts. How do the dataset settings control the memorization or reasoning short cuts? I may overlook some details? Did you use the counter-factual (or virtual) data to conduct the experiments?
This paper studies the compositionality gap in LLMs: why do LLMs fail to answer two hop questions directly but can do so with CoT.
优点
This paper studies an interesting problem, the compositionality gap, although with relatively shallow experiments (restricted to one model on a simple synthetic dataset). It tries different approaches to address this, although the motivation for the auxiliary objectives need to be strengthened (currently feels very ad-hoc).
缺点
- This work is quite incremental given the existing literature on compositionality gap (Press et al, 2023).
- Further the experiment is carried out only in a very simple synthetic domain.
- The experiments are performed using only one model, so this and the previous points brings the generalizability of this study into question.
- The motivation for this work needs to be clarified, as the LLMs do perfectly well with CoT.
- Several phrases used without proper definitions of them: "two-hop circuitry", "Goldilocks zone"
问题
What is the motivation behind this study given CoT does well? Why is the study restricted to only a simple synthetic dataset and only one model? What is the key differentiating factors between this work and the several prior works on compositionality gap of LLMs?
This paper studies the compositional gap in language models: can an LLM trained on A->B and B->C directly answer A->C without chain of thought. This study performs various interventions on Llama 3 8B such as finetuning and forcing facts to be stored at certain layers. However, although under chain-of-thought settings some of the proposed interventions can solve two-hop reasoning, even the model fails to learn two-hop reasoning without chain of thought.
Strengths:
- The problem is interesting and worth investigating.
- Although the experiment results are not very positive, the proposed interventions are intuitive.
Weaknesses:
- The experiment results are note positive --- none of the interventions helps the no chain of thought settings.
- Many reviewers pointed out that there are related works in the literature studying the same problem, and this work should discuss further their differences.
Overall, I think this is an interesting paper, despite the negative results. That said, reviewers have reached consensus that this paper is not yet in a form ready to be accepted, and I recommend the authors to compare their work to existing works pointed out by the reviewers for their next version, and further investigate the negative results under no chain of thought settings. I'm recommending reject for the current version.
审稿人讨论附加意见
Reviewers are mostly concerned about the negative results of this paper and the lack of novelty compared to existing literature. There's no author rebuttal provided.
Reject