Extractive Structures Learned in Pretraining Enable Generalization on Finetuned Facts
Pretraining LMs create extractive structures in LMs that enable LMs to generalize to implications of new facts seen during finetuning.
摘要
评审与讨论
This work provides an in-depth analysis on the ability of pretrained language models to generalize from specific facts to broader implications. The authors focus on understanding the underlying mechanisms that allow pre-trained language models to make such generalizations after being finetuned on particular facts. The paper introduces the concept of extractive structures, a novel framework that describes how different components within the Transformer-based models, such as MLP layers and attention modulars, work in concert to enable this generalization. The authors suggest that extractive structures consist of three kind of components: informative components, upstream and downstream extractive components. The paper presents two main predictions based on the extractive structures hypothesis: the data ordering effect and the weight grafting effect. Empirical evidence supporting these predictions is provided through experiments conducted on various large-scale language models。
给作者的问题
No.
论据与证据
Yes.
方法与评估标准
Yes.
理论论述
I check the design of extractive structures framework. I think it makes sense.
实验设计与分析
The experiments part has good soundness for supporting the predictions about data ordering and weight grafting.
补充材料
No.
与现有文献的关系
This work contributes to the interpretability of the learning process and knowledge storage of neural language models.
遗漏的重要参考文献
No.
其他优缺点
No.
其他意见或建议
No.
Thank you for the review! We're happy that you believe our extractive structures framework makes sense and that our experiments are sound!
This paper introduces extractive structures—model components that store, retrieve, and process facts—to explain how LMs generalize to implications of fine-tuned facts. The authors show these structures emerge during pretraining when models encounter implications of known facts. Experiments on multiple LMs confirm a data ordering effect (OCR fails if implications precede facts) and a weight grafting effect (extractive structures transfer to counterfactuals), offering insights into LM generalization and robustness.
给作者的问题
-
Evaluation Details: In Appendix B.1, you mention using the log-probability of the first continuation token for scoring. Could you clarify whether the mean rank is calculated based on this first token alone or on the entire continuation sequence?
-
Training Details: Out-of-context reasoning (OCR) can be sensitive to training hyperparameters. While you've discussed the impact of learning rates, have you also examined how the annealing stage affects OCR? A comparison between the final checkpoint and the last checkpoint before the final annealing stage could provide valuable insights.
-
Defining Early and Late Layers: In Table 4, layers 1–24 are categorized as early, and layers 25–32 as late. Could you elaborate on the criteria used to define these layers as early or late? Was this classification based on the visualizations in Figure 5 or another methodology?
论据与证据
The claims made in the paper are well supported.
Claim 1. The structures consist of informative components that store training facts as weight changes, and upstream and downstream extractive components that query and process the stored information to produce the correct implication.
The structures are intuitive. To operationalize this, the authors define scores to identify the roles of each LM module. The score visualizations support the proposed structure and align with findings from prior works analyzing pretrained models.
Claim 2. Our technique reveals that fact learning occurs in both early and late layers, which enable different forms of generalization.
This claim is supported by layer-freezing ablation in Table 2, where freezing either early or late layers does not hurt fine-tuning fact performance, indicating these facts are distributed across the model. Freezing early layers impairs first-hop implications, while freezing later layers impairs second-hop implications, suggesting that facts in early layers as part of the first-hop informative components enable first-hop reasoning, and facts in later layers enable second-hop reasoning.
Clam 3. We next study how extractive structures are learned during pretraining and propose a mechanism by which this occurs.
The extractive structures are hypothesized to emerge as models strategically generalize from facts to implications during pretraining, rather than memorizing both simultaneously. The paper supports this by designing synthetic pre-training settings, showing that the model's OCR ability is only non-trivial when facts precede their implications during pretraining.
方法与评估标准
I find the proposed evaluations well-designed and effective in demonstrating extractive structures as a plausible mechanism for how LMs generalize to implications of fine-tuned facts. See "Claims And Evidence" for details.
理论论述
I have not checked the proof in great details for the extractive scores in the Appendices, but the score definitions seem intuitively reasonable to me.
实验设计与分析
I find the experimental designs and analyses both sound and compelling. See "Claims And Evidence" for details.
补充材料
I have scanned through all the supplementary material.
与现有文献的关系
Previous work on out-of-context reasoning shows inconsistent results—fine-tuned facts sometimes enable generalization, but not always. This paper introduces extractive structures—model components that store, retrieve, and process facts—to explain why out-of-context reasoning occurs. It also analyzes how these structures are learned during fine-tuning, shedding light on why certain forms of generalization may fail, particularly when pretraining data lacks the necessary conditions for these structures to form.
遗漏的重要参考文献
No
其他优缺点
Strengths:
-
The paper is well-written, easy to follow, and presents a clear, intuitive mechanism for how LMs generalize to implications of fine-tuned facts, backed by thorough experiments and analysis.
-
The proposed extractive structure provides a timely framework that reconciles inconsistencies in the OCR literature (see "Relation to Broader Scientific Literature").
-
The experimental design is novel and engaging, with results and visualizations that are clear, interpretable, and strongly support the claims.
Weakness:
I did not find major weakness in this work.
其他意见或建议
This paper is well-written, but I have a few suggestions to improve clarity:
-
On line 24, changing "training" fact to "finetuning" fact would better emphasize that the fact in question is introduced during the fine-tuning stage.
-
In the setup paragraph of Section 6.1, it is initially unclear why dax represents facts and wugs represents implications until Appendix D.2 clarifies that there are more names (100) than animals (20). While this distinction may seem minor, moving some dataset design details to the main text could make the synthetic setting clearer from the outset.
Thank you for your review! We are happy that you found our claims "well supported", our evaluations "well-designed and effective", and our experimental designs and analyses "both sound and compelling". We're also grateful for the writing suggestions!
- Evaluation Details: In Appendix B.1, you mention using the log-probability of the first continuation token for scoring. Could you clarify whether the mean rank is calculated based on this first token alone or on the entire continuation sequence?
The mean rank is calculated based on the first token alone. Fortunately, in all of our datasets and all the tokenizers used in the models we studied (including those in the appendices), the first tokens are unique across the 20 options in each dataset. We'll clarify this in the paper.
- Training Details: Out-of-context reasoning (OCR) can be sensitive to training hyperparameters. While you've discussed the impact of learning rates, have you also examined how the annealing stage affects OCR? A comparison between the final checkpoint and the last checkpoint before the final annealing stage could provide valuable insights.
This is a great question. We have indeed performed this experiment; the final checkpoint of OLMo-0424 generalizes slightly worse than the preanneal checkpoint for the first-hop and second-hop OCR tasks, and has similar results for the data ordering and the grafting results. We think that it is possible that annealing does make a difference, but without more systematic pretraining-scale experiments or having more details behind the other open-weight but not open-source models such as Llama and Gemini, it is hard to make a strong conclusion. The figure are available at the following external links:
- Defining Early and Late Layers: In Table 4, layers 1–24 are categorized as early, and layers 25–32 as late. Could you elaborate on the criteria used to define these layers as early or late? Was this classification based on the visualizations in Figure 5 or another methodology?
The classification was based on the informative scores in Fig. 5. Specifically we compared the informative scores of the first-hop MLPs and the informative scores of the second-hop MLPs. The reason why we classified based on the MLPs is that the scale of the informative scores for MLPs are much higher than those for attention heads. We'll update the paper to clarify this.
This paper studies the mechanisms of how language models perform two-hop out-of-context reasoning (OCR), where the model generalizes to implications of new facts acquired during fine-tuning that involve composing the new facts as the first or second hop with another known fact. A series of experiments is conducted, which consolidates prior findings that LMs usually perform the two hops serially in the lower/middle and upper/late layers, respectively. The authors also hypothesize that the mechanism is learned during pretraining when encountering implications of known facts, where several controlled experiments strengthen this hypothesis.
update after rebuttal
The rebuttal helps complement the draft and addresses some of my concerns. I will maintain my original score, which leans positive overall.
给作者的问题
None
论据与证据
Yes. The claims are supported by clear evidence from various angles.
方法与评估标准
Most of the methods/evaluation criteria make sense. One metric that is somewhat unconventional to my knowledge is the usage of the rank of the probability that the model assigns to the ground truth continuation among a set of options, introduced in Section 3, which also serves as the basis for the later metrics. I wonder why not choosing more standard metrics such as accuracy.
理论论述
I don't think there are theoretical claims in this work.
实验设计与分析
The experimental designs and analyses are sound.
补充材料
I briefly reviewed all the appendix sections. In particular, the list of people names and templates.
与现有文献的关系
The work is based on prior findings (cited in the paper) on latent multi-hop reasoning in transformer language models, especially that LMs usually perform the two hops sequentially within the forward pass. Overall, the main contributions w.r.t. the existing literature are to consolidate and extend these prior findings with a new set of techniques on comparing models before and after fine-tuning.
遗漏的重要参考文献
There're no undiscussed essential references to my knowledge.
其他优缺点
Even though there are many interventions in the model internals which measure the different components, etc., the overall conclusion is arguably still quite behavioral. It might be interesting to look closer at the specifics, especially the interface between the informative and extractive components and how they are implemented. This might help understand why the model generally seem to learn the right things during fine-tuning instead of just memorizing the facts in arbitrary fashions.
其他意见或建议
None
Thank you for your review. We're happy you find that our claims are "supported by clear evidence from various angles", and that our "experimental designs and analyses are sound". We're also grateful for your constructive feedback! We'll now discuss your concerns.
One metric that is somewhat unconventional to my knowledge is the usage of the rank of the probability that the model assigns to the ground truth continuation among a set of options, introduced in Section 3, which also serves as the basis for the later metrics. I wonder why not choosing more standard metrics such as accuracy.
The main reason for using the mean rank instead of accuracy is that mean rank can measure partial progress more easily than accuracy. This allows us to measure improvements in LM performance throughout training. In contrast, using accuracy gives a sparse, sharp signal only when the log-prob on the correct continuation exceeds every other continuation, and has been shown to be misleading at times (Rylan et al, 2023).
We also believe that our use of mean rank is sound, particularly in this setting. First, note that we can reliably interpret low mean rank as high accuracy, because mean rank equals 0 necessarily implies perfect accuracy, and vice versa.
Secondly, in our setting, the language model cannot possibly use any prior knowledge or reasoning to improve mean rank without actually learning the underlying facts. This is because our synthetic dataset is constructed by picking a label uniformly at random from a set of possible continuations. In contrast, standard multiple choice benchmarks have options that often have relationships between them, so that language models can improve their mean rank by eliminating impossible answers (e.g. if option A implies option B, then option A is impossible, since there can only be one correct answer). This particular problem has been found in the TruthfulQA dataset. Because our continuations are randomly generated, there is no internal structure to bias the mean rank.
Schaeffer, Rylan, Brando Miranda, and Sanmi Koyejo. "Are emergent abilities of large language models a mirage?." Advances in Neural Information Processing Systems 36 (2023): 55565-55581.
This paper focused on two-step implication process in LLMs, and proposes "extractive structures" to analyze which part of the LLM syblayers are the dominants of implication at different types of problems (2 types in the paper) with a method to highlight them based on the output probability. The paper also discusses how the implication behavior is obtained during fine-tuning LLMs, argued that (1) early layers encode implication in the early positions in the input, and late layers vice versa, and (2) learning implication should happen after learning facts.
update after rebuttal:
Thank you for your comments on my review! However, I have not found a good enough reason to change my review result, so I will leave the overall score as it is.
给作者的问题
I would like to provide some comments on following questions: (1) the proposed framework can yield similar results on larger models (10B~100B) or smaller models (~1B)? (2) when the length of implication chain is increased more than 2, what can be said according to these results?
论据与证据
The study focused on only two-hop implication: the fact A implies B, then B implies C, and if we only discuss about this setting, the claims in the paper is well supported. But the result is indeed weak to be generalized to more complex task (more-than-3-hop implication).
方法与评估标准
The proposed method is based on comparing output probabilities when replacing certain set of weights/variables. This approach is able to highlight the effect on a specific sublayer in the LLM. In contrast, this approach doesn't focus on actual behavior of the updates of model parameters.
理论论述
Although the paper contains many discussion, the method is basically simple: check differences of output probability/rank of output if tweaked some portion of the target LMs. Definitions of extractive scores looks okay, but it is somewhat arbitrary and different argument may involve different formulation.
实验设计与分析
The experiment mainly focuses on analysis on OLMo, and the results may be suitable to explain how the OLMo model behaves. But it is somewhat questionable whether insights obtained from experiments can be generalized to other models (as noted in Appendix I showing some counterexamples).
补充材料
Appendix B involves actual calculation of extractive scores using derivation. Appendix I shows several results on other models, showing several counterintuitive behaviors.
与现有文献的关系
The study reveals several process of how/where the model remembers information from training data, so it may impact against designing training strategy on continual pre-training and fine-tuning LLMs.
遗漏的重要参考文献
NA
其他优缺点
NA
其他意见或建议
Table 1: Senshoji should be Sensoji (if the authors intended it is a famous temple in Tokyo) L.210: the the -> the
Thank you for the review! We're happy that you find that our claims are "well supported" for the two-hop setting. We're also grateful that you've pointed out areas to work on in terms of writing. We'll now address your questions and concerns.
But it is somewhat questionable whether insights obtained from experiments can be generalized to other models
(1) the proposed framework can yield similar results on larger models (10B ~ 100B) or smaller models ( ~1B)?
While we have shown that different models vary in terms of how well they can exhibit OCR and how sensitive they are to hyperparameters, we want to highlight that all the models we investigated show qualitative patterns consistent with our main hypotheses.
In addition, as suggested by the reviewer, we are running experiments on the Llama-3.2-1B and Gemma-2-27B. The results we have so far are consistent with the qualitative patterns exhibited by the 4 different 7B models we've studied. Specifically, we find that in every model we studied,
- The model exhibits OCR in the first-hop and second-hop datasets for some learning rates and training epochs (Llama-3.2-1B)
- For all learning rates and training epochs, the fact-first data order generalizes at least as well as the implications-first data order. Further, for some learning rates and training epochs, the fact-first data order generalizes significantly better. (Llama-3.2-1B)
- For all learning rates and training epochs, the grafted model generalizes to counterfactual implications at least as well as the control model. Further, for some learning rates and training epochs, the grafted model generalizes significantly better than the control model. (Llama-3.2-1B)
We are still in the middle of running the Gemma-2-27B experiments; we hope to provide an update in the next few days.
(2) when the length of implication chain is increased more than 2, what can be said according to these results?
We believe that our extractive structures framework can be generalized to describe longer chains of latent reasoning, or even reasoning that requires consolidating information from many different training documents (Treutlein et al, 2024). For example, to deal with several hops we might generalize the simple [upstream] -> [informative] -> [downstream] mechanism to [upstream] -> [informative 1] -> [connector] -> [informative 2] -> [downstream], and adapt the causal metrics accordingly. We believe that extending the framework to these settings would be an exciting way of building on our present work.
Treutlein, Johannes, et al. "Connecting the dots: Llms can infer and verbalize latent structure from disparate training data." Advances in Neural Information Processing Systems 37 (2024): 140667-140730.
This paper investigates the mechanisms that enable Language Models to perform two-hop implications, i.e. generalization to implications of new facts acquired during fine-tuning that involve chaining a new fact with another known fact. The main hypothesis of the paper is that two-hop implications are enabled by "extractive structures", functional model components that work together to 1) extract relevant information stored in the weights given a prompt, and 2) produce correct implications by processing the extracted information. The paper experimental localizes such extractive structures in attention heads and MLP components of OLMo models during the execution of appropriately constructed synthetic tasks, and proposes learning mechanisms by which extractive structures are embedded into the model during pretraining. Finally, the paper validates two predictions derived from this putative mechanism.
The major weakness pointed out by reviewers are concerns over generalization of the results on LLMs model family beyond OLMo models, which however the authors counter with claims about preliminary results on LLaMa and Gemma models that are consistent with the main hypotheses in the paper.
Other than that, reviewers praised the paper for its clear exposition, for its timeliness and breadth in reconciling inconsistencies in the current related literature, and for the sound experimental designs and analyses which contributes to empirically validating the hypotheses that are put forward as the underlying mechanisms implementing two-hop implications. Given the current overwhelming interest in reasoning in LLMs, this paper is a timely and relevant contribution to the field, also for how it indicates hypothesis-driven mechanistic studies as viable approaches to investigate the core implementational aspects underlying the reasoning capabilities of LLMs. The paper is therefore recommended for acceptance.