/10

Poster4 位审稿人

最低3最高4标准差0.4

ICML 2025

Understanding Synthetic Context Extension via Retrieval Heads

提交: 2025-01-24更新: 2025-07-24

TL;DR

We identify when synthetic data falls short on learning long-context abilities and trace an explanation to a specific set of attention heads.

摘要

关键词

Large Language ModelsSynthetic DataLong ContextRetrieval Heads

评审与讨论

审稿意见

评分: 32025-02-26

The paper demonstrates that synthetic context extension can partially emulate real data’s effects on LLMs but falls short due to less effective training of retrieval heads. It provides a framework for understanding this gap through retrieval heads, offering both a diagnostic tool and a path toward improving synthetic data generation for long-context tasks. These findings could inform strategies to create synthetic datasets that better target the necessary model components, reducing reliance on costly real long-context data.

给作者的问题

No.

论据与证据

The paper presents experimental results across three tasks: Multi-Document Question Answering (MDQA), Multi-Hop Situated Question Answering (MuSiQue), and SummHay Citation. For each task, models fine-tuned on synthetic data consistently show lower F1 scores compared to those fine-tuned on real data. For example, on MDQA, the best synthetic data yields an F1 score of 0.49, while real data achieves 0.83 using Llama-3-8B-Instruct. Similar gaps are observed for MuSiQue and SummHay Citation, though the differences vary in magnitude. This claim is strongly supported by the evidence. The consistent performance gap across multiple tasks and models provides clear and convincing support for the claim that synthetic data underperforms compared to real data in long-context tasks.
The paper identifies retrieval heads as attention heads specialized in retrieving key information from the context. It shows that models trained on synthetic data have fewer retrieval heads with positive retrieval scores (e.g., 112 and 74 for synthetic data vs. 129 for real data on MuSiQue). Additionally, there is a strong correlation between the recall of retrieval heads (the overlap of synthetic data heads with real data heads) and downstream task performance, with a Spearman correlation of 0.75 for Llama-3-8B-Instruct on MuSiQue. This claim is well-supported by the evidence. The reduction in retrieval heads and the high correlation with performance provide a convincing explanation for the performance gap. However, the paper’s focus on retrieval heads as the primary mechanism could be slightly overstated, as other components (e.g., multi-layer perceptrons, or MLPs) might also contribute, though this is partially addressed by a footnote stating similar conclusions hold when fine-tuning all modules.

Weekness: The paper heavily attributes the performance gap to retrieval heads, potentially underplaying the role of other transformer components like MLPs, which are known to handle parametric knowledge. Although the authors note in a footnote that similar conclusions hold when fine-tuning all modules, this is not fully explored in the main text; The paper varies synthetic data along concept expression and context diversity, but other factors (e.g., reasoning complexity or distractor presence) might also affect performance and are not explored.

方法与评估标准

Yes.

理论论述

The proofs for theoretical claims in the paper are correct.

实验设计与分析

The experimental designs or analyses in the paper are sound.

补充材料

I checked the supplementary material of the paper and its supplemented with data presentation and additional experimental results.

与现有文献的关系

The paper investigates the use of synthetic data for training LLMs on long-context tasks, specifically Multi-Document Question Answering (MDQA), Multi-Hop Situated Question Answering (MuSiQue), and SummHay Citation. It explores how varying the realism of "needle" concepts (key information to retrieve) and the diversity of the "haystack" context (surrounding information) impacts model performance. This systematic approach sheds light on the properties of synthetic data that influence real-world long-context capabilities. Previous research has used synthetic data for a variety of NLP tasks. In contrast, this paper extends the application of synthetic data to long-context tasks, systematically analyzing how realism and diversity affect performance across multiple domains.

遗漏的重要参考文献

No.

其他优缺点

The experiments are conducted on two models (Llama-3-8B-Instruct and Mistral-7B-Instruct-v0.1) and three tasks. While the results are consistent, they may not fully generalize to other models with different architectures or pretraining, or to a broader range of tasks.

其他意见或建议

No.

作者回复

2025-04-01

Thank you for your review!

The paper heavily attributes the performance gap to retrieval heads, potentially underplaying the role of other transformer components like MLPs, which are known to handle parametric knowledge. Although the authors note in a footnote that similar conclusions hold when fine-tuning all modules, this is not fully explored in the main text

Parametric knowledge should not be required for strong performance on the tasks we examine, since all the relevant information is contained within the input context. In fact, the closed-book F1 on MuSiQue is 0.086 for Llama3 and 0.042 for Mistral when prompted with the question directly and none of the relevant context documents, an indication that the model cannot “cheat” by drawing on parametric knowledge to answer the question. MDQA’s closed-book accuracy is higher (0.333 for Llama3 and 0.209 for Mistral). As a result, we think our focus on attention heads is consistent with the experimental evidence we’ve found as well as intuition from prior work. However, we can add more thorough discussion of this point in the main text in any future version.

The paper varies synthetic data along concept expression and context diversity, but other factors (e.g., reasoning complexity or distractor presence) might also affect performance and are not explored.

Reasoning complexity: for our experiments, we explore tasks at 3 different levels of complexity: single-hop (MDQA), two-hop (SummHay Citation), and three-hop (MuSiQue).

Distractor presence: In our framework, distractor presence is a variation of the context diversity rather than its own axis. In future work, it would be interesting to clearly define what makes a document (or sentence) “distracting”, such as if it has high token overlap with the question or needle sentence but is not actually useful for answering the question. Formalizing these notions would require tailoring to each “conception expression” variant, so we do not explore it here.

The experiments are conducted on two models (Llama-3-8B-Instruct and Mistral-7B-Instruct-v0.1) and three tasks. While the results are consistent, they may not fully generalize to other models with different architectures or pretraining, or to a broader range of tasks.

Different architectures or pretraining: Retrieval heads have been shown to be present across model families with different architectures (attention variants and mixture of experts) by Wu et al. (2024). We study how they are affected by fine-tuning in this work, particularly in two architectures with different attention variants, which we believe to have the most impact on the behavior of retrieval heads.

Broader range of tasks: While our results are most applicable to tasks that have a long input context, and where the final answer must be retrieved from that context, the concept of identifying attention heads which attend to the input context is also broadly applicable: as we demonstrate with SummHay “Insight Heads”, looking at attention heads that correctly identify relevant intermediate information in a multi-step reasoning task is also a strong indicator of synthetic data performance.

审稿人评论

2025-04-02

I have carefully read all the reviewers' comments as well as the authors' responses, and my final opinion is to keep the score.

审稿意见

评分: 42025-03-11

This work aims to answer an important research question in the field of long-context modeling: how could the training on synthetic long-context data improve LLMs. The authors present a novel investigation into the fine-tuning of LLMs using synthetically-generated long-context data. One of the key contributions of this paper is the exploration of varying the realism in the "needle" concepts (the information to be retrieved) and the diversity of the "haystack" context (the broader dataset). The paper's findings reveal that while models trained on synthetic data do not perform as well as those trained on real data, the underlying effectiveness of synthetic data can be reflected by the patterns in retrieval heads.

给作者的问题

No.

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

This work provides an empirical study. There is no theoretical proof.

实验设计与分析

Yes. The design of various synthetic datasets makes sense to me.

补充材料

No.

与现有文献的关系

This work contributes to prior studies about (1) interpretability of attention mechanism and (2) the effectiveness of synthetic long-context data.

遗漏的重要参考文献

For constructing synthetic long-context data, there has been some existing work that proposed general principles for creating synthetic training data beyond dataset-specific constructions, such as [1,2]. Although these work did not focus on the principles proposed in this work, it would be nice to have some discussion on these general methods for synthetic long-context data.

[1] Make your llm fully utilize the context.

[2] Bootstrap Your Own Context Length.

其他优缺点

No.

其他意见或建议

It could be better if there are some analysis on the patterns of retrieval heads in existing long-context LLMs (such as Mistral-v0.2 and Llama-3.1). If there are some observations in line with the conclusions in this work, it could greatly contributes to the soundness of this work.

作者回复

2025-04-01

Thank you for your review!

For constructing synthetic long-context data, there has been some existing work that proposed general principles for creating synthetic training data beyond dataset-specific constructions, such as [1,2]. Although these work did not focus on the principles proposed in this work, it would be nice to have some discussion on these general methods for synthetic long-context data.

Regarding the construction of general synthetic context extension datasets, our results indicate the effectiveness of a diverse source of documents and tasks as done in [1] and [2]. We will add these citations to the paper, thanks!

It could be better if there are some analysis on the patterns of retrieval heads in existing long-context LLMs (such as Mistral-v0.2 and Llama-3.1). If there are some observations in line with the conclusions in this work, it could greatly contributes to the soundness of this work.

Wu et al. (2024) showed that retrieval heads are clearly identifiable in existing long-context LMs such as Mistral-v0.2, which are trained on a wide variety of data mixtures. Our research question pertains to how specific types of context extension data influence the training of retrieval heads. While examining the retrieval heads used by existing long-context LMs on MDQA, MuSiQue, and SummHay after training on large data mixtures is an interesting extension, we consider it to be out of the scope of this work.

审稿意见

评分: 32025-03-14

The paper explores the impact of fine-tuning large language models (LLMs) with synthetic data for long-context tasks, particularly in retrieval and reasoning. The study evaluates different methods of synthetic data generation, varying both the realism of the "needle" (key concept) and the diversity of the "haystack" (context). The core finding is that the effectiveness of synthetic data can be interpreted through retrieval heads—specialized attention heads responsible for extracting relevant information. The study demonstrates that retrieval heads learned from synthetic data correlate well with those learned from real data, but synthetic training remains less effective. The authors also introduce a method of patching retrieval heads to improve model performance. The findings contribute to understanding synthetic data's role in LLM training and provide insights into designing better synthetic datasets.

====== update after rebuttal ======

During the rebuttal process, overall, I think the authors did not provide sufficiently empirical or insightful direct responses to some of the points reviewers raised. For instance, phrases such as "leave for future work," "we do not explore it here," and "out of the scope of this work" were largely used, which may indicate a lack of deeper engagement. However, considering the strengths and merits of the work as it currently stands, my final opinion is to maintain my original score of weak accept.

给作者的问题

Retrieval heads seem necessary but not sufficient for strong task performance (Morehopqa: More than multi-hop reasoning).
Would your results generalize to generative synthetic data tasks (e.g., reasoning, math or code)?
Does retrieval head behavior vary across different transformer architectures (e.g., mixture-of-experts models)?
How does synthetic data performance change as model size scales up? Does a larger model mitigate synthetic data limitations?

论据与证据

Most claims are clearly supported.

Claim: Synthetic data fine-tuning can extend the effective context of LLMs, but it underperforms real data training.

Evidence: The authors compare performance on three long-context tasks (MDQA, MuSiQue, SummHay) and show that even the best synthetic datasets have a significant performance gap compared to real data.

Claim: Retrieval heads play a key role in model performance on long-context tasks.

Evidence: Through a mechanistic interpretability analysis, the results demonstrate that models trained on synthetic data have a subset of retrieval heads found in models trained on realistic data.

Claim: Patching retrieval heads from models trained on realistic data into models trained on synthetic data can improve performance.

Evidence: The authors conduct intervention experiments, showing that patching retrieval heads from real-data models improves performance on long-context tasks.

方法与评估标准

The experimental design is sound and appropriate for the problem domain. The study fine-tunes two well-known LLMs (Llama-3-8B-Instruct and Mistral-7B-Instruct) and evaluates them on three long-context tasks, covering single-hop retrieval (MDQA), multi-hop retrieval (MuSiQue), and citation retrieval (SummHay).

理论论述

The paper does not primarily present new theoretical results, but it builds upon and extends existing work on retrieval heads (Retrieval Head Mechanistically Explains Long-Context Factuality). The mechanistic explanations are empirically validated through experiments rather than formal proofs.

实验设计与分析

The main results are mostly sound, I also checked the following aspects:

Models are fine-tuned with consistent training procedures.
Hyperparameters are documented (Appendix C).
Retrieval heads are measured consistently across different models and datasets.

补充材料

The appendices include detailed Synthetic Dataset Creation Prompts, Training Details, and Visualization Results, which add to the paper’s transparency.

与现有文献的关系

The paper is well-positioned within existing literature:

It builds on work on synthetic data for LLM fine-tuning (From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data)
It extends retrieval head analysis from single-hop settings to multi-hop and long-context tasks.
It connects with mechanistic interpretability work on transformer circuits (In-context Learning and Induction Heads.)

It is suggested to discuss its position in the context of general-purpose data synthesis for LLMs [1,2]; context compression for long-context LLMs [3,4]; and analysis of synthetic data biases and their effects on downstream performance [5,6].

[1] Img-diff: Contrastive data synthesis for multimodal large language models [2] Data-juicer 2.0: cloud-scale adaptive data processing for foundation models [3] Make your llm fully utilize the context [4] Long context alignment with short instructions and synthesized positions [5] Understanding and Mitigating the Bias Inheritance in LLM-based Data Augmentation on Downstream Tasks [6] LLM-Based Synthetic Datasets: Applications and Limitations in Toxicity Detection

遗漏的重要参考文献

None.

其他优缺点

Strengths:

Clear mechanistic insights into why synthetic data works.
Systematic comparison of synthetic data types.

Weaknesses: My major concern lies in the generalization beyond task and model specifications.

It is suggested to add more discussion on other LLM capabilities like reasoning.
The different transformer architectures and model size scaling up should also be considered.

其他意见或建议

None.

作者回复

2025-04-01

Thank you for the review!

It is suggested to discuss its position in the context of general-purpose data synthesis for LLMs [1,2]; context compression for long-context LLMs [3,4]; and analysis of synthetic data biases and their effects on downstream performance [5,6].

Thank you for the suggestion, we’ll add the discussion to the paper. We note for the other reviewers and the AC that these papers do not directly impact the conclusions and comparisons in our paper.

Retrieval heads seem necessary but not sufficient for strong task performance (MoreHopQA: More than multi-hop reasoning).

We agree with this view: retrieval heads are a useful indicator for comparing the model components targeted by different datasets, but do not comprise the whole set of model circuitry required for a given task. The advantage is that the concept of identifying attention heads which attend to the input context is very broadly applicable: as we demonstrate with SummHay “Insight Heads”, looking at attention heads that correctly identify relevant intermediate information in a multi-step reasoning task is also a strong indicator of synthetic data performance. This can be extrapolated to tasks like MoreHopQA that ultimately have generative answers–as long as the input context has required relevant information, attention heads that correctly attend to the relevant information can be studied.

Would your results generalize to generative synthetic data tasks (e.g., reasoning, math, or code)?

Our results are most applicable to tasks that require a long input context, and generative coding tasks that include an existing codebase or libraries as input would fit this description. As for tasks with shorter input context but long generative outputs, like most math problems in existing datasets, looking at the behavior of attention heads that might attend to the relevant parts of the intermediate generated output (“chain of thought”) is an interesting question that we leave for future work.

Does retrieval head behavior vary across different transformer architectures (e.g., mixture-of-experts models)?

We experimented with two attention variants in our work–full attention in Llama-3-8B-Instruct and sliding window attention in Mistral-7B-Instruct-v0.1, since these attention differences are significant for long-context retrieval behavior. We show consistent observations across these variants. Since mixture-of-experts is an architecture variant that changes the MLP modules and does not affect the attention mechanism, we do not think this would make a significant difference–in fact, Wu et al, 2024 showed that the basic properties of retrieval heads are similar for both Mistral-7B-v0.2 and a MoE variant in the same model family, Mixtral-8x7B-v0.1.

How does synthetic data performance change as model size scales up? Does a larger model mitigate synthetic data limitations?

Due to computational limitations, we restricted our investigation to 7B-scale models. (Note that long-context models have more extreme memory demands than standard-context models.) Nevertheless, dynamically activated retrieval heads have been shown to comprise a similar fraction of the total attention heads across model sizes (Wu et al., 2024). Since retrieval heads are unevenly activated based on context, we can speculate that synthetic data might face similar limitations for larger models.

审稿意见

评分: 32025-03-20

This paper examines how synthetic data affects the performance of long-context language models (LLMs) on retrieval-based tasks. The authors find that while models fine-tuned on synthetic data generally underperform compared to those trained on real data, careful construction of synthetic datasets can partially close this performance gap. They identify "retrieval heads" as critical attention mechanisms that help models retrieve relevant information from long contexts, and show that synthetic data induces fewer of these heads than real data. However, there's a strong correlation between the presence of these retrieval heads and model performance. The study demonstrates that the cosine similarity between retrieval scores of real and synthetic data is a strong predictor of model effectiveness, providing insights into how to create better synthetic training data for long-context tasks.

update after rebuttal

The authors have explained the differences between their work and the retrieval head framework proposed by Wu et al. (2024). Additionally, the inclusion of p-value experiments further strengthens their arguments. As a result, I have increased my score.

给作者的问题

See Other Strengths And Weaknesses

论据与证据

The claims made in the paper are generally well-supported:

The underperformance of synthetic data fine-tuning compared to real data is demonstrated through performance metrics across multiple tasks and models (Table 1).
The predictive power of cosine similarity between real and synthetic data retrieval scores is demonstrated through direct comparison of similarity metrics with performance outcomes (Figure 3).
The task-specific nature of retrieval heads is supported by cosine similarity measurements across different task types (Table 2).

方法与评估标准

The proposed methods and evaluation criteria in the paper are well-suited to the problem of understanding how synthetic data affects the performance of long-context language models on retrieval-augmented tasks:

The authors chose three diverse long-context tasks (MDQA, MuSiQue, and SummHay Citation) that represent different aspects of retrieval and reasoning. This provides a comprehensive understanding of how synthetic data impacts various types of long-context processing.
The systematic variation of concept expression and context diversity in synthetic datasets allows for controlled experimentation on how different aspects of data realism affect model performance.
The focus on retrieval heads as a specific mechanism for understanding model behavior is appropriate, as these heads have been shown to be critical for information retrieval in long-context settings. This provides a concrete, interpretable measure of how well models are learning to perform the required tasks.
The paper includes appropriate baselines (fine-tuning on real data) and makes meaningful comparisons between different synthetic data construction methods.

理论论述

There are no theoretical claims in the paper. Authors mainly use experiments to support the claims.

实验设计与分析

The experimental designs and analyses in this paper are generally sound.

The systematic variation of concept expression and context diversity in synthetic datasets allows for controlled experimentation. This approach effectively isolates variables that might influence fine-tuning outcomes.
The focus on retrieval heads as a specific mechanism for understanding model behavior is appropriate, as these heads have been shown to be critical for information retrieval in long-context settings. This provides a concrete, interpretable measure of how well models are learning to perform the required tasks.

However, While the paper shows numerical differences in performance metrics, it doesn't provide detailed statistical analysis (confidence intervals, p-values) to confirm whether these differences are statistically significant.

补充材料

Yes. The supplementary material mainly contains additional experiments and prompts used to generate synthetic datasets.

与现有文献的关系

The contributions of this paper may shed light on Retrieval Mechanisms in Transformers and Synthetic Data for Language Model Training research domains. Specifically, The paper builds upon work by Wu et al. (2024), which identified retrieval heads as critical mechanisms for long-context factuality in LLMs. This paper extends that research by examining how synthetic data affects the development and effectiveness of these retrieval heads across different tasks and model architectures.

遗漏的重要参考文献

The paper did a good work for literature review.

其他优缺点

Strengths:

The paper creatively combines existing ideas about synthetic data and mechanistic interpretability, focusing specifically on how synthetic data affects the development of retrieval heads in long-context LLMs.
The findings have practical implications for developing more effective synthetic training data for long-context LLMs, which is important given computational constraints of training on real long-context data.
The writing is generally clear and accessible, with appropriate technical detail for the intended audience.

Weaknesses:

Statistical analysis such as confidence intervals, p-values are missing.
While the analysis employs a well-established methodology—specifically adopting the retrieval head framework defined by Wu et al. (2024)—the study primarily extends prior work by applying this technique to examine LLM behavior on synthetic datasets. This approach, while methodologically sound, results in limited novelty, as it does not introduce significant conceptual or technical innovations beyond the foundational framework.
You note that different tasks leverage different sets of retrieval heads. What do you believe accounts for these differences, and how should this influence the design of synthetic data for specific types of long-context tasks?

其他意见或建议

See above.

作者回复

2025-04-01

Thank you for the review!

Statistical analysis such as confidence intervals, p-values are missing.

To address this, we will make the following additions to Tables 1, 3 and 11 to show when a performance gain is significant:

Table 1: $\dagger$ indicates that a model trained on the bold synthetic dataset in the column outperforms a model trained on the indicated dataset with $p < 0.05$ according to a paired bootstrap test. We see that most gains $\geq 0.02$ are statistically significant.

Concept Exp.	Context Div.	MDQA Llama3	MDQA Mistral	MuSiQue Llama3	MuSiQue Mistral	Concept Exp.	Context Div.	SummHay Llama3	SummHay Mistral
High	High	0.31†	0.20†	0.37†	0.22	High	High	0.70†	0.28†
High	Low	0.41†	0.23†	0.41	0.23	High	Low	0.61†	0.28†
Low	High	0.49	0.31	0.29	0.21	Simplified	High	0.79	0.38
Low	Low	0.47†	0.24†	0.34†	0.17†	Simplified	Low	0.65†	0.28†
Symbolic	Symbolic	0.48	0.16†	0.32†	0.11†	Symbolic	Symbolic	0.54†	0.18†

Real Data (Full)		0.83	0.64	0.45	0.20	Real Data (Full)		0.81	0.40
Real Data (Limited)		0.80	0.59	0.32	0.16	Real Data (Limited)		0.80	0.40
Non-FT		0.45	0.12	0.22	0.03	Non-FT		0.40	0.07

Tables 3 and 11: Patching Results (summarized due to character limits): We perform a paired bootstrap test to test whether any patch model outperforms the original model. In Table 11, all performance improvements for patched synthetic data models on MDQA and MuSiQue are significant with $p < 0.05$ , and gains on SummHay are not significant. In Table 11, all performance improvements for patched synthetic data models on MDQA and MuSiQue are significant with $p < 0.05$ , and gains on SummHay are not significant.

While the analysis employs a well-established methodology—specifically adopting the retrieval head framework defined by Wu et al. (2024)—the study primarily extends prior work by applying this technique to examine LLM behavior on synthetic datasets. This approach, while methodologically sound, results in limited novelty, as it does not introduce significant conceptual or technical innovations beyond the foundational framework.

In addition to showing a strong relationship between the retrieval heads recruited by synthetic datasets and downstream performance, our work takes a novel step in demonstrating that mechanistic interpretability can give insight into realistic data tasks involving complex reasoning. In contrast, Wu et al. (2024) only examined a single-step retrieval task. We also show that attention heads that attend to intermediate information within the context can be clearly identified and used to understand dataset performance, which extends the breadth of tasks that can be examined beyond purely extractive tasks.

You note that different tasks leverage different sets of retrieval heads. What do you believe accounts for these differences, and how should this influence the design of synthetic data for specific types of long-context tasks?

While our work does not do a full characterization of the circuits that are relevant to solve each task, each task requires different upstream capabilities, which we think leads to different attention heads recruited for the final retrieval step. (And different attention heads are recruited for different intermediate reasoning capabilities, as indicated by the SummHay Insight Head analysis).

To design synthetic data for different long-context tasks, our work shows that only a small (e.g. ~40 MuSiQue examples) number of real examples are needed to identify relevant retrieval heads, and then this can be used to assess the performance of promising synthetic datasets (using a few hundred examples) before scaling up synthetic dataset generation.

最终决定Accept (poster)

2025-05-01

This paper presents a timely and well-executed empirical study on the effectiveness of synthetic context extension for training long-context LLMs, offering mechanistic insights via the analysis of retrieval heads. The experimental design is solid, spanning multiple datasets, models, and task types, and the authors' identification of retrieval heads as key to performance provides a concrete diagnostic tool for evaluating synthetic dataset quality. The inclusion of patching experiments further strengthens the causal claims about the role of these heads.

While some reviewers raised concerns about limited novelty beyond prior work (particularly Wu et al. 2024), the authors make a compelling case that their contributions lie in extending retrieval head analysis to multi-hop and complex reasoning tasks, as well as offering principled methods for evaluating synthetic datasets in these settings. Additionally, reviewers pointed out the initial lack of statistical significance testing, which the authors addressed with a robust rebuttal including bootstrap analyses. Broader generalization—across models, tasks, or LLM capabilities like reasoning—remains an open question, but the empirical evidence presented here is strong and the methodology is reproducible and impactful.

Overall, while the work may not be a clear top-tier paper in terms of conceptual innovation, its solid experimental foundation, practical relevance for synthetic data construction, and interpretability contributions justify acceptance.