5.8

/10

Rejected4 位审稿人

最低5最高6标准差0.4

3.0

置信度

正确性2.8

贡献度2.5

表达3.0

ICLR 2025

Understanding Synthetic Context Extension via Retrieval Heads

Xinyu Zhao,Fangcong Yin,Greg Durrett

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

TL;DR

We identify when synthetic data falls short on learning long-context abilities and trace an explanation to a specific set attention heads.

摘要

关键词

Large Language ModelsSynthetic DataLong Context

评审与讨论

审稿意见

评分: 6置信度: 32024-10-31

Synthetic data is widely used to enhance long-context understanding and training in large language models, particularly in scenarios with limited resources. However, assessing the efficiency and performance impact of synthetic data on large language models remains challenging. This paper explores how synthetic data for context extension impacts downstream task performance, advancing understanding of long-context behavior and how synthetic data enhances language model capabilities.

优点

The paper is clearly written and readily comprehensible.
The paper presents a clear and compelling motivation. The discussion on the impact of synthetic data in training large language models is insightful and valuable for advancing the exploration and understanding of LLM principles.
This paper presents clear and coherent experimental procedures along with well-organized technical results.

缺点

This paper attempts to explore the influence and effects of synthetic data on the training of large language models (LLMs). However, it fails to establish a unified configuration and pattern, leading to results that appear somewhat random. This may affect the generalizability and applicability of the paper to a broader range of contexts.
I have noticed that randomly dropping attention heads can sometimes improve performance. Is it possible that certain information is detrimental, and could selectively dropping the least important heads enhance performance?
The evaluation for the LLM's performance is single and subjective. It may not fully support the conclusion.

问题

This paper focuses on long-context learning. I am curious whether the influence of synthetic data—and the corresponding conclusions—would be similar in short-context learning with the synthetic data, potentially enhancing the generalizability of the paper's insights.
This paper is well-organized and technically sound; however, it still has limitations and uncertainties. I lean toward a weak acceptance and will take the other reviewers' opinions into account before making a final decision.
Please see the weaknesses outlined.

评论- Response to reviewer Cj3o

2024-11-23

Thank you for the feedback! We address your comments below:

W1: We believe that our results exhibit clear patterns, namely: Retrieval heads for real and synthetic versions of the same task have high overlap, as we noted in the paper: There is higher cosine similarity between retrieval heads for the same conceptual tasks (i.e. real and synthetic data for the same task) than across tasks (MuSiQue vs. SummHay Citation) The synthetic data retrieval heads are largely subsets of the real data retrieval heads. To make this more apparent, we have added recall tables and discussion in Appendix E: “We find that on all 3 tasks, the attention heads with non-zero retrieval scores on the real data have high recall ( $\geq$ 0.76) against those identified on the synthetic data, while the reverse is more often lower.” We see greater retrieval head presence in the middle-to-last layers, as visualized in Figure 6 (formerly Figure 5) in Appendix D. SummHay insight heads follow this pattern with one clear difference: they are less prevalent in the last 2-3 layers of the model, reflecting their “intermediate” role of helping determine where downstream attention heads should look for the final answer tokens. Within a single layer, the relevant attention heads were randomly selected to be “primed” during pretraining for performing the target fine-tuning task. We have added this observation to the figure caption.

Additionally, retrieval heads can be identified for all tasks where the answer (or information relevant to the answer) are within the context, and therefore our analysis can be extended broadly.

W2: For the experiments in our work, the random attention head dropout results are averaged over three samples; since prior work has found that a small subset of the model’s components can match the full model performance on the Entity Tracking task (Prakash et al., 2024), we think that some of the random heads may not only be outside the relevant components, but also actively interfering with them. This idea is very compelling–but we leave the identification of attention heads with destructive interference to future work, since all of the attention heads selected to be randomly dropped have the same retrieval score of 0, and we need additional metrics to determine destructive interference.

W3: We evaluate across three tasks, which have been commonly used in this space, with objective automatic evaluation metrics. If this does not fully address your concerns, we’d appreciate additional clarification of this identified weakness!

Q1: We experimented with short-context fine-tuning of Llama-3-8B-Instruct on MuSiQue dataset variants. The resulting F1 scores from fine-tuning are: R,R = 0.59; R,R (L) = 0.46; H,H = 0.47; H,L = 0.48; L,H = 0.47; L,L = 0.50; S,S = 0.44. The F1 of the non-FT model is 0.44. In short context, the synthetic datasets result in marginal improvements over the non-FT model, and we find negligible correlation (Spearman R = 0.07) between F1 and retrieval score cosine similarity, despite similar ranges of cosine similarity (0.72-0.79) between synthetic data and the real data. We hypothesize that synthetic data is able to transfer better to the real data at longer context because it teaches the relevant attention heads to handle the new positional embeddings and produce sharp softmaxes over greater numbers of input tokens–abilities that should have been acquired at the shorter context during pretraining. This results in substantial gains over the non-FT model when fine-tuning on synthetic data at long context, which we have added to Table 1 for comparison. (For example, fine-tuning on the S,S dataset at short context results in +0.00 improvement, but leads to large (+0.10) improvement at 32K context.)

审稿意见

评分: 6置信度: 32024-11-04

This paper investigates the mechanism behind synthetic context extension helping long-context LLMs, where the models are fine-tuned with synthetically generated long-context data. Across three long-context retrieval and reasoning tasks, the paper examines the effects of varying "concept expression" and "context diversity" in fine-tuning and demonstrates that synthetic data yields inferior performance compared to real data. Through analysis of retrieval heads, the paper interprets the performance gap between the two types of fine-tuning data.

优点

The paper presents a framework for constructing synthetic long-context examples from existing databases with controlled similarity to real data.
The retrieval heads analysis provides interpretable insights into the behavioral differences between models trained on different datasets, helping explain both the effectiveness and limitations of synthetic data.

缺点

The visualization quality and clarity of figures should improve. 1) Figure 1's axis labels are of tiny size and poor resolution. The leftmost axis labels are occluded. The "Retrieval Heads" heatmap lacks a color scale legend. 2) Figures 4 and 5 would benefit from increased font sizes for better readability.
The subset relationship of retrieval heads. 1) The assertion in line 361 that synthetic data retrieval-scoring heads are "strict subsets" of those from real data training appears to contradict Figure 1, where certain heads (e.g., head #0, layer #21) show high scores in synthetic data plots but not in real data plots. 2) The characterization in line 428 describing these heads as "nearly" a subset requires clarification. The authors are encouraged to specify for which tasks and conditions the strict subset relationship holds versus where it is approximate.

问题

About limited real data training. 1) It would be better if the paper could quantitatively define the "limited" real data condition. 2) Figure 4 shows under some circumstances, synthetic data outperforms the limited relation subset of the real data. Could the authors discuss whether increasing synthetic training examples can help surpass limited real data performance? 3) Whether hybrid training (combining limited real data with synthetic data) could enhance performance, particularly for tasks where the retrieval head of synthetic is not a strict subset of the real data?
MuSiQue's context extension (line 137). It would be better if the authors could elaborate on what criteria governed the selection of padding paragraphs, and how to ensure the added context did not introduce extra information implying the answer to a certain hop's question.
In Table 4, cells in the columns "Compl.", "Inter.", "Rand" and row SummHay appear to have the same values as those in Table 3. The "Orig." value for (Real, Real) setting is also inconsistent with the addition of "Orig." and "delta" values in the following rows.
The paper employs LoRA fine-tuning, and it would be helpful to know how might the observed patterns in retrieval head behavior and dataset relationships generalize to full parameter fine-tuning. If it is impractical to verify it empirically due to computational constraints, authors' predictions and explanations would be valuable.

I am willing to change the score if my concerns are addressed.

评论- Response to reviewer TNEU

2024-11-23

Thank you for the review!

W1: We’ve increased the label text in Fig. 1, Fig. 4, and Fig. 5. Additionally, we’ve added a heatmap legend to Fig. 1 and fixed the leftmost labels.

W2: Apologies for the confusion. We clarified our findings in the general response, in that this is not a strict subset relationship, but instead a “soft” one: the real data has high retrieval head recall (>=0.76 for Llama-3-8B-Instruct on all tasks) against the synthetic data, but not the other way around. We’ve added Appendix E showing pair-wise retrieval head recall, where the first rows of Tables 5, 6, & 7 show that the real data retrieval/insight heads for Llama-3-8B-Instruct have high recall against all synthetic dataset retrieval heads. We’ve also updated the text throughout the paper accordingly.

This “soft subset” result is significant because retrieval head recall is highly correlated with real task performance, which we have made more clear by adding Table 4 to Appendix E. This shows that, when synthetic datasets have higher retrieval head recall against the real task, performance is also higher, with strong Spearman correlation for both Llama3 and Mistral.

Q1.1 In short, the limited subset is defined as a subset of the real dataset that satisfies certain constraints of the distribution. We’ve added the details to Appendix C, which says:

On MDQA, we create three variants: $\text{L}_1$ is the subset containing Who, When, and Where questions; $\text{L}_2$ is the subset containing When and Where questions; $\text{L}_3$ is the subset containing only Who questions. These comprise 65.8%, 31.0%, and 34.8% of all questions in the MDQA training set respectively. In Table 1, Table 3, and Table 11, we only report $\text{L}_1$ results due to space constraints.

On MuSiQue, we use the subset of linear 3-hop questions consisting solely of T-REx component questions (Elsahar et al., 2018), as identified by `` $>>$ ". 10.8% of MuSiQue linear 3-hop questions in the training set fit this criteria. Additionally, among all component question hops in the training set, 43.0% are sourced from T-REX.

Q1.2: To experiment with scaling up the synthetic datasets, we double the size (to 800 examples) of the synthetic datasets for MuSiQue and fine-tune Llama-3-8B-Instruct. This results in the following F1 improvements: H,H = +0.13; H,L = +0.07; L,H = -0.03; L,L = +0.11; S,S = +0.08. Notably, both H,H and H,L variants outperform fine-tuning on 400 real dataset examples, and three datasets–(H,H), (H,L), and (L,L)--outperform fine-tuning on the 400 limited relation dataset. On these scaled synthetic datasets, retrieval head recall against the real data (400 examples) model is still correlated with performance, with Spearman R = 0.50.

Q1.3: We conducted an experiment on Llama-3-8B-Instruct to explore hybrid limited+synthetic data variants for MuSiQue. Namely, for each synthetic dataset $D_\text{synth}$ , we trained a new model on the dataset $D_\text{synth} \cup D_\text{limited}$ , resulting in 792 total examples. We find that performance generally increases over using only the synthetic dataset: H,H = +0.05; H,L = -0.02; L,H = +0.08; L,L = +0.09; S,S = +0.02. However, retrieval head recall and retrieval score cosine similarity against the real data change minimally (-0.04 to +0.04).

Q2: Added a footnote to the bottom of page 3 (highlighted): “We pad with irrelevant repeated text ‘The grass is green. The sky is blue…’ to ensure that the added paragraphs do not interfere with the answer to the original question.”

Q3: Thanks for pointing this out, the Mistral patching results table in Appendix F (now Table 11) has been fixed.

Q4: Full-rank finetuning on all parameters for 32K context is prohibitively expensive due to quadratic GPU memory requirements. In view of the limited computational resources we have, we ran LoRA fine-tuning on all parameters to approximate the full-rank fine-tuning setting and the results are presented in Appendix G, showing that we find similar conclusions, namely: there are mostly small (<0.05) performance differences between fine-tuning only attention heads and all modules. the exceptions occur on the SummHay Citation task, where fine-tuning all modules outperforms. when the synthetic data induces fewer retrieval heads (i.e. on MuSiQue), the synthetic data retrieval heads tend to be a “core” subset of real data retrieval heads–with “core” indicating higher retrieval scores. retrieval score recall and cosine similarity are moderately correlated with real task performance. We expect full-rank fine-tuning to yield similar results to LoRA.

2024-11-27

I appreciate the authors' efforts to make the figures clearer, address concerns with additional experiments, and revise one of the key findings for greater accuracy. The authors are encouraged to include the responses to Q1.2 and Q1.3 in the manuscript to make this section more complete. I raised the 'presentation' score and kept the other scores as they were.

审稿意见

评分: 6置信度: 22024-11-04

This paper provides a method to analyze how synthetic data help with long context tasks for LLM, and provides. The authors start by handcraft several principles (Concept Expression, Context Diversity & Symbolic Tasks) to construct data, and find that different tasks shares few similarity in preference. Then, the authors find that there is a high corelation between similarity of retrieval heads and model performance after finetuning, which can be regards as a metric to indicate the quality of the synthetic dataset.

优点

The quality analysis of the synthesized datasets is reasonable. The authors provide evidences to show that there are no preset principles on how to synthesize data, and find similarity of retrieval heads to be a highly-corelated metric. Sufficient experiments have been done to support this.

This paper could be a guidance in future works for synthesize datasets, which can provides a new perspective for downstream tasks in LLMs.

缺点

This paper serves real data as the ceiling. However, the amount and distribution of real data may also influence the finetuned performance. Is it possible that in some cases, the synthesized dataset performs better than real data? (for example, the amount of data is larger) If so, the similarity of retrieval heads with real data may not be a good metric under such conditions.

This paper takes concept expression, context diversity and symbolic tasks as three principles to manually synthesize data. I am not sure if the combination of these has a good coverage of all possible.

In L202, it seems that the low-diversity version is also a meaningless version for the task. I can hardly imagine if the sentences “The grass is green. The skyis blue...” can influence the model. This setting fused diversity with quality, making it hard to ablate their influence on the performance. In my opinion, the repeated pattern should be at least some meaningful text related to the task.

问题

As in weakness, it raises concerns involving two aspects:

Is the handcrafted principle in Sec. 3 representative enough?
What is the border of using similarity of retrieval heads to score a synthetic data? Is there some preconditions (such as amount)?

评论- Response to reviewer yKgJ

2024-11-23

Thank you for your review!

W1: You are correct that in this work, we presume that the real data is high quality and construct lower quality synthetic data, and focus on a quantity controlled comparison. In other situations where the real data is of low quality, then high quality synthetic data may outperform. We view the retrieval heads on high quality real data as indicative of the subnetworks required to perform the tasks adequately. When the synthetic data is higher quality and results in greater performance than fine-tuning on real data, a different analysis might be useful, as implied by our work: identify retrieval heads of the synthetic-data-model on the real data, and examine what additional components are contributing to performance. This is an interesting future direction!

W2: In our setup, we consider tasks where the answer to a query is contained in an input text $\mathcal{C}$ , which contains spans of text that are relevant to answering the query, { $f_1, ...., f_m$ }, which we deem our “needle concepts”. The remaining text $\mathcal{C} \setminus$ { $f_1, ...., f_m$ } is considered the context. By partitioning the input text into these two sets–needle concepts and surrounding context, we can categorize any data as having some variant of “concept expression” or “context diversity”. Both concept and context text can range from being extremely diverse to extremely simplistic or structured. Within this framework, the symbolic data is a highly structured variant of concept expression and concept diversity that we felt was important to emphasize separately, since prior work [1][2] has observed that training on highly structured code data improves natural language entity tracking abilities. We’ve made this more clear in Section 3.1 and highlighted the change.

Among concept and context variants, we aim to explore ones that are most commonly found in prior synthetic data work, as we describe in Section 3.1. There are certainly far more variants that could be enumerated, and we leave this to future work.

W3: We use the sentences from [3][4] which are two popular long-context retrieval benchmarks. The use of padding data in the literature is to isolate the pure “context utilization” (ability to retrieve information regardless of position) performance from any “distractor distinguishing” performance. “Context utilization” by itself requires training when extending models to long context, since the attention heads will be exposed to new position embeddings (even when interpolating) and required to softmax over more context tokens. Table 1 shows that training using this “meaningless” padding does indeed result in greater performance (+0.12 to +0.19 F1 for Llama3 MuSiQue) over the Non-FT model.

As shown in Figure 2, top right, we also have variants that contain distractor context sentences that are similar to the “needle” concepts. We agree that there is a measure of “whether the context is distracting or not”--as measured by whether the context has high lexical, structural, or semantic overlap with the needle text–that may not be accounted for by a straightforward diversity metric such as token distribution entropy. We believe that our variants are effective at representing major prior works, and leave finer-grained taxonomy for future work.

Q1: Please see our response to W2. Namely, our framework is extensible to contextual reasoning tasks, by classifying all text within an input as either “needle concepts” { $f_1, ...., f_m$ } or the surrounding “context” $\mathcal{C} \setminus$ { $f_1, ...., f_m$ }. Symbolic tasks are a special variant of this that we chose to highlight. While there are far more possible variants than can be enumerated in a single paper, we aim to explore those most commonly found in prior synthetic data work.

Q2: As discussed in our response to W1, there is interesting future work suggested here. We think about retrieval heads as indicative of the subnetworks trained to accomplish a task, and for any given pretrained model, there are some components primed for further recruitment during fine-tuning. This causes high overlap of the “existing mechanisms” being enhanced when trained on variants of the same conceptual task. For some tasks, more components may be required that are not effectively targeted during fine-tuning due to the low quality or amount of data. We look forward to exploring this further!

[1] Kim et al. (2024). Code Pretraining Improves Entity Tracking Abilities of Language Models. ArXiv, abs/2405.21068

[2] Prakash et al. Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking. In Proceedings of the 2024 International Conference on Learning Representations, 2024

[3] Hsieh et. al. RULER: What’s the Real Context Size of Your Long-Context Language Models? In First Conference on Language Modeling, 2024

[4] Mohtashami & Jaggi. Random-Access Infinite Context Length for Transformers. In Thirty-seventh Conference on Neural Information Processing Systems, 2023

审稿意见

评分: 5置信度: 42024-11-04

This paper utilizes a recently introduced concept of “retrieval heads” in transformers, i.e. the heads that copy tokens from the context to the output which are characterized by "retrieval score".

The authors explore the influence of such heads on the fine-tuning on realistic and synthetic long-context data. They provide a fixed protocol for generating synthetic data and perform experiments that show that the configuration of retrieval heads is a good predictive feature for model performance.

Namely, they conduct the following experiments:

mask out retrieval heads and observe drop in performance;
describe dataset with a vector based on retrieval scores of all heads in the model trained on this dataset and measure similarity between datasets via cosine similarity between these vectors. They find that the closer a synthetic dataset to the realistic one the better is the performance of the model trained on it.
patch a weaker model (trained on synthetic data) by substituting its retrieval heads by the ones from a stronger model (trained on real data) and observe increased performance for the patched model.

优点

The authors designed a principled way to generate synthetic data for long-context fine-tuning.
The authors introduce a way to measure similarity between synthetic and realistic datasets in terms of retrieval scores that is correlated with performance on these datasets. While it requires further investigation, I find this idea promising for guiding synthetic data generation of higher quality.

缺点

Lack of contributions

I will outline the candidates for contributions and then explain why I think they are not sufficient for a conference paper.

As far as I understand, the main takeaways from the paper are:

retrieval heads influence performance;
models differently fine-tuned for the same task share a subset of retrieval heads;
if we insert retrieval heads from a stronger model instead of the corresponding retrieval heads in the weaker model, the performance of the weaker model will improve;
if we measure the similarity between synthetic and realistic datasets based on the retrieval scores of the models trained on them this similarity will correlate with performance;

The first two points were already discovered in [1]. The latter point follows from the first two points and the fact that heads in differently fine-tuned transformers can be interchangeably patched which was discovered in [2].

The fourth point is a promising step towards explaining how to generate synthetic data achieving realistic data quality. However, I find this step alone not enough for a conference paper as the authors do not explain how to generate synthetic data but only show a way to predict the performance of models trained on it while requiring access to realistic data to compute similarity with models trained on it (which is a big limitation as in real-world scenarios we do not have access to realistic data).

[1] Zhao, Xinyu & Yin, Fangcong & Durrett, Greg. (2024). Understanding Synthetic Context Extension via Retrieval Heads. 10.48550/arXiv.2410.22316.

[2] Prakash, N., Shaham, T.R., Haklay, T., Belinkov, Y., & Bau, D. (2024). Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking. ArXiv, abs/2402.14811.

Missing explanations for crucial parts

Retrieval heads are introduced in line 293 (shown in italics) but it is still not clear how they are formally defined even though it is a crucial concept for the whole paper. I guess that they are the top-k heads after sorting by retrieval score, but would be nice to read it in text.
It is also not explained how to detect common subsets (intersections) of retrieval heads between models trained on different datasets (this is important for sections 4.3 and 5). I also wonder whether any matching algorithm (to understand to which head in another model current head corresponds) is applied because simple matching by heads' indices might not be enough as models might have functional symmetry i.e. if we permute heads model outputs will not change while head indices will.
There is no explanation for how patching is done. There is a phrase “following Prakash et al. …” in line 471, however, it is important to properly define this operation as it is a key part of the section 5.

Experiment request

This paper provides a new way to generate datasets for long-context retrieval tasks, however, it is not immediately obvious for me that long-context fine-tuning is needed to solve them. Could you please provide results for base models fine-tuned only on short-context data to show that fine-tuning on long-context is really required for the constructed tasks.

Unclear writing

It is not explained what is EM in line 357. I guess it is "exact match" but it should be defined as the main performance metric used in experiments.
Table 1 has duplicate columns “concept exp” and “context div” which is confusing.
It is almost impossible to read axes' names in Figure 1.
The caption in Figure 3 does not explain what the figure shows.
Theta and RoPE embeddings are not defined in line 154.
It is very hard to understand the tasks from the current descriptions. Could you please give examples of samples from datasets and needles for them (at least in appendix)?
Where do numbers from paragraph in line 356 come from? Figure 1 does not have 0.35 EM or 0.32 and 0.20. Where do number of heads 129, 112 and 39 heads come from?

Typos

280: a sparse of heads - sparse is not a noun
309: given AN evaluation example
351: H_synth reflects
481: is THE greatest
101: not \mathcal{M}
126: no dot in the end

问题

In line 35 you say: “but pre-training a long context model would necessarily reduce the number of observed tokens.”. Could you please explain what do you mean by “observed tokens” and why pre-training on a long context reduces their number?
Why does the Figure 4 contain several dots with the same name, e.g. R,R (L)?
What is the difference between SummHay insight and SummHay retrieval in Table 2?
In line 484 you say: “The success of these heads on different tasks likely is caused by upstream changes in the model during fine tuning, which specifically change the representations passed to these retrieval heads.” During patching, we copy heads from another model. They were not part of the patched model during fine-tuning and therefore, I don’t see how upstream changes made during fine-tuning of the patched model can help these new heads to perform better. Could you please elaborate on that?
In line 194 you say: “In task specific cases, it is beneficial to make this data less realistic while encouraging generalization.". I can’t understand how making data less realistic encourages generalization. In all your experiments training on realistic data led to better generalization (better test performance). Could you elaborate, please?
What is meant by the "target" and "synthetic" tasks in line 482? So far you have introduced only synthetic datasets.

伦理问题详情

I have no ethics concerns

评论- Response to reviewer YA8F

2024-11-23

Thank you for the detailed suggestions! We would like to point out a significant misunderstanding regarding the “Lack of contributions” weakness, in that paper [1] is the present paper, uploaded during the anonymous review period, which is in compliance with https://iclr.cc/Conferences/2025/CallForPapers . It therefore does not detract from the contributions of the present work.

We address the other comments below:

Comparison to [2]: We believe that our findings in realistic, multi-hop reasoning settings constitute a significant contribution over the task used in [2], in which the model must retrieve the box containing an object which is stated explicitly in the context (no move operations, essentially a single-hop task, synthetically constructed). In addition, [2] focused on improvement from fine-tuning broadly on code data and does not show how different constructions of structured fine-tuning data might recruit the required entity tracking circuits in different or similar ways, while we do in our controlled synthetic data settings. Specifically, by identifying retrieval heads for the synthetic and symbolic datasets, we show that mostly the same components are performing the equivalent functionality after fine-tuning, as in the real task.

Missing Explanations:

Retrieval head definition: Added to Section 4.1 “Detecting Retrieval Heads”: “To compare across fine-tuned models, we consider any attention head with a positive retrieval score to be a retrieval head, and later compute cosine similarity to account for the strength of scores.”

Detecting common retrieval heads: No matching algorithm is used other than matching by head indices. Our results support the idea that specific attention heads are primed during pre-training to be more easily recruited and trained for certain functions. To visualize the efficacy of matching by head index, we updated the paper with heatmaps showing retrieval scores after fine-tuning on different datasets in Appendix D (these are currently a bit small but we’ll make them clearer in a later version). We note this in a (highlighted) footnote in Section 4.3.

Patching explanation: We’ve added a more detailed explanation in Appendix F.

Experiment request

This has been added to Table 1, showing that the non-FT model has extremely poor performance on the 32K context tasks. One major factor is that, during context length extension, attention heads are exposed to new positional embeddings (even if they are kept within pretraining range by appropriately scaling the RoPE theta or using position interpolation) and required to softmax over a greater number of context tokens [3], which requires model adaptation.

[3] Veličković, P., Perivolaropoulos, C., Barbero, F., & Pascanu, R. softmax is not enough (for sharp out-of-distribution). ArXiv, abs/2410.01104.

Questions

Q1: Reworded: “The quadratic memory requirement of Transformer attention imposes a strong computational constraint on our ability to train and do inference on long-context models. This disrupts the typical pre-training pipeline: pre-training must be done at as large a scale as possible, but pre-training a long context model would necessarily reduce the number of observed tokens able to fit on the GPU.”

Q2: We experiment with multiple subsets of relations for MDQA, resulting in 3 variants, as explained in Appendix C: L1 = Who, When, Where (65.8% of the full relation set), L2 = When, Where (31.0%), L3 = Who (34.8%) .We added a (highlighted) note to the Figure 4 caption.

Q3: Clarification added to a (highlighted) footnote in 4.3: “SummHay Retrieval Heads attend to the final answer (document number), whereas SummHay Insight Heads attend to the insight text within the document.”

Q4: Rephrased (and highlighted) for clarity: “One explanation is that fine-tuning induces upstream changes so that a different representation distribution is passed to the retrieval heads when learning on synthetic data. This allows retrieval heads to learn to be effective for the synthetic task while failing on out-of-distribution real data representations.”

Q5: As stated in Section 3.1: “Prior synthetic datasets have made use of fictional entities (Saparov & He, 2023) or nonsense phrases (Wei et al., 2023) in place of real entities and properties, or swapped out nouns to augment the dataset (Lu et al., 2024).” These effectively encourage generalization by preventing the model from overfitting to specific entities. Added a clarification!

Q6: We’ve replaced this text with “corresponding real task”.

Misc.

Where do numbers from paragraph in line 356 come from? Figure 1 does not have 0.35 EM or 0.32 and 0.20. Where do number of heads 129, 112 and 39 heads come from?

Figure 1 shows the F1 score, not the exact match accuracy; we’ve updated the text to reflect this.

We’ve also fixed the remaining instances of unclear writing and typos in the updated version! Dataset examples have been added to Appendix C.1.

评论- Thank you for answers + increase score

2024-12-02

Thank you for clarifying the questions and running an additional experiment, I increase my score to 5.

I apologize for the late response, however, I believe that in this case, we have fundamentally different views on the significance of contributions and a prolonged discussion would not change it much.

I still lean to the rejection side because I think that there are not enough contributions. I also apologize for incorrectly citing a paper (I originally cited your arxiv paper instead of [1]). I understand your point about [2], however, I can't agree with it, as I believe that [2] has already established that transformer heads can be patched. While checking this for realistic multi-hop reasoning + structured fine-tuning data would be a nice additional section for [2], it is not enough for a separate conference paper.

I copy paste my original point with the corrected citation below, as this point is still valid in my opinion:

Lack of contributions

I will outline the candidates for contributions and then explain why I think they are not sufficient for a conference paper.

As far as I understand, the main takeaways from the paper are:

retrieval heads influence performance;
models differently fine-tuned for the same task share a subset of retrieval heads;
if we insert retrieval heads from a stronger model instead of the corresponding retrieval heads in the weaker model, the performance of the weaker model will improve;
if we measure the similarity between synthetic and realistic datasets based on the retrieval scores of the models trained on them this similarity will correlate with performance;

[1] Wu, W., Wang, Y., Xiao, G., Peng, H., & Fu, Y. (2024). Retrieval Head Mechanistically Explains Long-Context Factuality. ArXiv, abs/2404.15574.

[2] Prakash, N., Shaham, T.R., Haklay, T., Belinkov, Y., & Bau, D. (2024). Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking. ArXiv, abs/2402.14811.

评论- Response to Updated Review by YA8F

2024-12-02

Thank you for the score increase and for considering our response!

We appreciate the prior contributions of [1] & [2] but believe that there is need for work like ours that demonstrates that mechanistic interpretability can give insight into realistic data with complex reasoning. Additionally, while [1] & [2] look at how the target toy task circuit changes after fine-tuning on various data, they compare base models with versions fine-tuned on large mixtures of data and do not (i) identify the specific data that increases target task performance, or (ii) explain how fine-tuning on data with very little token overlap (e.g. realistic vs. symbolic data) manages to improve the target task performance. Transformers can conceivably create different intermediate representations of these different data types, thereby either (A) inducing high activations on disparate sets of model components, or (B) inducing high activations on the same sets of model components in ways that do not affect data with out-of-distribution representations. Our work shows that such dissimilar data can still induce changes in the desired realistic task components during long-context fine-tuning.

Ultimately, bridging toy data and realistic data is an important piece to demonstrate broader applicability of mechanistic interpretability, provide a concrete foundation for using toy tasks as long-context benchmarks, and shed light into why fine-tuning on structured data such as code can improve performance on realistic tasks.

Again, thank you for your consideration!

评论- General Reviewer Response

2024-11-23

Thanks for the reviews of our paper!

We wanted to point out a change in the revised version of the paper prompted by the discussion with reviewer TNEU. The original paper contained an incorrect characterization of the sets of Llama-3-8B-Instruct retrieval heads found on the MuSiQue synthetic datasets in Figure 1. These are not “strict subsets” as we said in “4.1 Results”, but are instead just mostly covered: the real data retrieval heads encompass >= 76% of synthetic data retrieval heads. We have clarified this in the text and added recall tables in Appendix E to further clarify what the original submission described as “mostly subsets”. The strong relationship between learning greater subsets of the real retrieval heads and task performance is shown in the newly added Table 4 in Appendix E. The changes in the main text are highlighted.

Other changes to the paper:

In response to reviewers TNEU and YA8F, we have made cosmetic updates to figures throughout the paper, increasing the readability of axis and element labels.
We have put a description of symbolic data construction of MuSiQue in the main text and provided examples in Appendix C.
There was an error in the definition of insight scores for SummHay citation which has now been fixed (Equation 2).

AC 元评审

2024-12-23

The paper fine-tuned long-context LLMs with synthetic data and analyzed the impact on downstream long-context tasks. The authors explored the impact of retrieval heads when fine-tuning on synthetic and real data. The exploration provides insights into improving synthetic data for long-context tasks.

As pointed out by the reviewer, the main contribution of this paper is not that significant. Some already established in prior work ([a] and [b]). The point of predicting the performance of models trained on synthetic data using similarity scores is interesting but not sufficient.

Considering the rebuttal, the ratings and the confidence of the reviews, I tend to reject this paper at this stage. I encourage the authors to further enhance the paper with improved technical or empirical contributions, analysis and writing.

[a] Wu, W., Wang, Y., Xiao, G., Peng, H., & Fu, Y. (2024). Retrieval Head Mechanistically Explains Long-Context Factuality. ArXiv, abs/2404.15574.

[b] Prakash, N., Shaham, T.R., Haklay, T., Belinkov, Y., & Bau, D. (2024). Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking. ArXiv, abs/2402.14811.

审稿人讨论附加意见

The reviewers raised concerns as follows:

Reviewer TNEU (rating 6-6, confidence 3): the visualization quality and clarity of figures should improve, presentation can be more clear.

Reviewer yKgJ (rating 6-noreply, confidence 2): the handcrafted principle in Sec. 3 is not representative enough

Reviewer YA8F (rating 3-5, confidence 4): lack of contributions, missing explanations for crucial parts, insufficient experiments, unclear writing.

Reviewer Cj3o (rating 6-noreply, confidence 3): results appear somewhat random, the evaluation is single and subjective.

The rebuttal addressed the concerns of Reviewer TNEU to some extent and TNEU maintained the score of 6. The reviewer YA8F raised the score from 3 to 5, but the concerns about the lack of contributions still hold. The three reviewers except for YA8F gave scores of 6 but have a low confidence, e.g., 2 or 3.

After reading the paper, reviews, rebuttal, and the reply of reviewer YA8F, I recognize the concerns raised by YA8F that the paper's contribution is not that significant.

最终决定Reject

2025-01-22

Reject