PaperHub
6.0
/10
Poster4 位审稿人
最低4最高8标准差1.4
6
4
8
6
3.8
置信度
COLM 2025

A Controlled Study on Long Context Extension and Generalization in LLMs

OpenReviewPDF
提交: 2025-03-20更新: 2025-08-26
TL;DR

Using a controlled protocol to systematically study long context extension methods

摘要

关键词
Controlled StudyLong ContextExtensionBenchmarkAnalysis

评审与讨论

审稿意见
6

This paper presents a controlled study on long-context extension methods for large language models (LLMs), aiming to address the lack of standardized comparisons in prior work. The authors introduce a unified mathematical framework, a controlled experimental protocol (using identical base models, training data, and hyperparameters), and a rigorous evaluation pipeline combining intrinsic (e.g., perplexity) and extrinsic (e.g., downstream tasks) metrics. Through extensive experiments, they derive three key insights.

接收理由

  • The paper’s most significant strength is its controlled protocol, which isolates variables (base models, training data, hyperparameters) to enable fair comparisons across extension methods. This addresses a critical gap in prior work where inconsistent setups hindered reliable conclusions.
  • By integrating diverse metrics (perplexity, retrieval tasks like NIAH, downstream benchmarks like LongBench/RULER) and testing across model sizes (LLaMA2 7B/13B/70B, Phi-2, LLaMA3), the study offers a multi-faceted analysis of long-context capabilities.
  • The findings on perplexity’s predictive value, the trade-offs between exact/approximate attention, and the limits of extrapolation provide actionable guidance for model design, particularly for applications requiring robust long-context understanding.

拒绝理由

  • The experiments are confined to context lengths up to 64k (with most evaluations at 32k), while the broader field is increasingly interested in extreme contexts (128k+). The paper does not address performance beyond 64k or analyze how methods scale to near-infinite contexts, which are critical for applications like book summarization or long-document QA.
  • While the controlled protocol is valuable, the core insights (e.g., approximate attention trade-offs, perplexity’s utility) align with prior works, and the unified mathematical framework does not introduce radical new theory. The study advances methodology but may lack groundbreaking discoveries.
评论

We thank reviewer for the question, and we clarify our contributions below,

Infrastructure contribution: Our study’s primary contribution is not a new algorithm but a standardised, fully reproducible testbed that eliminates three confounding factors that have hampered prior work: (i) heterogeneous base checkpoints, (ii) mismatched pre-training corpora, and (iii) inconsistent hyper-parameters during fine-tuning.

By releasing (a) the unified data pipeline, (b) a set of five identical base models exposed to a single 1 B-token long-context corpus, and (c) templated evaluation scripts for both intrinsic and extrinsic tasks, we provide the first public “apple-to-apples” suite.

Unified mathematica view : we summarize and yield an explicit frequency-scaling equivalence that mathematically links PI ↔ NTK ↔ YaRN ↔ CLEX (Eq. 7–10), making cross-method trade-offs transparent.

Findings: novel take-aways now highlighted in our paper:

(i) Exact finetuning vs. approximate attention – exact RoPE variants preserve retrieval depth up to 4× their finetuned window, whereas all approximate schemes collapse beyond 1× (Fig. 1), clarifying when efficiency shortcuts lose fidelity.

(ii) Perplexity correlates with downstream accuracy in limited length to some extent once confounders are removed (Kendall t = -0.72 on RULER; Table 6) – resolving contradictory claims in prior work. And we will add some recent experiments and discovered requested to further elaborate the limitation of PPL and downstream task accuracy.

评论

We thank reviewer for their feedback. We agree “near-infinite” was overstated; we have replaced it with “beyond-training” throughout.

In our study, we define generalization as the model's ability to perform well across all tasks that extend beyond the training context length. Specifically, we have evaluated our models on tasks where the input lengths exceed 64k tokens, such as RULER, with sequences up to 128k tokens in Table 1. We will revise our writing to highlight this in our manuscript. In our original paper, we found that NTK-Dynamic yields the best performance beyond 32k.

Table 1: Generalization of NTK beyond 32k on RULER

Method4k8k16k32k64k128k
Llama2-NTK-32k86.5877.7570.0159.4246.2629.91
Llama2-NTK-64k86.6076.3469.5660.0349.3140.09

To further evaluate the generalization, we evaluated sequences up to 128k tokens on HELMET with Llama-3 (8B). Results are shown below.

Table 2: Llama3 with up to 128k evaluation on HELMET

Model8k16k32k64k128k
Llama3-NTK-32k50.5148.9647.1437.0819.95
Llama3-NTK-64k49.8649.1547.6842.9137.11
Llama3-NTK-Frozen47.4838.983.162.982.11
Llama3-SelfExtend44.2841.8538.5927.9010.41
Llama3-CLEX-32k46.6847.1442.5729.4317.41
Llama3-PI-32k49.2247.6645.782.431.56
Llama3-YaRN-32k48.4748.6545.312.951.77

Take-away: even the strongest exact method degrades when extrapolated 4 × beyond its fine-tuned window, confirming the reviewer’s intuition. This indicates that even for the best generalized methods we discovered in our controlled setting, their generalization becomes weaker when the context length is much larger than the fine-tuned length.

评论

Dear Reviewer uEtu,

As this is nearing the end of the author response period, we kindly ask if there is anything additional that you would like us to address. We appreciate your valuable suggestions and feedback.

Sincerely,

Authors

审稿意见
4

This paper presents a controlled farmwork to conduct an apple-to-apples comparison among different long-context methods by using the same training data and evaluation metrics on both intrinsic and extrinsic tasks. The paper also conducted a comprehensive experimental study on several base models.

接收理由

  1. Context extension is a very important problem.

拒绝理由

  1. No significant novelty in the proposed framework.
  2. Some insights claimed in the experiment section are not well justified by the experimental results.

给作者的问题

  1. at ling #246, the paper claims that "LongLora effectively keep low perplexity scores within the pre-training context length". This is not consistent with the results shown in Table 2, where the perplexity scores of LongLora are significantly larger than others.
  2. It is not obvious that "Only NTK ad CLEX can generalize to unseen sequence length ...". How about Landmark?
  3. What is "NTK-F" in Table 1?
评论

“NTK-F” stands for NTK-Frozen, i.e. NTK positional rescaling applied to a frozen base model with no further fine-tuning. We will expand the table caption and glossary to spell this out and avoid confusion.

评论

We thank reviewer for the question, and we clarify our contributions below,

Infrastructure contribution: Our study’s primary contribution is not a new algorithm but a standardised, fully reproducible testbed that eliminates three confounding factors that have hampered prior work: (i) heterogeneous base checkpoints, (ii) mismatched pre-training corpora, and (iii) inconsistent hyper-parameters during fine-tuning.

By releasing (a) the unified data pipeline, (b) a set of five identical base models exposed to a single 1 B-token long-context corpus, and (c) templated evaluation scripts for both intrinsic and extrinsic tasks, we provide the first public “apple-to-apples” suite.

Unified mathematica view : we summarize and yield an explicit frequency-scaling equivalence that mathematically links PI ↔ NTK ↔ YaRN ↔ CLEX (Eq. 7–10), making cross-method trade-offs transparent.

Findings: novel take-aways now highlighted in our paper:

(i) Exact finetuning vs. approximate attention – exact RoPE variants preserve retrieval depth up to 4× their finetuned window, whereas all approximate schemes collapse beyond 1× (Fig. 1), clarifying when efficiency shortcuts lose fidelity.

(ii) Perplexity correlates with downstream accuracy in limited length to some extent once confounders are removed (Kendall t = -0.72 on RULER; Table 6) – resolving contradictory claims in prior work. And we will add some recent experiments and discovered requested to further elaborate the limitation of PPL and downstream task accuracy.

评论

Our observation about generalization is particularly evident on RULER: only NTK and CLEX are able to maintain reasonable performance as the evaluation length increases. In contrast, Landmark’s performance degrades significantly with longer contexts, suggesting limited generalization ability in this setting.

Table 1: Llama2 on RULER Benchmark

GroupModelsTrain Len4k8k16k32k64k128k
FrozenLLaMA24k80.94-----
LM-Infinite4k81.0530.0118.0212.3410.56-
NTK-Frozen4k81.1444.4514.790.720.91-
Self-Extend4k65.0350.7344.0229.509.34-
FinetunedPI32k84.5676.0469.6457.660.00-
NTK-32K32k86.5877.7570.0159.4246.2629.91
YaRN32k79.1265.6054.2136.950.00-
CLEX32k50.1863.9364.3552.1730.61-
LongLora32k10.586.373.673.530.00-
Landmark32k22.3717.5216.3113.5614.15-
FinetunedNTK-64K64k86.6076.3469.5660.0349.3140.09
评论

We also notice that the perplexity scores of LongLoRA in Table 2 are significantly larger than those of other methods.

In Appendix L, we conduct a detailed study and find that hyperparameters can significantly influence the performance of different context extension methods, particularly approximate attention methods like LongLoRA. We sweep key hyperparameters such as batch size and learning rate.

The results are summarized below:

Table 1: Perplexity Results of LongLoRA on PG19 and Proof-file

PG19

MethodBatch SizeLearning Rate2k4k8k16k32k
LongLoRA322e-512.8011.5210.7010.189.89
LongLoRA82e-58.107.697.437.287.32

Proof-file

MethodBatch SizeLearning Rate2k4k8k16k32k
LongLoRA322e-55.975.104.584.274.13
LongLoRA82e-53.333.012.802.672.61

We draw the following observations:

High Sensitivity: Approximate attention methods like LongLoRA are highly sensitive to hyperparameter settings. Small changes in batch size or learning rate lead to significant changes in performance.

Robustness Comparison: In contrast, NTK and YaRN demonstrate strong robustness across hyperparameter choices, maintaining stable perplexity across settings.

Optimization Difficulty: LongLoRA often requires much more careful hyperparameter tuning to reach optimal performance, making training more costly and less predictable.

We find that LongLoRA exhibits a similar trend to PI and YaRN: these methods effectively maintain low perplexity within the pretraining context length but fail to generalize to longer sequences.

审稿意见
8

This paper presents a controlled comparison of several long-context extension methods for LLMs, including (and mostly focused on) several RoPE-based position extension methods and attention approximation methods. The authors compared those methods on several different base LMs (Llama2 with various sizes, Llama3, and Phi-2), with both fine-tuning and frozen parameters. The evaluation is based on perplexity, synthetic evaluation (needle in a haystack and RULER), and downstream tasks (LongBench and in-context learning).

The main takeaway of this controlled study is that full attention + fine-tuning + NTK-style extrapolation works the best for long-context extension. The authors also argue that for long-context evaluation, perplexity is still very indicative and useful.

接收理由

This paper presents a controlled and comprehensive study on RoPE-based position extrapolation methods and demonstrates several important and interesting conclusions for the community:

(1) Position extrapolation mostly only works with fine-tuning. With fine-tuning and comparable base models+fine-tuning data, different extrapolation mechanisms do not lead to much difference (NTK slightly better than others).

(2) Full attention still works much better than attention approximation methods for long-context tasks

(3) Perplexity could still be a good evaluation for long context tasks.

The experiments are well designed (both base models and data are controlled; the experiment includes both fine-tuning and frozen settings). The analysis is comprehensive as it includes several different metrics as well as different testing lengths.

拒绝理由

My biggest complain about the paper is the choice of evaluations and the missing references.

(1) Evaluation: at the time of the submission, there are several newer and more comprehensive downstream long-context benchmarks available, such as InfBench and HELMET. For the in-context learning evaluation (MShot), there are multiple established protocols the authors could have followed (from both Bertsch et al., 2025 and HELMET). I think the authors will find using those newer evaluations will make the gaps among methods even bigger, due to their better metrics and more challenging nature. Hence, even though I'm not happy with the outdated evaluation, I think changing them will only make the result stronger. The authors should at least acknowledge and cite these evaluations.

(2) Missing references: There are several newer approximate attention methods the authors should consider, such as NSA, MoBA, DuoAttention. One can argue those methods are out of scope (it seems that this paper mostly focuses on position extrapolation), but these papers should at least be acknowledged. The authors should also discuss some of the data engineer papers for long contexts, as one conclusion is "fine-tuning is important". Such papers include ProLong, LongSkyWork, and Xiong et al., 2023. The authors should also discuss "adjusted base frequency" from Xiong et al., 2023, which is basically NTK but with a clearer definition.

(3) Perplexity: I do agree that perplexity could be still useful for long-context evaluation, but the authors should also acknowledge its limitation, as discussed by some prior work like Fang et al., 2025 and Gao et al., 2024. One important premise the authors should mention is that perplexity is maybe useful when both the architecture and the data are fixed.

Even with the above reasons to reject, I still believe this paper makes enough contribution to get into COLM (if the above missing references are provided in the final version).

给作者的问题

Please see the "reasons to reject" section.

评论

We will acknowledge the well-known limitations of perplexity (PPL) as a long-context metric, citing the analyses of Fang (2025)[1] and Gao (2024)[2]. We will add new Section 5.3, “When does PPL succeed / fail?”, to address this request. Specifically,

First, we will summarize Fang et al.’s observation that token-averaged PPL can mask large errors on a handful of key tokens, making it an incomplete proxy for downstream quality.

Next, we will report that in our controlled setting—where both the model architecture and the fine-tuning data are held constant—PPL remains a weak predictor of task performance, with a correlation of -0.6 on the HELMET benchmark.

We further computed the Kendall correlation between downstream task performance and perplexity, and found that HELMET exhibits a stronger correlation compared to benchmarks such as RULER and LongBench.

Table 1: Kendall correlation of downstream task performance and PPL

TaskKendall's Taup-valueInterpretation
Needle-0.78070.0151Strong negative correlation; statistically significant (p < 0.05).
Mshots-0.29280.3621Weak negative correlation; not statistically significant.
LongB-0.35000.2823Weak negative correlation; not statistically significant.
RULER-0.48800.1287Moderate negative correlation; not statistically significant.
HELMET-0.58550.0683Moderate-to-strong negative correlation; marginally significant (p ≈ 0.05).

Finally, we emphasize that future evaluations should pair PPL with richer suites such as HELMET so that both token-level and task-level behaviors are captured.

References

[1] Fang L, Wang Y, Liu Z, et al. What is Wrong with Perplexity for Long-context Language Modeling?[J]. arXiv preprint arXiv:2410.23771, 2024.

[2] Gao T, Wettig A, Yen H, et al. How to train long-context language models (effectively)[J]. arXiv preprint arXiv:2410.02660, 2024.

评论

We thank the reviewer for the references to improve the comprehensiveness of our paper. While we are not able to make changes to our related work section at the moment, we are going to add the following references covering recommended papers from reviewer.

CategoryNewly acknowledged workHow we position it
Approx. attentionNSA (2025) [2], MoBA (2025) [3], DuoAttention (2024) [4]All three are sparse+learned variants that aim at amortising memory during inference; their kernels are orthogonal to our study’s focus on length extrapolation. We now clarify this scope and will integrate NSA into our efficiency plot in Appendix E.
Data engineeringProLong (2024) [5], LongSkywork (2024) [6], ABF / “adjusted base frequency” (2023) [7]We explicitly cite these works when stressing that data choice and schedule are critical for long-context finetuning — complementary to our finding that fine-tuning itself is necessary. ABF is now linked to the NTK rescaling we study, but with a clearer theoretical footing.

References (newly cited)

[1] HELMET – Yen et al., How to Evaluate Long-Context LMs Effectively and Thoroughly, 2024.

[2] NSANative Sparse Attention: Hardware-Aligned & Natively-Trainable Sparse Attention, 2025.

[3] MoBAMixture of Block Attention for Long-Context LLMs, 2025.

[4] DuoAttentionEfficient Long-Context LLM Inference with Retrieval Heads, 2024.

[5] ProLong – Gao et al., How to Train Long-Context LMs (Effectively), 2024.

[6] LongSkywork – Zhao et al., A Training Recipe for Extending Context Length, 2024.

[7] ABF – Xiong et al., Effective Long-Context Scaling of Foundation Models, 2023.

[8] PPL limitations – Fang et al., What Is Wrong with Perplexity for Long-Context LM?, 2024.

评论

Thank you for your dedicated review of our work. In the following, we will carefully respond to your questions.

“Evaluation protocol (“out-dated benchmarks”)”

We appreciate the reviewer’s insightful comments regarding our evaluation choices. We agree that our original evaluation setup is somewhat outdated. To address this concern, we have extended our evaluation to include the HELMET[1] benchmark, which offers a more comprehensive and challenging suite of long-context tasks. In particular, we evaluated both LLaMA2 and LLaMA3 variants using HELMET’s standard protocol. Table 1: Llama3 on HELMET

Model8k16k32k64k128k
Llama3-NTK-32k50.5148.9647.1437.0819.95
Llama3-NTK-64k49.8649.1547.6842.9137.11
Llama3-NTK-Frozen47.4838.983.162.982.11
Llama3-SelfExtend44.2841.8538.5927.9010.41
Llama3-CLEX-32k46.6847.1442.5729.4317.41
Llama3-PI-32k49.2247.6645.782.431.56
Llama3-YaRN-32k48.4748.6545.312.951.77

Table 2: Llama2 on HELMET

Model8k16k32k64k
Llama2-NTK-32k42.0937.3128.2924.95
Llama2-NTK-64k39.9135.4729.2926.49
Llama2-NTK-Frozen25.8116.023.461.86
Llama2-SelfExtend27.0124.4019.656.69
Llama2-CLEX-32k32.6530.8726.4322.80
Llama2-PI-32k41.4837.5625.740.98
Llama2-YaRN-32k36.8330.6721.280.98

Our Take-away: The newer HELMET tasks reinforce our original conclusion — full attention + NTK fine-tuning remains dominant; gains even widen on the 128 k slice (see Table 1).

[1] HELMET – Yen et al., How to Evaluate Long-Context LMs Effectively and Thoroughly, 2024.

评论

The authors did a really good job at the rebuttal, especially with the additional results on the correlation between PPL and downstream evaluations, and the new HELMET results. I will raise my score

评论

We thank Reviewer LDP3 for acknowledging our rebuttal content and raising the score.

We are happy to answer if there are any additional questions.

审稿意见
6

The paper delivers the first controlled comparison of long-context extension methods for LLMs. With five identical open-weight base models, a single 1 B-token long-context corpus, and unified hyper-parameters, the authors benchmark eight techniques ( i.e. exact RoPE variants (PI, NTK, YaRN, CLEX), approximate/sparse attentions (LongLoRA, Landmark, LM-Infinite) and the mapping-based Self-Extend) on a standardized suite of intrinsic (perplexity, Needle-in-a-Haystack, RULER) and extrinsic (LongBench, many-shot ICL) tasks up to 64 k tokens. Several key findings reveal interesting insights.

接收理由

  1. The study is rigorous by fixing base models, data, and training recipes for each controlled experiment.
  2. The author has conducted comprehensive evaluation on diverse settings.

拒绝理由

  1. The analysis of each experiment is not deep, limited insights are given.
  2. Experiments stop at 64 k tokens, which is not “near-infinite”.

给作者的问题

  1. why approximate attention degrades?
  2. what's the compute budget for different experiments? How is the efficiency?
评论

“Compute budget / efficiency?”*

We conducted efficiency analysis under controlled conditions using the same hardware setup in Appendix E.

As shown in Table 1, we observed that approximate attention methods are indeed faster, achieving a speedup of approximately 1.5x to 2x compared to LLaMA when the context length is short; however, when the context length gets longer, we didn't see a significant margin.

We hypothesize that the discrepancy between the theoretical FLOPs-based comparisons and the observed speedup arises due to differences in hardware characteristics and CUDA implementations of the respective methods.

Table 3: Efficiency analysis of prefill stage time cost, decoding speed, and memory usage The prefill time cost represents the time required to generate the first token. The decoding speed (seconds / per token) is averaged over 100 token inferences at each sequence length. Memory consumption corresponds to the peak GPU memory usage during inference. All methods, except for LM-Infinite and Landmark, utilize Flash-Attention 2 for enhanced computational efficiency.

Method4k8k16k32k
Prefill (s) / Decode (s) / Mem (GB).Prefill (s) / Decode (s) / Mem (GB).Prefill (s) / Decode (s) / Mem (GB).Prefill (s) / Decode (s) / Mem (GB).
Llama2-7b1.15 / 0.03 / 17.131.51 / 0.06 / 21.612.41 / 0.11 / 30.594.63 / 0.21 / 48.55
NTK-Frozen1.16 / 0.04 / 17.131.56 / 0.05 / 21.612.39 / 0.06 / 30.594.69 / 0.09 / 48.55
PI1.15 / 0.03 / 22.051.54 / 0.03 / 26.542.43 / 0.05 / 35.514.74 / 0.08 / 53.47
NTK-32k1.17 / 0.04 / 17.111.56 / 0.04 / 21.602.42 / 0.06 / 30.584.75 / 0.09 / 48.53
YaRN1.23 / 0.03 / 18.051.53 / 0.03 / 22.542.43 / 0.05 / 31.514.80 / 0.08 / 49.47
CLEX1.16 / 0.05 / 17.166.99 / 0.07 / 21.747.68 / 0.11 / 30.9210.06 / 0.18 / 49.28
LM-Infinite1.56 / 0.05 / 17.233.34 / 0.07 / 25.475.82 / 0.11 / 38.6011.58 / 0.18 / 65.61
Self-Extend1.24 / 0.05 / 17.231.63 / 0.07 / 21.812.63 / 0.13 / 30.984.97 / 0.22 / 49.32
LongLora1.16 / 0.05 / 17.161.65 / 0.05 / 21.652.60 / 0.05 / 30.625.07 / 0.08 / 48.58
Landmark8.62 / 0.08 / 18.7717.65 / 0.08 / 22.9736.47 / 0.09 / 31.2277.77 / 0.09 / 47.74
评论

“Experiments stop at 64 k; not ‘near-infinite’ ”

We agree “near-infinite” was overstated; we have replaced it with “beyond-training” throughout.

In our study, we define generalization as the model's ability to perform well across all tasks that extend beyond the training context length. Specifically, we have evaluated our models on tasks where the input lengths exceed 64k tokens, such as RULER[1], with sequences up to 128k tokens in Table 1.

We will revise our writing to highlight this in our manuscript. In our original paper, we found that NTK-Dynamic yields the best performance beyond 32k.

Table 1: Generalization of NTK beyond 32k on RULER

Method4k8k16k32k64k128k
Llama2-NTK-32k86.5877.7570.0159.4246.2629.91
Llama2-NTK-64k86.6076.3469.5660.0349.3140.09

To further evaluate the generalization, we evaluated sequences up to 128k tokens on HELMET[2] with Llama-3 (8B). Results are shown below.

Table 2: Llama3 with up to 128k evaluation on HELMET

Model8k16k32k64k128k
Llama3-NTK-32k50.5148.9647.1437.0819.95
Llama3-NTK-64k49.8649.1547.6842.9137.11
Llama3-NTK-Frozen47.4838.983.162.982.11
Llama3-SelfExtend44.2841.8538.5927.9010.41
Llama3-CLEX-32k46.6847.1442.5729.4317.41
Llama3-PI-32k49.2247.6645.782.431.56
Llama3-YaRN-32k48.4748.6545.312.951.77

Take-away: even the strongest exact method degrades when extrapolated 4 × beyond its fine-tuned window, confirming the reviewer’s intuition. This indicates that even for the best generalized methods we discovered in our controlled setting, their generalization becomes weaker when the context length is much larger than the fine-tuned length.

References

[1] Hsieh C P, Sun S, Kriman S, et al. RULER: What's the Real Context Size of Your Long-Context Language Models?[J]. arXiv preprint arXiv:2404.06654, 2024.

[2] Yen H, Gao T, Hou M, et al. Helmet: How to evaluate long-context language models effectively and thoroughly[J]. arXiv preprint arXiv:2410.02694, 2024.

评论

We thank the reviewer for the thoughtful feedback and for recognizing the rigor of our controlled protocol (fixed base models, data, and hyper-parameters).

“Analysis … not deep, limited insights.”

We thank reviewer for the question, and we clarify our contributions below,

Infrastructure contribution: Our study’s primary contribution is not a new algorithm but a standardised, fully reproducible testbed that eliminates three confounding factors that have hampered prior work: (i) heterogeneous base checkpoints, (ii) mismatched pre-training corpora, and (iii) inconsistent hyper-parameters during fine-tuning.

By releasing (a) the unified data pipeline, (b) a set of five identical base models exposed to a single 1 B-token long-context corpus, and (c) templated evaluation scripts for both intrinsic and extrinsic tasks, we provide the first public “apple-to-apples” suite.

Unified mathematica view : we summarize and yield an explicit frequency-scaling equivalence that mathematically links PI ↔ NTK ↔ YaRN ↔ CLEX (Eq. 7–10), making cross-method trade-offs transparent.

Findings: novel take-aways now highlighted in our paper:

(i) Exact finetuning vs. approximate attention – exact RoPE variants preserve retrieval depth up to 4× their finetuned window, whereas all approximate schemes collapse beyond 1× (Fig. 1), clarifying when efficiency shortcuts lose fidelity.

(ii) Perplexity correlates with downstream accuracy in limited length to some extent once confounders are removed (Kendall t = -0.72 on RULER; Table 6) – resolving contradictory claims in prior work. And we will add some recent experiments and discovered requested to further elaborate the limitation of PPL and downstream task accuracy.

评论

“Why does approximate attention degrade?”

While the purpose of this paper is not to improve over existing methods, we have the following hypotheses,

  1. Information bottleneck (two-hop routing).
    Block-wise schemes such as Landmark Attention compress every B tokens into a single landmark; a query must therefore travel token → landmark → token, discarding fine-grained cues. Pointer-style tasks (e.g., Needle-in-a-Haystack) are the first to collapse[1].

  2. Phase aliasing with RoPE.
    Sparse or chunked heads act as a low-pass filter on RoPE’s complex phases: high-frequency components (large \omega) are pruned, so accumulated phase error grows with hop count and the dot-product logits vanish behind the softmax. A recent analysis of RoPE’s failure modes confirms this aliasing effect[2].

  3. Length-generalization failure.
    During fine-tuning, approximate kernels see a maximum relative distance d_train; at inference they are queried at d’ ≫ d_train. LM-Infinite[3] apply the sliding window mechanism to discard distant contexts, which keeps the input length does not exceed the context window. However, LM-Infinite can only attend to the tokens within the local window, which leads to a rapid decline in its performance as the sequence length increases[4][5].

  4. Hardware-bound indirection costs.
    Many “sub-quadratic” kernels rely on gather/scatter or landmark sorting; on modern GPUs these memory-bound operations dominate wall-time, erasing the FLOP advantage once context length exceeds ≈ 32 k. Systems work on million-token inference using Context Parallelism observes the same bottleneck[6][7].

References

[1] A. Mohtashami & L. Alibeigi. Landmark Attention: Random-Access Infinite Context Length for Transformers. arXiv:2305.16300, 2023.
[2] I. Y. Men et al. Round and Round We Go! What Makes Rotary Positional Encodings Work—and Fail—in Long Contexts. OpenReview, 2025.
[3] C. Han et al. LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models. arXiv:2308.16137, 2023.
[4] Chaojun Xiao et al. InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory. arXiv:2402.04617, 2024.
[5] Yi Lu et al. LongHeads: Multi-Head Attention is Secretly a Long Context Processor. arXiv:2402.10685, 2024.
[6] Z. Wang et al. Efficient Infinite-Context Transformers with Infini-Attention. arXiv:2404.07143, 2024.
[7] Y. Chen et al. Context Parallelism for Scalable Million-Token Inference. arXiv:2411.01783, 2024.

评论

Thanks for the response, I will maintain my score since it's still not fully addressing long-context problem.

评论

Thank you for your response. If you have any specific concerns or questions, we would be happy to further clarify and address them.

评论

Dear Reviewer RShr,

As this is nearing the end of the author response period, we kindly ask if there is anything additional that you would like us to address. We appreciate your valuable suggestions and feedback.

Sincerely, Authors

最终决定

This paper introduces a controlled extension protocol and a robust evaluation for long context extension. It systematically compares a range of existing methods, including RoPE variants, approximate attention, and attention modifications. The evaluation includes perplexity based, needle in the haystack task, RULER, and several extrinsic tasks (LongBench, Many-shot tasks). The findings suggest that perplexity is a useful indicator and that exact fine-tuning remains robust. In terms of long context extension techniques, NTK performs consistently well, while modified and approximate attention methods struggle to generalize.

Reviewers acknowledge that:

  • The paper is rigorous and well controlled, e.g., fixing base models, data, and training recipes for each controlled experiments (RShr, LDP3, uEtu)
  • Evaluation is comprehensive and extensive (RShr, LDP3, uEtu)
  • Several impactful findings, such as (1) positional extrapolation only works with fine-tuning, (2) approximate attention generally performs poorly, and (3) perplexity remains a useful indicator (LDP3).

Weaknesses

  • Experiments are limited to 64K tokens, which is significantly shorter than modern long context LLMs’ target context window (RShr, uEtu). Additional experiments during the rebuttal were added, showing that most methods fail to generalize when the extension length is significantly beyond the original context window, which the authors committed to add in the final version of the paper.
  • Missing evaluation (LDP3) - added during rebuttal
  • Missing references (LDP3) - authors have committed to add
  • No significant novelty/new insights, given that many findings like perplexity’s utilities and approximate attention tradeoffs is known (RShr, hLQf, uEtu).

[Automatically added comment] At least one review was discounted during the decision process due to quality]