A Controlled Study on Long Context Extension and Generalization in LLMs
Using a controlled protocol to systematically study long context extension methods
摘要
评审与讨论
This paper presents a comprehensive empirical study on long-context extension methods for language models, including 4 methods that adapt the RoPE positional embedding, and 4 methods that approximates the attention operations. The paper ensures a fair comparison of these methods and conclude with three key takeaways: (1) contrary to claims in previous works, perplexity and downstream task performance has high correlations; (2) RoPE-adaptation methods in general work better than attention approximation methods; (3) Dynamic NTK works best among compared methods.
优点
- The topic of this paper is timely and important. While there is a lot of recent work on long-context extension, the experiments were conducted with different base models and different training data, hence there is a lack of fair comparison. This work aims to answer this important question.
- The background section (section 3) is mostly well-written and provides a unified view of these various context extension methods.
- The authors conducted extensive experiments and ensured the comparison is conducted in a fair manner.
缺点
- I'm not fully convinced by some of the experiment results and takeaways. See questions below.
- The paper mentions lack of "quantitative rankings of different methodologies" as a motivation of this work. While this paper finds NTK-Dynamic to work best in general, this paper does not provide a full ranking.
- The presentation and organization of the paper can be improved in various aspects. e.g., using more visualizations instead of tables to summarize the results and highlight main findings; having a table to summary the key characteristics of the 8 compared methods.
问题
- Figure 1. I am surprised by the LM-Infinite result as the authors of the LM-Infinite reported an pass rate of about 80% on the task of Passkey Retrieval, which is very similar to the needle-in-the-haystack evaluation conducted in Figure 1. However, LM-Infinite fails most cases on NIAH as reported in Figure 1. Could you help explain what are potential aspects that lead to this gap?
- Line 484 "Perplexity and downstream tasks". I'm not fully convinced by this argument and Figure 4. It seems that the linear trend in Figure 4 is highly dependent on LongLora and Landmark, in a sense that the linear trend is likely to disappear without these two compared methods.
- It seems that when perplexity is below certain level, e.g., below 6, the perplexity differences between models are small but the downstream performance differences are large. Thus I'm concerned with the claim that perplexity is a "general-purpose performance indicator" as suggested in the abstract, it can only indicate well within a certain region from my understanding.
- Could you please consider adding metrics such as rank correlation to further strengthen the claim?
- Line 505 "Context extension hurts in the short term and gains in the long term". I'm not quite sure what this title means here. What are "short term" and "long term" referring to here?
- Is there any inference speed trade-off between these compared methods? e.g., Are attention approximation methods faster? By how much? For some applications the inference speed may be critical. Providing such information will help users make informed decisions.
Others:
- Line 194: What is the CLEX method here? Currently there is little introduction of it.
- Line 221: "key matrix", are you referring to "key and query matrices"?
Figure 1. I am surprised by the LM-Infinite result as the authors of the LM-Infinite reported an pass rate of about 80% on the task of Passkey Retrieval, which is very similar to the needle-in-the-haystack evaluation conducted in Figure 1. However, LM-Infinite fails most cases on NIAH as reported in Figure 1. Could you help explain what are potential aspects that lead to this gap?
We thank the reviewer for the question. We ran the LM-Infinite pass-key retrieval task, and found that indeed within the length LM-Infinite achieves good results, however, when it is going beyond the length, the results degrade a lot. Results are shown in Table 1.
In a concurrent literature, InfLLM[1], authors had a similar observation that with context length getting longer the accuracy of LM-Infinite decreases a lot in their pass key retrieval tasks with their Mistral-7B base model. We hypothesize that LM-Infinite might be good at attending to closer tokens within the window, hence there is a decrease when the length is getting longer. However, LM-Infinite is very good at preserving short-context ability.
Table 1: Lm-infinite passkey retrieval task:
| Token Len | 2k | 4k | 8k | 16k | 32k |
|---|---|---|---|---|---|
| llama-2-7b-hf (4k) | 1.0 | 0.93 | 0.39 | 0.22 | 0.08 |
Table 2: Results taken from InfLLM[1]
| Token Len | 32k | 64k | 128k |
|---|---|---|---|
| Mistral-7B-Instruct-v0.2 (32k) [1] | 0.30 | 0.17 | 0.00 |
Reference:
Xiao, Chaojun, et al. "Infllm: Training-free long-context extrapolation for llms with an efficient context memory." The Thirty-eighth Annual Conference on Neural Information Processing Systems. 2024.
Thank you for clarifying!
- "within the length LM-Infinite achieves create results" I'm not sure what this sentence mean here.
- The LM-Infinite paper reported ~80% accuracy at 16k input length; The results you shared suggests a pass rate of 22% at 16k. This is still surprising to me. I think it is necessary to further discuss this huge gap.
- However this is not the main focus of the paper so I'm not going to question this further.
Thank the reviewer for the comments.
We are sorry about the typo in our response. We intended to say "within the length LM-Infinite achieves good results".
We will cite more papers about LM-Infinite results in our paper to explain further the performance gap
Line 194: What is the CLEX method here? Currently there is little introduction of it.
Thank the reviewer for the suggestion. We added some of the explanations of the CLEX method. CLEX achieved near SOTA recents and is considered a relatively recent baseline.
Line 221: "key matrix" refers to "key and query matrices"?
We thank reviewers for pointing this out, and we will revise as suggested.
Line 505 "Context extension hurts in the short term and gains in the long term". I'm not quite sure what this title means here. What are "short term" and "long term" referring to here?
We thank the reviewer for the question. Both short-term and long-term here refers to context length here. We will change to short context and long context to avoid any confusion.
Figure 3, where we analyze the average negative log-likelihood across different context positions, suggests that extending context length can impact performance at shorter contexts. To further validate our hypothesis we took short tasks from Open LLM Leaderboard and re-evaluated the short tasks.
Table 3: Short-Text Task Performance
| Methods | ARC-c | ARC-e | Hellaswag | MMLU | TruthfulQA | WinoGrande | Average |
|---|---|---|---|---|---|---|---|
| Llama2-7b-base | 0.5273 | 0.8131 | 0.7896 | 0.4209 | 0.3897 | 0.7443 | 0.6142 |
| LM-Infinite | 0.5256 | 0.8136 | 0.7895 | 0.4209 | 0.3896 | 0.7411 | 0.6134 |
| Self-Extend | 0.5256 | 0.8131 | 0.7894 | 0.4207 | 0.3897 | 0.7443 | 0.6138 |
| NTK-Frozen-hf | 0.5273 | 0.8131 | 0.7896 | 0.4209 | 0.3897 | 0.7443 | 0.6142 |
| PI | 0.5111 | 0.8114 | 0.7744 | 0.3719 | 0.3803 | 0.7174 | 0.5944 |
| NTK-32k | 0.4915 | 0.8022 | 0.7448 | 0.3525 | 0.3813 | 0.7261 | 0.5831 |
| NTK-64k | 0.4608 | 0.7832 | 0.7068 | 0.3427 | 0.3908 | 0.7024 | 0.5645 |
| YaRN | 0.5341 | 0.8182 | 0.7847 | 0.4106 | 0.3863 | 0.7443 | 0.6130 |
| CLEX | 0.5060 | 0.8127 | 0.7606 | 0.3754 | 0.3610 | 0.6472 | 0.5772 |
| LongLora | 0.4667 | 0.7858 | 0.6708 | 0.2629 | 0.3761 | 0.5525 | 0.5191 |
We had the following observations:
- Performance Degradation: Most long-context extension methods exhibit a slight decrease in performance on short-text tasks compared to the base model. Our discovery, aligning with what you suggested, shows that NTK-Frozen demonstrates better performance on short-text tasks compared to methods like NTK-RoPE.
- Trade-Off Between Long and Short Contexts: The reduction in short-text performance is more pronounced in models using continuous fine-tuning methods. This indicates a potential trade-off between enhancing long-context capabilities and maintaining optimal performance on short-text tasks.
- Alignment with Original Findings: These results align with our observations in Figure 3, where we analyze the average negative log-likelihood across different context positions, suggesting that extending context length can impact performance at shorter contexts.
The presentation and organization of the paper can be improved in various aspects. e.g., using more visualizations instead of tables to summarize the results and highlight main findings; having a table to summary the key characteristics of the 8 compared methods.
We thank the reviewer for the valuable feedback on improving our paper’s presentation and organization. Currently, we have summarized our main findings in Table 1 and included a heatmap and three plots to demonstrate our results.
To enhance clarity, we will add additional visualizations that summarize perplexity trends and final results. Additionally, we will include a new table that outlines the key characteristics of the eight compared methods. These enhancements aim to better highlight our main findings and provide a comprehensive overview of each method’s features.
Thank you for your insightful suggestions.
Is there any inference speed trade-off between these compared methods? e.g., Are attention approximation methods faster? By how much? For some applications the inference speed may be critical. Providing such information will help users make informed decisions.
We thank the reviewer for raising this insightful question. In our experimental study, our primary focus was on evaluating the effectiveness of model performance, which aligns with the approach taken in much of the recent literature.
That said, we also conducted inference speed comparisons under controlled conditions using the same hardware setup. As shown in Table 1, we observed that approximate attention methods are indeed faster, achieving a speedup of approximately 1.5x to 2x compared to LLaMA when the context length is short; however, when the context length gets longer, we didn't see a significant margin. We hypothesize that the discrepancy between the theoretical FLOPs-based comparisons and the observed speedup arises due to differences in hardware characteristics and CUDA implementations of the respective methods.
Table 1: Efficiency analysis of prefill stage time cost, decoding speed, and memory usage
The prefill time cost represents the time required to generate the first token. The decoding speed (seconds / per token) is averaged over 100 token inferences at each sequence length. Memory consumption corresponds to the peak GPU memory usage during inference. All methods, except for LM-Infinite and Landmark, utilize Flash-Attention 2 for enhanced computational efficiency.
| Method | 4k | 8k | 16k | 32k |
|---|---|---|---|---|
| Prefill (s) / Decode (s) / Mem (GB). | Prefill (s) / Decode (s) / Mem (GB). | Prefill (s) / Decode (s) / Mem (GB). | Prefill (s) / Decode (s) / Mem (GB). | |
| Llama2-7b | 1.15 / 0.03 / 17.13 | 1.51 / 0.06 / 21.61 | 2.41 / 0.11 / 30.59 | 4.63 / 0.21 / 48.55 |
| NTK-Frozen | 1.16 / 0.04 / 17.13 | 1.56 / 0.05 / 21.61 | 2.39 / 0.06 / 30.59 | 4.69 / 0.09 / 48.55 |
| PI | 1.15 / 0.03 / 22.05 | 1.54 / 0.03 / 26.54 | 2.43 / 0.05 / 35.51 | 4.74 / 0.08 / 53.47 |
| NTK-32k | 1.17 / 0.04 / 17.11 | 1.56 / 0.04 / 21.60 | 2.42 / 0.06 / 30.58 | 4.75 / 0.09 / 48.53 |
| YaRN | 1.23 / 0.03 / 18.05 | 1.53 / 0.03 / 22.54 | 2.43 / 0.05 / 31.51 | 4.80 / 0.08 / 49.47 |
| CLEX | 1.16 / 0.05 / 17.16 | 6.99 / 0.07 / 21.74 | 7.68 / 0.11 / 30.92 | 10.06 / 0.18 / 49.28 |
| LM-Infinite | 1.56 / 0.05 / 17.23 | 3.34 / 0.07 / 25.47 | 5.82 / 0.11 / 38.60 | 11.58 / 0.18 / 65.61 |
| Self-Extend | 1.24 / 0.05 / 17.23 | 1.63 / 0.07 / 21.81 | 2.63 / 0.13 / 30.98 | 4.97 / 0.22 / 49.32 |
| LongLora | 1.16 / 0.05 / 17.16 | 1.65 / 0.05 / 21.65 | 2.60 / 0.05 / 30.62 | 5.07 / 0.08 / 48.58 |
| Landmark | 8.62 / 0.08 / 18.77 | 17.65 / 0.08 / 22.97 | 36.47 / 0.09 / 31.22 | 77.77 / 0.09 / 47.74 |
Thanks for sharing these results! In future versions of the paper, it would be nice to report them along with the performance metrics (in Table 1 and 2) into a 2D figure, with performance on the Y-axis and speed on the X-axis.
Line 484 "Perplexity and downstream tasks". This argument and Figure 4 do not fully convince me. It seems that the linear trend in Figure 4 is highly dependent on LongLora and Landmark, in the sense that the linear trend is likely to disappear without these two compared methods. It seems that when perplexity is below a certain level, e.g., below 6, the perplexity differences between models are small but the downstream performance differences are large. Thus I'm concerned with the claim that perplexity is a "general-purpose performance indicator" as suggested in the abstract, it can only indicate well within a certain region from my understanding. Please consider adding metrics such as rank correlation to strengthen the claim further.
We thank the reviewer for the question. In our revision, we will make this clear about the distance and Here are the updated results for the correlation using a non-parametric method, ken-tau correlation.
Table 1: Kendall correlation of downstream task performance and PPL
| Task | Kendall's Tau | p-value | Interpretation |
|---|---|---|---|
| Needle | -0.7191 | 0.0041 | Strong negative correlation; statistically significant (p < 0.01). |
| Mshots | -0.4944 | 0.0482 | Moderate negative correlation; borderline significant (p ≈ 0.05). |
| LongB | -0.6136 | 0.0149 | Strong negative correlation; statistically significant (p < 0.05). |
| RULER | -0.7191 | 0.0041 | Strong negative correlation; statistically significant (p < 0.01). |
Key Findings:
- Consistency Across Tasks:
The results show a strong and statistically significant negative correlation between PPL and downstream performance for most tasks.
- This supports the claim that lower PPL values are generally associated with better downstream task performance.
- Task-Specific Observations:
- The strongest correlations are observed for Needle and RULER, where Kendall's tau indicates a robust alignment between PPL and task performance rankings.
- For Mshots, the correlation is moderate and statistically weaker, suggesting that PPL's predictive ability may vary slightly depending on the task.
- Impact of Perplexity Range:
- Even when perplexity values are close (e.g., below 6), PPL rankings remain a reliable indicator of downstream performance. However, the narrower range may amplify the observed performance differences, highlighting the need for nuanced interpretation.
We will add these findings to our revision.
Thanks for adding the Kendall-Tau metric and providing these new discussions. I'm afraid I'm still concerned with the argument that there is "a strong correlation between perplexity and downstream task performance." Indeed a strong correlation is present in the current setting when LongLora and Landmark are included. But this conclusion may not be general enough, especially if we focus on the 5 strongest method that achieves ppl below 6 in Figure 4. To address this concern, could you please consider doing the same set of analysis in Figure 4 but with LongLora and Landmark excluded? Alternatively, you might consider revising the wording of this conclusion and presenting it with less emphasis in the paper?
Dear authors and area chair,
Given the new results and discussion posted by the authors, I have raised my rating to 6. My concerns about the LM-infinite comparison and inference speed trade-off are resolved.
We still have some pending discussion regarding whether there is a strong correlation between ppl and downstream performance. While the conclusion holds when considering all long-context extension methods, the correlation appears weaker when focusing on the strongest or most advanced long-context extension methods. I'm concerned with the broader applicability of the conclusion.
Reviewer j6WF
We thank Reviewer j6WF for the suggestion to further improve the paper.
We have changed our results according to the suggestion of Reviewer j6WF and softened the tone and discovery presented in our paper. We specifically mentioned our discovery is "to some extent" and also mentioned that our discovery is limited as Reviewer j6WF has mentioned in our limitation section.
For example, we wrote the introduction paragraph as,
First, while there have been suggestions that we need new ways to measure performance, our findings show that perplexity does align with how well models perform to some extent on various tasks in our controlled studies. Though some newer attention methods don't show this pattern as clearly, we generally found that when models got better at predicting text, they also got better at most other tasks we tested them on.
We rewrote our discovery paragraph as,
While prior work [1, 2] suggests that perplexity may not reliably predict long-range task performance, our analysis in Figure 2 reveals to some extent perplexity might be reliable. We observe a general correlation between perplexity and model performance across tasks. However, we also observed that approximate attention methods, including LongLora and Landmark on RULER, show minor deviations but maintain a roughly linear relationship. We hypothesize that this apparent discrepancy with previous findings may stem from their less controlled experimental conditions and noisier datasets.
References
[1] Sun, Simeng, et al. "Do long-range language models actually use long-range context?." arXiv preprint arXiv:2109.09115 (2021).
[2] An, Chenxin, et al. "L-eval: Instituting standardized evaluation for long context language models." arXiv preprint arXiv:2307.11088 (2023).
This paper systematically evaluates various methods for extending the context length of LLMs, aiming to provide insights into the behavior of long-context models and establish a standardized evaluation framework. It designs a controlled protocol for comparing context extension methods using consistent base models and extension data. The study includes an examination of the performance of different attention mechanisms in long-context tasks, confirms that perplexity remains a relevant performance indicator in longer-context scenarios, and presents findings indicating that exact fine-tuning methods are effective within their extension range, while approximation methods tend to underperform. Additionally, the paper emphasizes the open-sourcing of codebases, models, and checkpoints to promote transparency and facilitate further research.
优点
- The paper introduces a novel controlled protocol for evaluating long-context extension methods, addressing a significant gap in the literature regarding the comparison of such techniques.
- The study is comprehensive, utilizing a variety of metrics and tasks to assess model performance. The use of standardized base models and extension data enhances the quality of the comparative analysis.
缺点
- The study is limited to three base models, which may not accurately represent the performance of other, potentially larger models. Expanding the analysis to include a more diverse set of base models could strengthen the conclusions.
- While the paper acknowledges limitations due to fixed hyperparameters, a more in-depth exploration of how different hyperparameter settings might affect the results would be beneficial.
- The generalization of findings to longer contexts beyond 32k (e.g., 128k and 1m) is not addressed, which is a significant limitation given the focus on long-context models.
- The insights provided in this paper have been discussed in previous studies, and I did not gain any new takeaways from it.
问题
- What is the performance of new models, such as Qwen2.5, in this experiment?
- What are your thoughts on the generalization behavior of these methods for contexts longer than 32k? Are there any preliminary findings or conjectures regarding this?
伦理问题详情
NA
The study is limited to three base models, which may not accurately represent the performance of other, potentially larger models. Expanding the analysis to include a more diverse set of base models could strengthen the conclusions.
Thank you for highlighting the importance of evaluating larger models. In response, we have conducted additional experiments using the LLaMA-13B model and are also running experiments on the LLaMA-70B model. Once those results are available, we will update the manuscript.
Table 1: Performance of LLaMA-13B Extension Methods
| Method | Perplexity (32k) | LongBench | Needle (64k) | RULER (32k) |
|---|---|---|---|---|
| Llama2-13b-hf (4k) | 2.90 | 33.84 | 0.17 | 86.35 |
| NTK-Frozen | 3.31 | 31.87 | 0.43 | 2.30 |
| Self-Extend | 2.65 | 33.69 | 0.54 | 30.23 |
| PI | 2.46 | 37.45 | 0.45 | 55.95 |
| NTK-32k | 2.44 | 38.41 | 0.82 | 58.38 |
| YaRN | 2.46 | 34.03 | 0.44 | 44.79 |
| CLEX | 2.43 | 35.89 | 0.79 | 52.76 |
Table 2: Detailed RULER Benchmark Results for LLaMA-13B at 32k
| Method | NIAH_S1 | NIAH_S2 | NIAH_S3 | NIAH_M1 | NIAH_M2 | NIAH_M3 | NIAH_MV | NIAH_MQ | VT | CWE | FWE | QA_1 | QA_2 | Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Llama2-13b-hf (4k) | 100 | 100 | 92 | 100 | 98 | 89 | 84.25 | 96.25 | 71.2 | 78.2 | 86.67 | 76 | 51 | 86.35 |
| NTK-Frozen | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.2 | 28.67 | 1 | 0 | 2.30 |
| Self-Extend | 65 | 63 | 76 | 15 | 1 | 0 | 25.25 | 11 | 32.8 | 17.6 | 51.33 | 14 | 21 | 30.23 |
| PI | 98 | 100 | 90 | 93 | 63 | 17 | 36 | 63 | 18.2 | 31.2 | 56 | 27 | 35 | 55.95 |
| NTK-32k | 100 | 97 | 81 | 81 | 36 | 12 | 63.75 | 74.5 | 29.6 | 48.1 | 69 | 29 | 38 | 58.38 |
| YaRN | 99 | 96 | 65 | 64 | 19 | 2 | 19.75 | 45 | 32.6 | 29.3 | 49.67 | 24 | 37 | 44.79 |
| CLEX | 98 | 98 | 96 | 71 | 20 | 2 | 45.25 | 61.5 | 24.2 | 40.2 | 74.67 | 19 | 36 | 52.76 |
We had the following new observations based on the experiments added and our previous analysis,
-
Performance Trends: With the larger LLaMA-13B model, we observe that non-extension methods like NTK-Frozen and Self-Extend show improved performance on intrinsic tasks such as Needle-in-a-Haystack compared to their performance at smaller scales.
-
Continual Fine-Tuning Methods: Despite the improvements in non-extension methods, continual fine-tuning methods still outperform them within their extension range.
-
Perplexity and Downstream Tasks: The correlation between perplexity and downstream task performance remains consistent, reinforcing our original conclusions. We will update the manuscript to include these new findings. The additional results provide deeper insights into how model scaling influences long-context capabilities and validate the robustness of our conclusions across different model sizes.
We will update the manuscript to include these new findings . The additional results provide deeper insights into how model scaling influences long-context capabilities and validate the robustness of our conclusions across different model sizes.
While the paper acknowledges limitations due to fixed hyperparameters, a more in-depth exploration of how different hyperparameter settings might affect the results would be beneficial.
Thank you for highlighting the importance of exploring the impact of hyperparameter settings on our results. We added following empirical studies to address your concerns.
Continual fine-tuned methods
We agree that hyperparameters can significantly influence the performance of different context extension methods, particularly approximate attention methods. We sweep traditional training hyperparameters, such as batch size, and learning rate. Results are shown below,
Table 1: Perplexity Results of LongLora on PG19 and Proof-file
| Method | Batch Size | Learning Rate | 2k | 4k | 8k | 16k | 32k |
|---|---|---|---|---|---|---|---|
| PG19 | |||||||
| Longlora | 32 | 2e-5 | 12.80 | 11.52 | 10.70 | 10.18 | 9.89 |
| Longlora | 8 | 2e-5 | 8.10 | 7.69 | 7.43 | 7.28 | 7.32 |
| Proof-file | |||||||
| Longlora | 32 | 2e-5 | 5.97 | 5.10 | 4.58 | 4.27 | 4.13 |
| Longlora | 8 | 2e-5 | 3.33 | 3.01 | 2.80 | 2.67 | 2.61 |
We made the following observations:
-
High Sensitivity: Approximate attention methods like LongLoRA are highly sensitive to hyperparameter settings. Small changes in learning rate or training steps led to significant fluctuations in performance.
-
Robustness to Hyperparameters: NTK and YaRN methods demonstrated robustness to changes in hyperparameter settings. Their performance remained stable across a wide range of configurations.
-
Optimization Challenges: Training times were more predictable for NTK and YaRN and generally shorter because fewer hyperparameter adjustments were needed. For LongLora, achieving optimal performance requires careful tuning, which can be computationally intensive and time-consuming.
Inference Time Optimization
Additionally, we experimented with different hyperparameter settings during inference, such as: scaling factors and some particular hyperparameter tailored to methods. Specifically, for Self-Extend, we follow the empirical rule proposed by the authors for selecting hyperparameters of neighbor tokens and group size, using 32k as the target length to set our hyperparameters of the size of neighbor tokens(window size) and group size. Table 2 below shows the PPL w.r.t different combination of hyperparameters.
Table 2: Perplexity Results of Self-Extend with different group and window size
| Method | Window Size | Group Size | 4k | 16k | 32k |
|---|---|---|---|---|---|
| Self-Extend | 512 | 32 | 7.74 | 7.64 | 7.67 |
| 512 | 64 | 7.77 | 7.72 | 8.43 | |
| 512 | 128 | 7.81 | 7.84 | 9.86 | |
| 1024 | 32 | 7.67 | 7.44 | 7.42 | |
| 1024 | 64 | 7.67 | 7.46 | 7.47 | |
| 1024 | 128 | 7.68 | 7.48 | 7.51 | |
| 2048 | 32 | 7.69 | 7.48 | 11.18 | |
| 2048 | 64 | 7.70 | 7.50 | 8.08 | |
| 2048 | 128 | 7.70 | 7.54 | 10.25 |
Despite not requiring fine-tuning, we found that Self-Extend is sensitive to hyperparameters during inference time, when the input context gets longer(32k): their performance varies significantly based on choices like group sizes. Small changes in hyperparameters can lead to considerable fluctuations in model performance during inference. This sensitivity can affect the reliability of these methods in practical applications where consistent performance is necessary.
We will add those into our analysis of the manuscript.
The generalization of findings to longer contexts beyond 32k (e.g., 128k and 1m) is not addressed, which is a significant limitation given the focus on long-context models.
We sincerely thank the reviewer for highlighting the importance of evaluating model performance on contexts longer than 32k tokens.
In our study, we define generalization as the model's ability to perform well across all tasks that extend beyond the training context length. Specifically, we have evaluated our models on tasks where the input lengths exceed 32k tokens, such as NIAH , Perplexity (PPL) and RULER, with sequences up to 64k tokens. We will revise our writing to highlight this in our manuscript. In our original paper, we found that NTK-Dynamic yields the best performance beyond 32k.
To further evaluate the generalization, we evaluated sequences up to 128k tokens on NTK, which in our submission draft works the best in generalization. Results are shown below.
Table 1: Generalization of NTK beyond 32k on RULER
| Method | 4k | 8k | 16k | 32k | 64k | 128k |
|---|---|---|---|---|---|---|
| NTK-32k | 86.58 | 77.75 | 70.01 | 59.42 | 46.26 | 29.91 |
| NTK-64k | 86.60 | 76.34 | 69.56 | 60.03 | 49.31 | 40.09 |
Table 2: Generalization of NTK beyond 32k on NIAH
| Method | Length | NIAH_S1 | NIAH_S2 | NIAH_S3 | NIAH_M1 | NIAH_M2 | NIAH_M3 | NIAH_MV | NIAH_MQ | VT | CWE | FWE | QA_1 | QA_2 | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NTK-32k | 128k | 75.00 | 56.00 | 74.00 | 48.00 | 3.00 | 0.00 | 25.75 | 25.75 | 11.00 | 3.30 | 13.00 | 28.00 | 26.00 | 29.91 |
| NTK-64k | 128k | 85.00 | 88.00 | 91.00 | 67.00 | 8.00 | 0.00 | 44.50 | 47.25 | 3.40 | 0.70 | 34.33 | 24.00 | 28.00 | 40.09 |
We found that our conclusion holds for 64k tokens. However, when the context length increased to 128k tokens (4x the fine-tuned length of 32k), we noticed a decrease in performance. This indicates that even for the best generalized methods we discovered in our controlled setting, their generalization becomes weaker when the context length is much larger than the fine-tuned length.
The insights provided in this paper have been discussed in previous studies, and I did not gain any new takeaways from it.
We appreciate the reviewer's candid feedback regarding the novelty of our insights. While we acknowledge that some of our conclusions align with findings from previous studies, we believe that our work offers several distinct contributions that differentiate it from existing research:
-
Controlled Experimental Protocol:
-
Consistent base models, datasets, and metrics ensure fair comparisons.
-
Eliminated confounding variables, enhancing credibility and reproducibility.
-
-
Clear Mathematical Connections:
-
Explicit mathematical relationships between methods unify different approaches.
-
Provided deeper insights into their performance and interrelations.
-
-
Comprehensive Evaluation:
-
Extensive tasks include intrinsic metrics (e.g., perplexity) and extrinsic benchmarks (e.g., LongBench, RULER).
-
Tested various model sizes (e.g., LLaMA-13B, adding LLaMA-70B) for stronger generalizability.
-
Standardized benchmarking framework enables meaningful comparisons and best practice identification.
-
While prior studies may have explored similar themes, our work differentiates itself through the combination of a controlled experimental setup, mathematical summarization contributions, and a comprehensive evaluation framework. We believe that quantifying these properties across multiple approaches and presenting them within a standardized benchmark adds significant value to the field.
We will revise the manuscript to more clearly highlight these unique contributions. By emphasizing these aspects, we aim to better convey the novelty and significance of our work to the reader.
What is the performance of new models, such as Qwen2.5, in this experiment? We thank the reviewer for bringing up Qwen-2.5. This model has demonstrated outstanding performance across various benchmarks and is widely recognized as one of the top lightweight models in many scenarios.
Evaluation of Qwen-7B:
-
Context Length Support: Qwen-7B supports an input context length of up to 128k tokens.
-
Extension Method: While there is currently no technical report detailing their context extension method, based on the previous Qwen-2 report, we hypothesize that they utilize a context extension technique similar to NTK-RoPE.
-
Experimental Results:
We conducted experiments to evaluate Qwen-7B using our standardized benchmarks. The results are as follows:
Table 1: Performance of Qwen-2.5-7b on Long-Context Tasks
| Method | Perplexity (32k) | LongBench | Needle (64k) | RULER (32k) |
|---|---|---|---|---|
| Qwen-2.5-7b | 2.3154 | 45.01 | 0.871 | 85.21 |
Table 2: Performance and generalization of Qwen-2.5-7b on RULER
| Method | 4k | 8k | 16k | 32k | 64k | 128k |
|---|---|---|---|---|---|---|
| Qwen2.5 | 94.90 | 89.95 | 88.30 | 85.21 | 63.67 | 21.06 |
| Method | Length | NIAH_S1 | NIAH_S2 | NIAH_S3 | NIAH_M1 | NIAH_M2 | NIAH_M3 | NIAH_MV | NIAH_MQ | VT | CWE | FWE | QA_1 | QA_2 | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5 | 4k | 100 | 100 | 100 | 100 | 100 | 100 | 97.5 | 99.75 | 99.8 | 98.3 | 97.33 | 84 | 57 | 94.90 |
| 8k | 100 | 100 | 100 | 100 | 100 | 98 | 94 | 99.75 | 95.4 | 86.2 | 83 | 64 | 49 | 89.95 | |
| 16k | 100 | 100 | 100 | 100 | 99 | 97 | 94 | 98.75 | 92.2 | 65 | 87 | 64 | 51 | 88.30 | |
| 32k | 100 | 100 | 100 | 99 | 96 | 92 | 93.25 | 97.5 | 88.6 | 58.7 | 85.67 | 56 | 41 | 85.21 | |
| 64k | 100 | 92 | 100 | 67 | 22 | 28 | 82.75 | 86.25 | 78.2 | 10.9 | 83.67 | 54 | 23 | 63.67 | |
| 128k | 98 | 29 | 37 | 18 | 4 | 1 | 26 | 15.75 | 10.4 | 1 | 1.67 | 10 | 22 | 21.06 |
We had the following observations:
-
Superior Performance: Qwen-7B achieves the best performance on our long-context tasks compared to other models of similar size.
-
Generalization Ability: The model demonstrates strong generalization to longer contexts beyond its training range, aligning with the trends observed in our study. We hypothesize that this is partially due to better and more continual fine-tuning data and better training recipe.
In addition to adding the 13B model, we also added the 70B model results, shown below. We observed a similar trend in Phi, Llama-7b, Llama-13B base models. This further shows that our analysis and discovered could be applied to larger models.
Table 1: Performance of LLaMA-70B Extension Methods
| Method | Perplexity (32k) | LongBench | Needle (64k) | RULER (32k) |
|---|---|---|---|---|
| Llama2-70b-hf (4k) | 2.66 | 34.00 | 14.70 | 93.67 |
| NTK-Frozen | 3.25 | 32.40 | 30.90 | 11.39 |
| Self-Extend | 2.43 | 29.10 | 32.60 | 31.94 |
| PI | 2.26 | 42.44 | 49.80 | 77.98 |
| NTK-32k | 2.25 | 41.51 | 90.50 | 76.97 |
Table 2: Detailed RULER Benchmark Results for LLaMA-70B
| Method | NIAH_S1 | NIAH_S2 | NIAH_S3 | NIAH_M1 | NIAH_M2 | NIAH_M3 | NIAH_MV | NIAH_MQ | VT | CWE | FWE | QA_1 | QA_2 | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Llama2-13b | 100.00 | 100.00 | 100.00 | 100.00 | 95.00 | 100.00 | 99.50 | 99.75 | 99.80 | 100.00 | 98.67 | 68.00 | 57.00 | 93.67 |
| NTK-Frozen | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.40 | 62.00 | 52.67 | 9.00 | 23.00 | 11.39 |
| Self-Extend | 24.00 | 50.00 | 27.00 | 1.00 | 0.00 | 0.00 | 24.50 | 12.00 | 54.60 | 75.40 | 88.67 | 25.00 | 33.00 | 31.94 |
| PI | 99.00 | 100.00 | 98.00 | 97.00 | 53.00 | 28.00 | 92.00 | 93.75 | 92.80 | 92.50 | 78.67 | 38.00 | 51.00 | 77.98 |
| NTK-32k | 100.00 | 100.00 | 92.00 | 87.00 | 44.00 | 19.00 | 90.25 | 96.00 | 93.20 | 97.50 | 93.67 | 37.00 | 51.00 | 76.97 |
Table 3 : Performance of LLaMA-7B, LLaMA-13B, LLaMA-70B Extension Methods
| Method | Perplexity (32k) | LongBench | Needle (64k) | RULER (32k) |
|---|---|---|---|---|
| Llama2-7b-hf (4k) | 3.04 | 32.92 | 8.40 | 80.94 |
| Llama2-13b-hf (4k) | 2.90 | 33.84 | 17.00 | 86.35 |
| Llama2-70b-hf (4k) | 2.66 | 34.00 | 14.70 | 93.67 |
| Llama2-7b-NTK-Frozen | 4.06 | 25.54 | 18.80 | 0.72 |
| Llama2-13b-NTK-Frozen | 3.31 | 31.87 | 43.00 | 2.30 |
| Llama2-70b-NTK-Frozen | 3.25 | 32.40 | 30.90 | 11.39 |
| Llama2-7b-Self-Extend | 2.75 | 33.62 | 25.80 | 29.50 |
| Llama2-13b-Self-Extend | 2.65 | 33.69 | 53.50 | 30.23 |
| Llama2-70b-Self-Extend | 2.43 | 29.10 | 32.60 | 31.94 |
| Llama2-7b-PI | 2.58 | 33.48 | 42.10 | 57.66 |
| Llama2-13b-PI | 2.46 | 37.45 | 45.00 | 55.95 |
| Llama2-70b-PI | 2.26 | 42.44 | 49.80 | 77.98 |
| Llama2-7b-NTK-32k | 2.54 | 35.32 | 83.70 | 59.42 |
| Llama2-13b-NTK-32k | 2.44 | 38.41 | 82.20 | 58.38 |
| Llama2-70b-NTK-32k | 2.25 | 41.51 | 90.50 | 76.97 |
| Llama2-7b-YaRN | 2.59 | 33.45 | 46.70 | 36.95 |
| Llama2-13b-YaRN | 2.46 | 34.03 | 44.20 | 44.79 |
| Llama2-7b-CLEX | 2.55 | 33.48 | 71.10 | 52.17 |
| Llama2-13b-CLEX | 2.43 | 35.89 | 78.90 | 52.76 |
Dear Reviewer 3BXk,
We would like to express our sincere gratitude for your thorough review of our manuscript and for your valuable insights. We have carefully considered your feedback and have made significant revisions to address your concerns.
In particular, we have:
-
Expanded our experiments to include larger and more diverse models, such as the LLaMA-13B and Qwen-2.5-7B models, to strengthen the generalizability of our conclusions.
-
Conducted a comprehensive hyperparameter sensitivity analysis, exploring how different hyperparameter settings affect the results during both training and inference phases. This includes additional experiments and detailed discussions in the revised manuscript.
-
Evaluated the generalization behavior of our methods for contexts longer than 32k tokens, extending our experiments to include context lengths of 64k and 128k tokens. Our findings and conjectures regarding this are included in the updated paper.
-
Clarified the novelty and contributions of our work, emphasizing the controlled experimental protocol, clear mathematical connections, and comprehensive evaluation across multiple dimensions.
We kindly ask if you could take a moment to review our updated manuscript. If there is anything further that you would like us to clarify or any additional feedback you wish to provide, please let us know. We are more than willing to provide any additional information or make further revisions as needed.
Thank you once again for your time and thoughtful feedback. Your insights have been instrumental in improving our paper.
Sincerely,
Authors of Submission 8416
Using consistent base models and extension data, the study yielded several insights into long-context behavior. First, it reaffirmed the critical role of perplexity as a general-purpose performance indicator. Second, current approximate attention methods systematically underperform in long-context tasks. Finally, it confirmed that exact fine-tuning based methods are generally effective within their extension range, whereas extrapolation remains challenging.
优点
- This work is the first to conduct a fair and comprehensive comparison of different long context extension methods, resulting in several useful conclusions.
缺点
-
Based on my experience, increasing the model size in long-context downstream tasks such as LongBench yields some interesting conclusions that are inconsistent with the 7B model. I hope you can add some simple experimental results from the 13B model to discuss this point.
-
While enhancing the model's long-context capabilities, we also care about the impact of different methods on short-text performance and the extent of knowledge forgetting. I hope you can add some results from the common Open LLM Leaderboard to discuss this point.
问题
please see weaknesses
We sincerely thank the reviewer for their insightful comments and valuable suggestions. We have addressed each of your concerns in two separate threads.
Based on my experience, increasing the model size in long-context downstream tasks such as LongBench yields some interesting conclusions that are inconsistent with the 7B model. I hope you can add some simple experimental results from the 13B model to discuss this point.
We have conducted additional experiments using the LLaMA-13B model and are also running experiments on the LLaMA-70B model. Once those results are available, we will update the manuscript.
Table 1: Performance of LLaMA-13B Extension Methods
| Method | Perplexity(32k) | LongBench | Needle(64k) | RULER(32k) |
|---|---|---|---|---|
| Llama2-13b-hf (4k) | 2.90 | 33.84 | 17.00 | 86.35 |
| NTK-Frozen | 3.31 | 31.87 | 43.00 | 2.30 |
| Self-Extend | 2.65 | 33.69 | 53.50 | 30.23 |
| PI | 2.46 | 37.45 | 45.00 | 55.95 |
| NTK-32k | 2.44 | 38.41 | 82.20 | 58.38 |
| YaRN | 2.46 | 34.03 | 44.20 | 44.79 |
| CLEX | 2.43 | 35.89 | 78.90 | 52.76 |
Table 2: Detailed RULER Benchmark Results for LLaMA-13B
| Method | NIAH_S1 | NIAH_S2 | NIAH_S3 | NIAH_M1 | NIAH_M2 | NIAH_M3 | NIAH_MV | NIAH_MQ | VT | CWE | FWE | QA_1 | QA_2 | Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Llama2-13b-hf (4k) | 100.00 | 100.00 | 92.00 | 100.00 | 98.00 | 89.00 | 84.25 | 96.25 | 71.20 | 78.20 | 86.67 | 76.00 | 51.00 | 86.35 |
| NTK-Frozen | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.20 | 28.67 | 1.00 | 0.00 | 2.30 |
| Self-Extend | 65.00 | 63.00 | 76.00 | 15.00 | 1.00 | 0.00 | 25.25 | 11.00 | 32.80 | 17.60 | 51.33 | 14.00 | 21.00 | 30.23 |
| PI | 98.00 | 100.00 | 90.00 | 93.00 | 63.00 | 17.00 | 36.00 | 63.00 | 18.20 | 31.20 | 56.00 | 27.00 | 35.00 | 55.95 |
| NTK-32k | 100.00 | 97.00 | 81.00 | 81.00 | 36.00 | 12.00 | 63.75 | 74.50 | 29.60 | 48.10 | 69.00 | 29.00 | 38.00 | 58.38 |
| YaRN | 99.00 | 96.00 | 65.00 | 64.00 | 19.00 | 2.00 | 19.75 | 45.00 | 32.60 | 29.30 | 49.67 | 24.00 | 37.00 | 44.79 |
| CLEX | 98.00 | 98.00 | 96.00 | 71.00 | 20.00 | 2.00 | 45.25 | 61.50 | 24.20 | 40.20 | 74.67 | 19.00 | 36.00 | 52.76 |
We had the following new observations based on the experiments added and our previous analysis,
-
Performance Trends: With the larger LLaMA-13B model, we observe that non-extension methods like NTK-Frozen and Self-Extend show improved performance on intrinsic tasks such as Needle-in-a-Haystack compared to their performance at smaller scales.
-
Continual Fine-Tuning Methods: Despite the improvements in non-extension methods, continual fine-tuning methods still outperform them within their extension range.
-
Perplexity and Downstream Tasks: The correlation between perplexity and downstream task performance remains consistent, reinforcing our original conclusions. We will update the manuscript to include these new findings. The additional results provide deeper insights into how model scaling influences long-context capabilities and validate the robustness of our conclusions across different model sizes.
We will update the manuscript to include these new findings . The additional results provide deeper insights into how model scaling influences long-context capabilities and validate the robustness of our conclusions across different model sizes.
While enhancing the model's long-context capabilities, we also care about the impact of different methods on short-text performance and the extent of knowledge forgetting. I hope you can add some results from the common Open LLM Leaderboard to discuss this point.
We appreciate your suggestion to assess the impact of long-context extensions on short-text performance and knowledge retention. To address this, we evaluated our models on several benchmarks from the Open LLM Leaderboard, focusing on tasks that measure short-text understanding and knowledge.
Table 3: Short-Text Task Performance
| Methods | ARC-c | ARC-e | Hellaswag | MMLU | TruthfulQA | WinoGrande | Average |
|---|---|---|---|---|---|---|---|
| Llama2-7b-base | 52.73 | 81.31 | 78.96 | 42.09 | 38.97 | 74.43 | 61.42 |
| LM-Infinite | 52.56 | 81.36 | 78.95 | 42.09 | 38.96 | 74.11 | 61.34 |
| Self-Extend | 52.56 | 81.31 | 78.94 | 42.07 | 38.97 | 74.43 | 61.38 |
| NTK-Frozen | 52.73 | 81.31 | 78.96 | 42.09 | 38.97 | 74.43 | 61.42 |
| PI | 51.11 | 81.14 | 77.44 | 37.19 | 38.03 | 71.74 | 59.44 |
| NTK-32k | 49.15 | 80.22 | 74.48 | 35.25 | 38.13 | 72.61 | 58.31 |
| NTK-64k | 46.08 | 78.32 | 70.68 | 34.27 | 39.08 | 70.24 | 56.45 |
| YaRN | 53.41 | 81.82 | 78.47 | 41.06 | 38.63 | 74.43 | 61.30 |
| CLEX | 50.60 | 81.27 | 76.06 | 37.54 | 36.10 | 64.72 | 57.72 |
| LongLora | 46.67 | 78.58 | 67.08 | 26.29 | 37.61 | 55.25 | 51.91 |
We had the following observations:
-
Performance Degradation: Most long-context extension methods exhibit a slight decrease in performance on short-text tasks compared to the base model.
-
Trade-Off Between Long and Short Contexts: The reduction in short-text performance is more pronounced in models using continuous fine-tuning methods. This indicates a potential trade-off between enhancing long-context capabilities and maintaining optimal performance on short-text tasks.
-
Alignment with Original Findings: These results align with our observations in Figure 3, where we analyze the average negative log-likelihood across different context positions, suggesting that extending context length can impact performance at shorter contexts.
We will incorporate these findings into the revised manuscript, discussing the implications for knowledge retention and the balance between long-context capabilities and short-text performance.
The additional experiments provide valuable insights into how model scaling and context extension methods affect long and short-text tasks. We believe these enhancements strengthen our conclusions and contribute meaningfully to the field. Please let us know if there are any further concerns or suggestions
In addition to adding the 13B model, we also added the 70B model results, shown below. We observed a similar trend in Phi, Llama-7b, Llama-13B base models. This further shows that our analysis and discovered could be applied to larger models.
Table 1: Performance of LLaMA-70B Extension Methods
| Method | Perplexity (32k) | LongBench | Needle (64k) | RULER (32k) |
|---|---|---|---|---|
| Llama2-70b-hf (4k) | 2.66 | 34.00 | 14.70 | 93.67 |
| NTK-Frozen | 3.25 | 32.40 | 30.90 | 11.39 |
| Self-Extend | 2.43 | 29.10 | 32.60 | 31.94 |
| PI | 2.26 | 42.44 | 49.80 | 77.98 |
| NTK-32k | 2.25 | 41.51 | 90.50 | 76.97 |
Table 2: Detailed RULER Benchmark Results for LLaMA-70B
| Method | NIAH_S1 | NIAH_S2 | NIAH_S3 | NIAH_M1 | NIAH_M2 | NIAH_M3 | NIAH_MV | NIAH_MQ | VT | CWE | FWE | QA_1 | QA_2 | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Llama2-13b | 100.00 | 100.00 | 100.00 | 100.00 | 95.00 | 100.00 | 99.50 | 99.75 | 99.80 | 100.00 | 98.67 | 68.00 | 57.00 | 93.67 |
| NTK-Frozen | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.40 | 62.00 | 52.67 | 9.00 | 23.00 | 11.39 |
| Self-Extend | 24.00 | 50.00 | 27.00 | 1.00 | 0.00 | 0.00 | 24.50 | 12.00 | 54.60 | 75.40 | 88.67 | 25.00 | 33.00 | 31.94 |
| PI | 99.00 | 100.00 | 98.00 | 97.00 | 53.00 | 28.00 | 92.00 | 93.75 | 92.80 | 92.50 | 78.67 | 38.00 | 51.00 | 77.98 |
| NTK-32k | 100.00 | 100.00 | 92.00 | 87.00 | 44.00 | 19.00 | 90.25 | 96.00 | 93.20 | 97.50 | 93.67 | 37.00 | 51.00 | 76.97 |
Table 3 : Performance of LLaMA-7B, LLaMA-13B, LLaMA-70B Extension Methods
| Method | Perplexity (32k) | LongBench | Needle (64k) | RULER (32k) |
|---|---|---|---|---|
| Llama2-7b-hf (4k) | 3.04 | 32.92 | 8.40 | 80.94 |
| Llama2-13b-hf (4k) | 2.90 | 33.84 | 17.00 | 86.35 |
| Llama2-70b-hf (4k) | 2.66 | 34.00 | 14.70 | 93.67 |
| Llama2-7b-NTK-Frozen | 4.06 | 25.54 | 18.80 | 0.72 |
| Llama2-13b-NTK-Frozen | 3.31 | 31.87 | 43.00 | 2.30 |
| Llama2-70b-NTK-Frozen | 3.25 | 32.40 | 30.90 | 11.39 |
| Llama2-7b-Self-Extend | 2.75 | 33.62 | 25.80 | 29.50 |
| Llama2-13b-Self-Extend | 2.65 | 33.69 | 53.50 | 30.23 |
| Llama2-70b-Self-Extend | 2.43 | 29.10 | 32.60 | 31.94 |
| Llama2-7b-PI | 2.58 | 33.48 | 42.10 | 57.66 |
| Llama2-13b-PI | 2.46 | 37.45 | 45.00 | 55.95 |
| Llama2-70b-PI | 2.26 | 42.44 | 49.80 | 77.98 |
| Llama2-7b-NTK-32k | 2.54 | 35.32 | 83.70 | 59.42 |
| Llama2-13b-NTK-32k | 2.44 | 38.41 | 82.20 | 58.38 |
| Llama2-70b-NTK-32k | 2.25 | 41.51 | 90.50 | 76.97 |
| Llama2-7b-YaRN | 2.59 | 33.45 | 46.70 | 36.95 |
| Llama2-13b-YaRN | 2.46 | 34.03 | 44.20 | 44.79 |
| Llama2-7b-CLEX | 2.55 | 33.48 | 71.10 | 52.17 |
| Llama2-13b-CLEX | 2.43 | 35.89 | 78.90 | 52.76 |
Dear Reviewer ZKTb,
We would like to express our sincere gratitude for your thorough review of our manuscript and for your valuable insights. We have made significant revisions to our draft to address your concerns and suggestions.
We kindly ask if you could take a moment to review our updates. If there is anything further that you would like us to clarify or any additional experiments you believe would strengthen our work, please let us know. We are more than willing to provide any additional information or make further revisions as needed.
Thank you once again for your time and thoughtful feedback.
Sincerely,
Authors of Submission 8416
The paper evaluates several freeze/fine-tuning long-context methods on several public LLMs. Based on its experiments, it argues that (1) PPL is still an important metric in long-context scenarios, (2) "approximate attention" methods show poor performance (3) "exact fine-tuning based methods" are effective within trained range but cannot further process longer texts well.
优点
- The paper evaluates many methods under a controlled setting.
- The paper proposes several might-be-useful conclusions based on their experiments.
缺点
- More evaluation of short texts is required to check whether these methods perform well in keeping the original performance. (for example, NTK by parts may behave better on this than NTK-rope.)
- The so-called "approximate attention" method includes many quite different methods and can be misleading. It isn't very clear to include the streaming-LLM methods / longlora / landmark into one class. And MinInference-style work shows promising performance now, the conclusion that 'approximate attention is poor' might be misleading.
- self-extend seems to be more like NTK-F than 'approximate attention'.
- The contribution of the work might be limited if the arguments were not sufficiently robust.
问题
As the (1)-(3) in weakness.
The so-called "approximate attention" method includes many quite different methods and can be misleading. It isn't very clear to include the streaming-LLM methods / longlora / landmark into one class. MinInference-style work shows promising performance now, the conclusion that 'approximate attention is poor' might be misleading.
Thank you for pointing out that the term "approximate attention" in our manuscript encompasses a wide range of methods, which may lead to confusion. We agree that grouping streaming-LLM methods, LongLoRA, and Landmark attention under a single category is not sufficiently precise and could be misleading. In the revised manuscript, we will update the categorization of these methods to more accurately reflect their differences and unique characteristics.
Regarding MinInference-style work[1, 2, 3], we acknowledge that recent developments have shown promising performance in this area. We will include a discussion of these methods in our related work and limitations sections to provide a more comprehensive overview. Additionally, we would like to mention that this line of research is orthogonal to our study and can be applied in conjunction with approaches evaluated in this submission whereas the main focus of this study is effectiveness aligning with recent long-context studies[4, 5].
We will also revisit our conclusion that "approximate attention is poor" to specify that this observation is specific to our experimental setting, particularly when applying approximate attention methods during fine-tuning. This clarification should prevent any potential misunderstandings about the general effectiveness of approximate attention methods.
self-extend seems to be more like NTK-F than 'approximate attention'.
Thank you for highlighting that the Self-Extend method is more akin to NTK-Frozen methods than to "approximate attention" methods. In the revision, we will reclassify Self-Extend under RoPE-based methods and clearly differentiate it from other approaches. This adjustment will improve the clarity of our methodology section and ensure that each method is appropriately categorized.
The contribution of the work might be limited if the arguments were not sufficiently robust.
We appreciate your concern regarding the robustness of our arguments. To ensure the validity of our conclusions, we have conducted experiments across different model sizes and families within a controlled setting. This comprehensive approach enhances our confidence in the robustness of our findings.
We recognize that some existing research presents slightly contradictory results using similar methods. We hypothesize that these discrepancies may stem from the use of unfair or biased base models, or from exhaustive hyperparameter searches that are not standardized across studies. In the revised manuscript, we will discuss these factors in more detail and outline how our methodology addresses them. This added context should strengthen the reliability of our conclusions and clarify the contributions of our work.
References
[1] Leviathan, Yaniv, Matan Kalman, and Yossi Matias. "Fast inference from transformers via speculative decoding." International Conference on Machine Learning. PMLR, 2023.
[2] Xia, Heming, et al. "Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding." arXiv preprint arXiv:2401.07851 (2024).
[3] Yang, Nan, et al. "Inference with reference: Lossless acceleration of large language models." arXiv preprint arXiv:2304.04487 (2023).
[4] Gao, Tianyu, et al. "How to train long-context language models (effectively)." arXiv preprint arXiv:2410.02660 (2024).
[5] Hsieh, Cheng-Ping, et al. "RULER: What's the Real Context Size of Your Long-Context Language Models?." arXiv preprint arXiv:2404.06654 (2024).
Dear Reviewer 7saq,
We would like to express our sincere gratitude for your thorough review of our manuscript and for your valuable insights. We have carefully considered your feedback and have made significant revisions to address your concerns.
In particular, we have:
-
Conducted additional evaluations on short-text tasks to assess whether the methods maintain the original performance, as you suggested. This includes evaluating the impact on short-text performance and the extent of knowledge retention.
-
Clarified the categorization of the "approximate attention" methods to avoid potential misunderstandings. We have reclassified certain methods and provided more detailed explanations to ensure clarity.
-
Revised the classification of the Self-Extend method, recognizing that it is more akin to NTK-Frozen methods rather than "approximate attention".
-
Strengthened the robustness of our arguments by conducting additional experiments across different model sizes and settings.
We kindly ask if you could take a moment to review our updated manuscript. If there is anything further that you would like us to clarify or any additional experiments you believe would strengthen our work, please let us know. We are more than willing to provide any additional information or make further revisions as needed.
Thank you once again for your time and thoughtful feedback. Your insights have been instrumental in improving our paper.
Sincerely,
Authors of Submission 8416
I've raised my rating to 6 since now I believe it meets ICLR standard.
Thank you for your kind words and for taking the time to review our detailed response. We are pleased that your concerns have been addressed. Thank you again for the time and effort you have dedicated to reviewing our paper and providing valuable feedback. We sincerely appreciate your recognition of our work and will carefully revise the paper based on your and other reviewers’ comments.
More evaluation of short texts is required to check whether these methods perform well in keeping the original performance. (for example, NTK by parts may behave better on this than NTK-rope.)
We thank the reviewer for the suggestion. This question was raised by Reviewer ZKTb. We took short tasks from Open LLM Leaderboard and re-evaluated the short tasks.
To address this, we evaluated our models on several benchmarks from the Open LLM Leaderboard, focusing on tasks that measure short-text understanding and knowledge.
Table 3: Short-Text Task Performance
| Methods | ARC-c | ARC-e | Hellaswag | MMLU | TruthfulQA | WinoGrande | Average |
|---|---|---|---|---|---|---|---|
| Llama2-7b-base | 52.73 | 81.31 | 78.96 | 42.09 | 38.97 | 74.43 | 61.42 |
| LM-Infinite | 52.56 | 81.36 | 78.95 | 42.09 | 38.96 | 74.11 | 61.34 |
| Self-Extend | 52.56 | 81.31 | 78.94 | 42.07 | 38.97 | 74.43 | 61.38 |
| NTK-Frozen | 52.73 | 81.31 | 78.96 | 42.09 | 38.97 | 74.43 | 61.42 |
| PI | 51.11 | 81.14 | 77.44 | 37.19 | 38.03 | 71.74 | 59.44 |
| NTK-32k | 49.15 | 80.22 | 74.48 | 35.25 | 38.13 | 72.61 | 58.31 |
| NTK-64k | 46.08 | 78.32 | 70.68 | 34.27 | 39.08 | 70.24 | 56.45 |
| YaRN | 53.41 | 81.82 | 78.47 | 41.06 | 38.63 | 74.43 | 61.30 |
| CLEX | 50.60 | 81.27 | 76.06 | 37.54 | 36.10 | 64.72 | 57.72 |
| LongLora | 46.67 | 78.58 | 67.08 | 26.29 | 37.61 | 55.25 | 51.91 |
We had the following observations:
- Performance Degradation: Most long-context extension methods exhibit a slight decrease in performance on short-text tasks compared to the base model. Our discovery, aligning with what you suggested, shows that NTK-Frozen demonstrates better performance on short-text tasks compared to methods like NTK-RoPE.
- Trade-Off Between Long and Short Contexts: The reduction in short-text performance is more pronounced in models using continuous fine-tuning methods. This indicates a potential trade-off between enhancing long-context capabilities and maintaining optimal performance on short-text tasks.
- Alignment with Original Findings: These results align with our observations in Figure 3, where we analyze the average negative log-likelihood across different context positions, suggesting that extending context length can impact performance at shorter contexts. We will incorporate these findings into the revised manuscript, discussing the implications for knowledge retention and the balance between long-context capabilities and short-text performance
Dear Reviewers,
We want to express our sincere gratitude for the constructive feedback provided by the reviewers for Submission8416. We have carefully addressed each of your concerns to enhance the quality and robustness of our paper. Details can be found in our response to your reviews.
Below is a summary of the key actions we have undertaken in response to the reviewers' comments:
-
Expansion to Larger and More Diverse Models: (Reviewer ZKTb, Reviewer 3BXk)
- In response to concerns about the limited number of base models, we have conducted experiments using the LLaMA-70B, LLaMA-13B model as base models and Qwen-2.5-7B model. These additional evaluations confirm that our findings are consistent across larger and more diverse LLMs, thereby strengthening the generalizability of our conclusions.
-
Evaluation of Short-Text Performance: (Reviewer ZKTb, Reviewer 7saq)
- To address the impact of long-context extension methods on short-text performance, we evaluated our models on several benchmarks from the Open LLM Leaderboard. The results indicate that while most long-context methods exhibit slight performance degradation on short-text tasks, whereas approximate attention methods are more robust in shorter tasks.
-
Inference Speed Trade-Off: (Reviewer j6WF)
- We addressed this concern about inference speed trade-offs by conducting inference speed comparisons under controlled conditions using the same hardware setup. Our experiments show that approximate attention methods achieve a speedup of approximately 1.5x to 2x compared to LLaMA when the context length is short. However, we didn't see a significant margin when the context length was longer.
-
Hyperparameter Sensitivity Analysis: (Reviewer 3BXk)
- We conducted a comprehensive hyperparameter sweep to analyze the effects on both training efficiency and inference performance. Our findings reveal that approximate attention methods, such as LongLoRA, are highly sensitive to hyperparameter changes, requiring meticulous tuning. In contrast, continual fine-tuning methods like NTK and YaRN demonstrated robustness across various hyperparameter configurations, resulting in more predictable training times and consistent inference performance.
-
Generalization to Longer Contexts Beyond 32k: (Reviewer 3BXk)
- Although our initial study focused on context lengths up to 32k token in most cases, and 64k for two particular tasks, we have extended our evaluation to include 64k and 128k tokens. The results show that exact fine-tuning methods perform well within their trained context range but encounter challenges as context length increases beyond 32k tokens. Approximate attention methods show potential for handling longer contexts but often at the expense of reduced accuracy.
-
Clarification and Reclassification of Methods: (Reviewer 7saq, Reviewer j6WF)
-
Method Categorization: We will refine the categorization of context extension methods to avoid ambiguity. Specifically, we will reclassify methods like Self-Extend under RoPE-based methods instead of grouping them with approximate attention methods. Additionally, we will incorporate discussions on recent developments in MinInference-style methods, acknowledging their promising performance and clarifying their orthogonal relationship to our study.
-
PPL Correlation: We will update the manuscript to include a ranking-based correlation measure between perplexity and downstream task performance, addressing the reviewer's concerns and strengthening our claims.
-
-
LM-Infinite Results: (Reviewer j6WF)
- We evaluated LM-Infinite on the Passkey Retrieval task. We observed that while it performs well at shorter context lengths, its performance significantly degrades as context length increases beyond its trained length. This aligns with findings from concurrent studies (e.g., InfLLM), suggesting that LM-Infinite effectively attends to closer tokens but struggles with longer contexts.
By implementing these revisions, we believe our submission offers a more thorough and balanced analysis of long-context extension methods in a controlled setting. The enhancements address the initial concerns raised by the reviewers and contribute valuable insights to the field.
Thank you for taking the time to read our rebuttal. We are confident that the improvements made have significantly strengthened the quality and impact of our work.
Best regards,
Authors of submission 8416
Dear Reviewers and Area Chair,
Thank you for your thoughtful feedback on our manuscript. We have uploaded a revised version that incorporates your valuable suggestions. The revision includes expanded analysis with additional base model results, short-context task results, a detailed discussion of generalization capabilities, revised analysis of perplexity and downstream tasks, hyperparameter search, and enhanced limitations. For your convenience, all modifications are highlighted in green within the manuscript.
We welcome your further comments and suggestions to strengthen our work. The manuscript remains open for additional improvements based on your expertise and insights.
Thank you for your continued guidance in improving this research manuscript.
Best regards,
Authors of Submission 8416
The paper presents a controlled study evaluating various methods for extending the context length that Large Language Models (LLMs) can handle during inference. The authors argue that existing evaluations are often inconsistent due to variations in base models, datasets, and methodologies. The paper's primary strength is its attempt to conduct a controlled comparison of different long-context extension methods. While the paper addresses generalization to some extent, its exploration of very long contexts (e.g., 128k tokens and beyond) is limited. This is a significant aspect of long-context modeling. As Reviewer 3BXk notes, some of the findings have been touched upon in previous work. While the controlled study adds value, the paper could have done a better job of highlighting truly novel insights. Reviewer j6WF points out that some conclusions, particularly regarding perplexity, might be overstated or need more nuanced presentation. I also listed some issues raised by the reviewers, which were discussed during the rebuttal phase. While the paper has merits for controlled experiments, the reviewers raised the concerns regarding the novelty of the findings, the generalizability of the conclusions, and certain aspects of the presentation and analysis. The authors are encourged to revise a stronger version by considering the comments of the reviewers.
审稿人讨论附加意见
The initial submission was critiqued for its limited set of base models (only three). However, this was partially addressed in the rebuttal by adding LLaMA-13B, LLaMA-70B, and Qwen-7B.
There's a notable discrepancy between the paper's results on LM-Infinite and the original LM-Infinite paper's reported performance. While the authors acknowledge this and point to a concurrent study with similar findings, a more thorough investigation might be warranted.
The claim of a strong correlation between perplexity and downstream performance is somewhat weakened when considering only the most advanced methods (ppl below 6). While addressed in the rebuttal with Kendall's Tau, the tone might need adjustment.
Reject