5.8

/10

Rejected4 位审稿人

最低5最高6标准差0.4

3.5

置信度

正确性2.5

贡献度2.5

表达2.8

ICLR 2025

A Controlled Study on Long Context Extension and Generalization in LLMs

Yi Lu,Jing Nathan Yan,Songlin Yang,Justin T Chiu,Siyu Ren,Fei Yuan,Wenting Zhao,Zhiyong Wu,Alexander M Rush

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

TL;DR

Using a controlled protocol to systematically study long context extension methods

摘要

关键词

Controlled StudyLong ContextExtensionBenchmarkAnalysis

评审与讨论

审稿意见

评分: 6置信度: 32024-11-02

This paper presents a comprehensive empirical study on long-context extension methods for language models, including 4 methods that adapt the RoPE positional embedding, and 4 methods that approximates the attention operations. The paper ensures a fair comparison of these methods and conclude with three key takeaways: (1) contrary to claims in previous works, perplexity and downstream task performance has high correlations; (2) RoPE-adaptation methods in general work better than attention approximation methods; (3) Dynamic NTK works best among compared methods.

优点

The topic of this paper is timely and important. While there is a lot of recent work on long-context extension, the experiments were conducted with different base models and different training data, hence there is a lack of fair comparison. This work aims to answer this important question.
The background section (section 3) is mostly well-written and provides a unified view of these various context extension methods.
The authors conducted extensive experiments and ensured the comparison is conducted in a fair manner.

缺点

I'm not fully convinced by some of the experiment results and takeaways. See questions below.
The paper mentions lack of "quantitative rankings of different methodologies" as a motivation of this work. While this paper finds NTK-Dynamic to work best in general, this paper does not provide a full ranking.
The presentation and organization of the paper can be improved in various aspects. e.g., using more visualizations instead of tables to summarize the results and highlight main findings; having a table to summary the key characteristics of the 8 compared methods.

问题

Figure 1. I am surprised by the LM-Infinite result as the authors of the LM-Infinite reported an pass rate of about 80% on the task of Passkey Retrieval, which is very similar to the needle-in-the-haystack evaluation conducted in Figure 1. However, LM-Infinite fails most cases on NIAH as reported in Figure 1. Could you help explain what are potential aspects that lead to this gap?
Line 484 "Perplexity and downstream tasks". I'm not fully convinced by this argument and Figure 4. It seems that the linear trend in Figure 4 is highly dependent on LongLora and Landmark, in a sense that the linear trend is likely to disappear without these two compared methods.
- It seems that when perplexity is below certain level, e.g., below 6, the perplexity differences between models are small but the downstream performance differences are large. Thus I'm concerned with the claim that perplexity is a "general-purpose performance indicator" as suggested in the abstract, it can only indicate well within a certain region from my understanding.
- Could you please consider adding metrics such as rank correlation to further strengthen the claim?
Line 505 "Context extension hurts in the short term and gains in the long term". I'm not quite sure what this title means here. What are "short term" and "long term" referring to here?
Is there any inference speed trade-off between these compared methods? e.g., Are attention approximation methods faster? By how much? For some applications the inference speed may be critical. Providing such information will help users make informed decisions.

Others:

Line 194: What is the CLEX method here? Currently there is little introduction of it.
Line 221: "key matrix", are you referring to "key and query matrices"?

评论- Response to LM-Infinite Results

2024-11-20

Figure 1. I am surprised by the LM-Infinite result as the authors of the LM-Infinite reported an pass rate of about 80% on the task of Passkey Retrieval, which is very similar to the needle-in-the-haystack evaluation conducted in Figure 1. However, LM-Infinite fails most cases on NIAH as reported in Figure 1. Could you help explain what are potential aspects that lead to this gap?

We thank the reviewer for the question. We ran the LM-Infinite pass-key retrieval task, and found that indeed within the length LM-Infinite achieves good results, however, when it is going beyond the length, the results degrade a lot. Results are shown in Table 1.

In a concurrent literature, InfLLM[1], authors had a similar observation that with context length getting longer the accuracy of LM-Infinite decreases a lot in their pass key retrieval tasks with their Mistral-7B base model. We hypothesize that LM-Infinite might be good at attending to closer tokens within the window, hence there is a decrease when the length is getting longer. However, LM-Infinite is very good at preserving short-context ability.

Table 1: Lm-infinite passkey retrieval task:

Token Len	2k	4k	8k	16k	32k
llama-2-7b-hf (4k)	1.0	0.93	0.39	0.22	0.08

Table 2: Results taken from InfLLM[1]

Token Len	32k	64k	128k
Mistral-7B-Instruct-v0.2 (32k) [1]	0.30	0.17	0.00

Reference:

Xiao, Chaojun, et al. "Infllm: Training-free long-context extrapolation for llms with an efficient context memory." The Thirty-eighth Annual Conference on Neural Information Processing Systems. 2024.

2024-11-27

Thank you for clarifying!

"within the length LM-Infinite achieves create results" I'm not sure what this sentence mean here.
The LM-Infinite paper reported ~80% accuracy at 16k input length; The results you shared suggests a pass rate of 22% at 16k. This is still surprising to me. I think it is necessary to further discuss this huge gap.
However this is not the main focus of the paper so I'm not going to question this further.

评论- response to LM-Infinite

2024-11-28

Thank the reviewer for the comments.

We are sorry about the typo in our response. We intended to say "within the length LM-Infinite achieves good results".

We will cite more papers about LM-Infinite results in our paper to explain further the performance gap

评论- Response to presentation issues

2024-11-20

Line 194: What is the CLEX method here? Currently there is little introduction of it.

Thank the reviewer for the suggestion. We added some of the explanations of the CLEX method. CLEX achieved near SOTA recents and is considered a relatively recent baseline.

Line 221: "key matrix" refers to "key and query matrices"?

We thank reviewers for pointing this out, and we will revise as suggested.

评论- Response to short-term and long-term confusion

2024-11-20

Line 505 "Context extension hurts in the short term and gains in the long term". I'm not quite sure what this title means here. What are "short term" and "long term" referring to here?

We thank the reviewer for the question. Both short-term and long-term here refers to context length here. We will change to short context and long context to avoid any confusion.

Figure 3, where we analyze the average negative log-likelihood across different context positions, suggests that extending context length can impact performance at shorter contexts. To further validate our hypothesis we took short tasks from Open LLM Leaderboard and re-evaluated the short tasks.

Table 3: Short-Text Task Performance

Methods	ARC-c	ARC-e	Hellaswag	MMLU	TruthfulQA	WinoGrande	Average
Llama2-7b-base	0.5273	0.8131	0.7896	0.4209	0.3897	0.7443	0.6142
LM-Infinite	0.5256	0.8136	0.7895	0.4209	0.3896	0.7411	0.6134
Self-Extend	0.5256	0.8131	0.7894	0.4207	0.3897	0.7443	0.6138
NTK-Frozen-hf	0.5273	0.8131	0.7896	0.4209	0.3897	0.7443	0.6142
PI	0.5111	0.8114	0.7744	0.3719	0.3803	0.7174	0.5944
NTK-32k	0.4915	0.8022	0.7448	0.3525	0.3813	0.7261	0.5831
NTK-64k	0.4608	0.7832	0.7068	0.3427	0.3908	0.7024	0.5645
YaRN	0.5341	0.8182	0.7847	0.4106	0.3863	0.7443	0.6130
CLEX	0.5060	0.8127	0.7606	0.3754	0.3610	0.6472	0.5772
LongLora	0.4667	0.7858	0.6708	0.2629	0.3761	0.5525	0.5191

We had the following observations:

Performance Degradation: Most long-context extension methods exhibit a slight decrease in performance on short-text tasks compared to the base model. Our discovery, aligning with what you suggested, shows that NTK-Frozen demonstrates better performance on short-text tasks compared to methods like NTK-RoPE.
Trade-Off Between Long and Short Contexts: The reduction in short-text performance is more pronounced in models using continuous fine-tuning methods. This indicates a potential trade-off between enhancing long-context capabilities and maintaining optimal performance on short-text tasks.
Alignment with Original Findings: These results align with our observations in Figure 3, where we analyze the average negative log-likelihood across different context positions, suggesting that extending context length can impact performance at shorter contexts.

2024-11-21

The presentation and organization of the paper can be improved in various aspects. e.g., using more visualizations instead of tables to summarize the results and highlight main findings; having a table to summary the key characteristics of the 8 compared methods.

We thank the reviewer for the valuable feedback on improving our paper’s presentation and organization. Currently, we have summarized our main findings in Table 1 and included a heatmap and three plots to demonstrate our results.

To enhance clarity, we will add additional visualizations that summarize perplexity trends and final results. Additionally, we will include a new table that outlines the key characteristics of the eight compared methods. These enhancements aim to better highlight our main findings and provide a comprehensive overview of each method’s features.

Thank you for your insightful suggestions.

评论- Response to efficiency analysis

2024-11-21

Is there any inference speed trade-off between these compared methods? e.g., Are attention approximation methods faster? By how much? For some applications the inference speed may be critical. Providing such information will help users make informed decisions.

We thank the reviewer for raising this insightful question. In our experimental study, our primary focus was on evaluating the effectiveness of model performance, which aligns with the approach taken in much of the recent literature.

That said, we also conducted inference speed comparisons under controlled conditions using the same hardware setup. As shown in Table 1, we observed that approximate attention methods are indeed faster, achieving a speedup of approximately 1.5x to 2x compared to LLaMA when the context length is short; however, when the context length gets longer, we didn't see a significant margin. We hypothesize that the discrepancy between the theoretical FLOPs-based comparisons and the observed speedup arises due to differences in hardware characteristics and CUDA implementations of the respective methods.

Table 1: Efficiency analysis of prefill stage time cost, decoding speed, and memory usage

The prefill time cost represents the time required to generate the first token. The decoding speed (seconds / per token) is averaged over 100 token inferences at each sequence length. Memory consumption corresponds to the peak GPU memory usage during inference. All methods, except for LM-Infinite and Landmark, utilize Flash-Attention 2 for enhanced computational efficiency.

Method	4k	8k	16k	32k
	Prefill (s) / Decode (s) / Mem (GB).	Prefill (s) / Decode (s) / Mem (GB).	Prefill (s) / Decode (s) / Mem (GB).	Prefill (s) / Decode (s) / Mem (GB).
Llama2-7b	1.15 / 0.03 / 17.13	1.51 / 0.06 / 21.61	2.41 / 0.11 / 30.59	4.63 / 0.21 / 48.55
NTK-Frozen	1.16 / 0.04 / 17.13	1.56 / 0.05 / 21.61	2.39 / 0.06 / 30.59	4.69 / 0.09 / 48.55
PI	1.15 / 0.03 / 22.05	1.54 / 0.03 / 26.54	2.43 / 0.05 / 35.51	4.74 / 0.08 / 53.47
NTK-32k	1.17 / 0.04 / 17.11	1.56 / 0.04 / 21.60	2.42 / 0.06 / 30.58	4.75 / 0.09 / 48.53
YaRN	1.23 / 0.03 / 18.05	1.53 / 0.03 / 22.54	2.43 / 0.05 / 31.51	4.80 / 0.08 / 49.47
CLEX	1.16 / 0.05 / 17.16	6.99 / 0.07 / 21.74	7.68 / 0.11 / 30.92	10.06 / 0.18 / 49.28
LM-Infinite	1.56 / 0.05 / 17.23	3.34 / 0.07 / 25.47	5.82 / 0.11 / 38.60	11.58 / 0.18 / 65.61
Self-Extend	1.24 / 0.05 / 17.23	1.63 / 0.07 / 21.81	2.63 / 0.13 / 30.98	4.97 / 0.22 / 49.32
LongLora	1.16 / 0.05 / 17.16	1.65 / 0.05 / 21.65	2.60 / 0.05 / 30.62	5.07 / 0.08 / 48.58
Landmark	8.62 / 0.08 / 18.77	17.65 / 0.08 / 22.97	36.47 / 0.09 / 31.22	77.77 / 0.09 / 47.74

2024-11-27

Thanks for sharing these results! In future versions of the paper, it would be nice to report them along with the performance metrics (in Table 1 and 2) into a 2D figure, with performance on the Y-axis and speed on the X-axis.

评论- response to correlation

2024-11-25

Line 484 "Perplexity and downstream tasks". This argument and Figure 4 do not fully convince me. It seems that the linear trend in Figure 4 is highly dependent on LongLora and Landmark, in the sense that the linear trend is likely to disappear without these two compared methods. It seems that when perplexity is below a certain level, e.g., below 6, the perplexity differences between models are small but the downstream performance differences are large. Thus I'm concerned with the claim that perplexity is a "general-purpose performance indicator" as suggested in the abstract, it can only indicate well within a certain region from my understanding. Please consider adding metrics such as rank correlation to strengthen the claim further.

We thank the reviewer for the question. In our revision, we will make this clear about the distance and Here are the updated results for the correlation using a non-parametric method, ken-tau correlation.

Table 1: Kendall correlation of downstream task performance and PPL

Task	Kendall's Tau	p-value	Interpretation
Needle	-0.7191	0.0041	Strong negative correlation; statistically significant (p < 0.01).
Mshots	-0.4944	0.0482	Moderate negative correlation; borderline significant (p ≈ 0.05).
LongB	-0.6136	0.0149	Strong negative correlation; statistically significant (p < 0.05).
RULER	-0.7191	0.0041	Strong negative correlation; statistically significant (p < 0.01).

Key Findings:

Consistency Across Tasks: The results show a strong and statistically significant negative correlation between PPL and downstream performance for most tasks.
- This supports the claim that lower PPL values are generally associated with better downstream task performance.
Task-Specific Observations:
- The strongest correlations are observed for Needle and RULER, where Kendall's tau indicates a robust alignment between PPL and task performance rankings.
- For Mshots, the correlation is moderate and statistically weaker, suggesting that PPL's predictive ability may vary slightly depending on the task.
Impact of Perplexity Range:
- Even when perplexity values are close (e.g., below 6), PPL rankings remain a reliable indicator of downstream performance. However, the narrower range may amplify the observed performance differences, highlighting the need for nuanced interpretation.

We will add these findings to our revision.

2024-11-27

Thanks for adding the Kendall-Tau metric and providing these new discussions. I'm afraid I'm still concerned with the argument that there is "a strong correlation between perplexity and downstream task performance." Indeed a strong correlation is present in the current setting when LongLora and Landmark are included. But this conclusion may not be general enough, especially if we focus on the 5 strongest method that achieves ppl below 6 in Figure 4. To address this concern, could you please consider doing the same set of analysis in Figure 4 but with LongLora and Landmark excluded? Alternatively, you might consider revising the wording of this conclusion and presenting it with less emphasis in the paper?

2024-11-27

Dear authors and area chair,

Given the new results and discussion posted by the authors, I have raised my rating to 6. My concerns about the LM-infinite comparison and inference speed trade-off are resolved.

We still have some pending discussion regarding whether there is a strong correlation between ppl and downstream performance. While the conclusion holds when considering all long-context extension methods, the correlation appears weaker when focusing on the strongest or most advanced long-context extension methods. I'm concerned with the broader applicability of the conclusion.

Reviewer j6WF

评论- response to Reviewer j6WF

2024-11-28

We thank Reviewer j6WF for the suggestion to further improve the paper.

We have changed our results according to the suggestion of Reviewer j6WF and softened the tone and discovery presented in our paper. We specifically mentioned our discovery is "to some extent" and also mentioned that our discovery is limited as Reviewer j6WF has mentioned in our limitation section.

For example, we wrote the introduction paragraph as,

First, while there have been suggestions that we need new ways to measure performance, our findings show that perplexity does align with how well models perform to some extent on various tasks in our controlled studies. Though some newer attention methods don't show this pattern as clearly, we generally found that when models got better at predicting text, they also got better at most other tasks we tested them on.

We rewrote our discovery paragraph as,

While prior work [1, 2] suggests that perplexity may not reliably predict long-range task performance, our analysis in Figure 2 reveals to some extent perplexity might be reliable. We observe a general correlation between perplexity and model performance across tasks. However, we also observed that approximate attention methods, including LongLora and Landmark on RULER, show minor deviations but maintain a roughly linear relationship. We hypothesize that this apparent discrepancy with previous findings may stem from their less controlled experimental conditions and noisier datasets.

References

[1] Sun, Simeng, et al. "Do long-range language models actually use long-range context?." arXiv preprint arXiv:2109.09115 (2021).

[2] An, Chenxin, et al. "L-eval: Instituting standardized evaluation for long context language models." arXiv preprint arXiv:2307.11088 (2023).

审稿意见

评分: 5置信度: 42024-11-03

This paper systematically evaluates various methods for extending the context length of LLMs, aiming to provide insights into the behavior of long-context models and establish a standardized evaluation framework. It designs a controlled protocol for comparing context extension methods using consistent base models and extension data. The study includes an examination of the performance of different attention mechanisms in long-context tasks, confirms that perplexity remains a relevant performance indicator in longer-context scenarios, and presents findings indicating that exact fine-tuning methods are effective within their extension range, while approximation methods tend to underperform. Additionally, the paper emphasizes the open-sourcing of codebases, models, and checkpoints to promote transparency and facilitate further research.

优点

The paper introduces a novel controlled protocol for evaluating long-context extension methods, addressing a significant gap in the literature regarding the comparison of such techniques.
The study is comprehensive, utilizing a variety of metrics and tasks to assess model performance. The use of standardized base models and extension data enhances the quality of the comparative analysis.

缺点

The study is limited to three base models, which may not accurately represent the performance of other, potentially larger models. Expanding the analysis to include a more diverse set of base models could strengthen the conclusions.
While the paper acknowledges limitations due to fixed hyperparameters, a more in-depth exploration of how different hyperparameter settings might affect the results would be beneficial.
The generalization of findings to longer contexts beyond 32k (e.g., 128k and 1m) is not addressed, which is a significant limitation given the focus on long-context models.
The insights provided in this paper have been discussed in previous studies, and I did not gain any new takeaways from it.

问题

What is the performance of new models, such as Qwen2.5, in this experiment?
What are your thoughts on the generalization behavior of these methods for contexts longer than 32k? Are there any preliminary findings or conjectures regarding this?

伦理问题详情

评论- response to incorporate more base models (LLaMA-13B)

2024-11-20

The study is limited to three base models, which may not accurately represent the performance of other, potentially larger models. Expanding the analysis to include a more diverse set of base models could strengthen the conclusions.

Thank you for highlighting the importance of evaluating larger models. In response, we have conducted additional experiments using the LLaMA-13B model and are also running experiments on the LLaMA-70B model. Once those results are available, we will update the manuscript.

Table 1: Performance of LLaMA-13B Extension Methods

Method	Perplexity (32k)	LongBench	Needle (64k)	RULER (32k)
Llama2-13b-hf (4k)	2.90	33.84	0.17	86.35
NTK-Frozen	3.31	31.87	0.43	2.30
Self-Extend	2.65	33.69	0.54	30.23
PI	2.46	37.45	0.45	55.95
NTK-32k	2.44	38.41	0.82	58.38
YaRN	2.46	34.03	0.44	44.79
CLEX	2.43	35.89	0.79	52.76

Table 2: Detailed RULER Benchmark Results for LLaMA-13B at 32k

Method	NIAH_S1	NIAH_S2	NIAH_S3	NIAH_M1	NIAH_M2	NIAH_M3	NIAH_MV	NIAH_MQ	VT	CWE	FWE	QA_1	QA_2	Avg
Llama2-13b-hf (4k)	100	100	92	100	98	89	84.25	96.25	71.2	78.2	86.67	76	51	86.35
NTK-Frozen	0	0	0	0	0	0	0	0	0	0.2	28.67	1	0	2.30
Self-Extend	65	63	76	15	1	0	25.25	11	32.8	17.6	51.33	14	21	30.23
PI	98	100	90	93	63	17	36	63	18.2	31.2	56	27	35	55.95
NTK-32k	100	97	81	81	36	12	63.75	74.5	29.6	48.1	69	29	38	58.38
YaRN	99	96	65	64	19	2	19.75	45	32.6	29.3	49.67	24	37	44.79
CLEX	98	98	96	71	20	2	45.25	61.5	24.2	40.2	74.67	19	36	52.76

We had the following new observations based on the experiments added and our previous analysis,

Performance Trends: With the larger LLaMA-13B model, we observe that non-extension methods like NTK-Frozen and Self-Extend show improved performance on intrinsic tasks such as Needle-in-a-Haystack compared to their performance at smaller scales.
Continual Fine-Tuning Methods: Despite the improvements in non-extension methods, continual fine-tuning methods still outperform them within their extension range.
Perplexity and Downstream Tasks: The correlation between perplexity and downstream task performance remains consistent, reinforcing our original conclusions. We will update the manuscript to include these new findings. The additional results provide deeper insights into how model scaling influences long-context capabilities and validate the robustness of our conclusions across different model sizes.

We will update the manuscript to include these new findings . The additional results provide deeper insights into how model scaling influences long-context capabilities and validate the robustness of our conclusions across different model sizes.

评论- response to hyper-parameter analysis

2024-11-20

While the paper acknowledges limitations due to fixed hyperparameters, a more in-depth exploration of how different hyperparameter settings might affect the results would be beneficial.

Thank you for highlighting the importance of exploring the impact of hyperparameter settings on our results. We added following empirical studies to address your concerns.

Continual fine-tuned methods

We agree that hyperparameters can significantly influence the performance of different context extension methods, particularly approximate attention methods. We sweep traditional training hyperparameters, such as batch size, and learning rate. Results are shown below,

Table 1: Perplexity Results of LongLora on PG19 and Proof-file

Method	Batch Size	Learning Rate	2k	4k	8k	16k	32k
PG19
Longlora	32	2e-5	12.80	11.52	10.70	10.18	9.89
Longlora	8	2e-5	8.10	7.69	7.43	7.28	7.32
Proof-file
Longlora	32	2e-5	5.97	5.10	4.58	4.27	4.13
Longlora	8	2e-5	3.33	3.01	2.80	2.67	2.61

We made the following observations:

High Sensitivity: Approximate attention methods like LongLoRA are highly sensitive to hyperparameter settings. Small changes in learning rate or training steps led to significant fluctuations in performance.
Robustness to Hyperparameters: NTK and YaRN methods demonstrated robustness to changes in hyperparameter settings. Their performance remained stable across a wide range of configurations.
Optimization Challenges: Training times were more predictable for NTK and YaRN and generally shorter because fewer hyperparameter adjustments were needed. For LongLora, achieving optimal performance requires careful tuning, which can be computationally intensive and time-consuming.

Inference Time Optimization

Additionally, we experimented with different hyperparameter settings during inference, such as: scaling factors and some particular hyperparameter tailored to methods. Specifically, for Self-Extend, we follow the empirical rule proposed by the authors for selecting hyperparameters of neighbor tokens and group size, using 32k as the target length to set our hyperparameters of the size of neighbor tokens(window size) and group size. Table 2 below shows the PPL w.r.t different combination of hyperparameters.

Table 2: Perplexity Results of Self-Extend with different group and window size

Method	Window Size	Group Size	4k	16k	32k
Self-Extend	512	32	7.74	7.64	7.67
	512	64	7.77	7.72	8.43
	512	128	7.81	7.84	9.86
	1024	32	7.67	7.44	7.42
	1024	64	7.67	7.46	7.47
	1024	128	7.68	7.48	7.51
	2048	32	7.69	7.48	11.18
	2048	64	7.70	7.50	8.08
	2048	128	7.70	7.54	10.25

Despite not requiring fine-tuning, we found that Self-Extend is sensitive to hyperparameters during inference time, when the input context gets longer(32k): their performance varies significantly based on choices like group sizes. Small changes in hyperparameters can lead to considerable fluctuations in model performance during inference. This sensitivity can affect the reliability of these methods in practical applications where consistent performance is necessary.

We will add those into our analysis of the manuscript.

评论- Response to generalization beyond 32k

2024-11-20

The generalization of findings to longer contexts beyond 32k (e.g., 128k and 1m) is not addressed, which is a significant limitation given the focus on long-context models.

We sincerely thank the reviewer for highlighting the importance of evaluating model performance on contexts longer than 32k tokens.

In our study, we define generalization as the model's ability to perform well across all tasks that extend beyond the training context length. Specifically, we have evaluated our models on tasks where the input lengths exceed 32k tokens, such as NIAH , Perplexity (PPL) and RULER, with sequences up to 64k tokens. We will revise our writing to highlight this in our manuscript. In our original paper, we found that NTK-Dynamic yields the best performance beyond 32k.

To further evaluate the generalization, we evaluated sequences up to 128k tokens on NTK, which in our submission draft works the best in generalization. Results are shown below.

Table 1: Generalization of NTK beyond 32k on RULER

Method	4k	8k	16k	32k	64k	128k
NTK-32k	86.58	77.75	70.01	59.42	46.26	29.91
NTK-64k	86.60	76.34	69.56	60.03	49.31	40.09

Table 2: Generalization of NTK beyond 32k on NIAH

Method	Length	NIAH_S1	NIAH_S2	NIAH_S3	NIAH_M1	NIAH_M2	NIAH_M3	NIAH_MV	NIAH_MQ	VT	CWE	FWE	QA_1	QA_2	Avg.
NTK-32k	128k	75.00	56.00	74.00	48.00	3.00	0.00	25.75	25.75	11.00	3.30	13.00	28.00	26.00	29.91
NTK-64k	128k	85.00	88.00	91.00	67.00	8.00	0.00	44.50	47.25	3.40	0.70	34.33	24.00	28.00	40.09

We found that our conclusion holds for 64k tokens. However, when the context length increased to 128k tokens (4x the fine-tuned length of 32k), we noticed a decrease in performance. This indicates that even for the best generalized methods we discovered in our controlled setting, their generalization becomes weaker when the context length is much larger than the fine-tuned length.

评论- Response to insights of the paper

2024-11-20

The insights provided in this paper have been discussed in previous studies, and I did not gain any new takeaways from it.

We appreciate the reviewer's candid feedback regarding the novelty of our insights. While we acknowledge that some of our conclusions align with findings from previous studies, we believe that our work offers several distinct contributions that differentiate it from existing research:

Controlled Experimental Protocol:
- Consistent base models, datasets, and metrics ensure fair comparisons.
- Eliminated confounding variables, enhancing credibility and reproducibility.
Clear Mathematical Connections:
- Explicit mathematical relationships between methods unify different approaches.
- Provided deeper insights into their performance and interrelations.
Comprehensive Evaluation:
- Extensive tasks include intrinsic metrics (e.g., perplexity) and extrinsic benchmarks (e.g., LongBench, RULER).
- Tested various model sizes (e.g., LLaMA-13B, adding LLaMA-70B) for stronger generalizability.
- Standardized benchmarking framework enables meaningful comparisons and best practice identification.

While prior studies may have explored similar themes, our work differentiates itself through the combination of a controlled experimental setup, mathematical summarization contributions, and a comprehensive evaluation framework. We believe that quantifying these properties across multiple approaches and presenting them within a standardized benchmark adds significant value to the field.

We will revise the manuscript to more clearly highlight these unique contributions. By emphasizing these aspects, we aim to better convey the novelty and significance of our work to the reader.

评论- Response to Qwen Model Performance

2024-11-20

What is the performance of new models, such as Qwen2.5, in this experiment? We thank the reviewer for bringing up Qwen-2.5. This model has demonstrated outstanding performance across various benchmarks and is widely recognized as one of the top lightweight models in many scenarios.

Evaluation of Qwen-7B:

Context Length Support: Qwen-7B supports an input context length of up to 128k tokens.
Extension Method: While there is currently no technical report detailing their context extension method, based on the previous Qwen-2 report, we hypothesize that they utilize a context extension technique similar to NTK-RoPE.
Experimental Results:

We conducted experiments to evaluate Qwen-7B using our standardized benchmarks. The results are as follows:

Table 1: Performance of Qwen-2.5-7b on Long-Context Tasks

Method	Perplexity (32k)	LongBench	Needle (64k)	RULER (32k)
Qwen-2.5-7b	2.3154	45.01	0.871	85.21

Table 2: Performance and generalization of Qwen-2.5-7b on RULER

Method	4k	8k	16k	32k	64k	128k
Qwen2.5	94.90	89.95	88.30	85.21	63.67	21.06

Method	Length	NIAH_S1	NIAH_S2	NIAH_S3	NIAH_M1	NIAH_M2	NIAH_M3	NIAH_MV	NIAH_MQ	VT	CWE	FWE	QA_1	QA_2	Avg.
Qwen2.5	4k	100	100	100	100	100	100	97.5	99.75	99.8	98.3	97.33	84	57	94.90
	8k	100	100	100	100	100	98	94	99.75	95.4	86.2	83	64	49	89.95
	16k	100	100	100	100	99	97	94	98.75	92.2	65	87	64	51	88.30
	32k	100	100	100	99	96	92	93.25	97.5	88.6	58.7	85.67	56	41	85.21
	64k	100	92	100	67	22	28	82.75	86.25	78.2	10.9	83.67	54	23	63.67
	128k	98	29	37	18	4	1	26	15.75	10.4	1	1.67	10	22	21.06

We had the following observations:

Superior Performance: Qwen-7B achieves the best performance on our long-context tasks compared to other models of similar size.
Generalization Ability: The model demonstrates strong generalization to longer contexts beyond its training range, aligning with the trends observed in our study. We hypothesize that this is partially due to better and more continual fine-tuning data and better training recipe.

评论- response to incorporate more base models (LLaMA-70B)

2024-11-28

In addition to adding the 13B model, we also added the 70B model results, shown below. We observed a similar trend in Phi, Llama-7b, Llama-13B base models. This further shows that our analysis and discovered could be applied to larger models.

Table 1: Performance of LLaMA-70B Extension Methods

Method	Perplexity (32k)	LongBench	Needle (64k)	RULER (32k)
Llama2-70b-hf (4k)	2.66	34.00	14.70	93.67
NTK-Frozen	3.25	32.40	30.90	11.39
Self-Extend	2.43	29.10	32.60	31.94
PI	2.26	42.44	49.80	77.98
NTK-32k	2.25	41.51	90.50	76.97

Table 2: Detailed RULER Benchmark Results for LLaMA-70B

Method	NIAH_S1	NIAH_S2	NIAH_S3	NIAH_M1	NIAH_M2	NIAH_M3	NIAH_MV	NIAH_MQ	VT	CWE	FWE	QA_1	QA_2	Avg.
Llama2-13b	100.00	100.00	100.00	100.00	95.00	100.00	99.50	99.75	99.80	100.00	98.67	68.00	57.00	93.67
NTK-Frozen	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1.40	62.00	52.67	9.00	23.00	11.39
Self-Extend	24.00	50.00	27.00	1.00	0.00	0.00	24.50	12.00	54.60	75.40	88.67	25.00	33.00	31.94
PI	99.00	100.00	98.00	97.00	53.00	28.00	92.00	93.75	92.80	92.50	78.67	38.00	51.00	77.98
NTK-32k	100.00	100.00	92.00	87.00	44.00	19.00	90.25	96.00	93.20	97.50	93.67	37.00	51.00	76.97

Table 3 : Performance of LLaMA-7B, LLaMA-13B, LLaMA-70B Extension Methods

Method	Perplexity (32k)	LongBench	Needle (64k)	RULER (32k)
Llama2-7b-hf (4k)	3.04	32.92	8.40	80.94
Llama2-13b-hf (4k)	2.90	33.84	17.00	86.35
Llama2-70b-hf (4k)	2.66	34.00	14.70	93.67
Llama2-7b-NTK-Frozen	4.06	25.54	18.80	0.72
Llama2-13b-NTK-Frozen	3.31	31.87	43.00	2.30
Llama2-70b-NTK-Frozen	3.25	32.40	30.90	11.39
Llama2-7b-Self-Extend	2.75	33.62	25.80	29.50
Llama2-13b-Self-Extend	2.65	33.69	53.50	30.23
Llama2-70b-Self-Extend	2.43	29.10	32.60	31.94
Llama2-7b-PI	2.58	33.48	42.10	57.66
Llama2-13b-PI	2.46	37.45	45.00	55.95
Llama2-70b-PI	2.26	42.44	49.80	77.98
Llama2-7b-NTK-32k	2.54	35.32	83.70	59.42
Llama2-13b-NTK-32k	2.44	38.41	82.20	58.38
Llama2-70b-NTK-32k	2.25	41.51	90.50	76.97
Llama2-7b-YaRN	2.59	33.45	46.70	36.95
Llama2-13b-YaRN	2.46	34.03	44.20	44.79
Llama2-7b-CLEX	2.55	33.48	71.10	52.17
Llama2-13b-CLEX	2.43	35.89	78.90	52.76

评论- follow up response

2024-12-02

Dear Reviewer 3BXk,

We would like to express our sincere gratitude for your thorough review of our manuscript and for your valuable insights. We have carefully considered your feedback and have made significant revisions to address your concerns.

In particular, we have:

Expanded our experiments to include larger and more diverse models, such as the LLaMA-13B and Qwen-2.5-7B models, to strengthen the generalizability of our conclusions.
Conducted a comprehensive hyperparameter sensitivity analysis, exploring how different hyperparameter settings affect the results during both training and inference phases. This includes additional experiments and detailed discussions in the revised manuscript.
Evaluated the generalization behavior of our methods for contexts longer than 32k tokens, extending our experiments to include context lengths of 64k and 128k tokens. Our findings and conjectures regarding this are included in the updated paper.
Clarified the novelty and contributions of our work, emphasizing the controlled experimental protocol, clear mathematical connections, and comprehensive evaluation across multiple dimensions.

We kindly ask if you could take a moment to review our updated manuscript. If there is anything further that you would like us to clarify or any additional feedback you wish to provide, please let us know. We are more than willing to provide any additional information or make further revisions as needed.

Thank you once again for your time and thoughtful feedback. Your insights have been instrumental in improving our paper.

Sincerely,
Authors of Submission 8416

审稿意见

评分: 6置信度: 42024-11-04

Using consistent base models and extension data, the study yielded several insights into long-context behavior. First, it reaffirmed the critical role of perplexity as a general-purpose performance indicator. Second, current approximate attention methods systematically underperform in long-context tasks. Finally, it confirmed that exact fine-tuning based methods are generally effective within their extension range, whereas extrapolation remains challenging.

优点

This work is the first to conduct a fair and comprehensive comparison of different long context extension methods, resulting in several useful conclusions.

缺点

Based on my experience, increasing the model size in long-context downstream tasks such as LongBench yields some interesting conclusions that are inconsistent with the 7B model. I hope you can add some simple experimental results from the 13B model to discuss this point.
While enhancing the model's long-context capabilities, we also care about the impact of different methods on short-text performance and the extent of knowledge forgetting. I hope you can add some results from the common Open LLM Leaderboard to discuss this point.

问题

please see weaknesses

评论- Response to larger base models

2024-11-20

We sincerely thank the reviewer for their insightful comments and valuable suggestions. We have addressed each of your concerns in two separate threads.

Based on my experience, increasing the model size in long-context downstream tasks such as LongBench yields some interesting conclusions that are inconsistent with the 7B model. I hope you can add some simple experimental results from the 13B model to discuss this point.

We have conducted additional experiments using the LLaMA-13B model and are also running experiments on the LLaMA-70B model. Once those results are available, we will update the manuscript.

Table 1: Performance of LLaMA-13B Extension Methods

Method	Perplexity(32k)	LongBench	Needle(64k)	RULER(32k)
Llama2-13b-hf (4k)	2.90	33.84	17.00	86.35
NTK-Frozen	3.31	31.87	43.00	2.30
Self-Extend	2.65	33.69	53.50	30.23
PI	2.46	37.45	45.00	55.95
NTK-32k	2.44	38.41	82.20	58.38
YaRN	2.46	34.03	44.20	44.79
CLEX	2.43	35.89	78.90	52.76

Table 2: Detailed RULER Benchmark Results for LLaMA-13B

Method	NIAH_S1	NIAH_S2	NIAH_S3	NIAH_M1	NIAH_M2	NIAH_M3	NIAH_MV	NIAH_MQ	VT	CWE	FWE	QA_1	QA_2	Avg
Llama2-13b-hf (4k)	100.00	100.00	92.00	100.00	98.00	89.00	84.25	96.25	71.20	78.20	86.67	76.00	51.00	86.35
NTK-Frozen	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.20	28.67	1.00	0.00	2.30
Self-Extend	65.00	63.00	76.00	15.00	1.00	0.00	25.25	11.00	32.80	17.60	51.33	14.00	21.00	30.23
PI	98.00	100.00	90.00	93.00	63.00	17.00	36.00	63.00	18.20	31.20	56.00	27.00	35.00	55.95
NTK-32k	100.00	97.00	81.00	81.00	36.00	12.00	63.75	74.50	29.60	48.10	69.00	29.00	38.00	58.38
YaRN	99.00	96.00	65.00	64.00	19.00	2.00	19.75	45.00	32.60	29.30	49.67	24.00	37.00	44.79
CLEX	98.00	98.00	96.00	71.00	20.00	2.00	45.25	61.50	24.20	40.20	74.67	19.00	36.00	52.76

We had the following new observations based on the experiments added and our previous analysis,

Performance Trends: With the larger LLaMA-13B model, we observe that non-extension methods like NTK-Frozen and Self-Extend show improved performance on intrinsic tasks such as Needle-in-a-Haystack compared to their performance at smaller scales.
Continual Fine-Tuning Methods: Despite the improvements in non-extension methods, continual fine-tuning methods still outperform them within their extension range.
Perplexity and Downstream Tasks: The correlation between perplexity and downstream task performance remains consistent, reinforcing our original conclusions. We will update the manuscript to include these new findings. The additional results provide deeper insights into how model scaling influences long-context capabilities and validate the robustness of our conclusions across different model sizes.

评论- response to shorter context length task evaluation

2024-11-20

While enhancing the model's long-context capabilities, we also care about the impact of different methods on short-text performance and the extent of knowledge forgetting. I hope you can add some results from the common Open LLM Leaderboard to discuss this point.

We appreciate your suggestion to assess the impact of long-context extensions on short-text performance and knowledge retention. To address this, we evaluated our models on several benchmarks from the Open LLM Leaderboard, focusing on tasks that measure short-text understanding and knowledge.

Table 3: Short-Text Task Performance

Methods	ARC-c	ARC-e	Hellaswag	MMLU	TruthfulQA	WinoGrande	Average
Llama2-7b-base	52.73	81.31	78.96	42.09	38.97	74.43	61.42
LM-Infinite	52.56	81.36	78.95	42.09	38.96	74.11	61.34
Self-Extend	52.56	81.31	78.94	42.07	38.97	74.43	61.38
NTK-Frozen	52.73	81.31	78.96	42.09	38.97	74.43	61.42
PI	51.11	81.14	77.44	37.19	38.03	71.74	59.44
NTK-32k	49.15	80.22	74.48	35.25	38.13	72.61	58.31
NTK-64k	46.08	78.32	70.68	34.27	39.08	70.24	56.45
YaRN	53.41	81.82	78.47	41.06	38.63	74.43	61.30
CLEX	50.60	81.27	76.06	37.54	36.10	64.72	57.72
LongLora	46.67	78.58	67.08	26.29	37.61	55.25	51.91

We had the following observations:

Performance Degradation: Most long-context extension methods exhibit a slight decrease in performance on short-text tasks compared to the base model.
Trade-Off Between Long and Short Contexts: The reduction in short-text performance is more pronounced in models using continuous fine-tuning methods. This indicates a potential trade-off between enhancing long-context capabilities and maintaining optimal performance on short-text tasks.
Alignment with Original Findings: These results align with our observations in Figure 3, where we analyze the average negative log-likelihood across different context positions, suggesting that extending context length can impact performance at shorter contexts.

We will incorporate these findings into the revised manuscript, discussing the implications for knowledge retention and the balance between long-context capabilities and short-text performance.

The additional experiments provide valuable insights into how model scaling and context extension methods affect long and short-text tasks. We believe these enhancements strengthen our conclusions and contribute meaningfully to the field. Please let us know if there are any further concerns or suggestions

评论- response more base models

2024-11-24

Table 1: Performance of LLaMA-70B Extension Methods

Method	Perplexity (32k)	LongBench	Needle (64k)	RULER (32k)
Llama2-70b-hf (4k)	2.66	34.00	14.70	93.67
NTK-Frozen	3.25	32.40	30.90	11.39
Self-Extend	2.43	29.10	32.60	31.94
PI	2.26	42.44	49.80	77.98
NTK-32k	2.25	41.51	90.50	76.97

Table 2: Detailed RULER Benchmark Results for LLaMA-70B

Method	NIAH_S1	NIAH_S2	NIAH_S3	NIAH_M1	NIAH_M2	NIAH_M3	NIAH_MV	NIAH_MQ	VT	CWE	FWE	QA_1	QA_2	Avg.
Llama2-13b	100.00	100.00	100.00	100.00	95.00	100.00	99.50	99.75	99.80	100.00	98.67	68.00	57.00	93.67
NTK-Frozen	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1.40	62.00	52.67	9.00	23.00	11.39
Self-Extend	24.00	50.00	27.00	1.00	0.00	0.00	24.50	12.00	54.60	75.40	88.67	25.00	33.00	31.94
PI	99.00	100.00	98.00	97.00	53.00	28.00	92.00	93.75	92.80	92.50	78.67	38.00	51.00	77.98
NTK-32k	100.00	100.00	92.00	87.00	44.00	19.00	90.25	96.00	93.20	97.50	93.67	37.00	51.00	76.97

Table 3 : Performance of LLaMA-7B, LLaMA-13B, LLaMA-70B Extension Methods

Method	Perplexity (32k)	LongBench	Needle (64k)	RULER (32k)
Llama2-7b-hf (4k)	3.04	32.92	8.40	80.94
Llama2-13b-hf (4k)	2.90	33.84	17.00	86.35
Llama2-70b-hf (4k)	2.66	34.00	14.70	93.67
Llama2-7b-NTK-Frozen	4.06	25.54	18.80	0.72
Llama2-13b-NTK-Frozen	3.31	31.87	43.00	2.30
Llama2-70b-NTK-Frozen	3.25	32.40	30.90	11.39
Llama2-7b-Self-Extend	2.75	33.62	25.80	29.50
Llama2-13b-Self-Extend	2.65	33.69	53.50	30.23
Llama2-70b-Self-Extend	2.43	29.10	32.60	31.94
Llama2-7b-PI	2.58	33.48	42.10	57.66
Llama2-13b-PI	2.46	37.45	45.00	55.95
Llama2-70b-PI	2.26	42.44	49.80	77.98
Llama2-7b-NTK-32k	2.54	35.32	83.70	59.42
Llama2-13b-NTK-32k	2.44	38.41	82.20	58.38
Llama2-70b-NTK-32k	2.25	41.51	90.50	76.97
Llama2-7b-YaRN	2.59	33.45	46.70	36.95
Llama2-13b-YaRN	2.46	34.03	44.20	44.79
Llama2-7b-CLEX	2.55	33.48	71.10	52.17
Llama2-13b-CLEX	2.43	35.89	78.90	52.76

评论- follow up

2024-12-02

Dear Reviewer ZKTb,

We would like to express our sincere gratitude for your thorough review of our manuscript and for your valuable insights. We have made significant revisions to our draft to address your concerns and suggestions.

We kindly ask if you could take a moment to review our updates. If there is anything further that you would like us to clarify or any additional experiments you believe would strengthen our work, please let us know. We are more than willing to provide any additional information or make further revisions as needed.

Thank you once again for your time and thoughtful feedback.

Sincerely,

Authors of Submission 8416

审稿意见

评分: 6置信度: 32024-11-04

The paper evaluates several freeze/fine-tuning long-context methods on several public LLMs. Based on its experiments, it argues that (1) PPL is still an important metric in long-context scenarios, (2) "approximate attention" methods show poor performance (3) "exact fine-tuning based methods" are effective within trained range but cannot further process longer texts well.

优点

The paper evaluates many methods under a controlled setting.
The paper proposes several might-be-useful conclusions based on their experiments.

缺点

More evaluation of short texts is required to check whether these methods perform well in keeping the original performance. (for example, NTK by parts may behave better on this than NTK-rope.)
The so-called "approximate attention" method includes many quite different methods and can be misleading. It isn't very clear to include the streaming-LLM methods / longlora / landmark into one class. And MinInference-style work shows promising performance now, the conclusion that 'approximate attention is poor' might be misleading.
self-extend seems to be more like NTK-F than 'approximate attention'.
The contribution of the work might be limited if the arguments were not sufficiently robust.

问题

As the (1)-(3) in weakness.

评论- responsed to weakness 2, 3, 4

2024-11-20

The so-called "approximate attention" method includes many quite different methods and can be misleading. It isn't very clear to include the streaming-LLM methods / longlora / landmark into one class. MinInference-style work shows promising performance now, the conclusion that 'approximate attention is poor' might be misleading.

Thank you for pointing out that the term "approximate attention" in our manuscript encompasses a wide range of methods, which may lead to confusion. We agree that grouping streaming-LLM methods, LongLoRA, and Landmark attention under a single category is not sufficiently precise and could be misleading. In the revised manuscript, we will update the categorization of these methods to more accurately reflect their differences and unique characteristics.

Regarding MinInference-style work[1, 2, 3], we acknowledge that recent developments have shown promising performance in this area. We will include a discussion of these methods in our related work and limitations sections to provide a more comprehensive overview. Additionally, we would like to mention that this line of research is orthogonal to our study and can be applied in conjunction with approaches evaluated in this submission whereas the main focus of this study is effectiveness aligning with recent long-context studies[4, 5].

We will also revisit our conclusion that "approximate attention is poor" to specify that this observation is specific to our experimental setting, particularly when applying approximate attention methods during fine-tuning. This clarification should prevent any potential misunderstandings about the general effectiveness of approximate attention methods.

self-extend seems to be more like NTK-F than 'approximate attention'.

Thank you for highlighting that the Self-Extend method is more akin to NTK-Frozen methods than to "approximate attention" methods. In the revision, we will reclassify Self-Extend under RoPE-based methods and clearly differentiate it from other approaches. This adjustment will improve the clarity of our methodology section and ensure that each method is appropriately categorized.

The contribution of the work might be limited if the arguments were not sufficiently robust.

We appreciate your concern regarding the robustness of our arguments. To ensure the validity of our conclusions, we have conducted experiments across different model sizes and families within a controlled setting. This comprehensive approach enhances our confidence in the robustness of our findings.

We recognize that some existing research presents slightly contradictory results using similar methods. We hypothesize that these discrepancies may stem from the use of unfair or biased base models, or from exhaustive hyperparameter searches that are not standardized across studies. In the revised manuscript, we will discuss these factors in more detail and outline how our methodology addresses them. This added context should strengthen the reliability of our conclusions and clarify the contributions of our work.

References

[1] Leviathan, Yaniv, Matan Kalman, and Yossi Matias. "Fast inference from transformers via speculative decoding." International Conference on Machine Learning. PMLR, 2023.

[2] Xia, Heming, et al. "Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding." arXiv preprint arXiv:2401.07851 (2024).

[3] Yang, Nan, et al. "Inference with reference: Lossless acceleration of large language models." arXiv preprint arXiv:2304.04487 (2023).

[4] Gao, Tianyu, et al. "How to train long-context language models (effectively)." arXiv preprint arXiv:2410.02660 (2024).

[5] Hsieh, Cheng-Ping, et al. "RULER: What's the Real Context Size of Your Long-Context Language Models?." arXiv preprint arXiv:2404.06654 (2024).

评论- follow up response

2024-12-02

Dear Reviewer 7saq,

In particular, we have:

Conducted additional evaluations on short-text tasks to assess whether the methods maintain the original performance, as you suggested. This includes evaluating the impact on short-text performance and the extent of knowledge retention.
Clarified the categorization of the "approximate attention" methods to avoid potential misunderstandings. We have reclassified certain methods and provided more detailed explanations to ensure clarity.
Revised the classification of the Self-Extend method, recognizing that it is more akin to NTK-Frozen methods rather than "approximate attention".
Strengthened the robustness of our arguments by conducting additional experiments across different model sizes and settings.

We kindly ask if you could take a moment to review our updated manuscript. If there is anything further that you would like us to clarify or any additional experiments you believe would strengthen our work, please let us know. We are more than willing to provide any additional information or make further revisions as needed.

Thank you once again for your time and thoughtful feedback. Your insights have been instrumental in improving our paper.

Sincerely,

Authors of Submission 8416

2024-12-02

I've raised my rating to 6 since now I believe it meets ICLR standard.

评论- Thanks for the valuable feedback

2024-12-03

Thank you for your kind words and for taking the time to review our detailed response. We are pleased that your concerns have been addressed. Thank you again for the time and effort you have dedicated to reviewing our paper and providing valuable feedback. We sincerely appreciate your recognition of our work and will carefully revise the paper based on your and other reviewers’ comments.

评论- respsone to shorter context evaluation

2024-11-20

More evaluation of short texts is required to check whether these methods perform well in keeping the original performance. (for example, NTK by parts may behave better on this than NTK-rope.)

We thank the reviewer for the suggestion. This question was raised by Reviewer ZKTb. We took short tasks from Open LLM Leaderboard and re-evaluated the short tasks.

To address this, we evaluated our models on several benchmarks from the Open LLM Leaderboard, focusing on tasks that measure short-text understanding and knowledge.

Table 3: Short-Text Task Performance

Methods	ARC-c	ARC-e	Hellaswag	MMLU	TruthfulQA	WinoGrande	Average
Llama2-7b-base	52.73	81.31	78.96	42.09	38.97	74.43	61.42
LM-Infinite	52.56	81.36	78.95	42.09	38.96	74.11	61.34
Self-Extend	52.56	81.31	78.94	42.07	38.97	74.43	61.38
NTK-Frozen	52.73	81.31	78.96	42.09	38.97	74.43	61.42
PI	51.11	81.14	77.44	37.19	38.03	71.74	59.44
NTK-32k	49.15	80.22	74.48	35.25	38.13	72.61	58.31
NTK-64k	46.08	78.32	70.68	34.27	39.08	70.24	56.45
YaRN	53.41	81.82	78.47	41.06	38.63	74.43	61.30
CLEX	50.60	81.27	76.06	37.54	36.10	64.72	57.72
LongLora	46.67	78.58	67.08	26.29	37.61	55.25	51.91

We had the following observations:

Performance Degradation: Most long-context extension methods exhibit a slight decrease in performance on short-text tasks compared to the base model. Our discovery, aligning with what you suggested, shows that NTK-Frozen demonstrates better performance on short-text tasks compared to methods like NTK-RoPE.
Trade-Off Between Long and Short Contexts: The reduction in short-text performance is more pronounced in models using continuous fine-tuning methods. This indicates a potential trade-off between enhancing long-context capabilities and maintaining optimal performance on short-text tasks.
Alignment with Original Findings: These results align with our observations in Figure 3, where we analyze the average negative log-likelihood across different context positions, suggesting that extending context length can impact performance at shorter contexts. We will incorporate these findings into the revised manuscript, discussing the implications for knowledge retention and the balance between long-context capabilities and short-text performance

评论- General Response

2024-11-21

Dear Reviewers,

We want to express our sincere gratitude for the constructive feedback provided by the reviewers for Submission8416. We have carefully addressed each of your concerns to enhance the quality and robustness of our paper. Details can be found in our response to your reviews.

Below is a summary of the key actions we have undertaken in response to the reviewers' comments:

Expansion to Larger and More Diverse Models: (Reviewer ZKTb, Reviewer 3BXk)
- In response to concerns about the limited number of base models, we have conducted experiments using the LLaMA-70B, LLaMA-13B model as base models and Qwen-2.5-7B model. These additional evaluations confirm that our findings are consistent across larger and more diverse LLMs, thereby strengthening the generalizability of our conclusions.
Evaluation of Short-Text Performance: (Reviewer ZKTb, Reviewer 7saq)
- To address the impact of long-context extension methods on short-text performance, we evaluated our models on several benchmarks from the Open LLM Leaderboard. The results indicate that while most long-context methods exhibit slight performance degradation on short-text tasks, whereas approximate attention methods are more robust in shorter tasks.
Inference Speed Trade-Off: (Reviewer j6WF)
- We addressed this concern about inference speed trade-offs by conducting inference speed comparisons under controlled conditions using the same hardware setup. Our experiments show that approximate attention methods achieve a speedup of approximately 1.5x to 2x compared to LLaMA when the context length is short. However, we didn't see a significant margin when the context length was longer.
Hyperparameter Sensitivity Analysis: (Reviewer 3BXk)
- We conducted a comprehensive hyperparameter sweep to analyze the effects on both training efficiency and inference performance. Our findings reveal that approximate attention methods, such as LongLoRA, are highly sensitive to hyperparameter changes, requiring meticulous tuning. In contrast, continual fine-tuning methods like NTK and YaRN demonstrated robustness across various hyperparameter configurations, resulting in more predictable training times and consistent inference performance.
Generalization to Longer Contexts Beyond 32k: (Reviewer 3BXk)
- Although our initial study focused on context lengths up to 32k token in most cases, and 64k for two particular tasks, we have extended our evaluation to include 64k and 128k tokens. The results show that exact fine-tuning methods perform well within their trained context range but encounter challenges as context length increases beyond 32k tokens. Approximate attention methods show potential for handling longer contexts but often at the expense of reduced accuracy.
Clarification and Reclassification of Methods: (Reviewer 7saq, Reviewer j6WF)
- Method Categorization: We will refine the categorization of context extension methods to avoid ambiguity. Specifically, we will reclassify methods like Self-Extend under RoPE-based methods instead of grouping them with approximate attention methods. Additionally, we will incorporate discussions on recent developments in MinInference-style methods, acknowledging their promising performance and clarifying their orthogonal relationship to our study.
- PPL Correlation: We will update the manuscript to include a ranking-based correlation measure between perplexity and downstream task performance, addressing the reviewer's concerns and strengthening our claims.
LM-Infinite Results: (Reviewer j6WF)
- We evaluated LM-Infinite on the Passkey Retrieval task. We observed that while it performs well at shorter context lengths, its performance significantly degrades as context length increases beyond its trained length. This aligns with findings from concurrent studies (e.g., InfLLM), suggesting that LM-Infinite effectively attends to closer tokens but struggles with longer contexts.

By implementing these revisions, we believe our submission offers a more thorough and balanced analysis of long-context extension methods in a controlled setting. The enhancements address the initial concerns raised by the reviewers and contribute valuable insights to the field.

Thank you for taking the time to read our rebuttal. We are confident that the improvements made have significantly strengthened the quality and impact of our work.

Best regards,

Authors of submission 8416

评论- revised submission

2024-11-28

Dear Reviewers and Area Chair,

Thank you for your thoughtful feedback on our manuscript. We have uploaded a revised version that incorporates your valuable suggestions. The revision includes expanded analysis with additional base model results, short-context task results, a detailed discussion of generalization capabilities, revised analysis of perplexity and downstream tasks, hyperparameter search, and enhanced limitations. For your convenience, all modifications are highlighted in green within the manuscript.

We welcome your further comments and suggestions to strengthen our work. The manuscript remains open for additional improvements based on your expertise and insights.

Thank you for your continued guidance in improving this research manuscript.

Best regards,

Authors of Submission 8416

AC 元评审

2024-12-21

The paper presents a controlled study evaluating various methods for extending the context length that Large Language Models (LLMs) can handle during inference. The authors argue that existing evaluations are often inconsistent due to variations in base models, datasets, and methodologies. The paper's primary strength is its attempt to conduct a controlled comparison of different long-context extension methods. While the paper addresses generalization to some extent, its exploration of very long contexts (e.g., 128k tokens and beyond) is limited. This is a significant aspect of long-context modeling. As Reviewer 3BXk notes, some of the findings have been touched upon in previous work. While the controlled study adds value, the paper could have done a better job of highlighting truly novel insights. Reviewer j6WF points out that some conclusions, particularly regarding perplexity, might be overstated or need more nuanced presentation. I also listed some issues raised by the reviewers, which were discussed during the rebuttal phase. While the paper has merits for controlled experiments, the reviewers raised the concerns regarding the novelty of the findings, the generalizability of the conclusions, and certain aspects of the presentation and analysis. The authors are encourged to revise a stronger version by considering the comments of the reviewers.

审稿人讨论附加意见

The initial submission was critiqued for its limited set of base models (only three). However, this was partially addressed in the rebuttal by adding LLaMA-13B, LLaMA-70B, and Qwen-7B.

There's a notable discrepancy between the paper's results on LM-Infinite and the original LM-Infinite paper's reported performance. While the authors acknowledge this and point to a concurrent study with similar findings, a more thorough investigation might be warranted.

The claim of a strong correlation between perplexity and downstream performance is somewhat weakened when considering only the most advanced methods (ppl below 6). While addressed in the rebuttal with Kendall's Tau, the tone might need adjustment.

最终决定Reject

2025-01-22

Reject