PaperHub
6.4
/10
Poster5 位审稿人
最低6最高8标准差0.8
6
6
6
8
6
3.4
置信度
正确性3.2
贡献度2.6
表达2.8
ICLR 2025

APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding

OpenReviewPDF
提交: 2024-09-28更新: 2025-03-02

摘要

Context-augmented generation (CAG) techniques, including RAG and ICL, require the efficient combination of multiple contexts to generate responses to user queries. Directly inputting these contexts as a sequence introduces a considerable computational burden by re-encoding the combined selection of contexts for every request. To address this, we explore the promising potential of parallel encoding to independently pre-compute and cache each context's KV states. This approach enables the direct loading of cached states during inference while accommodating more contexts through position reuse across contexts. However, due to misalignments in attention distribution, directly applying parallel encoding results in a significant performance drop. To enable effective and efficient CAG, we propose Adaptive Parallel Encoding (**APE**), which brings shared prefix, attention temperature, and scaling factor to align the distribution of parallel encoding with sequential encoding. Results on RAG and ICL tasks demonstrate that APE can preserve 98% and 93% sequential encoding performance using the same inputs while outperforming parallel encoding by 3.6% and 7.9%, respectively. It also scales to many-shot CAG, effectively encoding hundreds of contexts in parallel. Efficiency evaluation shows that APE can achieve an end-to-end 4.5$\times$ speedup by reducing 28$\times$ prefilling time for a 128K-length context. The code is available at https://github.com/Infini-AI-Lab/APE.
关键词
Parallel Encoding; Context-Augmented LLM; Efficient Inference; Length Extrapolation

评审与讨论

审稿意见
6

APE proposed the adaptive parallel encoding of LLM prompt, to speedup the prefill stage without limited by pretrained context length. APE add sink tokens (shared prefix) and split the context into chunks and perform attention inside of it. During decoding stage, it attends to every previous KV without adjusting the RoPE, so it will not exceed the pretrained context length.

优点

  • Nice empirical analysis to show value state can be merged & key states are similar for each context chunk.
  • Speeding up prefill stage drastically.

缺点

  1. The methodology might be too similar, with fixed step sparse attention with sink token. When we draw the attention mask of APE, the attention mask itself is extremely similar to the step attention mechanism. However, the difference between APE and APE is that APE reuses the RoPE embeddings rather than just extends them. However, treating the RoPE index is the same as streaming with fixed step sparse attention. Therefore, I think the scientific contribution of methodology is limited.

  2. The performance of the method is mostly proved by empirical results. I do not think these empirical results are weaknesses, but I have concerns about the lack of important comparisons with previous technologies to extend the context window and speed up prefill.

  • Lack of comparison with training-free context extension method (Self-Extend: https://github.com/datamllab/LongLM).

  • Lack of comparing with pretrained long context LLM. This might not be a big problem, but we need to know what performance is the upper limit if the LLM is already trained in a long context. The LLMs used for experiments are all short context (and maybe a little bit old at this point) models. I am concerned that using a long-context model such as Qwen2 or Llama3.1 that supports 128k tokens with a large-scale GPU cluster might lead to better performance than APE.

  • Lack of comparing with techniques to speed up the prefill stage.

--- Some minor improvements ---

  • Figure 3 and 4 are quite hard to understand. Can you increase the font size? And also I do not think we need visualized every layers. Can you randomly sample some layers and show them only? (e.g., range(0, 32, 4)).
  • In Table 1, there is a typo in Gemma2. Maybe you shifted the columns to the left.
  • Tables 2, 4, and 5 might have a too large a size.
  • Table 2 should improve the formatting.

问题

  1. In efficiency analysis, the claim 976x faster might be a little bit overclaiming because there is no performance (accuracy) evaluation on 512k. Can you add relevant benchmarks? (InfiniteBench: https://github.com/OpenBMB/InfiniteBench/, LOFT: https://github.com/google-deepmind/loft, RULER: https://github.com/NVIDIA/RULER)

  2. Table 5: what is the average context length of APE and real-world, end-to-end latency, including retrieval latency? As far as I understand the CRAG paper, the context length is quite limited with a smaller size (2k~4k) due to the speed of retrieval.

  3. What do you mean that query and generation lengths were fixed at 256 tokens in line 464? Is that mean, the chunk size of APE is 256? As far as I see the latency, I think 256 means the chunk size, which is NOT used in performance evaluation. Do you have any performance evaluation with 256 chunk size? If my understanding is correct, then the reported latency is might be significantly different with real-world scenario.

评论

Thank you so much for the insightful and valuable comments! They are very helpful for further improving the clarity and quality of our paper. We'll revise our manuscript in the next version to address all of your concerns.

Q1: The methodology might be too similar, with fixed step sparse attention with sink token. When we draw the attention mask of APE, the attention mask itself is extremely similar to the step attention mechanism. However, the difference between APE and APE is that APE reuses the RoPE embeddings rather than just extends them. However, treating the RoPE index is the same as streaming with fixed step sparse attention. Therefore, I think the scientific contribution of methodology is limited.

A1: Our APE is designed specifically for context-augmented generation, where multiple contexts are stored in an external database. The attention mask in APE has two distinctive characteristics that set it apart. First, in the context processing phase, each context can be encoded independently without attending to other contexts. This design enables pre-caching of context KV cache, improving computational efficiency. Second, when inputting the query, it can attend to all contexts, ensuring access to information across the entire context length.

These architectural choices fundamentally differentiate APE from previous methods that rely on fixed-step sparse attention with sink tokens. Such earlier approaches face two critical limitations: they require computationally expensive real-time context re-encoding, and their restricted attention masks prevent queries from effectively accessing the complete context, resulting in suboptimal performance.

The primary novelty of APE's attention mask lies in its ability to enable a new setting for faster and longer context-augmented generation, while maintaining a good performance. APE represents the first method to successfully achieve these in context-augmented generation, marking a significant advancement in the field.


Q2: Lack of comparison with training-free context extension method (Self-Extend: https://github.com/datamllab/LongLM).

A2: First, we need to clarify that APE is primarily designed for context-augmented generation, while the long-context serves as a by-product. Here, we added more baselines for experimental results on the LongBench dataset [1], where we evaluated 11 subsets in total and compared APE with Self-Extend [2]. The results in Table R1 show that APE achieves a 6.61% and 2.15%improvement over the base model and Self-Extend, respectively. Therefore, APE can serve as a SOTA method for length extension.

Table R1: Comparison of APE and Self-Extend on the LongBench dataset.

MethodNarratQAQasperMultiFQAGovReportQMSumLCCRepoBench-PHotpotQA2WikiMQAMuSiQueMultiNewsAverage
LLaMA-3-8B-Instruct19.3232.8343.3827.8922.4053.2238.1544.2421.0120.4723.6331.50
Self-Extend24.8237.9450.9930.4823.3658.0141.8351.0924.1728.7324.1135.96
APE26.8739.1459.1229.1023.0866.0949.4350.1128.0625.7922.4038.11
评论

Q3: Lack of comparing with pre-trained long context LLM. This might not be a big problem, but we need to know what performance is the upper limit if the LLM is already trained in a long context. The LLMs used for experiments are all short context (and maybe a little bit old at this point) models. I am concerned that using a long-context model such as Qwen2 or Llama3.1 that supports 128k tokens with a large-scale GPU cluster might lead to better performance than APE.

A3: First, we need to clarify that APE is primarily designed for context-augmented generation, while the long-context serves as a by-product. Here, we compared the performance of APE and sequential encoding using the LLaMA-3.1-8B-Instruct model as the pre-trained long context LLM on the LongBench dataset. Our evaluation in Table R2 revealed that APE exhibited only a marginal performance decrease of 1.67% compared to Sequential Encoding. This minimal difference does not diminish APE's advantages, as achieving the slight performance gain through sequential encoding would require substantial computational overhead during prefilling. Additionally, starting from a pre-trained long context LLM, APE can always further extend to much longer contexts and lead to better performance.


Q4: Lack of comparison with techniques to speed up the prefill stage.

A4: Thank you for your suggestion. We additionally add MInference [3] as another baseline for prefilling acceleration. As shown in Table R3, APE achieves up to 28x acceleration in prefilling compared to Sequential Encoding with MInference for 128K-token contexts. Furthermore, APE's prefilling overhead scales linearly, consuming less than 10% of the total inference time, while baseline methods require over 50% as context length increases. APE also demonstrates superior versatility across various settings, whereas MInference introduces additional overhead and slower inference speeds when processing short contexts and large batches.

Table R3: Latency on H100 GPU: [prefilling / total time (ms)]. We compared different strategies with various context lengths and batch sizes.

Context Length4K8K16K32K64K128K
Sequential (bsz=1)132/2383276/2558635/32041646/47134892/880416635/22252
Sequential + MInference (bsz=1)964/32211297/36091623/42012206/52743563/74586457/12085
APE (bsz=1)28/227940/232268/2637122/3189226/4138421/6038
Sequential (bsz=4)523/31291084/41792558/64746599/1205519468/2836765928/83869
Sequential + MInference (bsz=4)3789/63734995/80996300/102448470/1394413462/2231324199/42013
APE (bsz=4)69/267594/3189150/4066298/5754430/9329858/18799
评论

Thank you for insightful and extensively verified rebuttal. The response clearly resolve my all concerns. I think I am misunderstand the concept of the whole paper.


However, I think the author's claim "as existing RAG benchmarks do not fully support 128K context length" is not true, because the LOFT benchmark already supports RAG benchmark up to 1M context document. I think LOFT benchmark is the most important for APE. The LOFT is perfect benchmark to test APE, because (1) the context is concat of thousands of short documents, which is APE targeted. (2) The prompt is shared across queries, therefore partial prefix caching that emphasized by authors. (3) The context length is extremely long (1M) therefore the prefill stage is extremely important to be speedup.

I wonder we can compare the performance in LOFT (but I understand we do not have time now...). As far as, I played with that dataset, 128k evaluation on RAG-NQ dataset only took 10 to 20 minutes if the prompt is cached properly. I hope we can see updated result if possible during rebuttal, or final revision.


I check the provided code, and it is little bit disappointing the implementation is not very efficient. I was expected the codebase is based on PagedAttention framework with prefix caching features such as SGLang. I think the efficiency of implementation is extremely important as the paper proposing efficient mechanism. I want to encourage the author can contribute the open-sources such as vLLM and SGlang with APE in future time after the submission.


After re-considering main manuscript with updated rebuttal, I want to raise my score from 3 (reject) to 6 (borderline accept). Again, thank you for extensively verified rebuttal, and pay respect to authors for trying the best until last hours.

评论

Thank you agin for your encouragement and suggestions! We strongly agree that prefix caching is very useful for speeding up computations, and APE enables systems to generalize the prefix caching mechanisms particularly when multiple contexts are available. Therefore, we plan to reach out to vLLM, SGLang teams, and bring APE to these frameworks.

Additionally, we want to further discuss about the potential of APE in improving prefix caching mechanisms. For example, normal prefix cache requires to recompute prefix sequence [c0, c1], [c1, c2], [c0, c2] three times and bring exponential explosion when multiple contexts are combined. In comparison, APE allows us to cache c0, c1, c2 independently. Therefore, we can directly load all potential combinations of {c0, c1, c2} with only linear complexity in memory, bringing potential for further memory saving and speedup. As a result, our approach complements and can enhance previous prefix caching mechanisms.

评论

Dear Reviewer kKET,

Thank you very much for the increased score and for recommending the LOFT dataset. We will conduct additional experiments using LOFT subsets and include these results in our rebuttal. If time constraints prevent us from completing these experiments before the rebuttal deadline, we commit to providing comprehensive results in the final revision.

Thanks,

The Authors

评论

Q7: What do you mean that query and generation lengths were fixed at 256 tokens in line 464? Is that mean, the chunk size of APE is 256? As far as I see the latency, I think 256 means the chunk size, which is NOT used in performance evaluation. Do you have any performance evaluation with 256 chunk size? If my understanding is correct, then the reported latency is might be significantly different with real-world scenario.

A7: Here 256 doesn't mean the chunk size. Instead, it means that our query (w/o context) should be no more than 256 tokens, while the model will generate at most 256 new tokens. With the additional context length from 4K to 512K tokens, this setting simulates the real-world scenarios with long context, short query, and short response, as shown in Table 1 of the original paper. For the context part, it may be a very long context, concatenated by multiple chunks. However, the size of each chunk should approach the model's maximum sequence length (e.g., 8K tokens for the LLaMA-3-8B-Instruct model).


References:

[1] Bai, Yushi, et al. "Longbench: A bilingual, multitask benchmark for long context understanding." arXiv preprint arXiv:2308.14508 (2023).

[2] Jin, Hongye, et al. "Llm maybe longlm: Self-extend llm context window without tuning." arXiv preprint arXiv:2401.01325 (2024).

[3] Jiang, Huiqiang, et al. "Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention." arXiv preprint arXiv:2407.02490 (2024).

[4] Hsieh, Cheng-Ping, et al. "RULER: What's the Real Context Size of Your Long-Context Language Models?." arXiv preprint arXiv:2404.06654 (2024).

[5] Greg Kamradt. Needle in a haystack - pressure testing llms. 2023.

[6] Tang, Jiaming, et al. "Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference." arXiv preprint arXiv:2406.10774 (2024).

[7] Sun, Hanshi, et al. "ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference." arXiv preprint arXiv:2410.21465 (2024).

评论

Q5: In efficiency analysis, the claim 976x faster might be a little bit overclaiming because there is no performance (accuracy) evaluation on 512k. Can you add relevant benchmarks? (InfiniteBench: https://github.com/OpenBMB/InfiniteBench/, LOFT: https://github.com/google-deepmind/loft, RULER: https://github.com/NVIDIA/RULER)

A5: First, we need to clarify that APE is primarily designed for context-augmented generation. However, as existing RAG benchmarks do not fully support 128K context lengths, we demonstrated APE's effectiveness by evaluating it on the RULER [4] and Needle In A Haystack [5] datasets with 128K context lengths, where the original context is divided into 8 chunks of 16K positions. As shown in Table R4, APE demonstrates only a 5% performance decrease compared to sequential encoding. This decline is primarily observed in tasks requiring inter-context dependencies, specifically in VT and FWE subtasks. Compared to other baselines, APE even outperforms Quest [6] and approaches ShadowKV [7] that uses 128K positions with dynamic sparse attention. Therefore, APE should work for longer contexts, especially context-augmented generation without inter-context dependencies. These results, combined with the latency analysis, show the effectiveness and efficiency of APE for extremely long contexts.

Table R4: Experiment results of APE on the RULER dataset.

ModelN-S1N-S2N-MK1N-MK2N-MQN-MVQA-1QA-2VTFWEAverage
LLaMA-3.1-8B-Instruct10010098.9691.6798.9695.3182.2947.9268.9671.1885.525
Quest10098.9697.9234.3893.4988.5470.8344.7965.6368.476.294
ShadowKV10010010083.3397.9292.1981.2548.9667.0864.9383.566
APE10096.8897.9285.4297.9293.757546.8856.0459.7280.953

In Table R5, we further validate that APE can always retrieve the needle from the 128K context on the Needle In A Haystack dataset.

Table R5: Experiment results of APE on the Needle In A Haystack dataset.

1K10K19K28K37K46K55K64K73K82K91K100K109K118K128K
Average Accuracy (%)100100100100100100100100100100100100100100100

Additionally, in the updated efficiency analysis Table R3, we only evaluate from 4K to 128K, which matches the maximum length in the above benchmarks. Therefore, the current performance and efficiency results should be consistent.


Q6: Table 5: what is the average context length of APE and real-world, end-to-end latency, including retrieval latency? As far as I understand the CRAG paper, the context length is quite limited with a smaller size (2k~4k) due to the speed of retrieval.

A6: In our implementation of RAG on the CRAG benchmark, we employ sequential encoding to retrieve the top 10 chunks, each approximately 200 tokens in length. In contrast, when using APE, we retrieve the same number of chunks (top 10), but with significantly larger chunks of approximately 2,000 tokens each. Therefore, this longer context led to the improved performance in Table 5. Notably, despite the larger chunk size, APE demonstrates better latency as shown in Table R6, with improvements in both retrieval and prefilling time. Therefore, the retrieval time is not a bottleneck on this benchmark. This efficiency gain occurs because the increased chunk length reduces the size of the database that needs to be processed, thereby accelerating retrieval.

Table R6: End-to-end latency of APE and sequential encoding on the CRAG benchmark.

MethodRetrieval Time (ms)Prefilling Time (ms)Generation Time (ms)Total Time (ms)
Sequential Encoding3561526321140
APE263867051054

While the speedup was not significant on CRAG due to its short context length, we conducted additional measurements using the Ruler benchmark to demonstrate acceleration in long-context scenarios. As shown in Table R7, with a 128K context length, prefilling becomes a significant bottleneck for sequential encoding. In this setting, APE achieved a 15 times speedup in inference time.

Table R7: End-to-end latency of APE and sequential encoding on the Ruler benchmark.

MethodPrefilling Time (ms)Generation Time (ms)Total Time (ms)
Sequential Encoding1812781218427
APE4218331254
评论

Based on your suggestions, we further evaluated our approach on four text retrieval tasks from the LOFT benchmark using LLaMA-3.1-8B-Instruct as our base model, with performance measured by the F1 score between ground truth and model predictions. To simulate realistic scenarios where RAG systems may retrieve suboptimal content, we randomly sampled passages from the corpus as context, enabling analysis of how performance scales with additional context integration (Otherwise, we only need a short-context LLM to achieve a good performance if the retriever is very accurate, which is not the setting that we target on). For the 128k version of LOFT, we compared APE against Sequential Encoding using varying context lengths in Table R8.

Table R8. Comparison of APE and Sequential Encoding on four text retrieval tasks from the 128k version of LOFT benchmark using the LLaMA-3.1-8B-Instruct model.

Inference TimeArguanaFeverNQSciFactAverage
8K, Sequential84711.157.7817.787.7411.11
16K, Sequential127711.209.7817.819.4912.07
32K, Sequential241312.1510.2215.789.7411.97
64K, Sequential587015.3413.1219.6416.1216.06
128K, Sequential1803912.8414.1924.5416.8817.11
128K, APE182516.3214.7021.9115.7217.16

Our analysis yielded several key findings:

  • APE demonstrates superior performance on average compared to all baselines, validating APE’s scalability in processing hundreds/thousands of retrieved texts in parallel.

  • APE achieves performance comparable to the 128K Sequential Encoding baseline while utilizing the same context. Although transferring from sequential encoding to APE would typically be expected to cause performance degradation, we hypothesize that APE recovers this by effectively shortening the distance between the query and all examples. This mechanism helps address the “lost in the middle” problem common in long-context LLMs.

  • APE demonstrates remarkable computational efficiency, with its 128K variant requiring inference times comparable to Sequential Encoding at much shorter context lengths (between 16K and 32K), while achieving approximately 10 times faster processing compared to Sequential Encoding at equivalent context length.

We also compared APE with Sequential Encoding using the LLaMA-3.1-8B-Instruct model on the 1m version of LOFT, as shown in Table R9. While APE improves 5.26% on average using 8×\times more retrieved texts, it only requires 1/3 inference time by pre-caching these texts.

Table R9. Comparison of APE and Sequential Encoding on four text retrieval tasks from the 1m version of LOFT benchmark using the LLaMA-3.1-8B-Instruct model.

Inference TimeArguanaFeverNQSciFactAverage
128K, Sequential1803911.288.1317.557.5411.13
1M, APE507315.8915.1320.7814.5516.59
审稿意见
6

This paper introduces Adaptive Parallel Encoding, a method to improve the efficiency of large language models when processing multiple external contexts. APE pre-caches key-value states of contexts separately and enables position reuse during inference. The method consists of three key components: a shared prefix to align initial token distributions, a scaling factor to offset increased attention weights, and lower attention temperature to focus on semantically important tokens. The authors demonstrate that APE achieves a 976× speedup for long context generation while maintaining 93% accuracy compared to sequential encoding.

优点

  1. The authors provide thorough empirical analysis to understand the behavior of attention mechanisms and KV states.
  2. The method is practical as it requires no additional training and can be implemented with minimal modifications to existing architectures.
  3. Comprehensive evaluation across multiple tasks (ICL, RAG, long-context understanding)

缺点

  1. The performance evaluation is primarily focused on 8k context length, which feels insufficient given that many open-source LLMs now support context lengths of 128k or more. This restricted evaluation scope makes it difficult to assess the method's scalability to longer contexts.
  2. Compared to sequential encoding, APE introduces a non-negligible performance degradation at the 8k length scale, raising concerns about its effectiveness at longer context lengths.

问题

  1. How does the computational complexity of APE scale with increasing context length compared to sequential encoding?
  2. Why does the method perform particularly poorly on code completion tasks with continuous long contexts?
  3. How sensitive is the method to the choice of hyperparameters (scaling factor and attention temperature) across different models and tasks?
评论

Thank you for your valuable comments and suggestions. We will carefully revise our paper based on your comments. Our responses to your questions are detailed below. We would greatly appreciate your input on whether our revisions address your concern


Q1: The performance evaluation is primarily focused on 8k context length, which feels insufficient given that many open-source LLMs now support context lengths of 128k or more. This restricted evaluation scope makes it difficult to assess the method's scalability to longer contexts.

A1: First, we need to clarify that APE is primarily designed for context-augmented generation. However, as existing RAG benchmarks do not fully support 128K context lengths, we demonstrated APE's effectiveness by evaluating it on the RULER [1] and Needle In A Haystack [2] datasets with 128K context lengths, where the original context is divided into 8 chunks of 16K positions. We also include Quest [3] and ShadowKV [4] as two baselines using dynamic sparse attention and 128K positions. As shown in Table R2, APE demonstrates only a 5% performance decrease compared to Sequential Encoding. This decline is primarily observed in tasks requiring inter-context dependencies, specifically in VT and FWE subtasks. Compared to other baselines, APE even outperforms Quest and approaches ShadowKV despite using only 1/8 positions. Therefore, APE should work for longer contexts, especially context-augmented generation without inter-context dependencies.

Table R1: Experiment results of APE on the RULER dataset.

ModelN-S1N-S2N-MK1N-MK2N-MQN-MVQA-1QA-2VTFWEAverage
LLaMA-3.1-8B-Instruct10010098.9691.6798.9695.3182.2947.9268.9671.1885.525
Quest10098.9697.9234.3893.4988.5470.8344.7965.6368.476.294
ShadowKV10010010083.3397.9292.1981.2548.9667.0864.9383.566
APE10096.8897.9285.4297.9293.757546.8856.0459.7280.953

In Table R2, we further validate that APE can always retrieve the needle from the 128K context on the Needle In A Haystack dataset.

Table R2: Experiment results of APE on the Needle In A Haystack dataset.

1K10K19K28K37K46K55K64K73K82K91K100K109K118K128K
Average Accuracy (%)100100100100100100100100100100100100100100100

Q2: Compared to sequential encoding, APE introduces a non-negligible performance degradation at the 8k length scale, raising concerns about its effectiveness at longer context lengths.

A2: We admit that APE will lead to performance degradation compared to sequential encoding. However, this minimal difference does not diminish APE's advantages, as achieving the slight performance gain through sequential encoding would require substantial computational overhead during prefilling. The results in Table R1 and R2 further validates this performance degradation will not increase when scaling to longer context lengths.

Moreover, the reuse of positions enable APE to leverage more contexts, thereby outperforming sequential encoding in practical applications as shown in Section 6.3.


Q3: How does the computational complexity of APE scale with increasing context length compared to sequential encoding?

A3: We add analysis for the computational complexity of APE and Sequential Encoding. Following Section 2.1 and let the context-augmented input to be

S=sC0,0,,sC0,l0_Context,sC1,0,,sC1,l1_Context 1,,sCN1,0,,sCN1,lN1_Context N-1,s0,,sl_Query,\mathcal{S} = {\underbrace{s_{C_0, 0}, \ldots, s_{C_0, l_0}}\_{\text{Context}}, \underbrace{s_{C_1, 0}, \ldots, s_{C_1, l_1}}\_{\text{Context 1}}, \ldots, \underbrace{s_{C_{N-1}, 0}, \ldots, s_{C_{N-1}, l_{N-1}}}\_{\text{Context N-1}}, \underbrace{s_0, \ldots, s_l}\_{\text{Query}}}, where l0++lN1ll_0 + \ldots + l_{N-1} \gg l, we analyze the prefilling time of Sequential Encoding and APE.

  • Sequential Encoding: O((l0+...+lN1+l)2)O((l_0+...+l_{N-1}+l)^2)

  • APE with pre-caching: O((l0+...+lN1+l)l)O((l_0+...+l_{N-1}+l)\cdot l)

  • APE without pre-caching: O(max(l02,...,lN12)+((l0+...+lN1+l)lO(\max(l_0^2, ..., l_{N-1}^2)+((l_0+...+l_{N-1}+l)\cdot l

Therefore, APE achieves linear complexity with respect to total context length when using pre-caching, resulting in significantly improved efficiency. Even without pre-caching, APE maintains an advantage by allowing independent prefilling of each context. In this case, its complexity scales quadratically with the maximum context length, which is more efficient than Sequential Encoding's quadratic scaling with total context length.

评论

Q4: Why does the method perform particularly poorly on code completion tasks with continuous long contexts?

A4: This is because the code generation tasks require LLMs to generate the code after the end of context. Therefore, directly applying APE will mess up the nearby and far contexts. To address this limitation, we preserve the final 500 tokens of the provided code as the initial segment of the query. This modification significantly improves APE's performance on code generation tasks, as demonstrated by the results in Table R3.

Table R3: Comparison of APE and other methods on long-context code generation.

LCCRepoBench-P
LLaMA-3-8B-Instruct53.2238.15
+ LLMLingua216.4120.56
+ StreamingLLM40.0226.16
+ Long-context FT55.1243.05
+Self-Extend58.0141.83
+ APE66.0949.43

Q5: How sensitive is the method to the choice of hyperparameters (scaling factor and attention temperature) across different models and tasks?

A5: Our APE, despite introducing three hyperparameters (shared prefix, scaling factor, and attention temperature), is not very sensitive on these hyperparameters. First, the shared prefix is only applied to tasks lacking existing prefixes (e.g., GSM8K and TriviaQA). Our experimental results in Table R4 demonstrate that introducing a shared prefix increases performance by 7.38% on average, while variations in prefix length only result in a 0.4% difference in performance. These findings indicate that while the presence of a shared prefix is beneficial, APE's performance is not significantly sensitive to the prefix length itself.

Table R4: Ablation study of prefix length on GSM8K and TriviaQA using the LLaMA-3-8B-Instruct model.

Prefix Length0102040
GSM8K62.1772.1071.8771.69
TriviaQA67.4872.3172.5172.65

For the scaling factor ss and attention temperature TT, we evaluated various combinations across four QA tasks from the LongBench dataset, as shown in Table R2. While these parameters are relatively sensitive compared to the prefix length, their optimal values fall within a narrow range. In our experiments, both ss and TT were selected from {0.7, 0.8, 0.9, 1.0}, making hyperparameter optimization computationally tractable on the validation set. Notably, despite LongBench examples containing varying numbers of contexts, APE's hyperparameters demonstrated robust generalization across tasks with different context lengths, indicating low sensitivity.

Table R5: Ablation study of the scaling factor s and attention temperature T on four subsets from LongBench using the LLaMA-3-8B-Instruct model.

(ss, TT)MuSiQueNarrativeQAHotpotQA2WikiMQA
(1.0, 1.0)24.1326.9349.7143.09
(1.0, 0.9)26.1730.7147.7941.19
(0.9, 0.9)24.3227.9349.638.21
(0.8, 0.8)26.1927.0147.3235.41

References:

[1] Hsieh, Cheng-Ping, et al. "RULER: What's the Real Context Size of Your Long-Context Language Models?." arXiv preprint arXiv:2404.06654 (2024).

[2] Greg Kamradt. Needle in a haystack - pressure testing llms. 2023.

[3] Tang, Jiaming, et al. "Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference." arXiv preprint arXiv:2406.10774 (2024).

[4] Sun, Hanshi, et al. "ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference." arXiv preprint arXiv:2410.21465 (2024).

评论

Dear Reviewer onHd,

Based on the suggestions from Reviewer kKET, we further evaluated our approach on four text retrieval tasks from the LOFT benchmark to represent long-context RAG scenarios, which further validate that APE's scalability in processing hundreds/thousands of retrieved texts with the total length of 1M in parallel. We hope these results, combined to the results on the RULER and Needle-In-A-Haystack datasets, can comprehensively show the effectiveness of APE in extreme long-context scenarios.

Here, we use LLaMA-3.1-8B-Instruct as our base model, with performance measured by the F1 score between ground truth and model predictions. To simulate realistic scenarios where RAG systems may retrieve suboptimal content, we randomly sampled passages from the corpus as context, enabling analysis of how performance scales with additional context integration (Otherwise, we only need a short-context LLM to achieve a good performance if the retriever is very accurate, which is not the setting that we target on). For the 128k version of LOFT, we compared APE against Sequential Encoding using varying context lengths in Table R9.

Table R9. Comparison of APE and Sequential Encoding on four text retrieval tasks from the 128k version of LOFT benchmark using the LLaMA-3.1-8B-Instruct model.

Inference TimeArguanaFeverNQSciFactAverage
8K, Sequential84711.157.7817.787.7411.11
16K, Sequential127711.209.7817.819.4912.07
32K, Sequential241312.1510.2215.789.7411.97
64K, Sequential587015.3413.1219.6416.1216.06
128K, Sequential1803912.8414.1924.5416.8817.11
128K, APE182516.3214.7021.9115.7217.16

Our analysis yielded several key findings:

  • APE demonstrates superior performance on average compared to all baselines, validating APE’s scalability in processing hundreds/thousands of retrieved texts in parallel.

  • APE achieves performance comparable to the 128K Sequential Encoding baseline while utilizing the same context. Although transferring from sequential encoding to APE would typically be expected to cause performance degradation, we hypothesize that APE recovers this by effectively shortening the distance between the query and all examples. This mechanism helps address the “lost in the middle” problem common in long-context LLMs.

  • APE demonstrates remarkable computational efficiency, with its 128K variant requiring inference times comparable to Sequential Encoding at much shorter context lengths (between 16K and 32K), while achieving approximately 10 times faster processing compared to Sequential Encoding at equivalent context length.

We also compared APE with Sequential Encoding using the LLaMA-3.1-8B-Instruct model on the 1m version of LOFT, as shown in Table R10. While APE improves 5.26% on average using 8×\times more retrieved texts, it only requires 1/3 inference time by pre-caching these texts.

Table R10. Comparison of APE and Sequential Encoding on four text retrieval tasks from the 1m version of LOFT benchmark using the LLaMA-3.1-8B-Instruct model.

Inference TimeArguanaFeverNQSciFactAverage
128K, Sequential1803911.288.1317.557.5411.13
1M, APE507315.8915.1320.7814.5516.59
审稿意见
6

The papers introduce a simple and effective modification to parallel decoding which can be used a training-free drop-in method to substantially increase efficiency of inference while retaining most of its accuracy.

优点

  • The paper evaluates across many different models of different architectures.
  • The method is training-free which enables it be easily tested across many model -- assuming a working implementation.
  • Shows significant gains on efficiency and would have an real-world impact on deployed models. For example, enabling Pre-caching contexts.

缺点

  • Paper does not provide implementation of the method -- this can make it easier for reproducibility. Would it be possible to provide that?
  • The paper fixes a windows size but shows no empirical or theoretical analysis of choosing different sizes of this window: either from an efficiency or quality perspective. It would be great to have empirical analysis.
  • No complexity analysis written out. This could be particularly relevant if we want to vary window above.

问题

  • Would it be possible to include RULER as an evaluation for long-context task. It has become more standard for LC. Showing the impact of APE on needle in a haystack would be interesting as well.
  • Are the tasks where APE fails. For example, Does APE still hold when used for Humaneval or RepoBench?
  • What would be the impact of finetuning with examples that had APE applied to then. Would it stop degradation?
  • Can you provide an analysis of the impact of changing the "window" size you use for parallel decoding.
评论

In Table R3, we further validate that APE can always retrieve the needle from the 128K context on the Needle In A Haystack benchmark.

Table R3: Experiment results of APE on the Needle In A Haystack benchmark.

1K10K19K28K37K46K55K64K73K82K91K100K109K118K128K
Average Accuracy (%)100100100100100100100100100100100100100100100

Q5: Are the tasks where APE fails.

A5: APE fails in scenarios that requires the dependency among contexts, ,particularly when long contexts need to be split into multiple chunks. This limitation is especially evident in tasks requiring long-range dependency. Our experimental results from the VT and FWE subsets in Table R2 demonstrate significant performance degradation in such scenarios.


Q6: Does APE still hold when used for Humaneval or RepoBench?

A6: Yes, APE still hold for several code generation tasks as showcased in Table R4. The results on LCC and RepoBench-P show that APE can outperforms other long-context methods on code generation.

Table R4: Comparison of APE and other methods on long-context code generation.

LCCRepoBench-P
LLaMA-3-8B-Instruct53.2238.15
+ LLMLingua216.4120.56
+ StreamingLLM40.0226.16
+ Long-context FT55.1243.05
+ Self-Extend58.0141.83
+ APE66.0949.43

Q7: What would be the impact of finetuning with examples that had APE applied to then. Would it stop degradation?

A7: The model's performance should be improved after fine-tuning for both sequential encoding and APE, as the model encodes the knowledge in examples into its parameters. However, this will lead to significant computational overhead during inference if we fine-tune the model on-the-fly using the retrieved examples, making it not practical.


References:

[1] Dao, Tri, et al. "Flashattention: Fast and memory-efficient exact attention with io-awareness." Advances in Neural Information Processing Systems 35 (2022): 16344-16359.

[2] Hsieh, Cheng-Ping, et al. "RULER: What's the Real Context Size of Your Long-Context Language Models?." arXiv preprint arXiv:2404.06654 (2024).

[3] Greg Kamradt. Needle in a haystack - pressure testing llms. 2023.

[4] Tang, Jiaming, et al. "Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference." arXiv preprint arXiv:2406.10774 (2024).

[5] Sun, Hanshi, et al. "ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference." arXiv preprint arXiv:2410.21465 (2024).

评论

Dear Reviewer BcRN,

Based on the suggestions from Reviewer kKET, we further evaluated our approach on four text retrieval tasks from the LOFT benchmark to represent long-context RAG scenarios, which further validate that APE's scalability in processing hundreds/thousands of retrieved texts with the total length of 1M in parallel. We hope these results, combined to the results on the RULER and Needle-In-A-Haystack datasets, can comprehensively show the effectiveness of APE in extreme long-context scenarios.

Here, we use LLaMA-3.1-8B-Instruct as our base model, with performance measured by the F1 score between ground truth and model predictions. To simulate realistic scenarios where RAG systems may retrieve suboptimal content, we randomly sampled passages from the corpus as context, enabling analysis of how performance scales with additional context integration (Otherwise, we only need a short-context LLM to achieve a good performance if the retriever is very accurate, which is not the setting that we target on). For the 128k version of LOFT, we compared APE against Sequential Encoding using varying context lengths in Table R9.

Table R9. Comparison of APE and Sequential Encoding on four text retrieval tasks from the 128k version of LOFT benchmark using the LLaMA-3.1-8B-Instruct model.

Inference TimeArguanaFeverNQSciFactAverage
8K, Sequential84711.157.7817.787.7411.11
16K, Sequential127711.209.7817.819.4912.07
32K, Sequential241312.1510.2215.789.7411.97
64K, Sequential587015.3413.1219.6416.1216.06
128K, Sequential1803912.8414.1924.5416.8817.11
128K, APE182516.3214.7021.9115.7217.16

Our analysis yielded several key findings:

  • APE demonstrates superior performance on average compared to all baselines, validating APE’s scalability in processing hundreds/thousands of retrieved texts in parallel.

  • APE achieves performance comparable to the 128K Sequential Encoding baseline while utilizing the same context. Although transferring from sequential encoding to APE would typically be expected to cause performance degradation, we hypothesize that APE recovers this by effectively shortening the distance between the query and all examples. This mechanism helps address the “lost in the middle” problem common in long-context LLMs.

  • APE demonstrates remarkable computational efficiency, with its 128K variant requiring inference times comparable to Sequential Encoding at much shorter context lengths (between 16K and 32K), while achieving approximately 10 times faster processing compared to Sequential Encoding at equivalent context length.

We also compared APE with Sequential Encoding using the LLaMA-3.1-8B-Instruct model on the 1m version of LOFT, as shown in Table R10. While APE improves 5.26% on average using 8×\times more retrieved texts, it only requires 1/3 inference time by pre-caching these texts.

Table R10. Comparison of APE and Sequential Encoding on four text retrieval tasks from the 1m version of LOFT benchmark using the LLaMA-3.1-8B-Instruct model.

Inference TimeArguanaFeverNQSciFactAverage
128K, Sequential1803911.288.1317.557.5411.13
1M, APE507315.8915.1320.7814.5516.59
评论

Thank you for reviewing our paper and for your valuable feedback. Below, we address your concerns point by point and we will revise our paper according to your suggestions. We would appreciate it if you could let us know whether your concerns are addressed by our response.


Q1: Paper does not provide implementation of the method -- this can make it easier for reproducibility. Would it be possible to provide that?

A1: Our PyTorch implementation of APE is available as a proof of concept at https://anonymous.4open.science/r/APE_Rebuttal-93EB. We are currently developing an efficient implementation based on Flash-attention [1].


Q2: The paper fixes a windows size but shows no empirical or theoretical analysis of choosing different sizes of this window: either from an efficiency or quality perspective. It would be great to have empirical analysis. & Can you provide an analysis of the impact of changing the "window" size you use for parallel decoding.

A2: APE supports dynamic window sizes that adapt to different contexts. The maximum window size is determined by subtracting the combined length of the prefix, query, and maximum generation length from the model's maximum length. We set the window size to its maximum to support longer contexts.

Additionally, we conducted experiments to evaluate the impact of window size on model performance. Using the same long-context inputs, we divided the content into chunks with varying window sizes. The results in Table R1 demonstrate that larger window sizes lead to improved accuracies.

Table R1: Ablation study of window size in APE on the LongBench dataset.

Window SizeMuSiQueNarrativeQAHotpotQA2WikiMQA
4K24.2328.8547.0939.87
8K26.1930.7149.7144.43

Q3: No complexity analysis written out. This could be particularly relevant if we want to vary window above.

A3: Thank you for your suggestion. We add analysis for the computational complexity of APE and Sequential Encoding. Following Section 2.1 and let the context-augmented input to be

S=sC0,0,,sC0,l0_Context,sC1,0,,sC1,l1_Context 1,,sCN1,0,,sCN1,lN1_Context N-1,s0,,sl_Query,\mathcal{S} = {\underbrace{s_{C_0, 0}, \ldots, s_{C_0, l_0}}\_{\text{Context}}, \underbrace{s_{C_1, 0}, \ldots, s_{C_1, l_1}}\_{\text{Context 1}}, \ldots, \underbrace{s_{C_{N-1}, 0}, \ldots, s_{C_{N-1}, l_{N-1}}}\_{\text{Context N-1}}, \underbrace{s_0, \ldots, s_l}\_{\text{Query}}}, where l0++lN1ll_0 + \ldots + l_{N-1} \gg l, we analyze the prefilling time of Sequential Encoding and APE.

  • Sequential Encoding: O((l0+...+lN1+l)2)O((l_0+...+l_{N-1}+l)^2)

  • APE with pre-caching: O((l0+...+lN1+l)l)O((l_0+...+l_{N-1}+l)\cdot l)

  • APE without pre-caching: O(max(l02,...,lN12)+((l0+...+lN1+l)lO(\max(l_0^2, ..., l_{N-1}^2)+((l_0+...+l_{N-1}+l)\cdot l

Therefore, APE achieves linear complexity with respect to total context length when using pre-caching, resulting in significantly improved efficiency. Even without pre-caching, APE maintains an advantage by allowing independent prefilling of each context. In this case, its complexity scales quadratically with the maximum context length, which is more efficient than Sequential Encoding's quadratic scaling with total context length.


Q4: Would it be possible to include RULER as an evaluation for long-context task. It has become more standard for LC. Showing the impact of APE on needle in a haystack would be interesting as well.

A4: First, we need to clarify that APE is primarily designed for context-augmented generation despite its potential in processing long single contexts as featured in RULER [2] and Needle In A Haystack [3] benchmarks. Here, we evaluated the LLaMA-3.1-8B-Instruct model on these benchmarks using 128K context lengths, with the context divided into 8 chunks of 16K positions each. We also include Quest [4] and ShadowKV [5] as two baselines using dynamic sparse attention and 128K positions. As shown in Table R2, APE demonstrates only a 5% performance decrease compared to Sequential Encoding. This decline is primarily observed in tasks requiring inter-context dependencies, specifically in VT and FWE subtasks. Compared to other baselines, APE even outperforms Quest and approaches ShadowKV despite using only 1/8 positions. These new results further show APE's superiority in long-context understanding.

Table R2: Experiment results of APE on the RULER benchmark.

ModelN-S1N-S2N-MK1N-MK2N-MQN-MVQA-1QA-2VTFWEAverage
LLaMA-3.1-8B-Instruct10010098.9691.6798.9695.3182.2947.9268.9671.1885.525
Quest10098.9697.9234.3893.4988.5470.8344.7965.6368.476.294
ShadowKV10010010083.3397.9292.1981.2548.9667.0864.9383.566
APE10096.8897.9285.4297.9293.757546.8856.0459.7280.953
审稿意见
8

This paper investigates the performance degradation issue in parallel encoding and analyzes the distribution patterns of key and query vectors. This paper analysis reveals that key vector distributions remain relatively similar and query vector distributions can be combined. Based on these findings, they propose APE (Adaptive Parallel Encoding) to mitigate performance losses in LLMs during parallel encoding. Specifically, APE incorporates three key components: shared prompt prefix, position-aware dynamic scaling, and attention temperature adjustment. We evaluated APE on three LLMs (Llama-3-8B, Llama-2-7B, Gemma-2-9B) across multi-document QA and few-shot ICL tasks. Results demonstrate performance improvements compared to full attention and baseline methods, with latency benchmarks showing significant speedup over full attention.

优点

  1. The research focuses on a practical and significant challenge, particularly relevant to real-world LLM applications like RAG.
  2. The paper provides insightful observations and analysis regarding the feasibility of parallel encoding.

缺点

  1. This paper demonstrates limited evidence for performance superiority over baselines across various tasks. Specifically:
  • The evaluation is restricted to only two task categories from LongBench (multi-document QA and few-shot ICL), which is insufficient to demonstrate the method's effectiveness across diverse scenarios.
  • For multi-document QA tasks, the evaluation is conducted on a limited subset of LongBench, with additional testing only on one RAG dataset in Sec 6.3. This narrow scope of evaluation fails to provide comprehensive evidence of the method's effectiveness. And Table 5 lacks crucial baseline comparisons, making it difficult to assess the relative performance improvements.
  • The evaluation of other task types in Sec 6.2 is inadequate, missing essential baseline comparisons needed for meaningful performance assessment.
  1. And this paper lacks sufficient component analysis and validation.
  • The paper lacks comprehensive ablation studies to isolate and validate the contribution of each proposed component. This makes it impossible to determine whether performance improvements stem from specific modules (e.g., attention temperature adjustment) or their combination.
  • There is insufficient analytical evidence demonstrating how the three proposed components effectively address the challenges identified in the earlier sections of the paper. The causal relationship between the proposed solutions and the observed improvements needs stronger empirical support.
  • The absence of detailed component-wise analysis makes it difficult to justify the necessity and effectiveness of each module in the proposed architecture.

问题

  1. Formatting issue: In Table 1 #413 Gemma-2-9B line has miss alignment.
评论

Q3: And Table 5 lacks crucial baseline comparisons, making it difficult to assess the relative performance improvements.

A3: Thank you for your suggestions. In Table 5, we evaluate RAG with Sequential Encoding and APE for two different external databases, with additional comparison to LLM-only generation. In these settings, APE consistently leads to performance improvement. Therefore, Table 5 only serves as one real-world RAG application for APE, where we only compares it with Sequential Encoding.

To compare with more baseline, our evaluation in Tables R1-R6, which includes various methods for both RAG and long-context understanding, can provide evidence supporting the superiority of APE.


Q4: And this paper lacks sufficient component analysis and validation.

A4: To further validate the effectiveness of each component in APE, we expanded our ablation studies beyond the three ICL tasks presented in Table 3 of the original paper. Our additional analysis included 8 subsets from LongBench and 5 subsets from ChatRAGBench. The results in Table R7 demonstrate incremental improvements with each component: incorporating shared prefix enhanced performance by 1.75%, adding scaling factor yielded a 0.63% improvement, and introducing attention temperature provided an additional 0.5% gain. Similarly, Table R8 shows consistent improvements from each component, with gains of 2.3%, 0.29%, and 0.79% respectively. These findings further validate the value of each component in APE. Additionally, the minor performance gap between APE and sequential encoding in Table R5, combined with these detailed results, shows that the challenges identified in Section 3.2 in the main paper have been addressed by these three components.

Table R7: Ablation study for each component in APE on 8 subsets from LongBench.

Methodmusiqueqasper2wikimqadureaderhotpotqanarrativeqamultifieldqa_zhmultifieldqa_enAverage
PCW18.8242.5940.9921.5747.0923.2954.4045.0536.73
+ shared prefix24.1342.3243.0922.7349.7126.9354.2644.6538.48
+ scaling factor26.1741.1944.4323.1347.7930.7154.0745.4139.11
+ attention temperature (APE)26.1942.3244.4323.1349.7130.7155.0345.4139.61

Table R8: Ablation study for each component in APE on 5 subsets from ChatRAGBench.

MethodINSCITDoc2DialTopicCQAQreccQuACAverage
PCW19.8832.3328.5546.2831.7831.76
+ shared prefix22.934.6230.6748.1134.0234.06
+ scaling factor23.2434.8830.6748.3234.6534.35
+ attention temperature (APE)23.8434.9333.848.734.9235.24

References:

[1] Bai, Yushi, et al. "Longbench: A bilingual, multitask benchmark for long context understanding." arXiv preprint arXiv:2308.14508 (2023).

[2] Izacard, Gautier, et al. "Unsupervised dense information retrieval with contrastive learning." arXiv preprint arXiv:2112.09118 (2021).

[3] Liu, Zihan, et al. "Chatqa: Surpassing gpt-4 on conversational qa and rag." The Thirty-eighth Annual Conference on Neural Information Processing Systems. 2024.

[4] Li, Zehan, et al. "Towards general text embeddings with multi-stage contrastive learning." arXiv preprint arXiv:2308.03281 (2023).

[5] Pan, Zhuoshi, et al. "Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression." arXiv preprint arXiv:2403.12968 (2024).

[6] Xiao, Guangxuan, et al. "Efficient streaming language models with attention sinks." arXiv preprint arXiv:2309.17453 (2023).

[7] Jin, Hongye, et al. "Llm maybe longlm: Self-extend llm context window without tuning." arXiv preprint arXiv:2401.01325 (2024).

评论

Thank you for your response, even though it arrived a bit late. After carefully reviewing the additional experiments provided by the authors and considering the comments from other reviewers, I no longer have any outstanding concerns. I believe this work will be beneficial to the community. Therefore, I have decided to increase my score to 8.

评论

Dear Reviewer Yenp,

Thanks for reading our response and raising the score! We are happy that our new experiments addressing most of your concerns, and we would include them in our next revision. Thanks again for your time!

Thanks,

The Authors

评论

Table R4: RAG experiments for APE and other methods on the LongBench dataset for the Gemma-2-9b-it model.

musiqueqasper2wikimqadureaderhotpotqanarrativeqamultifieldqa_zhmultifieldqa_enAverage
Gemma-2-9b-it22.5739.9948.0627.447.4923.1150.8145.3538.10
C200*5, Sequential28.7342.0651.6226.9447.8720.4948.0947.3839.15
C200*10, Sequential30.6942.8653.5528.0452.0524.4550.2548.3441.28
C2000*20, PCW26.2746.6947.5923.4348.9527.1156.6949.8140.82
C2000*20, APE33.3847.7249.4928.4356.6230.4156.5250.8444.18
  • RAG on the ChatRAGBench dataset [3] with multiple documents. In this setting, we employed three distinct retrievers: Dragon-multiturn [3], Contriever [2], and GTE-base [4] to retrieve the top-5 relevant contexts from five multi-document subsets of the ChatRAGBench dataset. We then evaluated the performance of Llama3-ChatQA-1.5-8B using these retrieved contexts, comparing sequential encoding against APE. Additionally, we tested APE's performance when using all available contexts. The results, presented in Table R5, demonstrate that shifting from sequential encoding to APE results in a minimal performance degradation of no more than 1%, which we consider acceptable. Notably, this performance gap becomes even smaller with lower-quality retrievers. Furthermore, when directly processing all contexts, APE not only outperforms all sequential encoding baselines but also shows substantial improvements compared to scenarios using low-quality retrievers.

Table R5: RAG experiments for APE and sequential encoding on the ChatRAGBench dataset using different retrievers.

MethodINSCITDoc2DialTopicCQAQreccQuACAverage
dragon-multiturn, Sequential25.4236.2736.149.0135.1236.38
dragon-multiturn, APE23.8434.9333.848.734.9235.24
Contriever, Sequential19.9723.8530.4946.7526.5729.53
Contriever, APE19.8823.2828.8446.2826.829.02
GTE-base, Sequential21.5832.3533.4146.5430.6932.91
GTE-base, APE20.8530.9931.9245.8330.3531.99
All contexts, APE27.2236.1335.7249.1535.736.78

Q2: The evaluation of other task types in Sec 6.2 is inadequate, missing essential baseline comparisons needed for meaningful performance assessment.

A2: We added more baselines for experimental results on the LongBench dataset, where we evaluated 11 subsets in total and compared APE with LLMLingua2 [5], StreamingLLM [6], Long-context Fine-tuning, and Self-Extend [7]. The results in Table R6 show that APE achieves a 6.61% improvement over the base model. Among other length extension methods, only Self-Extend shows performance gains, though it still lags behind APE by 2.15%.

Table R6: Comparison of APE and other length extension methods on the LongBench dataset.

MethodNarratQAQasperMultiFQAGovReportQMSumLCCRepoBench-PHotpotQA2WikiMQAMuSiQueMultiNewsAverage
LLaMA-3-8B-Instruct19.3232.8343.3827.8922.4053.2238.1544.2421.0120.4723.6331.50
LLMLingua221.0025.7848.9227.0922.3416.4120.5640.1624.7220.8521.3426.29
StreamingLLM16.9928.9411.9925.6519.9140.0226.1632.7620.1217.3221.4923.76
Long-context FT14.8821.7047.7932.6524.7655.1243.0515.8910.498.7424.2827.21
Self-Extend24.8237.9450.9930.4823.3658.0141.8351.0924.1728.7324.1135.96
APE26.8739.1459.1229.1023.0866.0949.4350.1128.0625.7922.4038.11
评论

Thank you for your constructive comments and suggestions. We respond to your questions below and would appreciate it if you could let us know if our response addresses your concerns.


Q1: For multi-document QA tasks, the evaluation is conducted on a limited subset of LongBench, with additional testing only on one RAG dataset in Sec 6.3. This narrow scope of evaluation fails to provide comprehensive evidence of the method's effectiveness.

A1: Thank you for your comment. We have added more experiments to showcase the effectiveness of our methods in context-augmented generation.

  • RAG on the LongBench dataset [1]. In this setting, we implement a chunking-and-retrieval approach across eight LongBench subsets following the original paper. The method involved preprocessing long contexts into chunks and utilizing Contriever [2] to retrieve the top-k most relevant chunks. We conducted experiments using four base models: LLaMA-3-8B-Instruct, LLaMA-3.1-8B-Instruct, Mistral-7B-Instruct-v0.3, and Gemma-2-9b-it. For sequential encoding, we segmented texts into chunks of approximately 200 words. When using APE, we employed larger chunks of 4,000 words, except for Gemma-2-9b-it, which used 2,000-word chunks. For LLaMA-3.1-8B-Instruct, we limited context length to 8K tokens to simulate a typical RAG setting. We also included comparisons with PCW, a baseline parallel encoding method. As shown in Table R1, R2, R3, R4, APE leads to 5.6% and 4.5% improvement on average compared to Sequential Encoding and PCW. These findings suggest that APE enhances RAG performance by enabling the retrieval of more and longer contexts without compromising inference speed, as pre-cached contexts can be loaded directly.

Table R1: RAG experiments for APE and other methods on the LongBench dataset for the LLaMA-3-8B-Instruct model. Here C200*10 means using Contriever to retrieve the top10 chunks, with each chunk containing approximately 200 words.

musiqueqasper2wikimqadureaderhotpotqanarrativeqamultifieldqa_zhmultifieldqa_enAverage
LLaMA-3-8B-Instruct20.741.0530.029.5545.920.9858.5445.0433.97
C200*10, Sequential22.6340.636.4617.5346.2518.1456.1352.4236.27
C200*20, Sequential27.9342.7138.3512.6549.622.7857.8248.9437.60
C4000*20, PCW18.8242.5940.9921.5747.0923.2954.445.0536.73
C4000*20, APE26.1942.3244.4323.1349.7130.7155.0345.4139.62

Table R2: RAG experiments for APE and other methods on the LongBench dataset for the LLaMA-3.1-8B-Instruct model.

musiqueqasper2wikimqadureaderhotpotqanarrativeqamultifieldqa_zhmultifieldqa_enAverage
LLaMA-3-8.1B-Instruct22.1846.8140.5834.6143.9723.0861.651.8938.98
C200*10, Sequential26.3138.7941.4233.1648.4223.3956.5455.3738.29
C200*20, Sequential30.6242.3344.3933.5149.9723.8756.8755.1440.22
C4000*20, PCW21.2341.5244.8731.1149.4719.9860.951.1938.44
C4000*20, APE26.8843.0350.1132.155.4130.562.0252.5142.86

Table R3: RAG experiments for APE and other methods on the LongBench dataset for the Mistral-7B-Instruct-v0.3 model.

musiqueqasper2wikimqadureaderhotpotqanarrativeqamultifieldqa_zhmultifieldqa_enAverage
Mistral-7B-Instruct-v0.310.0531.0822.1217.6832.0919.6832.0340.3825.64
C200*10, Sequential12.3322.823.6526.8630.4719.0140.6446.2627.75
C200*20, Sequential11.5821.9824.4420.832.7916.0634.4338.425.06
C4000*20, PCW17.5835.5732.9718.737.0514.134.6940.1428.85
C4000*20, APE20.336.8134.3721.8942.3320.4940.244.0332.55
审稿意见
6

The paper presents Adaptive Parallel Encoding (APE) in order to improve efficiency and performance in language models handling RAG and ICL tasks. APE overcomes limitations in sequential encoding, which requires re-encoding context data and suffers from context window restrictions, by employing parallel encoding with adaptive adjustments. These changes include a shared prefix, scaling factors, and a modified attention temperature, aligning parallel encoding closely with sequential encoding. Experiments show APE achieves a high speedup in long contexts and outperforms slightly other methods in accuracy for long contexts.

优点

  • The method sounds intuitive. APE extends the usable context length for language models, overcoming limitations in the context window and enabling efficient handling of much larger inputs without additional training.
  • The performance is great esp. in latency
  • The analysis in Section 3 looks quite novel and interesting

缺点

  • The performance seems quite sensitive on specific parameters

问题

  • Will we conclude anything between the task and the choice of hyperparameters?
  • The latency improvement in Table 2 would be more interesting if other baselines are included.
评论

Thank you so much for the insightful and valuable comments! They are very helpful for further improving the quality of our paper. We will revise our manuscript in the next version to address all of your concerns.


Q1: The performance seems quite sensitive on specific parameters.

A1: Our APE, despite introducing three hyperparameters (shared prefix, scaling factor, and attention temperature), is not very senstive on these hyperparameters. First, the shared prefix is only applied to tasks lacking existing prefixes (e.g., GSM8K and TriviaQA). Our experimental results in Table R1 demonstrate that introducing this shared prefix increases performance by 7.38% on average, while variations in prefix length only result in a 0.4% difference in performance. These findings indicate that while the presence of a shared prefix is beneficial, APE's performance is not significantly sensitive to the prefix length.

Table R1: Ablation study of different prefix lengths on GSM8K and TriviaQA using the LLaMA-3-8B-Instruct model.

Prefix Length0102040
GSM8K62.1772.1071.8771.69
TriviaQA67.4872.3172.5172.65

For the scaling factor ss and attention temperature TT, we evaluated various combinations across four QA tasks from the LongBench dataset, as shown in Table R2. While these hyperparameters are relatively sensitive compared to the prefix length, their optimal values fall within a narrow range. In our experiments, both ss and TT were selected from {0.7, 0.8, 0.9, 1.0}, making hyperparameter searching efficient on the validation set. Notably, despite examples of each task on LongBench containing varying numbers of contexts, APE's hyperparameters demonstrated generalization ability across examples with different context lengths, indicating low sensitivity.

Table R2: Ablation study of the scaling factor s and attention temperature T on four subsets from LongBench using the LLaMA-3-8B-Instruct model.

(ss, TT)MuSiQueNarrativeQAHotpotQA2WikiMQA
(1.0, 1.0)24.1326.9349.7143.09
(1.0, 0.9)26.1730.7147.7941.19
(0.9, 0.9)24.3227.9349.638.21
(0.8, 0.8)26.1927.0147.3235.41

Q2: Will we conclude anything between the task and the choice of hyperparameters?

A2: In different tasks, we may incorporate different number of retrieved texts from external sources. With this number increase, we observe a consistent trend that both the scaling factor s and attention temperature T decrease, aligning with our observations in Section 3.2.


Q3: The latency improvement in Table 2 would be more interesting if other baselines are included.

A3: Thank you for your suggestion. We additionally add MInference [1] as another baseline for prefilling acceleration. As shown in Table R3, APE achieves up to 28×\times acceleration in prefilling compared to Sequential Encoding with MInference for 128K-token contexts. Furthermore, APE's prefilling overhead scales linearly, consuming less than 10% of the total inference time, while baseline methods require over 50% as context length increases. APE also demonstrates superior versatility across various settings, whereas MInference introduces additional overhead and slower inference speeds when processing short contexts and large batches.

Table R3: Latency on H100 GPU: [prefilling / total time (ms)]. We compared different strategies with various context lengths and batch sizes.

Context Length4K8K16K32K64K128K
Sequential (bsz=1)132/2383276/2558635/32041646/47134892/880416635/22252
Sequential + MInference (bsz=1)964/32211297/36091623/42012206/52743563/74586457/12085
APE (bsz=1)28/227940/232268/2637122/3189226/4138421/6038
Sequential (bsz=4)523/31291084/41792558/64746599/1205519468/2836765928/83869
Sequential + MInference (bsz=4)3789/63734995/80996300/102448470/1394413462/2231324199/42013
APE (bsz=4)69/267594/3189150/4066298/5754430/9329858/18799

References:

[1] Jiang, Huiqiang, et al. "Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention." arXiv preprint arXiv:2407.02490 (2024).

评论

Our PyTorch implementation of APE is available as a proof of concept at https://anonymous.4open.science/r/APE_Rebuttal-93EB.

评论

We sincerely apologize to all reviewers and Area Chairs for submitting our rebuttal in the final hours of the discussion period. We conducted extensive additional experiments that took longer than anticipated, consuming most of the available time. We would be grateful if you could still review our rebuttal during the discussion/decision period.

We thank reviewers [R1(YT3n), R2(BcRN), R3(Yenp), R4(onHd), R5(kKET)] for their thoughtful and highly supportive feedback! We were glad that the reviewers found the problem practical and significant [R2, R3], the observations and analyses insightful and thorough [R3, R4], the method training-free and deployable[R2, R4], the experimental results comprehensive and impressive [R4].

While several reviewers [R2, R3, R4, R5] focus on APE's potential for long-context LLMs, and R3 requests additional experiments for RAG and multi-document QA, we would like to first clarify APE's primary contribution. Our method focuses on context-augmented generation involving multiple predetermined contexts from an external dataset, rather than processing single long-contexts in real-time. Therefore, the unique advantage of APE is its ability to pre-cache these contexts, significantly reducing prefilling time, as shown in Table 2 in the original paper. With the prevalence of context-augmented generation in real-world deployment, we hope that APE can motivate the development of pre-cacheable strategies in practice. Therefore, APE is completely different from the fixed step sparse attention mentioned by R5 and other length extension methods, which require prefilling the contexts on-the-fly in context-augmented generation settings.

To comprehensively address the reviewers' concerns while keeping the original focus, we have expanded our evaluation on both sides, with the detailed results provided in responses to each reviewer:

  • [R3] RAG on the LongBench dataset [1]. In this setting, we preprocess long contexts into chunks and utilize Contriever [2] to retrieve the top-k most relevant chunks on LongBench. On average, APE leads to 5.6% and 4.5% improvement compared to Sequential Encoding and PCW. These findings suggest that APE enhances RAG performance by enabling the retrieval of longer, more comprehensive contexts without compromising inference speed, as pre-cached contexts can be loaded directly.

  • [R2, R4, R5] RAG on the ChatRAGBench dataset [3]. In this setting, we use three different retrievers to retrieve the top-5 relevant contexts on the ChatRAGBench dataset. The results demonstrate that APE results in a no more than 1% performance degradation compared to sequential encoding using the same contexts, which we consider acceptable. Furthermore, when directly processing all contexts, APE outperforms all sequential encoding baseline.

  • [R2, R4, R5] Long-context understanding on the RULER dataset [4]. In this setting, APE demonstrates only a 5% performance decrease compared to sequential encoding for 128k context length. This decline is primarily observed in tasks requiring inter-context dependencies, specifically in VT and FWE subtasks. Compared to other baselines, APE even outperforms Quest [7] and approaches ShadowKV [8] despite using only 1/8 positions. This indicates that APE can scale to longer contexts.

  • [R2, R4, R5] Long-context understanding on the Needle in a haystack dataset [5]. In this setting, APE can always retrieve the needle from the 128k-length context on the Needle In A Haystack dataset.

  • [R2, R3, R4, R5] Long-context understanding on the LongBench dataset [1]. In this setting, APE demonstrates superior long context understanding, outperforming LLMLingua2 [9], StreamingLLM [10], Long-context Fine-tuning, and Self-Extend [11] by an average improvement of 2.15% across 11 subtasks compared to the SOTA baseline.

We hope that these new results, combined with the efficiency analysis, can (i) further support the novelty of APE in context-augmented generation as the major contribution, (2) provide more evidence for the effectiveness of APE in long-context understanding as a by-product.

评论

References:

[1] Bai, Yushi, et al. "Longbench: A bilingual, multitask benchmark for long context understanding." arXiv preprint arXiv:2308.14508 (2023).

[2] Izacard, Gautier, et al. "Unsupervised dense information retrieval with contrastive learning." arXiv preprint arXiv:2112.09118 (2021).

[3] Liu, Zihan, et al. "Chatqa: Surpassing gpt-4 on conversational qa and rag." The Thirty-eighth Annual Conference on Neural Information Processing Systems. 2024.

[4] Li, Zehan, et al. "Towards general text embeddings with multi-stage contrastive learning." arXiv preprint arXiv:2308.03281 (2023).

[5] Hsieh, Cheng-Ping, et al. "RULER: What's the Real Context Size of Your Long-Context Language Models?." arXiv preprint arXiv:2404.06654 (2024).

[6] Greg Kamradt. Needle in a haystack - pressure testing llms. 2023.

[7] Tang, Jiaming, et al. "Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference." arXiv preprint arXiv:2406.10774 (2024).

[8] Sun, Hanshi, et al. "ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference." arXiv preprint arXiv:2410.21465 (2024).

[9] Pan, Zhuoshi, et al. "Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression." arXiv preprint arXiv:2403.12968 (2024).

[10] Xiao, Guangxuan, et al. "Efficient streaming language models with attention sinks." arXiv preprint arXiv:2309.17453 (2023).

[11] Jin, Hongye, et al. "Llm maybe longlm: Self-extend llm context window without tuning." arXiv preprint arXiv:2401.01325 (2024).

AC 元评审

The paper introduces Adaptive Parallel Encoding (APE), a novel method designed to enhance efficiency and performance in language models when handling multiple external contexts, such as Retrieval-Augmented Generation (RAG) and In-Context Learning (ICL). APE achieves this by pre-caching key-value states of contexts separately and enabling position reuse during inference. The method incorporates three key components: 1) A shared prefix to align initial token distributions, 2) A scaling factor to offset the increase in attention weights, and 3) A lower attention temperature to prioritize semantically important tokens. The authors demonstrate that APE achieves an impressive 976× speedup for long-context generation while maintaining 93% accuracy compared to sequential encoding.

Overall, the reviewers unanimously acknowledge the soundness and significance of the proposed techniques, and the substantial efficiency improvements. Given the potential impact of APE in reducing inference costs for large models in RAG and ICL tasks, we recommend accepting this submission. But for authors, please follow the reviewers' feedback and your promise to revise the paper.

审稿人讨论附加意见

Here I mainly list the main concerns.

  1. The performance seems quite sensitive to specific parameters. (Reviewers YT3n, onHd) The authors add extra experiments to show that the three hyperparameters (shared prefix, scaling factor, and attention temperature), is not very sensitive to these hyperparameters.

  2. Latency improvement in Table 2 would be more interesting if other baselines are included. (Reviewer YT3n) The authors add MInference as another baseline for prefilling acceleration. As shown in Table R3, APE achieves up to 28× acceleration in prefilling compared to Sequential Encoding with MInference for 128K-token contexts.

  3. Impact of changing the "window" size you use for parallel decoding, and complexity change (reviewer BcRN) The authors explain that 1) APE supports dynamic window sizes that adapt to different contexts; and 2) adding more experiments to show the performance of different window sizes.

  4. The 128K and 1M context-length LOFT benchmark (reviewer BcRN) The authors add extra experiments to show the effectiveness of APE under 128k and 1M context.

  5. The narrow scope of evaluation fails to provide comprehensive evidence of the method's effectiveness. ( Reviewers Yenp, onHd) The authors add many extra experiments to show the effectiveness of APE.

These concerns are well resolved in my view.

最终决定

Accept (Poster)