PaperHub
6.5
/10
Poster4 位审稿人
最低6最高8标准差0.9
8
6
6
6
3.8
置信度
正确性3.0
贡献度2.8
表达2.8
NeurIPS 2024

Loki: Low-rank Keys for Efficient Sparse Attention

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06

摘要

关键词
Approximate AttentionLow-Dimensional Attention KeysPCATop-K Attention

评审与讨论

审稿意见
8

This method propose the PCA based attention score approximation for top-k attention. Perform PCA on offline dataset, and store the PCA vectors for inference.

优点

This work easily makes a QK approximator without gradient-based training but with only simple PCA for top-k attention selection.

缺点

You have to store the PCA vectors and perform the projection of KV during attention. This projection matrix should be holded on threadblock in GPU, to minimize the GPU memory access. But this matrix might be huge if the hidden size of QKV is large. But untill now, most of LLM using less than 256256, therefore, should be fine. (NOTE: I need to checkout the implementation whether this is correct or not)

问题

  1. The .gz file seems broken. Can you upload the file in other format again? I want to look into source codes, for details.
  2. Fig. 1. might leads mis understand that this attention mechnism using low rank PCA attention only. Can you add that this method perform top-k operation on top of approximated scores in the figure using diagram? (I think text only is not sufficient)
  3. Top-k operation is usually very slow in GPU, due to synchronization and information exchange of top-k values via global-memory. I think per-kernel latency breakdown should be presented on this paper. (Fig. 6 only shows the single sequnce length in left.)
  4. Some figures are not sized properly. (e.g. Fig. 6.) Can you adjust the size of figures to avoid stretching the texts?
  5. I think there should be some plot that show latency-performance trade-off (ms - accuracy) Can you add this plot using some downstream tasks?
  6. Fig.4, they evaluate downstream tasks, but every task is quite short sequence length. Can you try LongBench (https://github.com/THUDM/LongBench)? Question 5 should be solved using LongBench rather than short sequence tasks.

I have concern about performance evalution. The only metic used is Perplexity, HellaSwag, TQA, Winogrande, ARC, MMLU, which is relatively short sequence compared to long context LLM such as Qwen2 and Phi3. I suggest to add evaluation on long context.

局限性

PCA should be performed during train stage, and PCA projection required for top-k attention selection. This may leads additional effort to optimize the GPU kernels in many scale, and sometimes impossible if the device cannot hold PCA projection matrix in shared memory.

评论

I found out that the tar.gz file seems forced to be renamed .gz in OpenReview. Don't worry about my question 1, it is resolved by myself right now...

If possible, I will look into it during the discussion period. Thank you for your great work, and sorry for my mistake.

作者回复

We thank the reviewer for their feedback and appreciate that they recognize the simplicity of the approach to get a Q.KTQ.K^T-approximator without any gradient-based training.

Weaknesses

You have to store the PCA vectors and perform the projection of KV during attention. This projection matrix should be holded on threadblock in GPU, to minimize the GPU memory access. But this matrix might be huge if the hidden size of QKV is large. But untill now, most of LLM using less than , therefore, should be fine. (NOTE: I need to checkout the implementation whether this is correct or not)

We would like to note here that the current implementation of PCA-TopK is not a fused implementation like Flash Attention. We discuss the possibility of developing a fused method in our response to Reviewer vRA7 (Question 2). In the current implementation, the projection to PCA space happens in a separate kernel. This projection operation is implemented as a standalone matrix multiplication using PyTorch. We do agree with the reviewer that a more efficient implementation would keep the PCA vectors in shared memory to minimize the GPU memory access and that is possible since the per-head PCA transformation is only D×DD \times D dimensional, where typically DD = 128 (sometimes 256) for current LLM models.

Questions

The .gz file seems broken. Can you upload the file in other format again? I want to look into source codes, for details.

Based on the comment submitted, we assume this is resolved for now. If needed, we can submit an anonymized link to our code as an official comment to the ACs following NeurIPS guidelines.

Fig. 1. might leads mis understand that this attention mechanism using low rank PCA attention only. Can you add that this method perform top-k operation on top of approximated scores in the figure using diagram? (I think text only is not sufficient)

We will update the figure in the subsequent version of the paper.

Top-k operation is usually very slow in GPU, due to synchronization and information exchange of top-k values via global-memory. I think per-kernel latency breakdown should be presented on this paper. (Fig. 6 only shows the single sequence length in left.)

Yes, we agree that Top-K operation is slow and while it is possible to fuse the Top-K operation with other matmuls, it still involves synchronization among threads. Currently, we are looking at ways to bypass the Top-K operation itself using a score thresholding mechanism (where the threshold is sampled offline). We leave that to future work.

Figure 2 (Left & Mid) in the rebuttal pdf shows the kernel-wise breakdown for more prompt and generation lengths. It shows that the Top-K operation is almost as costly as the matmuls. A custom (or fused) kernel or a thresholding strategy might help alleviate this cost.

Some figures are not sized properly. (e.g. Fig. 6.) Can you adjust the size of figures to avoid stretching the texts?

We will fix the figures in the subsequent version of the paper

I think there should be some plot that show latency-performance trade-off (ms - accuracy) Can you add this plot using some downstream tasks?

Figure 1 (Top-Right) in the rebuttal pdf shows a latency-accuracy trade-off curve evaluated on LongBench [1] for the Llama2-7b-Chat model (sequence length of 4096). We can see that the settings with kf=0.25k_f = 0.25 and kf=0.125k_f = 0.125 perform much better than vanilla attention. kfk_f has a larger impact on the performance than dfd_f, which is supported by our theoretical speedup analysis in the paper. Overall, the configurations: (kf=0.25,df=0.25)(k_f = 0.25, d_f = 0.25) and (kf=0.125,df=0.5)(k_f = 0.125, d_f = 0.5) give a good accuracy-performance trade-off for LongBench tasks as well.

Fig.4, they evaluate downstream tasks, but every task is quite short sequence length. Can you try LongBench (https://github.com/THUDM/LongBench)? Question 5 should be solved using LongBench rather than short sequence tasks.

Figure 1 (Top-Left and Bottom) in the rebuttal pdf shows the performance of PCA-TopK on LongBench [1] tasks for the Llama-2-7B-Chat model. We have included a discussion on the same in the Global Rebuttal.

Limitations

PCA should be performed during train stage, and PCA projection required for top-k attention selection. This may leads additional effort to optimize the GPU kernels in many scale, and sometimes impossible if the device cannot hold PCA projection matrix in shared memory.

For clarification, our current method does not require PCA to be performed during the training stage. We compute the PCA transforms by evaluating the trained model over a calibration dataset, storing the generated keys, and then computing the PCA. We agree that an approach that utilizes our low-dimensional observation during training or fine-tuning will be interesting to study and may improve model performance. We also agree that additional effort is required to optimise the GPU kernels further.


References

[1] Bai, Yushi, et al. "Longbench: A bilingual, multitask benchmark for long context understanding." arXiv preprint arXiv:2308.14508 (2023).

评论

Thank you for the detailed and kindly described rebuttal. Sorry for my late reply. I am happy that most of my concerns are resolved, and I want to increase the rating. I am pretty confident that this paper is an acceptable grade. I hope other reviewers will respond soon.

  • I briefly looked into triton codes, and they are simple and worth understanding for further researchers. I hope the readability (adding a comment, changing the variable names from A and B.. to something semantically meaningful) will be improved in the public release. I think the modularity of this code is generally good.
    • However, the hooker function is implemented with a forward overriding style, and I do not like this style... I hope we have a better way than this because this kind of approach will be broken if the transformers framework changes its internal API.
    • Since the PCA projection is in a separate kernel, this should be improved also.
  • I am very grateful for the latency breakdown plot in the PDF. In every other paper that I reviewed in this NIPS, no other author actually made this chart. I truly think this analysis is critical to find further bottleneck of method for future research.
  • I acknowledge the practical impact of this work is understandable by showing reducing the memory footprint of K tokens read by low-rank projection.
    • I hope the researchers will find a way to do PCA-took in a fused way (like flash attention. e.g., Flash-PCA-Topk). I think we have two approaches: (1) ML perspective: changing algorithms. e.g., change top-k into thresholding function. I think these papers will be helpful for the thresholding approach [1]. Or perform Top-k in hierarchically [2] (2) System perspective: preserve algorithm as much as possible, but change the implementation. e.g., implement top-k operation using a bucket (partitioning may also be possible).
  • I think memory reduction is not very critical for this kind of work. So, I respectfully disagree with aGJV's weakness. I understand why the reviewer points to memory reduction because, often, memory consumption limits the research scales. However, I think the memory access footprint is quite underestimated here. The important point of the whole reduction is reducing K memory read footprint. This means we can effectively reduce the latency and total throughput of K reads, which is the most significant part of attention. This means we do not need high bandwidth memory such as HBM3e, and we can just use GDDR6, which is much more cost-effective. Moreover, in much sparse attention works [2], they are struggling to reduce the memory throughput of K tokens because single K tokens are already quite huge (128 * 2 bytes for each token and there are 32 heads. usually 8KB, which is 2 pages of VM). So I really love the reduction of memory footprint, and this make possible to KV cache offloading effectively as shown in [2, 3] I hope other reviewers can acknowledge this practical benefits. Maybe it is good to show, how much memory reads actually happened during PCA-topk vs. FlashAttention using nsight-compute [4] in future research. I expect to minimize actual memory reads, we should be careful with memory addressing and formating of KV cache tensor.

Again, thank you for your wonderful work, and I hope this kind of work (improving attention mechanism) continues.

[1] https://arxiv.org/pdf/2107.00910
[2] https://arxiv.org/abs/2406.09827
[3] https://arxiv.org/html/2406.19707v1
[4] https://developer.nvidia.com/nsight-compute

评论

We thank the reviewer for their extremely positive comments and the corresponding score increase. We greatly appreciate the insightful suggestions and hope to incorporate them in future work to further improve our method.

审稿意见
6

This paper reveals that key vectors lie in a significantly lower-dimensional space. Inspired by this finding, the author approximates the computation of the original attention score using PCA, then selects the top-k keys based on the approximate attention scores. Experiments across different models and datasets show that PCA-TopK can achieve speedups of up to 40% with minor reductions in generation quality.

优点

  1. The author discovers that the key vectors in multi-head attention lie in a lower-dimensional space, which may inspire future work on Sparse Attention.
  2. TThe experiments across various models and datasets indicate that PCA-TopK can achieve speed improvements with only minor reductions in generation quality.

缺点

  1. There are some typos (lines 44, 150) in this article, and some figures are unclear with text overlaps (Figures 2, 6).
  2. There are few baselines about Sparse Attention in the experiments. How does the PCA-TopK method compare to SPAR-Q Attention? What is the trade-off curve between their acceleration ratio and effect?

问题

  1. The analysis that key vectors lie in a lower-dimensional space is interesting. Do value vectors and query vectors share similar characteristics? What about the vector after merging multi-head attention?
  2. Given the differences in Rankl@90 across layers, what is the impact of varying the policy depending on the layer?
  3. If post-training with PCA-TopK, will it yield better performance?

局限性

N/A

作者回复

We thank the reviewer for their feedback and appreciate that they recognize the potential impact of our low-dimensional observation and the strength of our evaluation.

Weaknesses

There are some typos (lines 44, 150) in this article, and some figures are unclear with text overlaps (Figures 2, 6).

We will fix these issues in the subsequent version of the paper.

There are few baselines about Sparse Attention in the experiments. How does the PCA-TopK method compare to SPAR-Q Attention? What is the trade-off curve between their acceleration ratio and effect?

SPAR-Q incurs significant overhead as it needs to store the keys in KV-Cache twice – once in a row-major and once in a column-major format. This is required to get performant kernels for their implementation. Our method does not incur such a massive memory overhead (the only overhead we have is of storing the PCA projections, which are comparatively much smaller).

Additionally, our Q.KTQ.K^T kernels are more performant than SPAR-Q as compared in Appendix E, Figure 13 (in our paper).

We have not evaluated SPAR-Q ourselves. Their paper does not have a trade-off curve between acceleration ratio and effect. The only commonly evaluated dataset they have is TriviaQA [1] where PCA-TopK also achieves good performance (Figure 4 in our paper). Their attention latency results can be found in Figure 8 of their paper while Figure 9 shows end-to-end performance but evaluated on CPU. We will include end-to-end quantitative comparison with their method in subsequent versions of the paper.

Questions

The analysis that key vectors lie in a lower-dimensional space is interesting. Do value vectors and query vectors share similar characteristics? What about the vector after merging multi-head attention?

Figure 3 in the rebuttal pdf shows the dimensionality analysis for query and value vectors for Llama2-7B and Llama3-70B models. It can be seen that while query vectors exhibit low dimensionality similar to key vectors, value vectors have a significantly higher rank. We will include this analysis for more models in the updated supplementary material.

Given the differences in Rankl@90 across layers, what is the impact of varying the policy depending on the layer?

Figure 2 (Right) in the rebuttal pdf shows the evaluation of varying the dfd_f parameter per layer based on the explained variance of the PCA components for Llama3-8B. We compare two policies: (1) Fixed dfd_f for all layers (set to either 0.25 or 0.5), (2) Variable dfd_f per layer. For the variable policy, we select the dfd_f of a layer based on the explained variance threshold (ranging from 0.5 to 0.8). The compression ratio is defined as the average of df/Dd_f/D across all layers, where DD is the full dimensionality of the vectors. We can see that using the variable policy does not show any benefit over the fixed policy. This indicates that different layers may require different variance thresholds as well, and therefore, tuning the variable policy is key to getting gains. Our evaluation on Llama2-13B also shows the same trend (plot not included due to space constraints).

If post-training with PCA-TopK, will it yield better performance?

PCA-TopK is a post-training method and only requires the PCA transforms to be calculated on a set of keys generated (in inference mode) over some calibration dataset. If the reviewer is suggesting adding a fine-tuning step for better performance, we think that is a good suggestion. In that paradigm, we also feel it might be more optimal to keep the model fixed and train a good transformation that minimizes the difference between the reduced dimensional and original attention scores.


References

[1] Joshi, Mandar, et al. "Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension." arXiv preprint arXiv:1705.03551 (2017).

评论

Thanks for the responses! I will keep my rating.

评论

We thank the reviewer for their response.

审稿意见
6

This paper proposes a sparse attention mechanism for large language models by leveraging the low-dimensionality of key vectors in the attention block. The approach ranks and selects tokens in the KV-cache based on attention scores computed in the reduced dimensional space, leading to significant speedups in attention computation without major sacrifices in model quality.

优点

Empirical Validation: Extensive evaluations demonstrate significant speedups with minimal accuracy degradation.

Theoretical Soundness: The use of low-rank keys is backed by robust theoretical analysis, enhancing the credibility of the approach.

缺点

Implementation Complexity: The method's complexity might pose challenges for achieving the theoretical speedups without specialized knowledge. Memory Footprint Consideration: While the approach excels in computation, it does not address memory footprint reduction, a limitation when compared to other sparse attention techniques.

问题

  1. Model Class and Size Performance Variance: How does Loki's performance vary with different model classes and sizes not included in the paper?
  2. Integration with High-Efficiency Attention Mechanisms: How does the PCA-TopK method align with or complement high-efficiency attention mechanisms such as Flash Attention or Ring Attention? Can it be orthogonally integrated with these techniques, or are there specific challenges that need to be addressed?
  3. Comparison with Other Sparsity Strategies: The paper does not compare with attention sparse strategies beyond H2O, such as SnapKV[1]. Could the authors discuss the positioning of PCA-TopK relative to these methods and possibly include comparative analysis in future work?
  4. Figure Clarity: Regarding Figure 6 (left), it appears there might be a compression issue. Could the authors verify the image quality and ensure that it is legible in subsequent versions of the paper?
  5. Information Loss Management: How does PCA-TopK handle the potential loss of information when reducing the dimensionality of key vectors, especially in the context of Rotary Positional Embeddings (RoPE) which increase dimensionality?
  6. Combination with Compression Techniques: How does PCA-TopK interact with other model compression techniques, and is there a combined approach that could yield better results?
  7. Handling of Longer Sequences: How does Loki handle sequence lengths longer than those in the evaluation datasets?
  8. Quantization Integration: Can the PCA-TopK mechanism be integrated with other optimization techniques like quantization?

局限性

N/A

作者回复

We thank the reviewer for their feedback and appreciate that the reviewer recognizes the strength of our extensive empirical evaluation backed by robust theoretical analysis.

Weaknesses

Implementation Complexity: We agree that achieving speedups with our method requires custom Triton kernels. We will open-source our kernels for wider use. Integrating with techniques like Flash Attention would involve a more complex implementation, as is the case with most modern attention methods.

Memory Footprint: We agree that our method does not reduce memory usage. It is designed as an efficient Top-K token selector without deleting tokens from the KV-Cache (which can incur a large accuracy penalty as seen with H2O). Our method can complement token-eviction methods like H2O by selecting Top-K tokens from a reduced KV-Cache.

Questions

Model Class and Size Performance Variance: How does Loki's performance vary with different model classes and sizes not included in the paper

Appendix B, Figure 11 and Tables 3 & 4 (in the paper) show perplexity and downstream evaluation of our method on models of different model classes covering dense and MoE models. Further, for each class, we show results for models of different sizes (7B, 8B, 13B, 70B, 8x7B, 8x22B). Appendix A, Figure 7 (in the paper) shows the dimensionality analysis on an even larger set of model classes and sizes.

Integration with High-Efficiency Attention Mechanisms: How does the PCA-TopK method align with or complement high-efficiency attention mechanisms such as Flash Attention or Ring Attention? Can it be orthogonally integrated with these techniques, or are there specific challenges that need to be addressed?*

Integrating PCA-TopK with Flash Attention (FA) or Ring Attention is theoretically possible, but performance challenges, especially with the Top-K operation, must be addressed to achieve speedups. While we haven't resolved all integration issues, we have a rough algorithm for FA integration:

  1. Load the query into shared memory
  2. Additional Loop:
    • Load the first dfd_f dimensions of the keys into shared memory
    • Compute the approximate scores for each block of keys and update top-k indices
  3. Original FA Loop: Load blocks of top-k keys (based on the indices computed) with full dimensionality and follow the FA algorithm.

Comparison with Other Sparsity Strategies: The paper does not compare with attention sparse strategies beyond H2O, such as SnapKV[1]. Could the authors discuss the positioning of PCA-TopK relative to these methods and possibly include comparative analysis in future work?

Methods like H2O and SnapKV [1] delete tokens to save KV-Cache memory. In contrast, PCA-TopK selects the Top-K tokens from the entire KV-Cache to reduce memory bandwidth without deleting tokens. Hence, PCA-TopK is orthogonal to H2O/SnapKV. A combined strategy could involve deleting tokens with H2O/SnapKV, and then selecting Top-K tokens from the retained cache. We will include a comparison of standalone PCA-TopK with other methods apart from H2O in future work.

Figure Clarity: Regarding Figure 6 (Left), it appears there might be a compression issue. Could the authors verify the image quality and ensure that it is legible in subsequent versions of the paper?

We will fix Figure 6 (Left) in the subsequent versions of the paper.

Information Loss Management: How does PCA-TopK handle the potential loss of information when reducing the dimensionality of key vectors, especially in the context of Rotary Positional Embeddings (RoPE) which increase dimensionality?

PCA-TopK is an approximate method and there is information loss when reducing the dimensionality. We try out different values of dfd_f (reduced dimensionality) and find a good tradeoff between compute performance and accuracy. We also pick the best transform between post-rotary and pre-rotary. Figure 2 in the paper shows that RoPE indeed increases the dimensionality but it is still low enough that settings like df=0.25d_f = 0.25 (25% dimensionality) work well enough for a good compute-accuracy tradeoff (Figure 3, 6 in the paper)

Combination with Compression Techniques: How does PCA-TopK interact with other model compression techniques, and is there a combined approach that could yield better results?

We discuss how our method can be used with token-eviction methods as a response to Question 3. PCA-TopK can theoretically be used with quantization, as PCA-TopK reduces dimensionality while quantization reduces bits per value. Model pruning [2] reduces model parameters, which is orthogonal to selecting tokens from the KV-Cache. We are unsure if pruning affects the low dimensionality of key vectors and leave that analysis for future work. We have not quantitatively analyzed combining PCA-TopK with other approaches and are unsure of the practical performance.

Handling of Longer Sequences: How does Loki handle sequence lengths longer than those in the evaluation datasets?

Figure 1 in the rebuttal pdf shows the evaluation of PCA-TopK on LongBench [3] long-sequence benchmark for the Llama-2-7b-Chat model. We have included a discussion on the same in the Global Rebuttal.

Quantization Integration: Can the PCA-TopK mechanism be integrated with other optimization techniques like quantization?

We discuss how our method can be used with quantization as a response to Question 6.


References

[1] Li, Yuhong, et al. "Snapkv: Llm knows what you are looking for before generation." arXiv preprint arXiv:2404.14469 (2024).

[2] Han, Song et al. “Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding.” arXiv: Computer Vision and Pattern Recognition(2015): n. Pag.

[3] Bai, Yushi, et al. "Longbench: A bilingual, multitask benchmark for long context understanding." arXiv preprint arXiv:2308.14508 (2023).

评论

Thanks for the author's response, and I will increase my rating to 6.

评论

We thank the reviewer for their response and the score increase.

审稿意见
6

The paper introduces a method for approximating attention in LLMs, with the benefit of improved inference efficiency. The insight is to focus on the dimensionality of the key vectors computed in the attention block. Principal component analysis reveals that the keys lie in a low dimensional space. This gives rise to the proposed method PCA-TopK, which uses PCA to compute approximate attention scores in a reduced dimension, and then selects the top-k tokens based on these scores, and compute the equivalent of full attention only for the selected tokens. The method is evaluated on multiple LLMs from the Llama family, Pythia, Mistral, Mixtral, Phi, and various datasets. The results show comparable performance to full attention while having significant speedups.

优点

The study of low intrinsic dimensionality of attention keys across multiple models and datasets is insightful. Theoretical support is provided through lemmas and proofs. The evaluation includes several models, tasks and datasets, showing it could be generalized. The authors developed optimized Triton kernels for practical implementation.

缺点

The method does not seem to reduce the memory usage.

问题

For some models the pre-rotary PCA transforms outperform the post-rotary ones, which is intriguing. The authors acknowledge that they do not have a clear explanation, however if more understanding was developed it would be great to include it in the paper.

局限性

yes

作者回复

We thank the reviewer for their feedback and appreciate that the reviewer finds our observation of the low intrinsic dimensionality of keys insightful and supported by theoretical analysis and extensive evaluation.

Weaknesses

The method does not seem to reduce the memory usage

We agree that our method does not reduce memory usage. It is designed as an efficient Top-K token selector without deleting tokens from the KV-Cache (deleting tokens can incur a significant accuracy penalty, as seen with H2O). Our method can complement token-eviction methods like H2O by selecting Top-K tokens from a reduced KV-Cache.

Questions

For some models the pre-rotary PCA transforms outperform the post-rotary ones, which is intriguing. The authors acknowledge that they do not have a clear explanation, however, if more understanding was developed it would be great to include it in the paper.

As noted, pre-rotary transforms outperforming post-rotary transforms is an intriguing observation for which we currently lack a clear explanation. We still have not been able to develop a concrete understanding of why this is the case. A naive intuition we have is that when computing the PCA transform over the post-rotary keys, it captures the distribution of token representations occurring at specific positions in the calibration dataset. During inference, the same token can appear at any other position. The pre-rotary transform captures the distribution of only tokens with significantly less positional information so it can generalize better.

While Lemma 4.2 (in the paper) proves that using the post-rotary transform is a good approximation, it does not show that it is the best approximation due to the variational upper bound formulation. Hence, the pre-rotary transform might provide a better approximation.

作者回复

We thank the reviewers for their valuable feedback.

PCA-TopK's performance on tasks with longer contexts

We ran the LongBench [1] long-sequence benchmark for the Llama2-7B-Chat model with PCA-TopK and compared its performance with Full Attention in Figure 1 of the rebuttal pdf. Figure 1 (Top-Left) shows that PCA-TopK performs well on all LongBench task categories for (kf=0.25,df=0.25)(k_f = 0.25, d_f = 0.25), with Multidoc-QA having the most significant accuracy drop from ~22% to ~18%. Figure 1 (Bottom) shows a finer-grained evaluation on all LongBench tasks with two configurations: (kf=0.25,df=0.25)(k_f = 0.25, d_f = 0.25) and (kf=0.125,df=0.5)(k_f = 0.125, d_f = 0.5). For this particular model, we see that, on average, the post-rotary transform is better than the pre-rotary one, and (kf=0.25,df=0.25)(k_f = 0.25, d_f = 0.25) is the better configuration. These observations are consistent with the evaluation of the tasks in the paper.

To examine the accuracy vs. performance trade-off, we plot overall LongBench accuracy along with attention latencies (computed in a micro-benchmark) for different configurations of kf,dfk_f, d_f in Figure 1 (Top-Right). Attention times were computed with a prompt length of 3500 and a generation length of 512 to match LongBench's values for the Llama2-7B-Chat model. Due to the slow cache updates in HuggingFace, we lack an end-to-end framework but plan to integrate our attention method with an inference framework like vLLM soon. Nonetheless, our method shows potential for up to 40% attention speedups with minimal accuracy degradation in long and short-context tasks. We will include other models supported by LongBench in the camera-ready version of the paper, if accepted.

Memory usage

We acknowledge the limitation pointed out by the reviewers that our method does not reduce memory usage. Our method efficiently selects Top-K tokens from the KV-Cache without deleting tokens. As we have demonstrated with H2O, deleting tokens can have significant accuracy penalties. PCA-TopK can be used in conjunction with memory reduction methods such as token-eviction or quantization, and a combined approach may lead to better memory, performance, and accuracy trade-offs.

Dimensionality of query/value vectors and Variable dfd_f policy

We analyzed the dimensionality of query and value vectors (Figure 3 in the rebuttal pdf), finding that while query and key vectors share the low-dimensional observation, value vectors do not. These findings can inspire further research into LLM properties and sparsity exploitation for improved performance. We also experiment with a variable policy for setting dfd_f per layer instead of a fixed policy across all layers. For the variable policy, we set dfd_f of every layer based on an explained variance threshold (varied from 0.5 to 0.8). Figure 2 (Right) in the rebuttal pdf shows the results of this evaluation on Llama3-8B, illustrating that the variable policy is no better than the simple fixed policy. We also evaluate on Llama2-13B and see a similar trend (plot not included in the pdf due to lack of space).


References

[1] Bai, Yushi, et al. "Longbench: A bilingual, multitask benchmark for long context understanding." arXiv preprint arXiv:2308.14508 (2023).

最终决定

This paper proposes a PCA-based sparse attention mechanism for LLMs by leveraging the low-dimensionality of key vectors in the attention block. The method helps approximate the attention and yields inference speedup. Evaluation results demonstrate the efficacy of this approach for multiple LLMs.

Overall, the method is interesting & authors provide a nice explanation & theoretical analysis of the technique. The reported extensive evaluations also demonstrate the generalizability of this method for different Transformer models, datasets & tasks which is impressive.

The reviewers point out a few suggestions for improvements -- the authors are encouraged to incorporate these to further improve the paper. In particular, the issue (or rather non-improvement) wrt memory footprint was raised. One suggestion — even though this is not a focus for the paper, it might be worth carving out some space to discuss this (maybe in future work or aux. section), how the compute/memory complexity tradeoff for your approach outweighs existing memory-reduction approaches & potential future improvements.