7.0

/10

Spotlight4 位审稿人

最低7最高7标准差0.0

4.0

置信度

正确性2.8

贡献度3.0

表达2.8

NeurIPS 2024

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

Huiqiang Jiang,YUCHENG LI,Chengruidong Zhang,Qianhui Wu,Xufang Luo,Surin Ahn,Zhenhua Han,Amir H. Abdi,Dongsheng Li,Chin-Yew Lin,Yuqing Yang,Lili Qiu

OpenReview PDF

提交: 2024-05-15更新: 2025-01-15

摘要

关键词

LLMs InferenceLong-Context LLMsDynamic Sparse AttentionEfficient Inference

评审与讨论

审稿意见

评分: 7置信度: 32024-07-11

This paper presents a sparse calculation technique for the attention mechanism in long-context large language models during the pre-filling stage. Specifically, the technique builds on the authors' observation of three patterns in the attention map: the A-shape pattern, the Vertical-Slash pattern, and the Block-Sparse pattern. In the proposed method, the authors first determine the optimal pattern for each attention head offline and then dynamically decide the hyperparameters for each pattern (e.g., sparse indices) on-the-fly. It is worth noting that the A-shape pattern is more regular in terms of the distribution of non-zero indices compared to the Vertical-Slash and Block-Sparse patterns. To provide a low-cost estimation of the non-zero indices in the Vertical-Slash and Block-Sparse patterns, the authors predict these indices using the matrix multiplication results between the last query vector and the key matrix and the matrix multiplication results between a mean-pooled query matrix and the key matrix, respectively. Results on three different pre-trained LLMs and four datasets indicate that the proposed method effectively maintains information from long prompts while reducing pre-filling latency compared to current state-of-the-art methods.

优点

Impressive results: Based on the reported experimental results, the proposed method can reduce the pre-filling latency for a 1M context by ten times without compromising accuracy.
Well-motivated: I appreciate the analysis and high-quality figures in Section 2, which explicitly illustrate the key takeaways for readers and effectively motivate the proposed method.
Comprehensive summary in the related works section.

缺点

How to ensure that the observations on the three LLMs in this paper are generalizable to different LLMs? Since the entire method is built on the observation of the attention patterns in these three LLMs, the generalizability of this observation is crucial for the robustness of this work. For example, the block-sparse pattern is the most dynamic among the three. What if there are even more dynamic attention maps in other LLMs with specific inputs? How is the search space determined, and how general is it? The offline Kernel-Aware Optimal Sparse Pattern Search determines the pattern for each head offline, and the pattern does not change during inference regardless of the input. Is it always true that the attention pattern for a specific head remains nearly the same?
Lack of evaluation on larger models: For longer contexts, larger models seem to be a better option to ensure quality. However, the models evaluated in this paper are relatively small (<10B parameters). Thus, it is unclear whether the improvements still hold for larger models.
Lack of discussion on usage for the generation/decoding stage: Pre-filling is important because it determines the first token latency, but generation/decoding is also important because it determines the throughput. It would be beneficial to add some discussion on whether the observations and the proposed method are still applicable for the generation/decoding stage. For example, while the A-shape pattern might still exist, the sparse indices approximation in the vertical-slash pattern might not work for generation/decoding.

问题

What is the dataset of input data used for the analysis in Section 2.2? How was it chosen?
In Figure 3(c), why does the block-sparse pattern perform so poorly in attention compared to the A-shape and Vertical-Slash patterns? Based on my understanding, the block-sparse pattern is a more general and dynamic case for the A-shape and Vertical-Slash patterns, as shown in Figure 4. Could the block size be an important factor causing this issue?
In Figure 3(c) and the corresponding description, I suggest explicitly mentioning which Top-K method is used. Based on my understanding, the Top-K method involves selecting the top-K tokens for all heads, while the other methods select the top-K tokens for each head. Presenting them together may confuse readers into thinking that Top-K means the top-K for each head.
What does "real FLOPs" mean in Line 137? FLOPs itself is a "conceptual estimation" compared to real measured latency. I believe the authors may want to refer to the FLOPs after discarding the zero values in the attention pattern. If so, please be more specific.
The LLaMA-3-8B-Instruct-262k is linked to LLaMA-3-70B-Instruct-262k in Line 176. Please clarify which model is used here.
There are minor consistency issues, such as StreamLLLM and StreamingLLM.

局限性

The authors discussed the limitation and social impact of their work.

作者回复

2024-08-06

"...generalizable... built on the observation of the attention patterns..."

Thank you for your question. We address it from two angles: the generalization of MInference and the relative stability of dynamic sparse attention patterns across different examples.

For the generalization of MInference,

To demonstrate the generalizability of MInference, we tested it on most open-source long-context LLMs, including LLaMA-3-8B/70B-1M, Yi-9B-200K, GLM-4-9B-1M, Phi-3-mini-128K, and Qwen2-7B-128K (see general response PDF). MInference consistently achieves good performance and acceleration across models with different pre-training data, training pipelines, RoPE structures, extended long-context methods, and sizes.
Additionally, we observed that three sparse attention patterns, especially the vertical and slash patterns, are not only present in GPT-like LLMs but also in BERT[2], T5 (both Encoder and Decoder), and MLLM[3]. We also found that "induction heads" exhibit a pattern similar to the "vertical and slash" pattern (see [4]).
By design, MInference accounts for different dynamic sparse attention patterns, which lends it a certain degree of generalization capability.

[2] SparseBERT: Rethinking the Importance Analysis in Self-Attention, ICML 2021. [3] LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference, 2024. [4] A Mathematical Framework for Transformer Circuits, 2021.

For the relative stability of dynamic sparse attention patterns across different examples,

Visualization: Fig.3(a) shows the visualization of the sparse attention pattern distribution of the same head across different tasks and examples, demonstrating good consistency.
Attention Recall: Fig.3(c) shows the average recall rate of different tasks and examples within the searched pattern as the compression rate changes. The searched pattern achieves a higher recall rate across different tasks.
Downstream tasks performance: Using a configuration searched from a single example, we achieve performance close to full attention across different scenarios and benchmarks, further indicating the stability of dynamic sparse attention patterns across various tasks and examples.

"What if there are even more dynamic attention maps in other LLMs with specific inputs"

Given the extreme sparsity of attention in long-context scenarios, where the significant attention weights are concentrated in a few blocks, even if the graph differs from the vertical and slash pattern and the A-shape, it can still be covered by the block-sparse pattern. Additionally, our experiments in Question 1 demonstrate the ubiquity of this dynamic sparse pattern in current long-context LLMs and the effective of MInference.

"How is the search space determined, and how general is it?"

The search space is determined by a specific sparsity rate and adjusted according to the actual FLOPs required for different patterns in the kernel. Based on testing, the search space obtained is quite consistent across different tasks and examples.

"Lack of evaluation on larger models"

Thank you for your suggestion. We have included results for LLaMA-3-70B-1M in the supplementary PDF. MInference achieves performance close to or even better than full attention, especially compared to baselines.

Methods	En.Sum	En.QA	En.MC	En.Dia	Zh.QA	Code.Debug	Math.Find	Retr.PassKey	Retr.Number	Retr.KV	Avg.
LLaMA-3-70B-1M	20.7	10.3	84.2	9.5	14.0	33.2	61.7	97.0	100.0	34.0	46.5
StreamingLLM	20.5	8.5	52.0	10.0	12.6	27.4	61.1	14.0	10.0	0.0	21.6
InfLLM	24.1	8.1	57.0	10.0	12.9	27.4	52.3	100.0	100.0	0.0	39.2
MInference	20.6	10.1	83.4	10.0	14.1	34.1	61.9	100.0	100.0	39.0	47.3

Table 1. Performance of different methods with different base models on InfiniteBench.

"Decoding"

Thank you for your suggestion. However, extending to decoding would require significant system work for CPU offloading, which we have earmarked for future work. Nonetheless, our preliminary experiments indicate that the extrapolation generality of these patterns is also very good, as shown in:

Methods	En.Sum	En.QA	En.MC	Zh.QA	Code.Debug	Math.Find	Retr.Number
GLM-4-9B-1M	28.3	9.7	68.6	12.1	29.4	38.9	100.0
MInference	28.8	9.6	68.6	12.0	30.7	39.1	100.0
MInference in prefilling and decoding	28.0	9.4	67.3	11.9	31.7	39.1	100.0

Table 3. Performance of different methods on InfiniteBench using GLM-4-9B-1M.

"Dataset used in Sec 2.2"

In Section 2.2, we used data from InfiniteBench to create Fig.3, including summarization, QA, Math, and retrieval tasks. Both (b) and (c) are results averaged after randomly selecting 10 examples from each subset.

"Why does the block-sparse pattern perform so poorly?"

This is mainly because different patterns use different block sizes in the kernel. For example, the Vertical and Slash pattern can use a 1x64 block to process vertical lines instead of 64x64 in block-sparse. If this head pattern is processed using block-sparse, it would result in significant computational waste, leading to a lower attention recall at the same sparsity rate.

"I suggest explicitly mentioning which Top-K method is used"

We apologize for any confusion caused by our writing. Yes, your understanding is correct; the Top-K mentioned in Fig.3(c) refers to token-level Top-K. We will add all corresponding granularities of TopK and rewrite Section 2.2 for better comprehension in the next version.

"Real FLOPs"

Real FLOPs refer to the FLOP values in the kernel after excluding non-zero calculations. Thank you for your suggestion; we will clarify this meaning in the next version.

Error link and typos

Thank you for your detailed review. We will correct these issues in the next version of our paper.

评论- Thanks for the rebuttal!

2024-08-08

Thank you to the authors for the detailed rebuttal!

I truly appreciate your efforts in putting everything together within such a short period.

I have raised my score to 7.

One minor issue: For this statement: "This is mainly because different patterns use different block sizes in the kernel. For example, the Vertical and Slash pattern can use a 1x64 block to process vertical lines instead of 64x64 in block-sparse. If this head pattern is processed using block-sparse, it would result in significant computational waste, leading to a lower attention recall at the same sparsity rate."

--> I understand that the Slash pattern’s block size is smaller than that of block-sparse, i.e., 1x64 vs. 64x64. However, the logical connection between "significant computational waste" and "a lower attention recall at the same sparsity rate" is not entirely clear to me. Specifically, I believe that "significant computational waste" would lead to lower efficiency rather than lower recall.

评论- Response by Authors

2024-08-08

Thank you again for recognizing our work and providing very helpful comments.

"...'significant computational waste' would lead to lower efficiency rather than lower recall..."

Yes, you are correct. However, in Fig. 3(c), we defined the horizontal axis as Dense FLOPs / FLOPs in Kernel, which represents the sparsity rate within the kernel. We intended to convey that larger block sizes lead to computational waste, resulting in lower effective recall at the same sparsity rate.

审稿意见

评分: 7置信度: 52024-07-13

A key challenge for LLM inference with processing long context lengths is time-to-first token for long prompts. This paper introduces a sparse attention method designed to accelerate prefill with long context lengths. They utilize a strategy that incorporates three different types of sparsity (A-shape, vertical-slash, and block-sparse). They calibrate for which sparsity pattern to apply to different heads, thereby adapting to the varying sparsity patterns across LLM heads. Their method is training-free and obtains significant latency gains in terms of time to first token.

优点

Time to first token is a significant problem that is under-addressed in existing KV cache compression works, as many applications have long input prompts and short generation lengths
Their approach is able to adapt dynamically to different inputs for certain attention patterns (which have different sparsity patterns), as well as to handle differing behaviors across attention heads
They provide efficient online kernel implementations for each sparsity pattern (both for constructing the sparsity pattern for the two dynamic patterns as well as for sparse computation)
They observe significant prefill speedups attained with minimal accuracy degradation
They provide detailed evaluation across a range of long context tasks, as well as ablation for each type of sparsity pattern (and additional analysis showing the distribution of sparsity patterns across attention heads, etc.)

缺点

Their approach is heuristic-based (as it is based on fixed types of attention patterns), which may not generalize to new models that are released (some of these patterns could be specific to models that employ particular positional encodings or other model architecture features, which may limit generalizability)
The offline kernel search assumes that each head will play the same role regardless of the target task (or alternatively that calibration data will always be available for the target task in order to pre-determine which general attention pattern to use and what budget to allocate to that head)
The writing in some sections is poor (eg. end of Section 2)

问题

Do you have any analysis that shows that the attention head patterns are the same regardless of the input task (or that their search converges to the same solution for the same head across different tasks)?

局限性

Yes

作者回复

2024-08-06

"...which may not generalize..."

To demonstrate the generalizability of MInference, we tested it on most open-source long-context LLMs, including LLaMA-3-8B/70B-1M, Yi-9B-200K, GLM-4-9B-1M, Phi-3-mini-128K, and Qwen2-7B-128K (see general response PDF). MInference consistently achieves good performance and acceleration across models with different pre-training data, training pipelines, RoPE structures, extended long-context methods, and sizes.
Additionally, we observed that three sparse attention patterns, especially the vertical and slash patterns, are not only present in GPT-like LLMs but also in BERT[2], T5 (both Encoder and Decoder), and MLLM[3]. We also found that "induction heads" exhibit a pattern similar to the "vertical and slash" pattern (see [4]).
By design, MInference accounts for different dynamic sparse attention patterns, which lends it a certain degree of generalization capability.

"... assumes that each head will play the same role regardless of the target task..."

Indeed, based on our experiments and observations, these dynamic sparse patterns are relatively fixed across different tasks and examples. This is corroborated by our experimental results, where a configuration searched from a single example can achieve performance close to full attention across various scenarios and benchmarks.

"The writing in some sections is poor (e.g., end of Section 2)"

We apologize for any confusion caused by our writing and will rewrite the relevant content, particularly Section 2.2, to improve readability.

"...analysis that shows that the attention head patterns are the same regardless of the input task..."

Thank you for your suggestion. We explain the relative stability of head-level dynamic sparse attention patterns from several perspectives:

Visualization: Fig.3(a) shows the visualization of the sparse attention pattern distribution of the same head across different tasks and examples, demonstrating good consistency.
Attention Recall: Fig.3(c) shows the average recall rate of different tasks and examples within the searched pattern as the compression rate changes. The searched pattern achieves a higher recall rate across different tasks.
Downstream tasks performance: Using a configuration searched from a single example, we achieve performance close to full attention across different scenarios and benchmarks, further indicating the stability of dynamic sparse attention patterns across various tasks and examples. We will rewrite the paper to highlight this point and thank you again for your helpful and insightful comments.

RULER	Effective	4K	8K	16K	32K	64K	128K	Avg.
GLM-4-9B-1M	64K	93.8	91.6	89.3	87.4	85.2	80.8	88.0
StreamingLLM	4K	93.8	66.9	58.5	51.4	45.9	39.1	59.3
InfLLM	8K	94.7	89.5	76.4	66.5	56.8	53.5	72.9
MInference	64K	94.6	93.1	91.0	89.6	85.5	84.0	89.6

Table 2(a). Performance of different methods on RULER.

评论- Rebuttal Response

2024-08-09

I appreciate the author's detailed responses addressing my comments. Their response sufficiently justifies the claims that their approach generalizes to a range of LLMs, as well as that particular heads play a similar role regardless of the task. I will therefore keep my initial score.

审稿意见

评分: 7置信度: 52024-07-14

Targeting at the time-consuming profiling stage of long contexts, this paper proposed an efficient prefilling stage spare attention mechanism. It's based on the observation to common attention patterns. The proposed method can be integrated into most existing LLMs, such as LLama3 and Yi-9B. It achieves good performance compared to the original model while significantly reduce the computation times.

优点

1: The motivation is clear and the paper is easy to understand.

2: The proposed method seems simple yet effective.

3: The experiments are comprehensive.

缺点

1: It’s tricky to claim that “we found the three patterns”. Those patterns have been found many years ago, since the years of Bert / transformer models were proposed. Even the idea of leveraging such sparsity of attention has been invented many times for different transformer based models.

Some other major concerns can be found in the "Question" section.

问题

Is the slash-vertical pattern distinguishable from lambda-shaped patterns? By my understanding, lambda-shaped patterns are just a special case of the slash-vertical pattern. Perhaps if you replace all lambda-shaped patterns with the slash-vertical pattern, it would still work well.
Could you please do a comparison between the proposed MInference and H2O[1]? From my understanding, without considering the real efficiency, it’s hard for a method relying on attention scores/similarity to obtain significantly better performance than H2O.
From my own experience, when the tasks become more challenging, the performance of efficient attention methods might show quite different effectiveness. For example, many long context methods cannot pass the passkey retrieval test with a 100-digit passkey, while with a 5-digit passkey, most methods work perfectly. Could you please construct experiments on more challenging tasks? Especially on those most recently proposed ones, for example: [2][3].
Also, Llama-3 is much more robust to perturbation and error accumulation. Could you please construct experiments with some weaker LLMs, such as Mistralv0.2, Llama2, and gemma-1.1?

References: [1] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models [2] KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches [3] BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

局限性

Yes

作者回复

2024-08-06

Thank you for your thorough review, and we apologize for any confusion caused.

"Claim that 'we found the three patterns'"

Thank you for your critique. We will revise the wording accordingly. Indeed, we discussed the importance of sparse attention in related works and how it inspired our research. We want to emphasize our understanding of the dynamic nature of these sparse patterns, which led us to propose the training-free MInference method.

"Lambda is part of vs?"

Yes, the A-shape pattern is a specific type of "vertical and slash" pattern. However, the A-shape head exhibits a more fixed sparsity pattern across different tasks and examples. Due to its static nature, it does not require online sparse index building and allows for more extreme sparse kernel optimization, resulting in better inference acceleration. We differentiate these three patterns based on kernel speedup and spatial locality.

"Compare with H20"

In fact, MInference and H2O optimize two different problems in long-context LLMs inference: prefilling stage latency and KV cache storage, respectively. These methods are orthogonal. Furthermore, we applied MInference with SnapKV in our paper (Table 5 in paper), which is a SoTA KV cache compression method superior to H20. MInference can be used in conjunction with KV cache compression methods to achieve even better performance.

"More challenging tasks"

Thank you for your suggestion. However, I would like to clarify that we have tested on very challenging long-context benchmarks, including RULER, InfiniteBench, Needle In A Haystack, and Language Model, with lengths ranging from 128k to 1M tokens.

RULER includes subtasks such as Multi-Needle and Multi-hop Tracing and is one of the most challenging long-context benchmarks, sharing a similar tasks with BABILong and a similar benchmark ranking.
We have additionally conducted evaluations on the KV cache compression benchmark[5], including LongBench (refer to Table 2(b) below) and Needle In A Haystack (see Figure 8 in the paper and Figure 1 in the general response PDF), showing that MInference does not negatively impact performance.
InfiniteBench includes many challenging long-context tasks. For example, KV retrieval task requires LLMs to recall a 36-bit value from several random 36-bit Key-Value pairs by retrieving the value corresponding to a random 36-bit key, which is not as simple as the 5-digit passkey task mentioned in the comments.

Overall, our tests have included similar or even more challenging long-context tasks, effectively reflecting the minimal impact of MInference on the capabilities of long-context LLMs.

LongBench	SingleDoc	MultiDoc	Summ.	FewShot	Synth.	Code	Avg.
LLaMA-3-8B-262K	33.5	28.3	29.4	66.9	43.0	42.4	40.6
StreamingLLM	29.7	23.7	28.8	65.8	19.3	42.8	35.0
InfLLM	31.6	25.8	29.1	66.3	36.3	41.8	38.5
MInference	34.0	28.0	29.7	66.8	42.8	42.2	40.6

Table 2(b). Performance of different methods on LongBench.

[5] KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches.

"Weaker models"

Besides testing on the powerful LLaMA-3-8B-1M, we also conducted tests on the Yi-9B-200K model, which has an effective context size of only 8k in RULER, showing that MInference can achieve performance close to Full Attention without changing the effective context size. Moreover, we have supplemented our results with GLM-4-9B-1M, Phi-mini-128K, and Qwen2-7B-128K (see general response PDF). Notably, the latter two models have weaker long-context capabilities, with effective context sizes of 4k and 8k in RULER, respectively. They show a significant performance drop in the Needle In A Haystack task when exceeding 100K tokens, but using MInference, they can achieve close or even better performance. We will include these results in future versions. Additionally, the weak LLMs mentioned by the reviewer have context windows smaller than 32K and cannot be tested on benchmarks like InfiniteBench. We will adapt more long-context LLMs in the future.

2024-08-08

Thank you so much for the response. Most my concerns are solved. I raised my score to 6.

However, after carefully checking all your rebuttals, I have mores question about the effectiveness of MInference:

1: Most long context tasks are simple evidence-QA tasks, which have much lower information density compared to standard short context tasks. I'm curious about whether Minference can still "maintain accuracy" on standard short context tasks, such as GSM8K, MMLU and GPQA.

2: About the slash pattern, I'm curious about the ratio of [computed token/blocks]/[not computed token/blocks] on some really tasks, for example, KV-retrieval. In another word, how many slashes are there for a real task. Do you mind to elaborate more about this?

评论- Response by Authors

2024-08-08

Thank you for your recognition and detailed, insightful comments, which have been very helpful for our work.

"Minference can still 'maintain accuracy' on standard short context tasks"

Our current understanding is primarily due to the following reasons:

The sparsity exhibited in short-context attention is generally lower than in long-context attention, but they still exhibit some degree of sparsity.
Attention in different heads exhibits stable sparse patterns. The difficulty or information density of tasks does not change the sparsity of the heads but only alters the specific sparse index positions. As described in "induction heads", multiple layers of attention have differentiated functions, contributing to the complex reasoning capabilities.
In our experiments, the sparsity rate used for short context sizes was not high. For LongBench, with an average length of 32k, we used the search space from Tab. 6, equivalent to 1k + 4K A-shape FLOPs, to maintain consistency.

"[computed token/blocks]/[not computed token/blocks]"

Thank you for your question. In Appendix F, we discuss the actual block-level sparsity rates within the kernel for different tasks. The vertical and slash patterns exhibit an actual sparsity rate of 78%-90% around 100k and greater than 95% above 500k. While there are slight variations in actual sparsity rates across different tasks, the overall values fluctuate around these numbers.

PS: Thank you for raising the score. It seems it has not yet been reflected in the system. If possible, could you please assist with this? Once again, thank you for your recognition and suggestions regarding our work.

2024-08-08

Already corrected the score : )

About the first question, do you mind show some to supporting results on standard short context tasks?
About the second question, could you please provide this ratio for some real examples? e.g. Some from LongBench. I'm really curious about how those slash patterns are distributed across queries. With 100k, 78% sparsity is not that high. I suspect this is due to those slashes are densely distributed for last a few query tokens. And early appearing diagonal slashes contribute to major part of the sparsity.

2024-08-10

To require some detailed results on standard benchmarks and clarify my point:

For any efficiency techiniques, it's more challenging to maintain performance on standard short context tasks rather than on long context tasks: 1: Long context tasks reported in this paper, are basically evidence-QA, barely requring reasoning abilities. 2: As I stated before, short contexts have much less tolerance to information loss.

[Longbench, using LLama-3 tokenizer, is only ~12k length, in average.)

评论- Response by Authors

2024-08-10

Thank you for your patience as additional experiments took some time to complete. And thank you for your insightful suggestions. Your input has been instrumental in enhancing the completeness of our paper, and we will include the relevant discussions in the updated version.

"...show some to supporting results on standard short context tasks..."

We add the results in GSM8K with 9-shot[1], which has an average length of 2.8K tokens, as shown in Table 4. The table uses a sparsity equivalent to 128 + 128 A-shape FLOPs, with an average sparsity of 84% in the kernel. Information redundancy is lower in short contexts than in long contexts, making them more sensitive to compression. As can be seen, our method experiences a slight drop in performance but remains close to the results of full attention. Compared to other baselines, there is a clear and uniform improvement. This demonstrates that, in addition to complex retrieval or knowledge-intensive tasks, our method also effectively preserves the reasoning capabilities of LLMs.

Methods	GSM8K
LLaMA-3-8B-262K	63.8
w/o few-shot	37.2
StreamingLLM	53.2
InfLLM	56.4
MInference	61.7

Table 4. Performance of different methods on GSM8K with 9-shot ICL using LLaMA-3-8B-262K.

[1] Complexity-Based Prompting for Multi-Step Reasoning, ICLR 2023.

However, we would like to emphasize that our method is primarily designed for long-context scenarios. In short contexts, the latency proportion of attention is not high, and the overhead from dynamic sparse index building becomes more significant.

Furthermore, we already have tested our method on extremely challenging long-context LLMs benchmarks, especially the RULER benchmark, which includes Multi-hop Tracing and Multi-Needle tasks that require reasoning abilities to complete.

"...provide this ratio for some real examples?"

Apologies for the oversight in my previous response. As illustrated in Fig.12 of our paper, the average block-wise sparsity ratio of "78%-90% around 100k" is 88%, which allows MInference to achieve a 3x-4x speedup in single attention operations and a 1.8x speedup on an end-to-end basis. The variability in sparsity across different inputs mainly stems from the density of the slash lines, which causes differences sparsity in the blocks within the kernel.

Thank you for the correction. The average number of tokens in LongBench is 12k; we mistakenly cited data from InfLLM. We have calculated the average sparsity rate in the kernel for LongBench, which ranges from 55% to 92%, depending on the number of tokens in the task. For example, tasks with longer contexts like narrativeqa have an average sparsity rate above 90%, while shorter context tasks like lcc are closer to 60%. Moreover, in terms of performance, our method outperforms other methods in the baselines and is closer to the full attention performance.

2024-08-11

I don’t want to seem too picky, but I believe the most important thing in research is to be sincere and I am really annoyed by nowadays trends of over-claiming. Your method's contributions are already sufficient, even if it works well only in long texts. No method is perfect.

I have no idea about why you keep trying to emphasize that your method works well on short tasks; such sparsity can be used overall. If you do want to show that MInference has minimal impact on all aspects of the model, then more rigorous testing on standard tasks should be conducted.

Back to your newly updated results. Thank you for the experiments. However, I must admit, I find your gsm8k setting very odd. I have never seen 9shot; most gsm8k experiments are either 5shot, like [1, 2], or 8shot, like [3,4]. And the performance of llama3 seems like it should be better. Using more shots means increase of redundancy.

Given the effectiveness on long texts, at this time, I will not change my score. But I am very much looking forward to more responses from the authors to address my concerns.

[1]QWEN2 TECHNICAL REPORT, https://arxiv.org/pdf/2407.10671 [2]Gemma 2: Improving Open Language Models at a Practical Size, https://arxiv.org/pdf/2408.00118 [3]https://llama.meta.com [4]DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, https://arxiv.org/pdf/2405.04434

评论- Response by Authors

2024-08-11

We are very grateful for your critique and corrections. We fully agree with and acknowledge your exposition of MInference's limitations in short-context scenarios. In fact, we have never claimed that MInference is suitable for use in short-context scenarios, and we discussed these limitations in Appendix A. We will rewrite this section in the next version of paper to highlight the discussion on limitations in short-context scenarios.

Here we provide some details to support this content:

MInference cannot achieve any acceleration in short-context scenarios (prompts < 10k tokens), and due to the dynamic index building, it can be 30%-50% slower than FlashAttention-2. Moreover, since the latency proportion of Attention is not high in short-context, even if the single Attention module has a speedup, the end-to-end speedup is minimal.
The sparsity rate of Attention in short-context scenarios is significantly lower than in long-context scenarios. For example, controlling for the same sparsity rate (96.8%) as in Fig.2(b) of the paper, the Attention recall drops from 96.4% at 128K to 89.8% at 4K.
If the sparsity ratio from long-context scenarios is used in short-context, MInference causes a noticeable performance degradation. For example, as shown in Table 5, the accuracy of qasper (one of single-document QA tasks in LongBench) drops from 29.64% to 8.04%.

Methods	qasper in LongBench
LLaMA-3-8B-262K	29.24
MInference using 55% sparsity rate	29.64
MInference using 85% sparsity rate	8.04

Table 6. Performance of different sparsity rates on qasper in LongBench using LLaMA-3-8B-262K.

"...gsm8k setting being very odd..."

We appreciate your critique and correction. We did not attempt any cherry-picking; we followed the experimental setup for GSM8K as per [1,2]. We will review the relevant experimental settings, align the 8-shot results, and update them in next versions of the paper.

[1] Complexity-Based Prompting for Multi-Step Reasoning, ICLR 2023.

[2] Chain-of-Thought Hub: Measuring LLMs' Reasoning Performance, Github.

Thank you once again for your efforts in reviewing our work. Your input is very helpful for the further improvement of our research, and we will update the relevant content in the next version of our paper.

2024-08-13

Thanks for the response.

I'm just trying to figure out what is the optimal sparsity MInference can achieve with standard tasks. It's a more fundamental problem related to transformers' internal mechanism. I totally understand that it might be too much for the rebuttal period. Looking forward to more detailed results in the next version.

2024-08-13

Thank you for your patience. After debugging, we found that the performance drop on GSM8K is primarily due to the extended context version LLaMA-3-8B-262K [1], which significantly lost performance on GSM8K (dropping from 78.9 to 63.8). We have now aligned the 8-shot prompt and evaluation script with lm_eval [2] using the original LLaMA-3-8B [3], with an average prompt length of 800 tokens.

We conducted experiments using the same FLOPs as the 32 local windows and 128 global tokens A-shape pattern. The actual sparsity rate in the kernel was approximately 65%. We observed that most methods experienced significant performance losses, with MInference showing a 10.5 accuracy drop compared to Full Attention. StreamingLLM and InfLLM exhibited even greater performance drops, with declines of 14.6 and 22.3, respectively. Additionally, we further increased the sparsity rate and found that MInference suffered from severe hallucinations, losing its reasoning capability.

Methods	GSM8K
LLaMA-3-8B	78.9
StreamingLLM	64.3
InfLLM	56.6
MInference	68.4

Table 5. Performance of different methods on GSM8K with 8-shot ICL using LLaMA-3-8B.

In summary, the current effective methods show considerable performance degradation in short-context scenarios, and further optimization is needed in future work. We will include this analysis in the next version.

Thank you once again for your insightful suggestions and the effort you put into reviewing our paper! Your feedback has been incredibly helpful to us.

[1] https://huggingface.co/gradientai/Llama-3-8B-Instruct-262k [2] https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct [3] https://github.com/EleutherAI/lm-evaluation-harness

2024-08-13

Really appreciate your efforts on conducting those experiments. I raised my score.

2024-08-13

Thank you deeply for your effort in reviewing our paper and for proposing such helpful suggestions!

审稿意见

评分: 7置信度: 32024-07-16

This paper proposes MInference, a method to accelerate the pre-filling stage for long-context LLM generation. The key method leveraged by MInference is dynamic sparsification, which consists of three sparse patterns observed in attention matrices: the A-shape pattern, the Vertical-Slash pattern, and the Block-Sparse pattern. A sparse pattern search method is developed to minimize the sparsification error for each attention. MInference has been tested across various LLMs and benchmarks, demonstrating that it significantly accelerates LLM generation in long-context settings with no performance drop.

优点

The paper is well-written and well-motivated.
Studying how to accelerate LLM generation for long-context settings is of great practical importance.
The experimental results of the proposed MInference method are promising.

缺点

The experimented model scales are relatively small (e.g., LLaMA-3-8B and Yi-9B) and are only constrained to dense LLMs (i.e., not MoE).
It is not clear how general the observed sparse patterns are across LLMs of various scales.
The implementation details on how the proposed MInference interacts with existing CUDA kernels (e.g., flashattention) are not very clear.

问题

How does MInference scale up to LLMs with larger sizes, e.g., 70B? And how does MInference perform for MoE-based LLMs, e.g., Mixtral 8x7B?
How general are the observed sparse patterns across LLMs?
Is MInference compatible with other inference engines, e.g., vLLM?
Are the Vertical-Slash Head and Block-Sparse Head part of CUDA kernels in MInference, or does MInference leverage other implementations of those sparse head kernels?

局限性

The limitations of the proposed method have been thoroughly discussed. And there is no potential negative societal impact of this work.

作者回复

2024-08-06

"For larger models and MoE"

Thank you for your suggestion. We have added experimental results for LLaMA-3-70B-1M, as shown in Table 1, where MInference maintains excellent performance in larger models, significantly surpassing StreamingLLM and InfLLM[1], and nearly matching full attention. Due to resource constraints, we could not provide MoE results during the rebuttal period, but based on our experience, modifications to FFN do not affect the sparse attention patterns. We will include MoE results in future versions.

Methods	En.Sum	En.QA	En.MC	En.Dia	Zh.QA	Code.Debug	Math.Find	Retr.PassKey	Retr.Number	Retr.KV	Avg.
LLaMA-3-70B-1M	20.7	10.3	84.2	9.5	14.0	33.2	61.7	97.0	100.0	34.0	46.5
StreamingLLM	20.5	8.5	52.0	10.0	12.6	27.4	61.1	14.0	10.0	0.0	21.6
InfLLM	24.1	8.1	57.0	10.0	12.9	27.4	52.3	100.0	100.0	0.0	39.2
MInference	20.6	10.1	83.4	10.0	14.1	34.1	61.9	100.0	100.0	39.0	47.3

Table 1. Performance of different methods with different base models on InfiniteBench.

[1] InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory, 2024.

"Generality ability across models"

We tested MInference on a variety of long-context LLMs, including LLaMA-3-8B-1M, Yi-9B-200K, GLM-4-9B-1M, Phi-3-mini-128K, and Qwen2-7B-128K, all of which maintained performance close to full attention (see general response PDF). Additionally, we observed that three sparse attention patterns, especially the vertical and slash patterns, are present in BERT[2], T5 (both Encoder and Decoder), and MLLM[3]. We also found that "induction heads" exhibit a pattern similar to the "vertical and slash" pattern (see [4]). In conclusion, we believe MInference is highly generalizable across different structures, sizes, pre-training data, training pipelines, RoPE structures, and extended long-context methods.

"Compatible with other inference engines, e.g., vLLM"

Yes, we have implemented MInference within vLLM. Our method only requires replacing the corresponding FlashAttention kernel and does not affect the scheduling of vLLM PageAttention or similar features. We will open-source the implementation after the paper review.

"CUDA Kernel"

We have utilized Dynamic Sparse Compiler PIT, Triton, and FlashAttention to implement CUDA Kernels for Vertical and Slash Head and Block-sparse Head, ensuring transferability and support for long-context LLMs inference across different GPUs and CUDA versions. Details can be found in Appendix C.4. We have submitted the relevant code with our submission and will open-source the implementation after the review.

作者回复

2024-08-06

We are grateful for the diligent efforts and insightful comments from the reviewers. Your suggestions have been incredibly valuable to our work. We will address and resolve these issues in our responses and in subsequent versions of our paper. Here we respond to some common questions and have included additional experimental results (see attached PDF), specifically:

"For larger models"

We have supplemented the results for LLaMA-3-70B-1M on InfiniteBench, demonstrating that MInference continues to perform exceptionally well in larger models, outperforming SoTA baselines such as StreamingLLM and InfLLM[1], and matching the performance of full attention.

Methods	En.Sum	En.QA	En.MC	En.Dia	Zh.QA	Code.Debug	Math.Find	Retr.PassKey	Retr.Number	Retr.KV	Avg.
GLM-4-9B-1M	28.3	9.7	68.6	39.5	12.1	29.4	38.9	100.0	100.0	41.0	46.7
StreamingLLM	27.7	6.4	40.2	12.5	10.8	27.7	21.1	97.1	25.6	0.6	27.0
InfLLM	28.0	7.3	45.0	14.0	10.7	27.9	39.4	98.0	100.0	2.6	37.3
MInference	28.8	9.6	68.6	38.5	12.0	30.7	39.1	100.0	100.0	43.0	47.0
LLaMA-3-70B-1M	20.7	10.3	84.2	9.5	14.0	33.2	61.7	97.0	100.0	34.0	46.5
StreamingLLM	20.5	8.5	52.0	10.0	12.6	27.4	61.1	14.0	10.0	0.0	21.6
InfLLM	24.1	8.1	57.0	10.0	12.9	27.4	52.3	100.0	100.0	0.0	39.2
MInference	20.6	10.1	83.4	10.0	14.1	34.1	61.9	100.0	100.0	39.0	47.3

Table 1. Performance of different methods with different base models on InfiniteBench.

[1] InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory, 2024.

"Generality ability across models"

To demonstrate the generality of MInference, we tested it on various open-source long-context LLMs, including but not limited to LLaMA-3-8B-1M, Yi-9B-200K, GLM-4-1M, Phi-3-mini-128K, and Qwen2-7B-128K, compared with additional baseline InfLLM. Our method consistently achieves good performance and acceleration across models with different pre-training data, training pipelines, RoPE structures, extended long-context methods, and sizes.
We have also included results from LongBench to further substantiate the generality of MInference in scenarios around 32K tokens.
Moreover, we observed that three sparse attention patterns, particularly the vertical and slash patterns, are not only present in GPT-like LLMs but also in BERT[2], T5 (both Encoder and Decoder), and MLLM[3]. We also found that "induction heads" exhibit a pattern similar to the "vertical and slash" pattern (see [4]).
In summary, we believe MInference exhibits strong generality across various model architectures.

RULER	Effective	4K	8K	16K	32K	64K	128K	Avg.
GLM-4-9B-1M	64K	93.8	91.6	89.3	87.4	85.2	80.8	88.0
StreamingLLM	4K	93.8	66.9	58.5	51.4	45.9	39.1	59.3
InfLLM	8K	94.7	89.5	76.4	66.5	56.8	53.5	72.9
MInference	64K	94.6	93.1	91.0	89.6	85.5	84.0	89.6

Table 2(a). Performance of different methods on RULER.

LongBench	SingleDoc	MultiDoc	Summ.	FewShot	Synth.	Code	Avg.
LLaMA-3-8B-262K	33.5	28.3	29.4	66.9	43.0	42.4	40.6
StreamingLLM	29.7	23.7	28.8	65.8	19.3	42.8	35.0
InfLLM	31.6	25.8	29.1	66.3	36.3	41.8	38.5
MInference	34.0	28.0	29.7	66.8	42.8	42.2	40.6

Table 2(b). Performance of different methods on LongBench.

最终决定Accept (spotlight)

2024-09-25

This paper discussed an important problem in improving the efficiency of long context inference. The solution is solid with good novelty, and the experimental section is well designed to verify the design of the proposed solution.