5.0

/10

Rejected3 位审稿人

最低5最高5标准差0.0

3.7

置信度

正确性2.7

贡献度2.7

表达2.7

ICLR 2025

Hierarchy-Aided Sparse Attention For Fast LLMs Prefilling Inference

Wenhao Li,Mingbao Lin,Zhanpeng Zeng,Shuicheng YAN,Rongrong Ji

OpenReview PDF

提交: 2024-09-17更新: 2025-02-05

TL;DR

A method for accelerating the pre-filling phase of LLMs using hierarchical attention.

摘要

关键词

Long-Context LLM; Pre-Filling Acceleration; Sparse Attention

评审与讨论

审稿意见

评分: 5置信度: 42024-11-03

This paper introduces Hierarchy-Aided Sparse Attention (HASA), a simple and effective approach to enhance the efficiency of the pre-filling phase in large language models (LLMs) when processing long-context inputs. By employing diagonal block sparse attention with specifically designed global attention tokens, they significantly reduce attention-related floating-point operations (FLOPs) without largely compromising language modeling performance.

优点

This paper is well-written, with a clear analysis and step-by-step presentation of the improvement from the original diagonal block sparse attention to the Hierarchy-Aided Sparse Attention. Hierarchy-Aided Sparse Attention is technically sound, and the experiments demonstrate its efficiency in the pre-filling phase with a long-context input.
The experiments are thoroughly conducted on LM, few-shot tasks, and long-context tasks, showing the effectiveness of Hierarchy-Aided Sparse Attention.

缺点

Introducing another transformer branch to distill the global information seems computationally heavy. It's better to include the original diagonal block sparse attention and HASA w.o. another transformer branch as baselines for ablating the influence of each proposed component for efficiency.
A strong baseline is sliding window attention plus some global attention at the end. In this way, information flows between overlapped blocks, compared to the independent block sparse attention. I would like to see the efficiency and performance compared to HASA.
Why does HASA not include the sink attention? As many papers suggest, the LLMs tend to contribute most of the attention score to the first token (i.e., the sink token). It seems that more efforts will be needed (e.g., training cost and network design) to convert full attention to sparse attention when dropping the sink token.

问题

Please address the concerns and questions in the Weaknesses part.

审稿意见

评分: 5置信度: 32024-11-04

This paper proposes using block sparse attention to reduce prefilling time and an additional branch of global attention to compensate for the loss of global information. For global attention, they pool each block to get the representative token for each block and pass it to a new branch of the transformer. This paper further parameter efficient fine-tuning LLMs with this method. This method improves over full attention on long context benchmarks and reduces latency.

优点

This paper proposes a novel way to bridge sparse attention methods and hierarchical attention.
The prefilling phase is time-consuming, and this paper reduces time while maintaining most of the performance.

缺点

The novelty of this paper is limited. This paper is a mix of existing methods.
This method's speedup is not significant, and the additional branch takes twice the time to pass transformed layers.

问题

In Figure 2, it seems you are trying to express information with color; please add descriptions of each color.

审稿意见

评分: 5置信度: 42024-11-04

This paper introduces HASA, hierarchy-aided sparse attention, to reduce the quadratic computation complexity. HASA incorporates a specialized transformer branch that combines sparse attention with hierarchical embeddings. Specifically, HASA first separate long texts into a series of chunks and compute local attention for each chunk during prefilling. To setup the connection between different chunks, the tokens inside each chunk also attend to a special embedding, which is an average pooling of token embeddings inside the chunk and attends to other previous special embeddings via the specialized branch with shift mask attention. Experiments conducted on various benchmarks demonstrate that HASA not only maintains performance but also outperforms baseline methods in certain scenarios.

优点

The key idea of leveraging a special embedding to capture dependency among different chunks is interesting in the context of efficient pre-filling.
HASA achieves lower end-to-end latency compared to baseline methods.
Table 4 shows that HASA achieves superior performance on several tasks from LongBench such as SQA and MQA.

缺点

HASA introduces a specialized branch to model inter-chunk attention, which, while ensuring efficiency, increases GPU memory consumption.
The performance comparison experiments with prior methods are not fully convincing. For example, YaRN-7B was trained on the PG19 dataset, whereas HASA was trained on the SlimPajama dataset in stage 1, followed by stage 2 instruction tuning on a mixed dataset. It is recommended to reimplement YaRN using the datasets utilized in this paper for a fairer comparison.
Compared to training-free methods for efficient long-context inference, such as Streaming LLM and MInference, HASA's adaptability to various LLMs and scenarios may be limited.

问题

Please see Weakness.

AC 元评审

2024-12-23

Summary

This paper proposes a method (Hierarchy-Aided Sparse Attention) to accelerate the pre-filling phase of large language models (LLMs) with long-context inputs. HASA reduces the quadratic complexity of full attention by leveraging block sparse attention and introduces a specialized transformer branch to handle global information across chunks. The method achieves significant speedups in the pre-filling phase while maintaining strong performance on long-context benchmarks.

Strengths

The proposed method combines sparse and hierarchical attention, which provides an efficient mechanism for pre-filling acceleration. The experiments demonstrate robust performance on various long-context tasks, and the paper is clearly written with thorough analysis and insights.

Weaknesses:

As noted by reviewers, HASA has limited novelty compared to existing techniques, and the additional transformer branch increases computational overhead and memory usage. The comparison with baseline methods lacks fairness due to differences in training datasets. Some key ablations and baseline comparisons, such as diagonal sparse attention without the additional transformer branch, are missing. Furthermore, concerns regarding the adaptability of HASA across broader scenarios were not adequately addressed.

审稿人讨论附加意见

During the rebuttal period, the reviewers raised the following key points:

Baselines: The reviewers noted that comparisons with YaRN-7B were unfair due to differences in training datasets. There are also other missing important baselines.
Ablation Studies: The reviewers requested additional ablations to isolate the impact of each component of HASA, such as diagonal sparse attention without the transformer branch.

The authors did not provide a rebuttal or any clarification on these issues.

最终决定Reject

2025-01-22

Reject