PaperHub
6.0
/10
Rejected4 位审稿人
最低3最高5标准差0.8
3
3
5
4
3.0
置信度
创新性2.3
质量2.5
清晰度2.3
重要性2.5
NeurIPS 2025

Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29

摘要

关键词
Sparse attentionneighborhood attention

评审与讨论

审稿意见
3

This paper introduces ​Generalized Neighborhood Attention (GNA)​, a sparse attention method that adds "stride" to local attention mechanisms. GNA unifies sliding-window and blocked attention, optimizing computational efficiency while maintaining quality. The authors develop ​NATTENSim, an analytical tool predicting speedup upper bounds, and implement a high-performance kernel for NVIDIA Blackwell GPUs.

优缺点分析

Strengths:

  1. The proposed method is training-free and achieves close to theoratical speedup on GPUs.

Weakness:

  1. The paper is not written clearly. The methodology is only one page. Without any equation or pseudocode,it's not easy to understand what's the detailed definition of the algorithm as well as how the kernel is implemented.
  2. The results only contain blackwell architecture results and thus, do not compare to other works' hardware results with Hopper or Ampere architecture. This is fine but if the authors consider targeting only Hopper or Ampere as limitation for other works, isn't only targeting for blackwell architecture is the limitation for this work too? If your framework does not limit to blackwell architecture, could you provide results on Hopper or Ampere.

问题

  1. Why a simulator is used in the paper? Is it because the kernel code base is hard-coded for some configuration? Why not just generate the kernel for different sparse kernel configuration and test the real speedup but develop a simulator? Besides, I think how the simulator is designed is not clearly stated.

  2. Does the kernel code base contains the backward capability?

局限性

NA

最终评判理由

The author have solved some of my confusion in the rebuttal. However, I can not make sure that the final writing of the paper can be satisfied. So my rating can not be positive.

格式问题

NA

作者回复

Thank you for your time and feedback.

We start by first reporting that since the submission, we’ve also successfully completed implementations for the Hopper architecture, with comparable throughput to Flash Attention v3, which is the current state of the art. In addition, we’ve also implemented backward pass kernels for both our Hopper and Blackwell kernels, extending our approach and methodology to even the approximately millions of Hopper GPUs that are still being used, and allowing for efficient training.

Due to the formatting guidelines for NeurIPS rebuttal, we cannot directly share detailed results, but our key observation is that our Hopper kernels achieve similar, and sometimes even better performance improvements compared to our Blackwell kernels (in the case of standard NA, or GNA cases that are not fully block sparse).

We can also share that when running GNA against STA on the Hunyuan model, GNA consistently outperforms STA, and achieves the full 11.1X speedup theoretically possible, while STA only reports a 10.45X speedup in their paper.

The paper is not written clearly. The methodology is only one page. Without any equation or pseudocode,it's not easy to understand what's the detailed definition of the algorithm as well as how the kernel is implemented.

We apologize, and will make every effort to reorganize the paper to make it more coherent and comprehensible, but at the same time we note that we clearly describe our methodology, but due to space limitations had to move part of our methodology to the appendix. We clearly cross-reference and explain this in each subsection of our methodology. We also did this for our related works sections. We simply could not save any more space without risking moving our key results and visualizations to the appendix, and facing similar criticisms.

The results only contain blackwell architecture results and thus, do not compare to other works' hardware results with Hopper or Ampere architecture. This is fine but if the authors consider targeting only Hopper or Ampere as limitation for other works, isn't only targeting for blackwell architecture is the limitation for this work too? If your framework does not limit to blackwell architecture, could you provide results on Hopper or Ampere.

Please see the above note regarding our Hopper implementation. We will be adding those results in the final version of our paper, showing that we achieve even slightly better performance (for cases where we don’t already achieve the maximum speedup possible) on Hopper than Blackwell.

With respect to it being a limitation, we simply don’t see it this way as many papers have proposed approaches limited to some specific hardware, and some have been published in NeurIPS. One very well known and established example is Flash Attention 3.

If you’re referring to this sentence from Table 5 caption:

“We do not report STA speedups since their implementation is limited to the Hopper architecture”

This is not a criticism of STA, and merely pointing out the reason for not including runtimes for STA -- comparing the runtimes of two different methods on two different GPU architectures is not informative.

However, as stated, we will be removing this once in the final version and directly compare against STA. As stated above, we also outperform STA.

Why a simulator is used in the paper? Is it because the kernel code base is hard-coded for some configuration? Why not just generate the kernel for different sparse kernel configuration and test the real speedup but develop a simulator? Besides, I think how the simulator is designed is not clearly stated.

Could you kindly clarify this? What do you mean specifically by “just generate the kernel”?

Even if we assume kernels have been written and developed for all design choices and attention pattern, which could not be farther from the truth, running these kernels requires: 1. The GPU itself, 2. More time and careful measurement of runtime. Our analytical model can run on any CPU and in a fraction of the time.

The intention behind the simulator, as described in the paper, is so that we can analytically study the performance levels of different attention patterns, specifically those covered by the GNA framework, without having to first invest time and energy and money into implementing kernels. In addition, our analytical study also inspired some of the design choices in the kernel.

Finally, kernels are implementations. They can vary in quality, efficiency, and coverage. Implementations do not necessarily give away performance implications or even the amount of computation done by a kernel.

Simply put, 100% of the runtime of a kernel is not always spent on pure computation, especially if we mean tensor core / MMAs by that.

We hope this clarifies the motivation, goals, and achievements of NATTENSim.

As for how it is designed, we describe it in detail in 577-638.

Does the kernel code base contains the backward capability?

No. Kernels are rarely (in our experience never) backward compatible. You cannot run Flash Attention 2 on a Volta GPU. You cannot run Flash Attention 3 on an Ampere GPU. More importantly, in recent NVIDIA architectures, some features such as Tensor Core instructions are also specialized for specific generations and not forward compatible either. Meaning you cannot run Flash Attention 3 on a Blackwell GPU.

评论

The rebuttal have solved some of my concerns. I have raised my score to 3. But at this stage, I still not satisfied with the overall story and evaluation. Typically, the contributions of the paper are scatter in multiple small points: algorithm, simulator, and some kernel for blackwell GPU. It would be better if the novelty can be highlighted and improve the writing accordingly.

评论

We thank you for your time and feedback. We will continue to make improvements to the organization of the paper.

However, we would like to push back that our individual contributions (algorithm, analytical model and simulator, Hopper and Blackwell kernels) are "small points", and against the claim that they are scattered. Our contributions are clearly highlighted in Sec. 3, and further details about each contribution are presented in their corresponding appendices, which are cross referenced clearly in Sec. 3 as well. This was inevitable due to the page limit -- but again, we are committed to making every effort in changing the organization further to improve readability.

评论

I appreciate the works from the authors. My point is that I think that it would be great if each point itself can be novel and solid enough for an self-contained paper. For example, for the algorithm part, it can discuss what's the best sparse attention method we can find. The kernel part, it would be great if it can be a general sparse attention library for backwell GPU, supporting vision and lanugage models, training and etc. For simulator, it's not a widely adopted practice on GPU. If it can be general purpose and applicable to different settings, it might change the eco-system a lot. However, when everything is placed on one paper, I can not dive into the details of each part and give a fair justification (like you said, the pages are limited).

评论

Thank you.

We note that the individual pieces (methodology, simulator, kernel) can together answer the very basic questions the community has been asking for a long time: how much can sparse attention speed up my application exactly ?(simulator) is there room for more efficiency? (stride), and how do we realize it? (Hopper and Blackwell kernels).

Our kernels and simulator will also become part of the NATTEN open source project, which is already dedicated to providing fast sparse attention kernels, supports training, inference, and various modalities, all usable within PyTorch with only a few lines of code.

For the simulator, we again note that this is very common and reasonable to do in the space. Just like we can and sometimes want to measure the computational complexity of an algorithm before we implement it, we want to do the same for sparse attention kernels. The design choices in the kernel, programming model of each architecture, and the like make it impossible to conduct a fair comparison of different methods by solely looking at their runtimes.

Our simulator not only provides a better theoretical upper-bound of speedup for the covered methods, it can also help find the best design choice for a specific use case / hardware.

Overall, it is true that there is a lot of content in our paper that many readers would be unfamiliar with, we remain committed to improving that experience in the revision, and enable more readers to learn about the concepts discussed and introduced.

We hope that the fact that our work includes a lot of details and thus requires a lot of content is not considered a shortcoming.

审稿意见
3

This paper proposes a method called “Generalized Neighborhood Attention” aiming to speed up multidimensional self-attention blocks with local sparsity. Generalized Neighborhood Attention extends Neighborhood Attention by adding a stride on top of it, which makes Neighborhood Attention more generalizable. This work is motivated by studies that speed up attention like sliding window attention, blocked attention, and aims to provide a more generalized implementation of them. It evaluates how much speedup can be achieved by GNA in Cosmos-7B, HunyuanVideo, and FLUX, and shows that GNA can provide 28% to 46% end-to-end speedup.

优缺点分析

Strengths:

  • The authors evaluate multiple off-the-shelf models with different architectures with similar methodology, showing this method can be generalizable
  • The intuition to provide a more generalizable summarization over different attention speedup approaches is nice.

Weakness

  • The paper is not very well written. The problems are not limited to
    • Lack of explanation of the actual methodology. There is no math equation or visualization explaining how the striding is done in the body. I have to guess based on my knowledge on how stride is generally done.
    • Many sentences are not well organized. Examples are not limited to “but instead rely on a small memory operation, dubbed token permutation, instead of fusing multi-dimensional tiling into the kernel like FNA” => you use two “instead”s in one sentence. “Any setting in which T_KV evenly divides window size,and T_Q evenly divides stride achieves this” -> it can be easier to understand if you say this can be achieved whenever T_KV evenly...
    • The experiment setups for different models look similar: keep the first few diffusion steps, and then apply GNA to the rest. It is better to summarize them in a table.
  • The ablation study is limited. This paper reports large speedup under different experimental setups, but for many of the reported results, the baseline is just self-attention. The results would be more meaningful and convincing if authors conduct ablation studies using better baselines like sliding window attention etc.
  • The technical novelty of this work is limited. As mentioned in the first bullet point, I may not fully understand the technical details as there is no detailed explanation. But as far as I understand, this work adds a stride to an existing method and shows it can provide a larger speedup, which is not a very novel idea. The technical contribution could be larger if authors can discuss how to find the optimal strides, and provide more scientific insights.

问题

  • “we selected Q and KV tile shapes (TQ and TKV ) according to the shape of the token layout (feature map size)” => How do you select the tile shapes given a token layout?
  • “beyond this level of sparsity without further training/fine-tuning we cannot maintain visual quality” => How do you measure the quality? Based on human evaluation or based on scores like Table 5. The criteria is only explained for Table 5. Is it the same for other experiments?
  • Similarly, the first X diffusion steps are kept for many experiments. How was X decided?

局限性

See the weakness section above.

最终评判理由

Authors' rebuttal clarified some issues, but it doesn't solve my main concern: just adding a stride to neighborhood attention lacks novelty. One angle to address that is to show this is a very general solution, but I don't think authors provided enough evidence for that. Therefore I keep my score.

格式问题

No

作者回复

Thank you for your valuable time and feedback on our paper.

We start by first reporting that since the submission, we’ve also successfully completed implementations for the Hopper architecture, with comparable throughput to Flash Attention v3, which is the current state of the art. In addition, we’ve also implemented backward pass kernels for both our Hopper and Blackwell kernels, extending our approach and methodology to even the approximately millions of Hopper GPUs that are still being used, and allowing for efficient training.

Due to the formatting guidelines for NeurIPS rebuttal, we cannot directly share detailed results, but our key observation is that our Hopper kernels achieve similar, and sometimes even better performance improvements compared to our Blackwell kernels (in the case of standard NA, or GNA cases that are not fully block sparse).

We can also share that when running GNA against STA on the Hunyuan model, GNA consistently outperforms STA, and achieves the full 11.1X speedup theoretically possible, while STA only reports a 10.45X speedup in their paper.

Regarding the organization of the paper, we are grateful for your suggestions and will make every effort to make it more coherent and understandable. However, we note that we had no choice but to summarize our methodology and move the more detailed descriptions to the extended methodology in the appendix, due to the very strict 9 page limit, and given the number of figures and tables.

Lack of explanation of the actual methodology. There is no math equation or visualization explaining how the striding is done in the body. I have to guess based on my knowledge on how stride is generally done.

In fairness, Figure 2 depicts the exact pattern of stride following the same type of visualization as presented in all similar prior works (NAT [1], DiNAT [2], Swin Transformer [3], HaloNet [4]). We would be happy to revise it to make it even more comprehensible in the final version.

Many sentences are not well organized…

Thank you for pointing out those sentences, we will correct them in the final version.

The experiment setups for different models look similar: keep the first few diffusion steps, and then apply GNA to the rest. It is better to summarize them in a table.

We present this information clearly in Tables 2, 3 and 4. These, along with the workload distributions reported in Table 1, are all of the variables affecting the end to end performance. These models run at different diffusion steps (Cosmos at 35 steps, HunyuanVideo at 50 steps, and FLUX at 28 steps), and naturally have different workload statistics, tensor shapes, and therefore sparsity configurations. This was unavoidable.

The ablation study is limited. This paper reports large speedup under different experimental setups, but for many of the reported results, the baseline is just self-attention. The results would be more meaningful and convincing if authors conduct ablation studies using better baselines like sliding window attention etc.

We have a sliding window baseline. NA is a sliding window approach. We also have blocked attention as a baseline in some of the experiments, but it is impossible for blocked attention to outperform the block-sparse GNA cases, both in speed and quality.

The technical novelty of this work is limited. As mentioned in the first bullet point, I may not fully understand the technical details as there is no detailed explanation. But as far as I understand, this work adds a stride to an existing method and shows it can provide a larger speedup, which is not a very novel idea. The technical contribution could be larger if authors can discuss how to find the optimal strides, and provide more scientific insights.

We respectfully disagree. GNA is a novel methodology that we propose, extending the existing sliding window / neighborhood attention family of attention patterns, and unifying existing and seemingly very different patterns such as NA and blocked attention.

Analytical studies of performance models is very rare in the ML Systems space, and yet we propose a first-of-its-kind simulator program for analytically studying NATTEN and GNA, and giving useful informative upper bounds for both use cases and kernel design choices. Future researchers and engineers interested in this or similar methodologies can extend and use this approach to get very realistic reasonable estimates for their op-level and end-to-end speedups before spending a single cent or second on writing kernels.

Finally, our Blackwell FNA kernel, along with our new Hopper FNA counterpart, are among the few (to our knowledge the only) static sparse attention implementations for vision that offer the maximum speedup theoretically possible.

“we selected Q and KV tile shapes (TQ and TKV ) according to the shape of the token layout (feature map size)” => How do you select the tile shapes given a token layout?

The choices for tile shapes are limited and partially tied to hardware, and out of the choices available, we can easily find the optimal ones for a given pattern using NATTENSim. However, just choosing tile shapes that evenly divide the input is typically enough to proceed.

“beyond this level of sparsity without further training/fine-tuning we cannot maintain visual quality” => How do you measure the quality? Based on human evaluation or based on scores like Table 5. The criteria is only explained for Table 5. Is it the same for other experiments? Similarly, the first X diffusion steps are kept for many experiments. How was X decided?

X, and the quality was decided based on a limited human evaluation done by observing various samples, and in comparison to self attention, and other sparse attention patterns. We tried to minimize X and maximize our returns while keeping quality acceptable.

For HunyuanVideo, as you pointed out, we also present VBench results, and for FLUX we also present GenEval benchmarks, MAN-IQA, and QualiCLIP results as well.

References

[1] Hassani, Ali, et al. "Neighborhood attention transformer." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023.

[2] Hassani, Ali, and Humphrey Shi. "Dilated neighborhood attention transformer." arXiv preprint arXiv:2209.15001 (2022).

[3] Liu, Ze, et al. "Swin transformer: Hierarchical vision transformer using shifted windows." Proceedings of the IEEE/CVF international conference on computer vision. 2021.

[4] Vaswani, Ashish, et al. "Scaling local self-attention for parameter efficient visual backbones." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.

评论
  • Fig 2 makes sense. If my pdf search works well, Fig 2 is not referred anywhere in the body of the paper. Please consider adding proper reference so the content is well organized and easy to read.
  • Only using self-attention as baseline => Explicitly Table 6 only uses self-attention as baseline, right?
  • "X, and the quality was decided based on a limited human evaluation done by observing various samples" => so that's basically a hand tuning hyper parameter?
评论
  • Thank you; Figure 2 is referenced in Line 64 and Line 138.
  • Not exactly; GNA with stride 1x1 is also considered a baseline, as it is the standard Neighborhood Attention approach. Self attention, and neighborhood attention are both baselines in Table 6 specifically. In other experiments (i.e. Table 5), we also have Blocked Attention and STA.
  • Yes.
评论

Can you explain why didn't add other methods in Table 6? Depends on the tasks, it is sometimes not a surprise that an expanded work of NA can achieve higher acceleration

评论

Could you please clarify what other methods you're referring to?

Also, we report detailed upper bounds for all of our performance measurements. No approach, under the same level of sparsity, can exceed those upper bounds, and in many cases, including the case of Table 6 (FLUX image gen), our implementation already achieves the upper bound of speedup. This means no approach under the same level of sparsity can outperform ours.

评论

Like FNA, etc. The main point is that the claim in this paper is strong: no other work can achieve the same level of speedup, it is important to show this approach doesn't sacrifice more quality than other approaches. Given GNA is basically adding a stride to NA, I think the novelty is limited, so it is more important to show this is a good practical approach: finding a good balance between quality and performance, and is generalizable to many applications.

审稿意见
5

This paper addresses the problem that sparse attention mechanisms often fail to deliver practical wall-clock speedups that match their theoretical reduction in FLOPs. The authors introduce Generalised Neighborhood Attention (GNA), a unifying framework that describes various locality based sparse attention patterns (sliding window, strided, and blocked) through a single stride parameter. To analyse the performance implications of GNA, they develop a simulator, NATTENSim, which provides realistic speedup upper bounds. Finally, they present a optimised implementation of GNA built on a fused attention kernel for the NVIDIA Blackwell architecture. By plugging their GNA implementation into existing large-scale generative models, they demonstrate significant end-to-end speedups without any model fine-tuning.

优缺点分析

Strengths

The paper tackles a highly relevant and difficult challenge in deep learning engineering. The gap between theoretical computational savings and real-world performance is often large and the focus on creating a sparse attention method that is truly "fast" on modern hardware is a valuable contribution.

The conceptualisation of Generalized Neighborhood Attention (GNA) is a key strength. It elegantly unifies several disparate local attention methods under a single, intuitive parameter. The accompanying simulator, NATTENSim, is a good contribution, providing a tool for researchers to compare different sparse attention strategies.

Weaknesses:

The experiments focus exclusively on achieving speedup in existing, pre-trained models without any retraining or fine-tuning. While this is a powerful demonstration of plug-and-play capability, it leaves the accuracy/quality trade-off of different GNA configurations largely unexplored. The stride parameter explicitly trades translational equivariance for efficiency, but the impact of this trade-off on model quality (beyond qualitative samples and some summary benchmarks) is not deeply analysed.

The impressive performance results are demonstrated on the NVIDIA Blackwell architecture. While the design principles are discussed as being general, the concrete, high-performance kernel is specific to this hardware. This limits the immediate reproducibility and applicability for researchers who do not have access to this architecture. Performance on more widely available hardware like Hopper or Ampere is not presented despite being discusses as theoretically improved.

The colours selected for the figures are very tough to separate in some cases! Particularly the dark/light red. Would have been better if orthogonal colour choices were made.

问题

The trade-off between speed and quality is central to GNA. Have you conducted any experiments where a smaller model is trained from scratch using a GNA configuration with a high stride? It would be valuable to understand how this impacts final model convergence and accuracy compared to training with standard self-attention or NA (stride=1x1).

The stride parameter creates a clear trade-off between computational efficiency and the receptive field overlap between adjacent queries. The results for FLUX (Table 6) show almost no degradation in quality metrics when moving from a 1x1 stride to a 16x16 stride. Is it surprising that this significant reduction in shared local information has such a minimal impact on the quality of a 4K generated image? What does this imply about the level of redundancy in local attention for high-resolution generative tasks? (this may be a misunderstanding on my part)

局限性

Yes

最终评判理由

Raised review score in response to rebuttal addressing my concerns

格式问题

None

作者回复

Thank you for your valuable time and feedback on our paper.

(Regarding weakness 3) Firstly, thank you for giving feedback on the plot colors. We will absolutely change them in the final version of the paper to improve readability.

(Regarding weakness 2) We start by first reporting that since the submission, we’ve also successfully completed implementations for the Hopper architecture, with comparable throughput to Flash Attention v3, which is the current state of the art. In addition, we’ve also implemented backward pass kernels for both our Hopper and Blackwell kernels, extending our approach and methodology to even the approximately millions of Hopper GPUs that are still being used, and allowing for efficient training.

Due to the formatting guidelines for NeurIPS rebuttal, we cannot directly share detailed results, but our key observation is that our Hopper kernels achieve similar, and sometimes even better performance improvements compared to our Blackwell kernels (in the case of standard NA, or GNA cases that are not fully block sparse).

We can also share that when running GNA against STA on the Hunyuan model, GNA consistently outperforms STA, and achieves the full 11.1X speedup theoretically possible, while STA only reports a 10.45X speedup in their paper.

As you rightly pointed out, hardware specificity is not atypical, as many past NeurIPS papers were largely around creation of architecture-specific optimizations, such as Flash Attention v3 itself. However, we hope that concerns regarding this are minimized as a result of our new Hopper implementation.

With respect to other implementations, we are very optimistic that our observations can be transferred to other hardware architectures as well, as long as specifications such as memory bandwidth and peak performance and the like are comparable. Our implementation overhead, excluding the additional memory operations, is very minimal as illustrated by our results.

(Regarding weakness 1) Regarding training, we simply did not have training resources (multiple GPUs, training data) to train or fine-tune any of the reported models. Since the only applications where our methodology truly shines are those where attention is a bottleneck, and not those which are more easily trainable with limited resources, this was excluded in our paper.

However, if we were to speculate, training would only help our methodology and the end-to-end returns. Various works going back to the original Vision Transformer paper in 2020 [8] point out that inductive biases can be “learned” if trained with enough data, and subsequent works show that this can hold true in even relatively limited data regimes like ImageNet (DeiT[7]).

The effect of training on sparsity in general, again illustrated by various papers such as STA (included in our paper) is that it likely increases the upper bound for sparsity, meaning instead of introducing ~50% sparsity and only in some heads/layers/diffusion steps, we could introduce more sparsity in more layers and diffusion steps.

Therefore, we expect the reported end-to-end speedups to only increase if any of the models reported were fine-tuned.

The trade-off between speed and quality is central to GNA. Have you conducted any experiments where a smaller model is trained from scratch using a GNA configuration with a high stride? It would be valuable to understand how this impacts final model convergence and accuracy compared to training with standard self-attention or NA (stride=1x1).

We have not, but prior works on neighborhood attention (NAT[1] and DiNAT[2] papers) have. While stride did not exist at the time, the highest possible stride value makes GNA effectively the same pattern as blocked attention, which Swin Transformer[3] used as their core attention operator. The NAT and DiNAT papers, along with other works such as Hourglass Diffusion[4], and WeatherMesh-3[5] clearly highlight the superior quality of sliding windows over non-overlapping windows (highest stride).

In addition to those, the HaloNet[6] paper also further discusses the implications of this tradeoff space in detail.

Based on our experiments with off-the-shelf models, our understanding is that the difference will be largely dependent on the model, but again we build on findings from ViT that in many cases where we are not greatly limited by data availability, inductive biases can be largely learned by Transformers.

At the same time, we also argue that in addition to GNA, our implementation still implements and massively accelerates standard sliding window / NA patterns. And together with our simulator, and implementations, the end users can simply decide the best tradeoff for them. For some applications that are very sensitive to translational equivariance like weather prediction, smaller strides and exact sliding window many be required. For others which involve a lot of training data (i.e. video and world models), it may matter less.

The stride parameter creates a clear trade-off between computational efficiency and the receptive field overlap between adjacent queries. The results for FLUX (Table 6) show almost no degradation in quality metrics when moving from a 1x1 stride to a 16x16 stride. Is it surprising that this significant reduction in shared local information has such a minimal impact on the quality of a 4K generated image? What does this imply about the level of redundancy in local attention for high-resolution generative tasks?

Thank you for bringing this up. This changes very visibly when we move to 32x32 stride. Given NeurIPS’s guidelines on rebuttal, we cannot share the exact results, but you can visibly see a “grid effect” appear in the image when stride grows too large. However, we also note that given blocked attention’s success in vision (which in the case of FLUX would be a whopping 80x80 stride), we are very confident that smaller strides (as detected by NATTENSim) can do the same, while offering the same (and possibly better) performance with the help of our implementation.

References

[1] Hassani, Ali, et al. "Neighborhood attention transformer." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023.

[2] Hassani, Ali, and Humphrey Shi. "Dilated neighborhood attention transformer." arXiv preprint arXiv:2209.15001 (2022).

[3] Liu, Ze, et al. "Swin transformer: Hierarchical vision transformer using shifted windows." Proceedings of the IEEE/CVF international conference on computer vision. 2021.

[4] Crowson, Katherine, et al. "Scalable high-resolution pixel-space image synthesis with hourglass diffusion transformers." Forty-first International Conference on Machine Learning. 2024.

[5] Du, Haoxing, et al. "WeatherMesh-3: Fast and accurate operational global weather forecasting." arXiv preprint arXiv:2503.22235 (2025).

[6] Vaswani, Ashish, et al. "Scaling local self-attention for parameter efficient visual backbones." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.

[7] Touvron, Hugo, et al. "Training data-efficient image transformers & distillation through attention." International conference on machine learning. PMLR, 2021.

[8] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).

评论

The authors have satisfied my concerns. Hopefully they will implement into the paper the points they are not able to include in the rebuttal. I will be raising my score to 5 (accept)

评论

We thank you for your valuable time and feedback, and for your recommendation. We will be sure to incorporate the changes discussed, and the results of our new Hopper FNA kernel in the final revision.

审稿意见
4

This paper introduces Generalized Neighborhood Attention (GNA), an extension to Neighborhood Attention (NA) that unifies various existing local sparse attention mechanisms, including sliding window, strided sliding window, and blocked attention, through a new "stride" parameter. Experimental results demonstrate that their GNA implementation can fully realize the maximum theoretical speedup in perfectly block-sparse cases and achieves significant end-to-end speedups (28% to 46%) on off-the-shelf generative models like Cosmos-7B, Hunyuan Video, and FLUX, without any fine-tuning. The authors commit to open-sourcing their simulator and Blackwell kernels.

优缺点分析

Strengths

  1. Unified Framework. GNA successfully unifies a diverse set of local sparse attention mechanisms (sliding window, strided sliding window, blocked attention) under a single, flexible definition with the introduction of the "stride" parameter. This provides a valuable conceptual framework for the field.

  2. Analytical Simulation Tool (NATTENSim). The development of NATTENSim is a significant contribution. It allows for a fair and analytical comparison of different sparse attention approaches by abstracting away implementation details and focusing on tile-level computation, providing more realistic speedup upper bounds. This addresses a long-standing challenge in evaluating sparse attention.

  3. State-of-the-Art Implementation. The implementation of GNA on the NVIDIA Blackwell architecture using CUTLASS FMHA kernels is a major strength. Demonstrating near-perfect utilization of theoretical FLOP-wise speedup in perfectly block-sparse cases (up to 1.3 PFLOPs/s in FP16) is highly impressive and showcases cutting-edge engineering.

  4. Open-Source Commitment. The commitment to open-sourcing the simulator and Blackwell kernels through the NATTEN project will greatly benefit the research community, enabling reproducibility and further development.

Weaknesses

  1. Token Permutation Overhead. While token permutation is presented as a practical solution to the curse of multi-dimensionality, the paper acknowledges its naive PyTorch implementation only utilizes 1/8th of the memory bandwidth. Although it's stated that this can be minimized by performing it once, it remains a potential bottleneck for certain workloads and could limit the "Speed-of-Light" claim if not further optimized.
  2. Hardware Specificity. The primary implementation and performance evaluation are tied to the NVIDIA Blackwell architecture. While the design choices are claimed to be architecture-agnostic, the direct transferability of the achieved performance to other GPU architectures (e.g., AMD, Intel) is not explicitly demonstrated or discussed in detail. However, this is fully understandable.

问题

  1. Token Permutation Optimization. Given the acknowledged low memory bandwidth utilization of the current token permutation implementation, what specific future optimizations are planned (e.g., custom CUDA kernels, integration with memory management libraries) to improve this aspect and further close the gap to theoretical speedup?
  2. Generalization to Other Architectures. Could the authors elaborate on the challenges and potential strategies for porting the highly optimized Blackwell kernel implementation to other GPU architectures (e.g., AMD CDNA, Intel Xe)? Are there specific features of Blackwell that are critical to the observed performance gains that might not be available elsewhere?
  3. What's the major strengths of the proposed method over linear attention?

局限性

See Weaknesses.

最终评判理由

My major concerns have been addressed by the rebuttal. I thus keep my positive scores unchanged.

格式问题

n/a

作者回复

Thank you for your valuable time and feedback on our paper.

(Regarding weakness 2) We start by first reporting that since the submission, we’ve also successfully completed implementations for the Hopper architecture, with comparable throughput to Flash Attention v3, which is the current state of the art. In addition, we’ve also implemented backward pass kernels for both our Hopper and Blackwell kernels, extending our approach and methodology to even the approximately millions of Hopper GPUs that are still being used, and allowing for efficient training.

Due to the formatting guidelines for NeurIPS rebuttal, we cannot directly share detailed results, but our key observation is that our Hopper kernels achieve similar, and sometimes even better performance improvements compared to our Blackwell kernels (in the case of standard NA, or GNA cases that are not fully block sparse).

We can also share that when running GNA against STA on the Hunyuan model, GNA consistently outperforms STA, and achieves the full 11.1X speedup theoretically possible, while STA only reports a 10.45X speedup in their paper.

As you rightly pointed out, hardware specificity is not atypical, as many past NeurIPS papers were largely around creation of architecture-specific optimizations, such as Flash Attention v3 itself. However, we hope that concerns regarding this are minimized as a result of our new Hopper implementation.

With respect to other implementations, we are very optimistic that our observations can be transferred to other hardware architectures as well, as long as specifications such as memory bandwidth and peak performance and the like are comparable. Our implementation overhead, excluding the additional memory operations, is very minimal as illustrated by our results.

(Regarding weakness 1) Regarding token permutation, and its overhead, yes – we suspect better DRAM bandwidth utilizations are possible with custom copy kernels, and it is not uncommon for specialized efficient kernels written in CuTe or Triton or various other tools to outperform PyTorch baselines, especially when it is not a compute kernel. The original Neighborhood Attention work which kickstarted NATTEN was in fact a perfect example of this (except it was a compute kernel, hence our reliance on CUTLASS).

We also add that, again as pointed out in the paper, and in your review, that token permutation, even when suboptimal, can be 1. Further accelerated, 2. In some cases, including all the use cases in our paper, reduced to only happen once in the beginning of the model, and once in the end. The only reason why this was not implemented and reported in the paper, is that due to diminishing returns, the gains from such optimizations would be within the margin of error for model inference runtimes. This is exactly why we achieve the theoretical maximum speedup, not at the operation level, but end-to-end for some use cases like Hunyuan Video.

(Regarding weakness 3) As for strengths over linear attention, first and foremost neighborhood attention and by extension GNA are all forms of self-attention, meaning introducing them into a new model never explicitly requires re-training or fine-tuning, as static model weights and the rest of the architecture are untouched. This means introducing linear attention into other architectures almost always requires further training. Secondly, unlike linear attention, the degree of sparsity / efficiency gain is completely up to the end user, whereas with linear attention, you either use it or not. As a result, applications based on NA/GNA can freely adjust the level of sparsity, receptive field, block-sparsity, and even causal / bi-directional behavior, using the various parameters in this new framework. Different applications have different needs in all of the aforementioned settings, and our framework offers any desired combination, and can successfully deliver significant speedups over a directly-comparable baseline. Linear attention’s changes to the architecture may further complicate apples-to-apples comparisons.

最终决定

This paper received one acceptance, one weak acceptance, and two weak rejections. Reviewers appreciated the interesting extension of neighborhood attention to generalized neighborhood attentions for acceleration, particularly targeting Nvidia Blackwell kernels. However, major concerns remain regarding writing quality, the specificity of design principles to this hardware, and limited novelty. Although the rebuttal made some efforts to clarify these weaknesses, the AC thinks the paper requires a more substantial effort to be acceptable. Key points include:

  • Implementation Details: As a hardware optimization paper, clear and detailed implementation descriptions are critical. The paper must thoroughly introduce the experimental setup and implementation to ensure reproducibility without relying solely on future code release. These details are essential to justify the novelty of the idea beyond engineering optimization.
  • Quality-Speed Tradeoff: The discussion of the tradeoff between quality and speed is crucial. The results show some degradation in quality, making it difficult to validate the claimed tradeoff convincingly.
  • Writing and Scientific Insight: The writing requires significant improvement, not only in describing implementation but also in providing deeper scientific insights. More insightful conclusions could be drawn from carefully designed ablation studies, such as exploring optimal stride and a detailed analysis of tradeoffs.

The AC recommends rejecting this paper in its current form but encourages the authors to address these issues for a stronger resubmission.