GSPN-2: Efficient Parallel Sequence Modeling
摘要
评审与讨论
The paper describes GSPN-2, an optimised and engineered version of GSPN-1 - a vision architecture based that aims to replace attention-based architectures with parallel axis-aligned recurrence scans. The main results in the paper are centered around careful engineering of the algorithm and its implementation around moderns GPU kernels; and hardware utilisation is analysed in-depth. Experimental results are presented for IN-1K classification and text-to-image generation.
优缺点分析
Strengths
- The paper is very much a well-executed engineering exercise. While at first glance it may seem like a poor fit for NeurIPS, I would argue that these kind of papers are important and necessary. Transformers did so well partially due to the (accidental?) alignment between hardware and algorithms. Arriving at alternative high-performing architectures may require careful hardware-aligned engineering; and it is my belief that such work should be encouraged.
- In-depth and insightful analysis of various implementation / algorithmic choices on runtime.
- Clear and well-written. Strikes a good balance between details and high-level motivation.
Major weaknesses
- While excelling at latency analysis (e.g. Fig 3, Tab 1, Fig 4), I find that the paper falls short in empirical evaluation on relevant machine learning tasks. These results should be expanded along several dimensions (more below).
- Ablations in Fig 1 should expanded to the downstream tasks (e.g. IN-1K, text2image, and ideally more). Right now it is unclear how or if the various latency optimisation affect downstream performance.
- To motivate their work the authors explicitly mention detection and segmentation (lines 22-23) yet do not present such results. It would be extremely interesting and useful for the potential audience of the paper if such experiments could be performed and added to the paper.
- Similarly, the authors mention models like CLIP / SigLIP. Including experiments that train such models with GSPN-2 as the vision encoder would also be extremely useful. In particular, it would demonstrate how this architecture scales with data - something that is completely missing from the current set of experiments.
- Expanding on the data scaling axis - can this behaviour be at least demonstrated on larger image classification tasks - IN-1K is a relatively old and by modern standards a small benchmark.
- Why were only small models considered in Tab 2? And why no results on resolution scaling are presented?
- The effect of has not been ablated. How does this setting affect latency and downstream metrics (e.g. accuracy on IN-1K)? The choice of for the IN-1K experiments (lines 319-320) was in fact very surprising and intuitively seems like a very small number. Could the authors share their thoughts on why this is sufficient?
- Text2Image generation results are not particularly useful without quantitative metrics (e.g. CLIP score, etc) - these should be added.
Minor weaknesses
- The choice of channel dimensions for benchmarking (e.g. in Table 1) was a bit surprising. 128 channels is not a lot when comparing to a ViT (e.g. ViT-B or ViT-L) architecture. Could the authors provide justification for the choice of these settings?
- The paper emphasis forward pass benchmarking, although backward pass is also benchmarked (e.g. Fig 4). This should be emphasised more in the paper; it would also be interesting to see the ablations in Fig 3 be extended to the backward pass.
- The authors do not discuss limitations (if any) that aligning architecture / and kernel implementation with hardware design introduces. Can the resulting model run on all accelerators? Would it still be easy to experiment with various architectural modifications in follow up research?
问题
See weaknesses section, especially the Major ones.
局限性
Yes.
最终评判理由
The paper (especially empirical validation of the proposed approach) significantly improved during the rebuttal process.
格式问题
No.
We thank the reviewer for the thorough and insightful feedback. Below we respond to each concern and clarify the current scope, design choices, and planned extensions.
Machine Learning Tasks. Our experiments in the main paper focus on image classification (ImageNet-1K) and text-to-image generation (Stable Diffusion XL) to evaluate both discriminative and generative capabilities. Besides, as GSPN-2 is designed for high-resolution spatial modeling, we have extended experiments to dense prediction tasks across various benchmarks in Appx E. We also provide preliminary integration of GSPN-2 into SigLIP2 in the response below to assess its capacity to scale with large and downstream generalization.
Fig. 1 Should be Expanded to the Downstream Tasks. Yes, we did – the corresponding comparison has been fully explored in the paper: in Tab.2 of the main paper, we show an extensive comparison with both VIT (all with Flash Attn) based (orange block), and mamba-based (green block) architectures. In Fig. 1 of Appx B, we provided a comprehensive analysis of accuracy, model size, and throughput for GSPN-2 compared to SoTA architectures (e.g., ConvNet, Transformer-based, Mamba-based). These results highlight the competitiveness of GSPN-2—beyond speed gains alone—and establish it as a general-purpose architecture.
Detection and Segmentation. We did provide both in Appx E: we have extended GSPN-2 with high-resolution inputs (e.g., 1024x1024 resolution) to evaluate on segmentation, depth estimation, and 3D probing benchmarks such as ADE20K, VOC2012, Probe3D, NYUDv2 and PASCAL in Appx E. Preliminary results at Tab. 2 indicate comparable or superior performance to transformer-based backbones with much better inference latency. We will move this experiment into the main paper in the final version.
SigLip. Thank you for the valuable suggestion. We are currently integrating GSPN-2 into vision encoder backbones (e.g., SigLIP2-Base) using a subsample of 1 million samples from the DataComp1B dataset for rapid evaluation, with testing on ImageNet Zero-shot. The table shows a significant improvement in inference speed, e.g., 180 times faster compared with flash-attention and 5 times faster, including overheads like MLP.
| BS=32, Resolution=1025 | SigLip (Flash Attention) | SigLip-GSPN (ours) |
|---|---|---|
| Module-level Speed (ms) | 31.36 | 0.174 (180x faster) |
| Block-level Speed (ms) | 63.08 | 12.59 (5× faster) |
| Zero-shot Accuracy | 53.4 | 52.3 |
IN-1K is Relatively Old. We agree that ImageNet-1K is limited in scale. However, it remains a universally adopted and effective benchmark in attention-based literature [15-29], especially for comparing multiple model sizes (e.g., tiny, base, large) within practical GPU resource constraints. For instance, state-of-the-art methods like SigLIP and Radio require substantial GPU resources (multiple weeks on 16×8 GPUs), making them impractical for extensive variant analysis. Additionally, beyond IN-1K, we have successfully validated GSPN-2’s scalability to higher-resolution tasks (e.g., 1024×1024) in segmentation (ADE20K, VOC2012, NYUDv2, PASCAL) and text-to-image generation (SDXL) in Appx E, confirming its effectiveness across diverse and challenging dense-prediction benchmarks.
Only small models in Tab. 2. To enforce fair comparison, we follow the common setting used in previous relevant works (e.g., VMamba, VisionMamba), where models are evaluated from tiny to large size on a 224×224 resolution. This choice aligns with the standard benchmarking practices in the field for efficient models. However, we acknowledge the importance of validating performance on high-resolution inputs, which are more practical for real-world applications. To address this, we have extended our experiments to both high-resolution text-to-image generation and dense prediction tasks using 1024×1024 resolution, as presented in Appx E. These experiments better reflect the capabilities of GSPN-2 in handling high-resolution visual data
Ablation on . We selected as a minimal proxy to demonstrate how aggressively the input can be compressed while still preserving accuracy. We provide an ablation on in the table below, analyzing the trade-off between accuracy and throughput. To make a fair comparison, we maintain almost the same model size for different by increasing the number of blocks if necessary. We found this value yielded.
| C_proxy | Acc | Throughput (img/s) |
|---|---|---|
| 2 | 83.0 | 1544 |
| 4 | 83.0 | 1492 |
| 8 | 83.0 | 1387 |
| 16 | 82.9 | 1293 |
| 32 | 82.8 | 1106 |
As shown, using achieves the best latency–accuracy trade-off, with no measurable loss in accuracy and the highest throughput. This suggests that the proxy dimension serves primarily as a low-rank bottleneck for efficient propagation and does not need to match the full input dimensionality to be effective. We believe that the global context captured by the spatial recurrence appears to be preserved even through such a compressed intermediate representation.
Quantitative metrics for T2I generation. We have included CLIP-T scores and FID metrics in Tab 1 in Appx D, along with the quality-runtime trade-off at Fig. 3.
Choice of 128-channel setting. We chose 128 channels to match prior system benchmarks (e.g., VMamba). However, we agree that large channel counts would be more realistic, so we update Table 1 with results for with appropriate resolutions below.
| Input Size | Batch | Channels | GSPN-1 Throughput | GSPN-2 Throughput |
|---|---|---|---|---|
| 32×32 | 32 | 196 | 114 GB/s (6.0%) | 1832 GB/s (91.8%) |
| 64×64 | 1 | 768 | 86 GB/s (4.5%) | 1847 GB/s (92.3%) |
| 64×64 | 1 | 1152 | 35 GB/s (2.1%) | 1837 GB/s (92.0%) |
Backward Benchmark. We agree, and will highlight backward pass performance more explicitly in the main text. Note that the backward function is very similar to the forward function in terms of kernel launch design, and thus the speedup of backward is similar to the forward behaviour. We will introduce it in the revised paper.
Run on Different Accelerators. Yes, our CUDA acceleration design is fully compatible with all CUDA-enabled GPUs, relying only on standard libraries such as ATen and THC. Moreover, our algorithmic components—line-scan recurrence and channel-compressed formulations—can be readily ported to other accelerator backends (e.g., Triton or TileLang). The core idea of kernel fusion can also be adapted to accelerators that provide similar memory hierarchies. However, to achieve optimal performance, hardware-level control over thread blocks and shared memory is beneficial.
Dear Reviewer Da5V,
We sincerely appreciate your valuable review, which has greatly helped us clarify the scope, design choices, and broader applicability of our work.
Following your valuable suggestions, we have revised our paper and submitted a detailed rebuttal. In summary, we have:
-
Highlighted experiments beyond ImageNet-1K already included in Appx E like segmentation, depth estimation, and 3D probing benchmarks (ADE20K, VOC2012, Probe3D, NYUDv2, PASCAL) with high-resolution inputs (e.g., 1024×1024), showing competitive or superior performance with much better inference latency.
-
Integrated GSPN-2 into SigLIP2-Base, demonstrating up to 180× module-level and 5× block-level speedups over Flash Attention, while maintaining comparable zero-shot accuracy.
-
Highlighted comprehensive comparisons with both Transformer- and Mamba-based SoTA architectures, highlighting GSPN-2’s competitiveness in accuracy, model size, and throughput already included in Fig. 1 in Appx B.
-
Added a detailed ablation study on the compressive proxy dimension (), showing that achieves the best latency–accuracy trade-off with no loss in accuracy.
-
Highlighted quantitative metrics (CLIP-T, FID) for text-to-image generation tasks already included in Appx D.
-
Updated throughput benchmarks for larger channel counts () and clarified backward-pass speedups.
-
Discussed portability to other accelerators (e.g., Triton, TileLang) and broader algorithmic generality.
We hope these updates have addressed your concerns. As the discussion period draws to a close, we would greatly appreciate it if you could let us know whether our response has resolved your questions.
Thank you again for your time and consideration.
Sincerely,
Authors of Paper 943
This paper presents GSPN-2, a significant algorithm–system co-design that improves the efficiency of the Generalized Spatial Propagation Network (GSPN) for vision transformers. The authors identify key system-level inefficiencies in the original GSPN implementation, including excessive GPU kernel launches, redundant memory access, and per-channel computation overhead. To address these, they propose a streamlined 2D kernel design with warp-channel pinning and shared memory staging, alongside a model-side enhancement that replaces per-channel propagation weights with a channel-shared variant. Experimental results show that GSPN-2 achieves comparable accuracy to transformer-based models while substantially reducing computational cost. Overall, the work effectively pushes the efficiency frontier for spatial context modeling in vision tasks and offers a compelling solution for high-resolution and long-sequence scenarios.
优缺点分析
Strength:
- Authors provide many profilings and comparisons to demonstrate the bottlenecks and improvement, making the conclusion compulsory.
- The analysis of the hardware performance, such as the L1 cache missing rate, bandwidth throughput, provides more evidence of the effectiveness of the optimization.
- The given evaluation results shows significant improvements.
Weakness:
- The writing is not very easy to follow. And some illustrations are hard to understand, especially why it is in this chapter, how this solution can solve the bottlenecks.
- The evolution is not comprehensive and most of the results focus on the improvement over GSPN-1. As the author mentioned many other system optimizations, such as Flush Attention, there is no directly comparison with it. If it is integrated in the baseline, you'd better mention it.
- This work basically an system implementation of a published paper, while the algorithm aspect is kind of weak and unclear, even this is a so-called algorithm-system co-design work.
问题
- Can you show me how the channel-shared weight mitigates the concurrency bottleneck?
- Is the compressive proxy dimension essentially a low-rank approximation? What is the cost of using this projection method due to the potential error?
- Can you show me the comparison with other SOTA algorithm-system codesign work?
局限性
This work mostly focuses on the system implementation optimization. The question is can you extend the optimization (fused kernel, shard weight, streaming concurrency) to other works not using your GSPN algorithm?
格式问题
No major issues.
We thank the reviewer for the constructive feedback. We respect your concerns and have taken them seriously. In the following, we provide clarifications and additional context to better address the issues raised and to more clearly communicate the value of our work.
Coverage of Flash Attention Baselines. Flash Attention is already present in every ViT-based baseline we report: all orange blocks in Table 1 and Fig. 1 (Appx B), the SDXL row in Fig. 5 (main paper), and the ViT-L RADIO model in Table 2 (Appx E) are Flash Attention implementations. In the strongest head-to-head test—SDXL vs. GSPN-2 SDXL—we replace each soft-max/Flash layer with a single GSPN layer and fine-tune by distillation: image quality is maintained while inference latency drops by . These results show that our gains are measured against the latest optimized attention baselines.
| Resolution | Full Attention | FlashAttention | GSPN-2 (Ours) |
|---|---|---|---|
| 256² | 58.377 | 4.491 | 0.165 |
| 512² | 928.610 | 64.757 | 0.444 |
| 1k² | – | 1034.967 | 1.096 |
| 2k² | – | – | 3.714 |
| 4k² | – | – | 13.533 |
| 8k² | – | – | 51.198 |
| 16k² | – | – | 196.412 |
Comparison with other SOTA algorithm-system codesign work. In addition, we have compared with other algorithm–system co-designs such as VMamba and MambaVision (Fig 1 in Appx B), to highlight differences in computation patterns and practical GPU efficiency. Additionally, Appx B provides a detailed trade-off analysis among accuracy, model size, and throughput for the entire system. We will further clarify them in the revised paper.
Limited Algorithmic Contribution, a System Implementation. We address the question from the following aspect:
-
The importance of CUDA optimization. CUDA optimizations have been widely explored for softmax attention (e.g., FlashAttention, TensorCore-based solutions). However, efficient CUDA support for recurrent or linear-propagation methods—despite their promising scalability for unlimited-length contexts (e.g., SSM, Mamba, Linear Attention, and GSPN)—is severely lacking. Our work explicitly targets this important gap.
-
Just a System Implementation? No, in contrast, we found that CUDA-level optimizations alone were insufficient. Even aggressive kernel tuning left us with a runtime that still grew linearly with channel and batch dimensions—a fundamental algorithmic bottleneck. To solve this, we introduced a novel GSPN micro-design (Sec. 3.3) that significantly reduces this dependency without sacrificing accuracy, clearly an intrinsic algorithmic advancement.
-
Algorithm-system Co-design: Unlike typical algorithm-only papers, we tightly fused this new micro-design into a single optimized CUDA kernel—exactly demonstrating an algorithm-system co-design. This integrated solution unlocks the full potential of GSPN in practical scenarios, underscoring that our contribution is both algorithmically substantial and systemically necessary. We appreciate your feedback and will revise the paper to better highlight these algorithmic contributions and explicitly clarify their integration with our system-level optimizations.
Presentation Clarity. Thanks for pointing out the issue. We will clarify the connection between each bottleneck and its corresponding solution, and restructure Section 3 for better flow in the revised paper.
How Channel-shared Weight Mitigates the Concurrency Bottleneck? They mitigate the concurrency bottlenecks by reducing the number of independent CUDA blocks needed during kernel launch. In GSPN-1, per-channel weights (,) required independent computation for each (N, C) slice, quickly saturating the GPU's grid dimension and memory queues. In contrast, channel-sharing allows one weight matrix per row/column (), enabling:
- Shared memory reuse: once W_i is loaded into shared/L1 cache, it’s reused across all C channels within the block, improving cache hit rates and lowering register pressure. See +2D thread block (), +shared memory cache (), and +channel-shared gates () at Fig.3.
- Fewer parameter fetches per step, accelerating global memory throughput (confirmed by Nsight in Table 1).
This significantly improves parallel efficiency on modern GPUs—especially when combined with 2D blocks.
Relate Compressive Proxy Dimension to Low-rank Approximation. Yes—conceptually, the proxy dimension () is a low-rank projection of the original -dimensional input. Specifically, we project to via a lightweight linear layer (Sec. 3.2), apply the GSPN recurrence in this reduced space, and then project back to channels. This reduces CUDA workload from slices to , mitigating the concurrency bottleneck under high or . We empirically found no measurable loss in accuracy across classification and generation tasks. For example, GSPN-2-S (with compression ratio 2:1) improves accuracy over GSPN-S while using less compute (9.2G vs. 9.0G MACs). We also provide an ablation on in the below table, analyzing the trade-off between accuracy and throughput.
| C_proxy | Acc | Throughput (img/s) |
|---|---|---|
| 2 | 83.0 | 1544 |
| 4 | 83.0 | 1492 |
| 8 | 83.0 | 1387 |
| 16 | 82.9 | 1293 |
| 32 | 82.8 | 1106 |
This suggests that the proxy dimension serves primarily as a low-rank bottleneck for efficient propagation and does not need to match the full input dimensionality to be effective. We believe that the global context captured by the spatial recurrence appears to be preserved even through such a compressed intermediate representation.
Dear Reviewer zPFN,
We sincerely appreciate your thoughtful and constructive review, which has greatly helped improve the quality of our work.
Following your valuable suggestions, we have revised our paper and submitted a detailed rebuttal. In summary, we have:
- Provided a detailed, direct comparison to FlashAttn on time per image (ms), at various image resolutions;
- Provided more ablation to the compressive proxy dimension, to analyze the trade-off between accuracy and throughput.
- Provided additional clarifications on why the work is an Algorithm-system Co-design, and How Channel-shared Weight Mitigates the Concurrency Bottleneck.
We hope these updates have addressed your concerns. As the discussion period draws to a close, we would greatly appreciate it if you could let us know whether our response has addressed your questions.
Thank you again for your time and consideration.
Sincerely,
Authors of Paper 943
To address the problem of Transformer cost for high-resolution images, etc., GSPN, a type of state-space model, was proposed. However, the implementation of this method was not fully optimized and did not achieve sufficient speed due to frequent access to the HBM and insufficient kernel fusion. Therefore, in this paper, the author proposes GSPN-2, which addresses these issues and improves the implementation to be CUDA/GPU-friendly, achieving a significant speedup.
优缺点分析
Strengths
-
This paper is well organized and clearly written.
-
The significant speedups achieved by the ingenious implementation of GSPN are clearly demonstrated.
-
The technical contributions of this paper are clearly and comprehensively described.
Weakness
-
The main contribution of this study is that the CUDA-based implementation of GSPN is made more GPU-friendly, resulting in a speedup of up to 84x over the previous implementation. Although the proposed implementation and its effectiveness are described in the text, the fact that the implementation code is not publicly available at this time, making reproducibility and verification of technical details difficult, is a concern in the evaluation of the paper.
-
This research appears to focus on implementation optimization, with limited intrinsic extensions to the algorithm itself (e.g., channel-shared weights). Therefore, the value of this paper seems to strongly depend on the effectiveness of the GSPN method; although Figure 1 compares it with a simple SSM model, it is unclear what significance the proposed method can have outside the context of GSPN, especially in comparison with other SOTA methods and in terms other than speed. In particular, it is unclear how the proposed method compares to other SOTA methods and what its advantages are in terms other than speed.
-
Qualitative and quantitative evaluations using image classification and text-to-image generation tasks have been conducted, but considering that the main objective of the proposed method is to “increase inference speed,” the “accuracy vs. speed trade-off” as shown in Supplemental Figure 3 analysis should be more clearly addressed in the main experiment. In particular, the relative significance of the speed improvement should be evaluated through a comparison between the proposed method and existing lightweight models.
问题
-
I believe that not only the technical aspects but also the academic contributions should be clarified. Please clarify if there is any evidence that the proposed method is generally useful beyond the GSPN framework?
-
The proposed method is mainly limited to comparisons with the GSPN and there is no comparison with other current SOTA methods, especially those that optimize inference time and model size. Please clarify the advantages of the proposed method when compared to lightweight models.
-
Supplemental Figure 3 shows some relationship between speed and accuracy, but the evaluation in the text is somewhat limited. Since the main objective is to improve speed, shouldn't the trade-off analysis with accuracy be presented in more detail in the main experiment? Has such an analysis been performed, including comparisons with lighter models?
局限性
yes
最终评判理由
From a technical perspective, this work achieves results that are highly commendable, particularly in terms of computational speedup. Furthermore, the proposed method has potential applicability beyond improving a specific model like GSPN, extending to a broader class of models such as variants of linear attention, which is an important strength. However, within the scope of my understanding, the paper is presented primarily in the context of GSPN, and its generality is not made sufficiently clear. Therefore, my final decision is a weak-leaning accept.
格式问题
No
We sincerely thank the reviewer for the encouraging feedback and thoughtful comments. We truly appreciate your recognition of the contributions and are glad that the strengths of our work came through. Below, we address your questions and suggestions in more detail.
Code Availability. Yes, we will release it upon acceptance. In the meantime, we ensure reproducibility by (1) providing extensive implementation details of our CUDA kernel design (Section 3.1–3.4). (2) Detailing block/grid configuration, memory layout, shared memory strategy, and concurrency management (Figure 2, Figure 3, and Table 1). (3) Describing all architectural modifications, including the proxy compression dimension and channel-shared weights (Section 3.2–3.3). We believe these elements provide a clear path to reproducing the kernel and model behaviors even before code release.
Significance outside GSPN. First, improving GSPN is non-trivial. GSPN [1], because it is already a top-performing attention architecture that breaks the softmax wall with propagation, yet its open-source code runs at < 8 % GPU utilisation. Our work directly addresses this bottleneck, making GSPN fast enough to be a practical alternative to softmax attention. Similar to recent studies on Mamba, we believe improving GSPN’s efficiency is both necessary and impactful. Moreover, the compressed-channel architecture can reduce the channel-linear cost inside any 2D recurrence. Because the design only assumes a line-scan recurrence and a shared (scalar) affinity map, it can be plugged into other propagation-based models—including SSM, Mamba/MambaVision, 2D RNNs, and LSTMs—to cut memory-bandwidth pressure without changing their update equations. Likewise, the single-kernel fusion pattern —or its Triton/TileLang analogue—can accelerate those models on all CUDA-class GPUs. In short, our contributions generalise: the compressed-channel idea is an algorithmic building block for any linear propagation method, and the fused-kernel blueprint shows how to reach hardware limits, making the work broadly relevant rather than GSPN-specific.
Algorithmic Contribution. Our paper does not merely optimize implementation. We identify that GSPN’s main efficiency bottleneck is the large channel dimension (often thousands) typically used in vision models. To fundamentally address this, we propose a novel channel sharing and compression approach (Sec. 3.3) that significantly reduces memory and computational load—this is a novel micro design, not merely an engineering optimization. Additionally, we integrate this compressed-channel structure directly into a new, highly efficient CUDA kernel. Thus, our method tightly couples algorithmic and system-level improvements to unlock GSPN’s full practical potential.
Comparison with the SOTAs other than Speed. Our experiments already show that GSPN-2 matches or outperforms the strongest non-GSPN backbones in accuracy and parameter count, not just latency. On ImageNet-1K, GSPN-2-T delivers 83.0 % top-1 with 4.2G MACs, beating LocalVMamba-T (82.7 %, 5.7G MACs) and other light ViTs at the same scale. For larger models, GSPN-2-B reaches 84.7 % with fewer MACs than ViT-B and VMamba-B. Beyond classification, we directly replace Flash-Attention with our layer in SDXL and obtain matching text-to-image quality (same FID and CLIP-score) while preserving the 92 × speed-up reported in Fig. 5. On dense prediction, our Appx E shows GSPN-2 backbones achieve 47.9 mIoU on ADE20K and 82.5 mIoU on VOC2012, on par with or better than transformer baselines yet at much lower inference cost. These results demonstrate that the proposed channel-compressed design is competitive with state-of-the-art models in accuracy, compute, and memory, so its value is not confined to speeding up GSPN—it offers a strong alternative backbone in its own right.
Accuracy vs. Speed Trade-off. Thank you for this suggestion. We agree that clearly showing the trade-off between accuracy and speed is important. Besides Fig. 3, we also show a more comprehensive analysis of accuracy, model size, and throughput in Fig. 1 (Appx B). To improve clarity, we will move key results from Appendices B and D into the main paper. We note that in Fig. 1, the trade-off isn’t obvious, as most of our models simultaneously achieve top results in both accuracy and throughput.
Light-weighted Models. While Fig. 1 (Appx B) focuses on isolating our core design improvements over various baselines, we have also compared GSPN-2 against several SOTA lightweight backbones in both accuracy and speed in the table below:
| Model | Params (M) | Throughput (img/s) | Top-1 Acc (%) |
|---|---|---|---|
| MobileViT-S (ICLR’22) | 5.6 | ~1030 | 78.4 |
| MobileFormer-294M (CVPR’22) | 11.4 | ~1200 | 77.9 |
| ConvNeXt-XT (CVPR’22) | 7.4 | ~1100 | 77.5 |
| VAN-B0 (CVMJ’23) | 4.1 | ~1250 | 75.4 |
| ParC-Net-S (ECCV’22) | 5.0 | ~1000 | 78.6 |
| GSPN-2-T (Ours) | 24.6 | 1544 | 83.0 |
Despite having a similar compute budget, GSPN-2-T outperforms all listed lightweight models by a wide margin in accuracy—by +4.4% over MobileViT-S and +5.1% over ConvNeXt-XT—while maintaining a comparable number of MACs. Furthermore, GSPN-2-T achieves much higher throughput than transformer and CNN-based models of similar scale, or even lighter scale. This demonstrates that GSPN-2 offers an excellent trade-off between model latency and performance.
Thank you very much for your thoughtful response.
Significance outside GSPN
Likewise, the single-kernel fusion pattern —or its Triton/TileLang analogue—can accelerate those models on all CUDA-class GPUs. In short, our contributions generalise: the compressed-channel idea is an algorithmic building block for any linear propagation method, and the fused-kernel blueprint shows how to reach hardware limits, making the work broadly relevant rather than GSPN-specific.
I believe your point is very important. While this paper is written in the context of improving GSPN, I believe that if it had made its generality clearer from a broader perspective and presented GSPN as one specific example, it would have been more appealing to a wider readership.
Comparison with the SOTAs other than Speed
Our experiments already show that GSPN-2 matches or outperforms the strongest non-GSPN backbones in accuracy and parameter count, not just latency.
Thank you for your additional explanation. My question was whether there was a comparison with the latest SOTA models, such as VMamba shown in Figure 1, that specialize in recent throughput and lightweight design. However, the various experimental results in the paper clearly demonstrate the promising potential of the proposed method in terms of both performance and efficiency.
Lightweight Models
Your explanation made it clear that this method is capable of achieving both accuracy and throughput when compared to lightweight models up to around 2023.
Based on the contents of the rebuttal, I now have a more accurate and positive understanding of the technical contributions of this paper. The final score will be decided carefully, taking into account discussions with other reviewers.
Thank you for your careful consideration and for acknowledging the technical contributions of the paper. We're glad that the additional context helped clarify the positioning and potential of our approach.
This paper introduces GSPN-2, a joint algorithm-system redesign of the Generalized Spatial Propagation Network (GSPN) aimed at significantly improving its efficiency for high-resolution image and long-video applications. The original GSPN-1 implementation suffered from inefficiencies due to repeated GPU kernel launches, excessive global memory transfers, and redundant per-channel computations. GSPN-2 tackles these bottlenecks by consolidating all propagation steps into a single 2D CUDA kernel, introducing channel-shared propagation weights, and optimizing memory access patterns (shared memory caching, coalesced access). Experimental results demonstrate GSPN-2's impressive speedups while matching or exceeding transformer-level accuracy on image classification and improving semantic consistency in text-to-image synthesis.
优缺点分析
Strengths
- Joint Algorithm-System Redesign. The strength lies in the holistic approach, combining algorithmic modifications (channel-shared weights, compressive proxy dimension) with deep kernel-level optimizations (unified kernel, 2D thread blocks, shared memory, coalesced access, CUDA streams). This integrated design is key to the substantial performance gains.
- Impressive Speedups. The demonstrated speedups are significant, making GSPN-2 a highly practical solution for high-resolution vision tasks.
- Detailed Profiling and Ablation. Figure 3, showing the step-by-step optimization impact, is excellent. It clearly quantifies the contribution of each optimization, providing strong evidence for the design choices. Table 1 on memory throughput further reinforces the efficiency claims.
- Comprehensive Evaluation. The evaluation covers both efficiency (profiling across varying resolutions, batch sizes, channels) and task performance (image classification, text-to-image generation), providing a well-rounded assessment.
Weaknesses
- Limited Theoretical Justification for Channel-Agnostic Propagation. While the channel-agnostic propagation is empirically shown to be effective, a deeper theoretical justification or analysis of its impact on feature representation compared to channel-specific weights could strengthen this aspect.
- SM Utilization for Small Workloads: The paper notes that for small batch sizes and channel counts, SM occupancy can drop significantly (as low as 20-30%). While acknowledged as a future optimization area, this is a current limitation for scenarios with low parallelism.
- The text size in Figure 4 could be increased. Currently, it's a little bit difficult to read.
问题
- Generalization to Other Architectures. GSPN-2 is a redesign of GSPN. How easily can the core optimization principles (unified kernel, channel-shared weights, memory optimizations) be applied to other efficient attention mechanisms or state-space models beyond GSPN?
- Impact on Downstream Tasks. How would GSPN-2 perform on other dense prediction tasks like object detection and segmentation, which often rely heavily on fine-grained spatial context?
局限性
Lack of Statistical Significance Reporting. The paper does not report error bars or statistical significance for experimental results. This makes it harder to definitively ascertain the robustness of the reported performance gains and accuracy improvements, particularly when differences between models are small.
最终评判理由
My major concerns have been addressed by the authors, so I keep my positive score unchanged.
格式问题
n/a
We appreciate the reviewer’s positive evaluation and insightful observations. Your detailed technical feedback is highly valuable, and we have carefully addressed each point below to further clarify and strengthen our contributions.
Justification for Channel-Agnostic Propagation. As we have briefly stated at line-193, Softmax attention already uses one scalar affinity per token pair, shared across all channels; the extra per-channel weight matrices in the original GSPN likely duplicate that information. By replacing them with a single shared plus per-channel gains , we (i) align GSPN’s affinity operator with standard attention, (ii) drop the hidden channel-dependent cost, and (iii) keep channel diversity through . Formally, Eq. 5 shows the full spatial kernel is preserved; sharing just factorises the original block matrix into , a rank-1 compression that removes redundant parameters while retaining dense global interactions. This is the same low-rank idea behind attention and grouped convolutions, where weight sharing regularises the model and cuts over-fitting risk. We will add this analysis to the revised paper.
SM Utilization for Small Workloads. This trade-off is a known limitation of large-kernel fused CUDA designs under light parallel workloads. We are actively exploring fine-grained tiling, micro-batching, and block fusion across directions as potential directions to address this in future work. Despite this limitation, GSPN-2 still outperforms GSPN-1 and state-space alternatives in both accuracy and runtime, even under small-scale settings, as shown in Fig. 4 and Tab. 1.
Font size. We will increase the font size and improve the figure in the revised paper.
Generalization to Other Architectures. We view GSPN-2 as a blueprint for applying GPU-aware design to 2D structured attention mechanisms, and anticipate future extensions to SSM, 2D RNN, LSTM, linear attention, RWKV variants, etc. The core optimization strategies in GSPN-2 are highly generalizable to any operation with 2D directional recurrence (e.g., causal convolutions, 2D scan layers / structured RNNs). For channel-sharing and proxy compression, these are modular strategies that can be transplanted to transformer blocks, gated MLPs, and SSMs like Mamba for reducing memory and increasing kernel fusion efficiency. Memory optimization tactics (e.g., shared memory reuse, coalesced access) can also benefit any scan-like operation where partial results are reused over spatial dimensions.
Impact on Downstream Tasks (see Appx E). We agree that evaluating downstream dense prediction tasks is important. In Appx E, we already extend GSPN-2 to evaluate on segmentation, depth estimation, and 3D probing benchmarks such as ADE20K, VOC2012, Probe3D, NYUDv2, and PASCAL, as shown in Tab.2. These experiments confirm that GSPN-2 maintains strong dense prediction performance—comparable or superior to transformer-based backbones—while significantly reducing attention latency, especially for high-resolution inputs (e.g., 1024x1024). We will move these segmentation results into the main paper to clearly highlight GSPN-2’s advantages on dense vision tasks.
Dear Reviewer XSoo,
We sincerely thank you for your positive evaluation and for the insightful technical observations that helped us further clarify and strengthen our contributions.
Following your valuable comments, we have:
-
Expanded the justification for channel-agnostic propagation, detailing how replacing per-channel weights with a shared and per-channel gains preserves the full spatial kernel while reducing redundancy, aligning with standard attention and low-rank regularization ideas.
-
Discussed strategies to improve SM utilization for small workloads, such as fine-grained tiling, micro-batching, and block fusion, while noting that GSPN-2 still outperforms GSPN-1 and state-space alternatives in accuracy and runtime even in small-scale settings.
-
Improved figure font size in the revised paper.
-
Elaborated on the generalizability of GSPN-2’s GPU-aware design principles to a broad class of architectures with 2D directional recurrence (e.g., SSM, 2D RNN, LSTM, linear attention, RWKV variants), and on the transferability of channel-sharing, proxy compression, and memory optimization tactics.
-
Highlighted dense prediction performance (ADE20K, VOC2012, Probe3D, NYUDv2, PASCAL) already included in Appx E, showing that GSPN-2 maintains strong results while greatly reducing attention latency for high-resolution inputs (e.g., 1024×1024), and committed to moving these results into the main paper.
We hope these clarifications address your points fully. As the discussion phase draws to a close, we would be grateful if you could let us know whether our updates resolve your questions.
Sincerely,
Authors of Paper 943
This paper presents GSPN-2, a joint algorithm-system co-design that significantly improves the efficiency of the Generalized Spatial Propagation Network (GSPN) for vision transformers by addressing key implementation bottlenecks such as excessive GPU kernel launches, redundant memory access, and per-channel computation overhead.
While some reviewers initially raised concerns about the clarity of algorithmic contributions, reproducibility, and evaluation breadth, the authors provided compelling rebuttals with extensive additional results, including comparisons with FlashAttention, ablation studies on the proxy dimension, downstream task performance (e.g., segmentation on ADE20K), and integration into models like SigLIP, demonstrating both the generality and practical impact of their optimizations.
The final ratings are borderline accept, accept, borderline reject, and accept. AC has reached out to the BR reviewer but has yet to receive a response. Upon reviewing the BR reviewer’s comments, AC noted that this reviewer expressed less confidence compared to the others. Additionally, the AC aligns with the majority of reviewers in recognizing the work’s merits—specifically its strong systems insights, thorough evaluation, and potential to inform the design of efficient vision models. This alignment has led to a consensus in favor of acceptance.