PaperHub
6.4
/10
Poster4 位审稿人
最低2最高5标准差1.2
4
5
5
2
4.8
置信度
创新性2.5
质量2.5
清晰度2.8
重要性2.5
NeurIPS 2025

Spiking Neural Networks Need High-Frequency Information

OpenReviewPDF
提交: 2025-05-01更新: 2025-10-29
TL;DR

The paper reveals that the performance gap between SNNs and ANNs stems not from information loss caused by binary spike activations, but from the intrinsic low-pass filtering of spiking neurons.

摘要

关键词
Spiking Neural NetworksVision Transformers

评审与讨论

审稿意见
4

The paper argues that the fundamental reason for the performance bottleneck of SNN in the Transformer structure is the rapid decay of high-frequency information in the network, which reduces the representation ability. To enhance high-frequency information, modules like Max-Pooling and Depth-Wise Convolution are adopted, boosting the model's performance on tasks such as CIFAR and ImageNet.

优缺点分析

Strengths:

1.It provides a fresh frequency-domain perspective on spiking networks, showing that performance degradation is not solely due to binary activations but to frequency loss.

2.The proposed Max-Former is lightweight, easy to implement, and achieves superior performance and energy efficiency without requiring complex architectural changes or longer time steps.

3.Well-written and organized

Weaknesses:

1.The core idea of this work shares strong similarities with MetaFormer [1], which also posits that a combination of token mixer and channel mixer is sufficient for vision tasks. Additionally, several architectural choices appear closely aligned with those proposed in Spike-Driven Transformer v2 [2]. It would be valuable for the authors to discuss the distinctions between their work and these prior approaches.

2.Furthermore, since this paper emphasizes the importance of high-frequency components in Spiking Transformers, could the authors elaborate on whether this frequency sensitivity also exists in convolutional SNN architectures, such as MS-ResNet [3] and its variant GAC-MS-ResNet [4]?

3.Lastly, could the authors provide details on the training time of Max-Former? As training efficiency remains a key bottleneck in SNNs, it would be insightful to know whether the proposed design contributes to faster or more efficient training.

[1] MetaFormer Baselines for Vision. IEEE TPAMI 2024.

[2] Spike-driven Transformer V2: Meta Spiking Neural Network Architecture Inspiring the Design of Next-generation Neuromorphic Chips. ICLR 2024.

[3] Advancing spiking neural networks toward deep residual learning. IEEE TNNLS 2024.

[4] Gated attention coding for training high-performance and efficient spiking neural networks. AAAI 2024.

问题

  1. The core idea of this work shares strong similarities with MetaFormer [1], which also posits that a combination of token mixer and channel mixer is sufficient for vision tasks. Additionally, several architectural choices appear closely aligned with those proposed in Spike-Driven Transformer v2 [2]. It would be valuable for the authors to discuss the distinctions between their work and these prior approaches.

  2. Furthermore, since this paper emphasizes the importance of high-frequency components in Spiking Transformers, could the authors elaborate on whether this frequency sensitivity also exists in convolutional SNN architectures, such as MS-ResNet [3] and its variant GAC-MS-ResNet [4]?

  3. Lastly, could the authors provide details on the training time of Max-Former? As training efficiency remains a key bottleneck in SNNs, it would be insightful to know whether the proposed design contributes to faster or more efficient training.

[1] MetaFormer Baselines for Vision. IEEE TPAMI 2024.

[2] Spike-driven Transformer V2: Meta Spiking Neural Network Architecture Inspiring the Design of Next-generation Neuromorphic Chips. ICLR 2024.

[3] Advancing spiking neural networks toward deep residual learning. IEEE TNNLS 2024.

[4] Gated attention coding for training high-performance and efficient spiking neural networks. AAAI 2024.

局限性

See Questions before

最终评判理由

Although the authors explain the differences with existing methods, I still think the main idea is similar to the previous ones, i.e., [1] and [2]. Thus, keeping the original score is suitable.

格式问题

n/a

作者回复

Q1: Clarification of the Core Idea and Architectural Design

Thank you for recognizing the novelty of our work. In fact, the design of our Max-Former does not emphasize the concept of token mixer or channel mixer as in [1] or [2]. Instead, we hope that the success of Max-Former can empirically support our core idea, as you pointed out, that the performance degradation of spiking neural networks is not only due to binary activation but also to frequency loss. Therefore, to eliminate the influence of factors such as computational complexity and receptive field, Max-Former only makes minor changes to the existing model [3]. Specifically, we add max-pooling operations in patch embedding and replace early-stage self-attention with lightweight depth-wise convolution. We primarily compare our Max-Former with the self-implemented version of [3] that uses the same training configurations to ensure fair comparison and an identical shortcut scheme to strictly adhere to the binary activation rule.

Our work serves as the theoretical foundation for many existing architectural choices in Spiking Transformers. Specifically, in [4], the authors found that directly applying known practices of MetaFormer does not achieve good results in Spiking Transformers: employing average pooling operators to replace SDSA as the token mixer surprisingly results in substantial performance degradation from 61.0% to 41.2%. Similar phenomena have been discussed in earlier works. For instance, Spikformer v2[5] discovered that removing the max-pooling operator in [6] leads to a substantial performance drop, while adding convolution layers (which act as high-pass filters) in the patch embedding stage significantly improves the performance. Our work reveals the underlying principles of these architectural designs: spiking transformers need to enhance high-frequency components to alleviate the feature degradation caused by their inherent low-pass activation. We are well aware that there is still much space to be explored, and we hope that our Max-Former can serve as a good starting point for future research.

[1] ”MetaFormer Baselines for Vision”. IEEE TPAMI 2024.

[2] ”Spike-driven Transformer V2: Meta Spiking Neural Network Architecture Inspiring the Design of Next-generation Neuromorphic Chips“. ICLR 2024.

[3] ”Qkformer: Hierarchical Spiking Transformer using QK Attention”. NeurIPS 2024.

[4] ”Spike-Driven Transformer”. NeurIPS 2023. (rebuttal for reviewer 8FkB, Openreview)

[5] “Spikformer: When spiking neural network meets transformer.” ICLR 2023

[6] “Spikformer V2: Join the High Accuracy Club on ImageNet with an SNN Ticket” arXiv 2024.

Q2: Experiments on Convolutional SNN Architectures

Yes, we have verified that high-frequency information is also beneficial to convolutional-based spiking neural networks. To demonstrate this, we provide additional experimental results using the MS-ResNet18 backbone with the same training settings as reported in [7]. Our Max-ResNet18 only makes minor modifications to the original MS-ResNet18 architecture, i.e., adding one max pooling operator at the end of the downsampling shortcut module in the second ResNet-Block to enhance high-frequency information.

DatasetArchitectureParamTAcc. (%)
Cifar10MS-ResNet[1]12.50694.92%
GAC-SNN[1]12.63696.46%(+1.54%)
Max-ResNet(ours)12.50696.67%(+1.75%)
MS-ResNet(ANN)[1]12.50NA96.75%
Cifar100MS-ResNet[1]12.54676.41%
GAC-SNN[1]12.67680.45% (+4.04%)
Max-ResNet(ours)12.54681.15%(+4.74%)
MS-ResNet(ANN)[1]12.54NA80.67%

The results are quite promising: Max-ResNet consistently outperforms the MS-ResNet baseline with identical model size, and GAC-SNN with more parameters. Notably, our Max-ResNet nearly matches the ANN version of MS-ResNet on CIFAR-10 (96.67% vs 96.75%) and significantly outperforms it on CIFAR-100 (81.15% vs 80.67%, +0.48%). Beyond that, we are preparing to share our analysis and results on more general cases in future work.

[7] “Gated attention coding for training high-performance and efficient spiking neural networks.” AAAI 2024.

Q3: Training Efficiency

We also attach importance to training efficiency. Here, we provide an additional Table regarding the full training time, supposing 8×H20 for ImageNet, with timestep=4.

ArchitectureMax-10-384Max-10-512Max-10-768
Time37 hours50 hours84 hours

For CIFAR-10, CIFAR-100, and CIFAR10-DVS, we use a single A30 GPU with training times of 8h, 8h, and 1.5h, respectively. The training settings are detailed in Appendix A1 (Table 5).

Regarding to prior studies, we have provided a detailed comparison of training/inference time and GPU memory usage in Appendix A3. Compared to [3], our Max-Former saves 16.4% time and 15.3% GPU memory in the training phase.

评论

Thanks for the rebuttal. Although you explain the differences with existing methods, I still think the main idea is similar to the previous ones, i.e., [1] and [2]. Thus, keeping the original score is suitable.

评论

Thank you for taking the time to review our rebuttal. As a side note, we have already included a comparison between our proposed MaxFormer and [2] in Table 3 of the main paper. While our MaxFormer do not require complex architectural changes or longer time steps, it demonstrates clear advantages over Meta-SpikFormer [2], achieving 0.62% higher accuracy (77.2% to 77.82%) while reducing model size by half (31.3M to 16.23M parameters) and decreasing energy consumption/synaptic operations by nearly 7-fold (32.8mJ to 4.89mJ). We noticed there has been a citation issue in Table 3 that inadvertently linked Meta-SpikFormer [2] to [1], which we will correct in the final version.

Throughout the review process, we were fortunate to receive your valuable feedback and support. Hope all is well with you!

[1] "Spike-Driven Transformer". NeurIPS 2023.

[2] "Spike-driven Transformer V2: Meta Spiking Neural Network Architecture Inspiring the Design of Next-generation Neuromorphic Chips". ICLR 2024.

审稿意见
5

This paper reveals a profound insight about SNNs: spiking neurons act as low-pass filters at the network level, causing critical high-frequency information loss. The authors propose Max-Former, which recovers high-frequency signals through two lightweight operators: max-pooling in patch embeddings and depth-wise convolution replacing early-stage self-attention. This simple solution achieves remarkable results, 82.39% ImageNet accuracy (+7.58% over Spikformer) with 30% less power consumption, plus consistent improvements on CIFAR and neuromorphic datasets. This work challenges the prevailing assumption that binary activation constraints are the main Spiking Transformer limitation, instead identifying frequency domain characteristics as the key bottleneck. It provides theoretical proof and practical solutions that significantly improve performance and energy efficiency while preserving the spike-driven computing paradigm.

优缺点分析

Strength

  1. Motivation
  • This paper challenges the common view that binary activations of spiking neurons in SNNs are the main reason for their performance limitation. The perspective is supported by both a clear theoretical foundation and empirical evidence.
  1. Simple Design with Good Performance
  • The design of Max-Former demonstrates that simply adding high-frequency enhancement operators (such as Max-Pooling and Depth-wise Convolution) can bring significant and consistent performance and energy efficiency improvements on almost all datasets, which can also inspire practical solutions for unlocking the full potential of Spiking Transformers.
  1. Comprehensive Experiments
  • The authors provide comprehensive experiments in both the main paper and the supplementary material to prove the importance of high-frequency information. Overall, the paper is well-written, clearly presented, and the insights are very interesting.

Weakness

  1. Although low-pass filtering in spiking neurons is a common phenomenon, this paper only discusses how to improve spiking Transformers to achieve performance improvements on CV tasks based on this insight. A more general discussion will help to enhance the impact of this paper.

  2. As noted in the limitation section, Max-Former requires manually balancing frequency components, which the high-frequency enhancing strategy needs to be modified according to the task, rather than adaptively adjusting. However, I agree that Max-Former will inspire future architectural design in the SNN field.

  3. Technical details: There are typos in lines 236-237 of the main paper. Besides, in Section 3.2.3, the inputs to the DWC module appear to be non-spiking signals. A LIF neuron layer is probably missing here.

问题

Most of my questions are aligned with the weaknesses.

  1. Given that low-pass filtering in spiking neurons is a general phenomenon, would integrating high-frequency enhancement operators also improve other SNN architectures or tasks?

  2. The results in Tables 2-3 and Appendix A4(b) show that QKFormer* and Max-Former with hierarchical designs significantly outperform the other models. Does the hierarchical design itself help preserve high-frequency information?

  3. In Appendix A1.2(5), how is the DWC operation inserted into SSA? The explanation seems to be missing in the article.

局限性

Yes.

最终评判理由

After considering the author rebuttal, I am increasing my score to 5. The authors addressed my main concern about generalizability by demonstrating their high-frequency enhancement approach works beyond Transformers (showing improvements in convolutional SNNs), validating the broader applicability of their theoretical insight. The paper provides practical solutions that will likely inspire future research in spike transformers.

格式问题

N/A

作者回复

W1 & W2 & Q1 : Spiking Neural Networks Need High Frequency Information

Thank you for appreciating our work. We have investigated this aspect and successfully verified that high-frequency information is also beneficial to convolutional-based spiking neural networks. To demonstrate this, we provide additional experimental results using the MS-ResNet18 backbone with the same experimental settings as reported in [1]. Our Max-ResNet18 only makes minor modifications to the original MS-ResNet18 architecture, i.e., adding a max pooling operator at the end of the downsampling shortcut module in the second ResNet-Block to enhance high-frequency information.

DatasetArchitectureParamTAcc. (%)
Cifar10MS-ResNet[1]12.50694.92%
GAC-SNN[1]12.63696.46%(+1.54%)
Max-ResNet(ours)12.50696.67%(+1.75%)
MS-ResNet(ANN)[1]12.50NA96.75%
Cifar100MS-ResNet[1]12.54676.41%
GAC-SNN[1]12.67680.45% (+4.04%)
Max-ResNet(ours)12.54681.15%(+4.74%)
MS-ResNet(ANN)[1]12.54NA80.67%

The results are quite promising: Max-ResNet consistently outperforms the MS-ResNet baseline with identical model size, and GAC-SNN with more parameters. Notably, our Max-ResNet nearly matches the ANN version of MS-ResNet on CIFAR-10 (96.67% vs 96.75%) and significantly outperforms it on CIFAR-100 (81.15% vs 80.67%, +0.48%).

At present, our conclusion is that for static datasets, high-frequency information should be strengthened in the middle layers of the model, while for event datasets, enhancing high-frequency information in the shallow layers of the model will bring greater benefits. This is also supported by Section 4.3 and Appendix A1.2 of our paper. We are preparing to share our analysis and results on more general cases in future work.

[1] “Gated attention coding for training high-performance and efficient spiking neural networks.” AAAI 2024.

Q2: Hierarchical Design Is Not the Fundamental Reason

In fact,this is mainly because QKFormer* has adopted maxpooling operators in the patch embedding stage, and, in hierarchical design, the patch embedding is used multiple times for downsampling. Their architectural choices implicitly validate the importance of high-frequency information to the Spiking Transformer, which is also consistent with our experimental results presented in Table 4 of the main paper, where enhancing high-frequency information through patch embedding strategies improves the performance of Max-Former from 81.63% to 82.65%.

Beyond that, our work provides theoretical grounding for many existing architectural choices in Spiking Transformers. For instance, Spikformer v2[3] discovered that adding convolution layers (which act as high-pass filters) in the patch embedding stage significantly improves the performance, while removing the max-pooling operator in [2] leads to substantial performance drop. We believe our empirical and theoretical analysis offers valuable insights that will guide future research in spiking neural network architectures.

[2] “Spikformer: When spiking neural network meets transformer.” ICLR 2023

[3] “Spikformer v2: Join the high accuracy club on imagenet with an snn ticket.” arXiv 2024.

Q3: Implementation of SSA with DWC

We insert the DWC module after converting the output of the original SSA module into spikes. We will include detailed explanations in Appendix A1.2 along with implementation code to ensure reproducibility.

W3: Technical details

Thank you for identifying these issues. We will correct all typos and revise the corresponding content in the final version.

评论

Thank you for your response and additional experiments. My concerns have been addressed. Considering that the paper's idea and method are simple yet effective, and the authors have provided reasonable explanations for the raised issues, I am increasing my score to 5.

评论

We sincerely appreciate your timely feedback throughout this review process. Your insights are crucial to strengthening our work, and we will carefully incorporate all suggested improvements into our final version.

审稿意见
5

In this paper, the authors analyze the Spiking Transformer in the frequency domain. Interestingly, they found that LIF tends to pass low-frequency information and block high-frequency information. This phenomenon has been verified using an intuitive example. I am no expert in frequency-domain analysis, but I found the conclusion interesting and valuable for this area. The final resulting model, called MaxFormer, is a straightforward modification of the existing Spiking token-mixer and achieves a significant performance improvement.

优缺点分析

Strength:

  • Strong support of low-pass filter nature in LIF neuron, based on the analysis in Section 3.1.

  • Impressive performance improvement over average-pooling-based baselines. However, I did not verify the code's correctness. Given that the authors have provided the implementation of CIFAR100 models, I will buy the results for now.

Weakness:

  • Throughout the paper, the authors keep talking about the difference between LIF and ReLU neurons and claim that the ReLU neuron is more than just a low-pass filter. It would be interesting to see the ANN version of MaxFormer (at least on CIFAR-100). The current table only shows some trivial ANN practices, which do not fully convince the readers.

  • I hope the authors can fully open-source their implementation in rebuttal/after submission, or clarify the reason that only the CIFAR-100 code is provided.

问题

  1. How do you explain the performance improvement from the pre-spike shortcut or vanilla shortcut with the frequency information?

  2. Except for max-pooling, is there another way to increase the high-frequency information by modifying LIF?

局限性

Yes

最终评判理由

The authors have addressed my concerns that high frequency information is only needed for SNNs. Based on this result, I believe this work has enough take-home information and I encourage the publication on NeurIPS.

格式问题

None

作者回复

W1: Comparison with the ANN baseline

Thank you for your insightful feedback. We agree that it is more appropriate to compare the ANN version and spike-based version of Max-Former under the same architectural configuration and training hyperparameters. However, as the distribution of attention scores in ANNs and SNNs is very different, directly adopting the ANN-version of spiking self-attention, e.g., replacing all spiking neurons with the ReLU activation, will cause gradient explosion. We managed to provide additional experimental results on the ANN version of Max-Former by additionally scaling the q, k, and v matrices in the ReLU-based processing:

DatasetArchitectureTAcc. (%)
Cifar10Max-Former(ours)497.04%
QKFormer*496.84%
Max-Former(ANN)NA96.82%
Cifar100Max-Former(ours)482.65%
QKFormer*481.57%
Max-Former(ANN)NA82.41%

It is noted that the ANN version of Max-Former consistently underperforms the spiking Max-Former on both datasets (96.82% vs 97.04% on CIFAR-10; 82.4% vs 82.65% on CIFAR-100). Compared to QKFormer*, Max-Former(ANN) is slightly weaker on CIFAR-10 (96.82% vs 96.84%) but stronger on CIFAR-100 (82.41% vs 81.57%, +0.83%).

This can be explained from two perspectives. First, since ANN Transformers do not experience the same high-frequency attenuation problem, the frequency enhancement strategy in Max-Former may be less effective. Second, the spiking self-attention module may not be well-suited for token mixing in ANN Transformers.

W2: Code Implementation

We will fully open source our implementation in the final version since we cannot upload any files during the rebuttal phase. The CIFAR-100 model file we provided actually contains the implementation of all the patch embedding strategies/token mixers used in this article. We will further organize our code to facilitate reproduction.

Q1: Shortcuts in Spiking Neural Networks

For pre-spike shortcut, we found this cannot be explained simply from the frequency domain perspective as discussed in Appendix A4(a) of our paper. The pre-spike QKFormer requires 3x more synaptic operations than its membrane shortcut variant when processing the same images. We feel that a comparison between the two is unfair given the significant difference in computational cost. Therefore, throughout our paper, we perform analysis or experiments under the premise of binary activation transmission. Likewise, our Max-Former is primarily compared with the membrane shortcut variant of QKFormer (denoted as QKFormer*) that adheres to the binary activation rule.

For the vanilla shortcut, as explained in lines 188-190 of our main paper, it can lead to distribution mismatch and thus limit the model performance. Therefore, this shortcut scheme has not been widely adopted in spiking neural networks and is rarely used in recent studies.

Besides, we need to point out that replacing the membrane potential shortcut with the pre-spike shortcut does not always result in performance improvements. As presented in Table 2 and Table 3, compared to QKFormer with pre-spike shortcut, QKFormer* that employs the membrane shortcut achieves higher performance on CIFAR10/CIFAR-100/DVS128, while showing weaker performance on CIFAR10-DVS and ImageNet. The reasons behind this will be a valuable direction for future research.

Q2: LIF for High-Frequency Information

Based on our theoretical analysis presented in lines 119-141 of the main paper, reducing the value of β\beta in Equ. 6 can enable LIF neurons to capture more high-frequency information.

To illustrate this more intuitively, we need to note that in our formula, β=11τ\beta = 1 - \frac{1}{\tau}, where τ\tau represents the time constant of the membrane potential, which ranges from in 11 to ++\infty. Therefore, β\beta becomes smaller the closer τ\tau is to 11, and a smaller time constant τ\tau causes the membrane potential to respond to narrower time windows, making the neuron more sensitive to higher frequency signals. We will make revisions in the final version to better explain this point.

评论

I would like to thank the authors for their detailed explanation and efforts in running additional experiments. The results of ANNs make the paper's claim stronger. Meanwhile, I have read the reviews from other reviewers. To me, this paper has delivered enough insights in the frequency domain and architectural design of spiking transformers. Therefore I will increase my score to 5.

评论

Thank you for reviewing our response and appreciating our work! Wish you all the best!

审稿意见
2

Here is a precise summary in bullet points:

  1. Issue Addressed:

    Spiking Transformers underperform compared to traditional deep networks due to information loss, especially of high-frequency signals, from their binary and sparse activations.

  2. Proposed Solution:

    Introduce Max-Former, which enhances high-frequency feature retention via:

    Max-Pooling in patch embedding. Depth-Wise Convolution replacing self-attention.

  3. Evaluation:

    CIFAR-100: Improves accuracy from 76.73% (Avg-Pooling) to 79.12% (Max-Pooling). ImageNet: Achieves 82.39% top-1 accuracy, a +7.58% gain over Spikformer, with comparable model size.

优缺点分析

  1. Lack of Theoretical Justification for Low-Frequency Bias in Spiking Neurons: While the paper claims that spiking neurons preferentially propagate low-frequency information, there is no rigorous theoretical or empirical analysis provided to support this assertion. A formal frequency-domain analysis or signal decomposition study would significantly enhance the credibility of this claim.

  2. Insufficient Literature Review on High-Frequency Modeling: The manuscript omits discussion of several relevant works that address high-frequency representation in vision models, such as:

    *SVT (Scattering Vision Transformer)[1], which models spectral components using wavelet-inspired mixing,

    • HiLo Attention [2], which splits frequency bands for efficient attention computation,
    • SpectFormer and WaveViT, which are designed to efficiently encode high-frequency textures.

    Including these would provide valuable context and distinguish the proposed method more clearly from existing approaches.

  3. Lack of Comparison with Contemporary Spectral Mixing Techniques: The proposed Max-Former does not compare or contrast with similar spectral decomposition and frequency-aware channel mixing methods, such as:

    • Monarch Mixing, which exploits structured linear transforms for spectral control,
    • EinFFT used in SiMBA-TS[3], which mixes channels via frequency-domain transformations.

    A detailed comparison is essential to understand how Max-Former is uniquely positioned among these recent advances.

  4. Limited Generalizability Discussion: It remains unclear whether the proposed Max-Former architecture is generalizable to non-spiking architectures or other modalities such as image captioning or multimodal tasks. The paper should explicitly discuss whether the frequency enhancement mechanisms apply only to spiking transformers or more broadly across architectures.

  5. Unsubstantiated Claims About Max-Former’s High-Frequency Behavior: The claim that Max-Former captures high-frequency components more effectively is not backed by sufficient evidence. There is no frequency-domain visualization, filter response plot, or Fourier spectrum analysis to demonstrate that the method indeed enhances high-frequency signals across layers or tasks.

  6. Missing Comparison with Strong Baselines: Tables 2 and 3 lack comparative results with strong recent baselines, particularly: SVT[1], HiLo Attention[2], EinFFT-based SiMBA[3].

    These omissions make it difficult to assess whether Max-Former truly offers a state-of-the-art tradeoff between accuracy, frequency preservation, and efficiency.

  7. Potential Overlap with Existing Work: A recent preprint available at arXiv:2505.18608 appears to tackle a similar problem space of frequency-aware modeling in spiking networks. The authors should clearly articulate how their contributions differ from or improve upon this work, including differences in methodology, theoretical insight, or empirical scope.

[1] Patro, Badri, and Vijay Agneeswaran. "Scattering Vision Transformer: Spectral Mixing Matters." NeurIPS 2024. [2] Fast Vision Transformers with HiLo Attention, NeurIPS 2022 [3] SiMBA-TS: Simplified Channel Mixing and Mamba for Long-term Time Series Forecasting, ICASSP-2025

问题

Refer to Strengths And Weaknesses session

局限性

Refer to Strengths And Weaknesses session

最终评判理由

The authors have provided a detailed rebuttal, but unfortunately, many of my original concerns remain unaddressed or insufficiently clarified. SO I would like to keep my rating as it is.

格式问题

No

作者回复

Thank you for your time in evaluating our work and providing feedback. However, we cannot fully agree with your opinions.

W1 & W5: Lack of Theoretical Justification and Frequency Domain Analysis

We understand your concerns about the analytical rigor in our work. We share your focus on reliability and have performed extensive experiments to validate our claims. Indeed, our claims are supported by comprehensive empirical evidence (especially our frequency domain analysis) and theoretical proofs throughout the paper.

We propose and preliminarily validate our hypothesis in lines 42-57 of the main paper. The discussion is presented along with several empirical findings. Figure 2 provides a comparative visualization of Fourier spectra, relative log amplitudes, and Grad-CAM activations for images processed by ReLU versus spiking neurons. This shows spiking neurons propagate low-frequency information more than ReLU neurons. To further verify our hypothesis, Figure 1 shows that simply replacing the avg-pooling with max-pooling operators for token mixing in Spiking Transformers can lead to a striking (+2.39%) performance improvement on the CIFAR-100. The results confirm that the rapid dissipation of high-frequency components is one of the factors that cause the modest performance of Spiking Transformers.

In Section 3.1, we use an entire subsection to prove the low-pass filtering nature of spiking neurons. Our analysis starts with an intuitive frequency analysis (Figure 3, lines 111-119), which shows that while ReLU expands frequency bandwidth, spiking neurons exhibit strong high-frequency attenuation. We then present a comprehensive theoretical analysis (lines 119-141) of the frequency-selective properties of commonly used LIF and IF neurons.

The Fourier spectra of spiking neurons, spiking max-pooling, spiking depth-wise convolution, and spiking self-attention are visualized to verify that Max-Former's architectural design helps to restore high-frequency components (Figure 6). We finalize our conclusion with the success of Max-Former, which achieves 82.39% top-1 accuracy on ImageNet (7.58% higher than the Spikformer with a similar model size), and the ablation studies in Section 4.3 and Appendix A1.2, which provide direct evidence for the critical role of high-frequency information in Spiking Transformers.

We always hope to support our point of view with sufficient analysis and promote the development of the field of spiking neural networks. We hope our response can address your concerns.

W2&W3&W6: Insufficient Literature Review and Comparison with Baselines

We need to clarify that our work is not related to high-frequency modeling based on spectral decomposition and is not aimed at proposing a better frequency learning method in spiking neural networks. Instead, our Max-Former only add two lightweight high-frequency operators: additional max-pooling in patch embedding process and depthwise convolution to replace early-stage self-attention. We hope to eliminate the impact of factors such as computational complexity and receptive field, thereby better supporting our core argument: the performance degradation of spiking neural networks stems not only from binary activation but also from frequency loss. Therefore, we believe that including the works you mentioned would distract the readers from our main contribution.

We also agree that direct frequency learning approaches like Fourier-based or wavelet-based methods, would offer more straightforward solutions, as discussed in our limitation section (Appendix A5). However, the main challenge lies in developing efficient spike-based frequency representations. For instance, existing works on spiking neural networks have attempted to adopt time-to-value mapping for accurate Fourier transforms, but at the expense of high latency [1]. A more recent work in Spiking Transformers proposed leveraging the inherent sparsity and robustness of the wavelet transform to approximate the spiking frequency representation, but relies on uncommon negative spike dynamics [2].

We have already compared with [2] in Table 2. In the final version, we will explicitly highlight frequency learning approaches as a valuable future direction for spiking neural networks and add the results of [2] to the comparison in Table 3.

[1] “Time-coded Spiking Fourier Transform in Neuromorphic Hardware”. In IEEE Transactions on Computers, 2022

[2] “Spiking Wavelet Transformer”. In ECCV, 2024

W4: Limited Generalizability Discussion

This is indeed an interesting question worth exploring. However, the core idea of our work is to emphasize the importance of the low-pass filtering properties of spiking neurons for model design, rather than proposing a universal frequency enhancement strategy that works for all tasks and architectures.

The frequency enhancement strategy in MaxFormer may be less effective for non-spiking Transformers that do not experience the same high-frequency attenuation problem. Here, we provide additional experimental results on CIFAR-10/100:

DatasetArchitectureTAcc. (%)
Cifar10MaxFormer(ours)497.04%
QKFormer*496.84%
MaxFormer(ANN)NA96.82%
Cifar100MaxFormer(ours)482.65%
QKFormer*481.57%
MaxFormer(ANN)NA82.41%

It is noted that the ANN version of MaxFormer consistently underperforms the spiking MaxFormer on both datasets (96.82% vs 97.04% on CIFAR-10; 82.4% vs 82.65% on CIFAR-100). Compared to QKFormer*, MaxFormer(ANN) is slightly weaker on CIFAR-10 (96.82% vs 96.84%) but stronger on CIFAR-100 (82.41% vs 81.57%, +0.83%).

Regarding other modalities, we need to point out that, for spiking neural networks, the field of image captioning or multimodal processing remains in early development stages, and there is no mature evaluation framework yet. However, we believe that empirical validation across modalities will be valuable future work.

W7: Overlap with Existing Work

Thank you for the reminder. We need to clarify that our work is entirely original. We have gone through the publication and noticed that the cited paper was submitted to arXiv one week after the conference submission deadline. Therefore, we believe a comparison with this work is not necessary.

评论

Dear Reviewer foAf,

As the author-reviewer discussion period is coming to an end, we would like to know if you have any remaining questions or concerns about our paper. We would be happy to address them before the discussion period concludes~

Best,

Authors

评论

The authors have provided a detailed rebuttal, but unfortunately, many of my original concerns remain unaddressed or insufficiently clarified.

W1 & W5 – Lack of Theoretical Justification and Frequency Domain Analysis: The authors emphasize that their hypothesis regarding the low-pass filtering nature of spiking neurons is supported by empirical results and some theoretical discussion (e.g., Section 3.1). However, the theoretical analysis remains relatively shallow and largely intuitive, lacking the rigor expected to fully substantiate their central claim. The rebuttal leans heavily on visualizations (e.g., Fourier spectra in Figures 2 and 6), but does not strengthen the theoretical grounding beyond what was already in the paper. The claim that high-frequency attenuation is a key reason for performance degradation in Spiking Transformers still feels speculative, despite the provided ablations.

W2, W3 & W6 – Insufficient Literature Review and Comparison with Baselines: The authors argue that comparisons with frequency-domain learning methods are outside the scope of their work. However, this feels dismissive. Since their central claim revolves around frequency degradation, a deeper engagement with prior frequency-based models—especially recent spiking wavelet and Fourier approaches—is essential. Their claim that including such works would "distract readers" is unconvincing. Moreover, while they mention adding [2] to Table 3, no updated comparisons were shown in the rebuttal.

W4 – Limited Generalizability Discussion: The authors admit that their method may not generalize well to non-spiking Transformers and other modalities. They present additional results on CIFAR-10 and CIFAR-100 with ANN and SNN variants, but the improvements are marginal. Their justification for excluding other modalities (e.g., lack of frameworks) is understandable but ultimately reinforces concerns about limited generalizability.

W7 – Overlap with Existing Work: The authors assert the originality of their work and cite submission timelines to explain the lack of comparison. However, this does not resolve the issue of conceptual overlap. Even if the cited paper appeared on arXiv post-deadline, the similarities should still be acknowledged in a more substantive way.

评论

Thank you for engaging in our discussion. However, we disagree with your points with our highest respect.

W1 & W5: We appreciate your feedback regarding our paper. However, we are confused by the logical contradiction in your review. In your initial comments, you noted that our paper did not provide visualizations, theoretical analysis, or empirical analysis to demonstrate Max-Former enhances high-frequency information. We have highlighted that these results were already presented in the original submission in our response. Then, in your subsequent comment, your initial position seems to be directly contradicted by criticizing our paper, asserting that it relies too heavily on visualizations. However, the same visualization results cannot be both insufficient and excessive.

Similarly, regarding our analytical contributions, your review first noted a lack of theoretical and empirical analysis, then dismissed our theoretical analysis as "shallow" and empirical analysis as "speculative" without providing specific reasons. We would appreciate any further details on these points to help us address your concerns.

W2&W3&W6: Regarding related works, your initial feedback suggested adding comparisons with non-spiking Transformers and Mambas. In our response, we respectfully explained why those comparisons are inappropriate for our work focused on spiking transformers and emphasized our core contributions within this area. The above references were completely forgotten in subsequent comments, and the problem became that we declined to compare with frequency domain learning methods.

Instead, in Table 2 of our original submission, we have already compared with Spiking Wavelet Transformer [1], the first and only competitive spiking transformer that adopts frequency-domain learning methods. For your convenience, we provide an additional table with direct comparison results on ImageNet.

MethodsArchitectureParam (M)TimestepPower (mJ)Acc. (%)
Spiking Wavelet Transformer [1]Transformer-6-51221.843.8774.84
Transformer-8-51227.645.0875.43
Max-Former (ours)Max-10-38416.2344.8977.82
Max-10-51228.6512.5075.47
Max-10-51228.6547.4979.86

Our Max-Former consistently outperforms the Spiking Wavelet Transformer. Specifically, Max-10-384 achieves 77.82% accuracy with just 16.23 M parameters at timestep=4 (vs. Spiking Wavelet Transformer: 74.8–75.4% with 21.8-27.6 M params and similar power) and still hits 75.47% at timestep=1 using 28.65 M parameters while cutting energy by over 50% (2.5 mJ vs. 5.08 mJ).

[1] Spiking Wavelet Transformer. ECCV 2024

W4: For fair comparison, our experimental setting follows prior studies in the field of spiking transformers/ token mixers [1-6], which focus on static image classifications on CIFAR10/CIFAR100/ImageNet and neuromorphic data classification on CIFAR10-DVS/DVSGesture. We also need to note that the performance gap between ANNs and SNNs has always been a huge challenge. It is a significant breakthrough that our work enables SNNs to outperform ANNs.

[1] Spikformer: When spiking neural network meets transformer. ICLR 2023.

[2] Spike-driven transformer. Neurips 2023

[3] Spiking Wavelet Transformer. ECCV 2024

[4] QKFormer: Hierarchical Spiking Transformer using QK Attention. Neurips 2024

[5] Spiking Token Mixer: An Event-Driven Friendly Former Structure for Spiking Neural Networks. Neurips 2024

[6] Spiking transformer with spatial-temporal attention. CVPR 2025

W7: We do not have an answer for this, and the issue has been reported to the area chair.

最终决定

The paper views spiking transformers through a frequency‑domain lens: LIF/IF neurons behave as low‑pass filters that dissipate high‑frequency content, contributing to SNN–ANN performance gaps. The authors propose Max‑Former, which restores high‑frequency information via (i) added max‑pooling in patch embedding and (ii) depth‑wise convolution replacing early self‑attention. Results show 82.39% top‑1 on ImageNet—well above Spikformer with similar model size—with consistent gains on CIFAR and neuromorphic datasets. The rebuttal adds ANN variants and a convolutional SNN (Max‑ResNet) showing similar benefits.

Strengths

  • Clear, actionable insight (low‑pass bias) tied to simple architectural changes that keep the spike‑driven paradigm.

  • Strong empirical gains on standard SNN benchmarks (incl. ImageNet) with lightweight changes and good efficiency.

  • Rebuttal meaningfully extends evidence (ANN comparison, Conv‑SNN), plus added training/efficiency details.

Weaknesses

  • Theory is suggestive rather than rigorous; arguments lean on intuition and visuals.

  • Related‑work positioning vs. frequency‑aware models (wavelet/Fourier, spectral mixing) could be deeper; partial overlap with MetaFormer‑style design and Spikformer‑v2 choices remains a concern.

  • Generality beyond vision SNNs is promising but not yet demonstrated.

Reasons for decision

Two reviewers moved to accept (5) after rebuttal, one remained borderline-accept (4), and one stayed negative. I weigh the consistent empirical gains, simplicity, and the take-home conceptual message (mind the frequency loss in SNNs) more heavily than concerns about theoretical tightness or partial overlap with related lines. The added convolutional SNN experiment and ANN comparisons clarify scope: the benefit is specific to SNNs where high-frequency attenuation is salient. Overall, the paper will likely influence practical SNN design and spur deeper theoretical work.

Discussion & rebuttal highlights

  • Authors clarified the frequency-domain evidence (figures/ablations) and provided a derivation for LIF’s low-pass behavior; critics still desired more rigor, but other reviewers found the analysis sufficient.
  • Comparisons & positioning: Authors argued their goal is not spectral learning per se; they added comparisons (e.g., to a spiking wavelet transformer) and discussed why broader spectral baselines might distract from the central claim.
  • Generalization: New results on Max-ResNet indicate gains beyond transformers; ANN vs SNN contrasts show the benefit aligns with the hypothesized SNN frequency issue.