5.0

/10

Rejected4 位审稿人

最低3最高6标准差1.2

3.5

置信度

ICLR 2024

S-TLLR: STDP-inspired Temporal Local Learning Rule for Spiking Neural Networks

Marco Paul E. Apolinario,Kaushik Roy

OpenReview PDF

提交: 2023-09-23更新: 2024-02-11

摘要

Spiking Neural Networks (SNNs) are biologically plausible models that have been identified as potentially apt for deploying energy-efficient intelligence at the edge, particularly for sequential learning tasks. However, training of SNNs poses significant challenges due to the necessity for precise temporal and spatial credit assignments. Back-propagation through time (BPTT) algorithm, whilst the most widely used method for addressing these issues, incurs a high computational cost due to its temporal dependency. In this work, we propose S-TLLR, a novel three-factor temporal local learning rule inspired by the Spike-Timing Dependent Plasticity (STDP) mechanism, aimed at training deep SNNs on event-based learning tasks. Furthermore, S-TLLR is designed to have low memory and time complexities, which are independent of the number of time steps, rendering it suitable for online learning on low-power edge devices. To demonstrate the scalability of our proposed method, we have conducted extensive evaluations on event-based datasets spanning a wide range of applications, such as image and gesture recognition, audio classification, and optical flow estimation. In all the experiments, S-TLLR achieved high accuracy, comparable to BPTT, with reduction in the number of computations between $1.1-10\times$.

关键词

Local learningSpiking Neural NetworksMemory-efficient learningSTDP

评审与讨论

审稿意见

评分: 5置信度: 42023-10-30

This paper proposed an STDP-based learning algorithm that focuses on SNN training from the memory efficient perspective. The proposed algorithm has shown reduced complexity on the event-based dataset.

优点

The proposed method shows reduced time complexity, it is natural that the STDP-based learning requires less memory compared to BPTT with gradient surrogation. The proposed algorithm seems hardware friendly with discrete operations.

缺点

W1: Insufficient experiments: I understand that the event-based computer vision tasks are suitable for spiking neural networks, but I think the dataset reported in this paper is not comprehensive enough. In addition to the popular DVS-CIFAR10 and DVS-Gesture, N-CalTech101, and NCARs are also adopted in prior works [R1] as benchmarks. However, these results are missing in the paper.

[R1] AEGNN: Asynchronous Event-Based Graph Neural Networks, CVPR, 2022.

W2: Since the proposed method claims that the conventional BNTT is memory expensive, it is important to demonstrate the memory-accuracy comparison between the proposed method and BNTT (e.g., GPU Memory)

W3: Some recent papers and SoTA methods are not cited in this paper:

[R2]: Temporal Efficient Training of Spiking Neural Network via Gradient Re-weighting

[R2]: Differentiable Spike: Rethinking Gradient-Descent for Training Spiking Neural Networks, NeurIPS'21

[R4]: Training High-Performance Low-Latency Spiking Neural Networks by Differentiation on Spike Representation CVPR 2022

W4: The methodology section should be elaborated more. Based on Figure 1, S-TLLR introduces the incoming gradient $\partial L / \partial y$ on top of discrete STDP. What is the theoretical advantage (or intuition) of doing that?

问题

Please refer to Weaknesses

2023-11-23

Answer 1: Thank you for your suggestion. In our revised version, we have incorporated the results for the N-Caltech101 dataset. Regrettably, integrating the Ncars dataset into our current framework wasn't feasible at this moment. Please, note that in our initial version, we already included evaluations beyond event-based vision classification tasks. Specifically, we incorporated assessments in event-based audio recognition and optical flow domains. These evaluations were deliberately included to explore longer temporal dependencies, such as in audio, and to tackle more intricate spatio-temporal regression through self-supervision, as observed in optical flow tasks.

Answer 2: We have taken the reviewers' feedback into careful consideration and have supplemented our paper with a more extensive analysis of the computational and memory-intensive nature of BPTT in Appendix C. Although the analysis primarily discusses fully connected models, it can be extrapolated to encompass convolutional models as well. Additionally, we have included a visual representation in Figure 3 within Appendix C4 to illustrate a simple example that delineates how GPU memory scales linearly with the number of time steps, contrasting with the consistent of S-TLLR memory consumption. These additions aim to provide a comprehensive understanding of the computational and memory complexities associated with our proposed framework.

Answer 3: We sincerely appreciate the reviewers' valuable input regarding the additional papers. To ensure completeness and accuracy in our comparative analysis, we have incorporated the suggested papers in Table 3. It's essential to note that while these prior works focused on achieving high performance in image classification on static datasets, our emphasis lies in addressing temporal locality and memory-efficient training for Spiking Neural Networks (SNNs). The highlighted papers, while significant, do not encompass the pivotal aspects of our proposed approach, which emphasizes temporal dynamics and memory efficiency in SNNs

Answer 4: STDP, an unsupervised learning algorithm, operates by adjusting synaptic connections based on the relative timing of spiking activity but lacks a global supervisory signal to guide learning toward specific goals. So, we designed S-TLLR to incorporate a mechanism akin to STDP to identify neurons for update based on spiking activity. Then, we use a third factor—the learning signal—to attribute credit or assign blame to the neurons identified for update. This additional supervisory mechanism augments the synaptic plasticity process, enabling goal-oriented learning within the SNN framework.

审稿意见

评分: 6置信度: 32023-10-30

This paper introduces a new learning rule to train spiking neural networks. The idea is based on a three-factor structure, but using BPTT and STDP as its components. The STDP-based eligibility trace function scales with $n$ and is temporally local (does not scale with $T$ ), which is an improvement over existing methods. This method, referred to by the authors as S-TLLR, has an additional non-causal component which scales with $n$ , just like the causal component, and therefore does not affect its scaling with space and time. Experiments and benchmarks on numerous datasets reveal the advantage of this non-causal learning component.

I am generally positive about this work in regards to the new proposed method and how it improves training of spiking networks. I hope that authors can clarify any misunderstandings I may have in the weaknesses section and I am very willing to adjust my score in the rebuttal.

优点

The theoretical scaling advantage is highly relevant and important to the spiking neural network community. Using an STDP-based eligibility trace function also lends to biological plausibility, which has relevance to neuroscience audiences.

缺点

I am doubtful of both main claims, on (1) temporal locality and (2) improvements from non-causal terms.

(1) The temporal locality property of this method is unconvincing. In Figure 1, my naive understanding is that it is possible to simply truncate both BPTT and STDP methods in the same way S-TLLR is truncated using equation (11). In other words, all methods can have temporal locality. The only way to truly claim that the proposed method does not scale with time, is by using both BPTT and STDP (and perhaps even other existing methods) with this truncation and see if S-TLLR learns faster or if other methods fail to learn the objective.

(2) The improvement from non-causal terms is similarly highly confounded by the secondary activation functions in equations (14-17). Suggestions for fair experiments could be:

universally use the same secondary activation function across all tasks, or use all 4 activation functions for all tasks
apply the same activation functions to other methods

To be very clear, I understand that S-TLLR is compared across different values of $\alpha$ within the same secondary activation functions, but it is not clear if this behavior is task and function specific. For example, dataset A and secondary function X could give better results with non-zero $\alpha$ , while dataset B with secondary function X or dataset A with secondary function Y has better results with $\alpha = 0$ .

(3) It is also not clear how the method works in the recurrent neural network task. If I were to incorporate causal recurrent gradients in Figure 1, that would correspond to red lines being drawn from $u[t]$ to $y[t-1]$ (and others), which means most terms with have red and blue lines in parallel.

(4) The recurrent term in equation (1), while true and makes the equation general, simply disappears and lacks coherence and continuity with all future equations where the narrative centers around a feedforward network. For example, equation (4) has no recurrent term. This should be stated in the text somewhere or removed.

问题

Should blue terms in Figure 1 also extend beyond $t-2$ (with three dots) just like the red terms?

While theoretical scaling arguments are convincing, there are many factors underlying number of computations. How are the 1.1x, 4x and 10x claims actually made? Was it done by recording the number of floating point operations? More information is needed to substantiate these claims. The actual amount of time taken to train the networks is also an important metric to include as well.

2023-11-23

Answer W1: We would like to address this question by stating that temporal locality means using only information available at the present. For example, in a sequence with multiple time steps, a temporal local operation at time step ( $t$ ) must use only information available at that time step. By this definition, STDP expressed in equation (6) is a temporal local learning rule at any time step as it uses only information available at the present, and therefore its memory requirements are independent of the length of the sequence. In contrast, as we discuss in appendix C, BPTT does not present a temporal locality feature due to the backpropagation of errors trough time. This implies that the weight updates ( $\Delta w$ ) at any time step depend on future information as shown in equation (23) and Figure 2a. Note that this happens due to the implicit recurrent connections (trough membrane potential accumulation) of hidden layers and not the loss function itself, so even if the loss for BPTT is computed as described in equation (11) and just for the last time step ( $T_l=T$ ), propagation of errors is going to be similar to the one shown in Figure 2a with the small difference that the connections between the last layer and the loss function disappear, but still the weight updates at time step $t$ ( $\Delta w [t]$ ) are going to depend on information at future time steps ( $t+1, t+2, ..., T$ ). Therefore, BPTT is not a temporal local learning rule and requires the information of every single time step to compute the synaptic updates. In the case of S-TLLR, similar to STDP we can use trace variables to propagate the required information forward in time and use a learning signal obtained by propagation of errors trough the layers (not in time). Therefore, S-TLLR memory requirements are independent of the sequence length. An analysis on memory requirements is shown in Appendix C. Additionally, we would like to point out that equation (11) represent the instantaneous error signal obtained at the current time step ( $t$ ), and the term $T_l$ only control at which point the targets are available and therefore the number of weight updates. So even if the learning signal is available for all time steps ( $T_l=0$ ) the memory requirements of S-TLLR are the same.

Answer W2: In response to the ablation study suggestion, we conducted experiments, presented in Table 5 (Appendix E3), using three activation functions with varied $\alpha_{post}$ values due to resource constraints. The findings consistently demonstrate superior performance when using non-zero $\alpha_{post}$ compared to $\alpha_{post}=0$ .

Answer W3: We have included a new appendix D elucidating the application of our method in recurrent models. Additionally, we updated the Figure 1 to ensure the discussion's generality.

Answer W4: We appreciate the feedback and have clarified that the primary text focuses on feed-forward networks, with discussions regarding recurrent models available in the supplementary material (Appendix D). This distinction aims to provide a comprehensive understanding while maintaining clarity within the main text.

Answer Q1: Yes, the blue terms in Figure 1 should be extend beyond $t-2$ . We have updated Figure 1 to ensure the discussion's generality.

Answer Q2: In Appendix C, we presented the analysis to estimate the improvements and updated the reported improvements accordingly. Notably, we deliberately omitted specific details regarding the training duration of the models in GPU. This decision stems from our deliberate emphasis on explicitly illustrating the workings of S-TLLR within our implementation (code), prioritizing a comprehensive demonstration over runtime optimization. The variability in GPU training times is contingent upon specific code implementations, a facet that diverges from the primary focus of our discussion.

审稿意见

评分: 6置信度: 42023-11-01

This paper introduces S-TLLR, a novel learning rule for Spiking Neural Networks (SNNs) aimed at efficient online learning on resource-constrained edge devices. S-TLLR draws inspiration from Spike-Timing Dependent Plasticity (STDP) and utilizes both causal and non-causal relationships for synaptic weight updates, maintaining constant memory and time complexity. Through extensive experimentation, the authors demonstrate that S-TLLR achieves comparable accuracy to traditional methods like BPTT but with significantly lower computational demands. The paper's contributions are highlighted by the improved generalization and performance of SNNs on a variety of event-based tasks—including image and gesture recognition, audio classification, and optical flow estimation—and the validation of S-TLLR's efficacy across multiple network topologies, marking a step forward in deploying energy-efficient intelligence in real-world applications.

优点

S-TLLR is a groundbreaking approach that successfully trains SNNs with high efficiency, addressing the temporal and spatial credit assignment challenge that is inherent in such networks.
By incorporating principles from STDP, S-TLLR aligns closely with biological neural processes, potentially unlocking more natural learning patterns and efficiencies.
S-TLLR successfully integrates both top-down modulation and the local algorithm.
The proposed learning rule maintains constant time and memory complexity, which is a significant advancement for deploying SNNs on edge devices where resources are constrained.

缺点

The complexity was estimated, but the real energy consumption/efficiency haven't been calculated/tested.
While BPTT could work on much deeper SNNs, how about S-TLLR? Could it be extended to larger models/datasets?

问题

Please see the weaknesses:

Could the energy consumption/efficiency be calculated/tested.
While BPTT could work on much deeper SNNs, how about S-TLLR? Could it be extended to larger models/datasets, such as CIFAR100?

2023-11-23

Answer 1: We have expanded our discussion on the number of multiply-accumulate (MAC) operations required by both S-TLLR and BPTT in Appendix C. While the estimation of dynamic energy consumption is typically proportional to the number of MAC operations, during training, the energy consumption associated with memory read and write operations might significantly contribute, particularly for memory-intensive methods like BPTT. Accurately modeling such energy consumption is a complex task beyond the current scope of our research. It's worth noting that S-TLLR was specifically designed as a three-factor learning rule, tailored for potential utilization on neuromorphic hardware like Loihi, which is an avenue we plan to explore in our future research.

Answer 2: Primarily, our focus lies on event-based tasks, as we believe these tasks inherently benefit from utilizing SNNs due to their temporal information, unlike static image classification tasks. While our exploration into SNNs hasn't delved extensively into deeper architectures, this limitation is largely due to the absence of neuromorphic datasets comparable in size to more mainstream vision tasks such as CIFAR100 or ImageNet. However, it's important to note our experiments on N-Caltech101, which contains a similar number of classes as CIFAR100, demonstrating that S-TLLR performs comparably to BPTT. Additionally, our experiments on optical flow using MVSEC, a challenging spatiotemporal regression task, show that S-TLLR can achieve performance similar to BPTT. These results instill confidence that S-TLLR can indeed be extended to deeper models and a broader spectrum of tasks.

审稿意见

评分: 3置信度: 32023-11-02

The paper introduces a new learning rule for Spiking Neural Networks. This rule has low linear memory complexity and quadratic time complexity in terms of number of neurons. Moreover, the proposed learning algorithm incorporates a non-causal learning term, inspired by Spike-Timing-Dependent Plasticity.

优点

Evaluation is done on variety of tasks;
Paper is well-written and easy to follow.

缺点

My main concern is that the method considered in the paper (S-TLLR) is very similar to OTTT[1]:

OTTT has the same learning rule as S-TLLR except that additionally S-TLLR leverages non-causality and few other minor differences. But this non-causal term doesn’t help S-TLLR consistently based on Fig. 2;
S-TLLR has the same time and memory complexity;
S-TLLR doesn’t outperform OTTT.

[1] Mingqing Xiao, Qingyan Meng, Zongpeng Zhang, Di He, and Zhouchen Lin. Online Training Through Time for Spiking Neural Networks, NeurIPS 2022

问题

Can the authors list all the differences between OTTT with S-TLLR methods?
In the paper, it is mentioned that OTTT applies learning rules at each forward pass, whereas S-TLLR enforces the learning rule at every fourth forward step. Could the authors test the performance if S-TLLR's learning rule was applied at each forward pass, similar to OTTT?
Can the authors do ablation study taking OTTT model as a reference starting point? The study would systematically integrate modifications that transition the model towards the S-TLLR approach.

伦理问题详情

2023-11-23

Answer 1: Thank you for pointing out the similarities with OTTT [1], which allows for a more in-depth discussion. It's essential to note that while both algorithms share certain aspects, their fundamental motivations differ significantly. OTTT strives to approximate BPTT and is derived from BPTT equations by detaching all recurrent connections from the auto-differentiation graph. In contrast, S-TLLR is based on the theory of three-factor learning rules, utilizing the STDP mechanism to compute eligibility traces and a learning signal derived from backpropagation through layers or random feedback connections (DFA).

Furthermore, the selection of STDP in S-TLLR stems from the observation that the exponential decay of the causal term in STDP is akin to the computation of gradients over time with a decay factor equal to the leak factor of spiking neurons. However, while STDP considers both causal and non-causal terms for synaptic plasticity, focusing solely on causal information might result in a loss of valuable learning information. Hence, S-TLLR proposes a three-factor learning rule that models an STDP-like mechanism for neuron eligibility.

Another significant distinction lies in the target applications; OTTT, as stated by its authors, targets static image classification, while S-TLLR is specifically designed for temporal dependency tasks, deemed more suitable for SNNs. Notably, S-TLLR encompasses a pool of learning rules defined by the selection of hyperparameters (STDP parameters), where OTTT represents a specific case with hyperparameters $\alpha_{post} = 0$ and $\lambda_{pre} = \gamma$ .

In our evaluations (Table 2, 3, 4, 5), using non-causal terms consistently yields better performance than using only causal terms (S-TLLR with $\alpha_{post}=0$ equivalent to OTTT). Even though we attempted to replicate OTTT's results for DVS CIFAR10 using S-TLLR ( $\alpha_{post}=0$ ) in Table 3, our batch size discrepancy—using 48 compared to [1]'s 128 due to GPU memory limitations—may have influenced the outcome. Our comparison across various datasets supports the superiority of S-TLLR using non-causal terms over OTTT.

Answer 2: To clarify, we did not enforce the learning rule every fourth forward step but during the last $T-T_l$ timesteps of the sequence. For N-CALTECH101, DVS Gesture and DVS CIFAR10, it applies to the last 5 time steps, and for Optical Flow, it occurs at the final timestep. Following the reviewer's suggestion, we conducted simulations for DVS CIFAR20, presenting the learning signal during every timestep of the sequence, as indicated in Table 3. Consistently, S-TLLR with non-causal terms outperforms OTTT (S-TLLR with $\alpha_{post} = 0$ ) for the same batch size 48.

Answer 3: As previously discussed, OTTT can be viewed as a specific instance of S-TLLR when $\alpha_{post}=0$ and $\lambda_{pre}=\gamma$ (the leak factor in LIF models). Consequently, the ablation studies comparing OTTT and S-TLLR are encompassed across the work in Table 2, 3, 4 and 5. Our results consistently demonstrate that S-TLLR outperforms OTTT (S-TLLR with $\alpha_{post}=0$ ) under same conditions.

AC 元评审

2023-12-06

This paper describes a learning algorithm for spiking neural networks (SNNs) based on spike-timing-dependent plasticity (STDP) from neuroscience. Briefly, the algorithm, STDP-inspired Temporal Local Learning Rule (S-TLLR) employs a three-factor approach that combines the derivative of the loss wrt a unit with the pre- and post-synaptic activations of that unit, similar to STDP rules. This approach, akin to some other recent approaches, ignores the temporally and spatially non-local components of the gradient update. In doing so, the mnemonic and temporal complexity of the algorithm is very good. The authors show that this efficient learning algorithm can learn several event-based tasks as well as or better than existing methods.

The reviews for this paper were borderline. The reviewers raised a variety of concerns, with come consistent ones about the actual improvements over existing techniques and lack of tests of actual practical efficiency. Notably, one reviewer pointed out that this algorithm is very similar in form to a recent algorithm by Xiao et al. (2022), Online Training Through Time (OTTT). In discussion amongst reviewers it was noted that S-TLLR is effectively OTTT with the addition of a non-causal term. As such, S-TLLR is not actually any more efficient than OTTT. As well, the reviewers agreed that the experiments do not show a convincing advance over OTTT. Based on these observations, at the end of discussion, there was a consensus to reject, and the AC followed that advice.

为何不给更高分

The reviewers (and myself) were convinced that this is not actually a big advance, and essentially involves the addition of a non-causal term to OTTT, with little effect.

为何不给更低分

N/A

最终决定Reject

2024-01-16

Reject