4.8

/10

Rejected5 位审稿人

最低3最高6标准差1.0

4.0

置信度

ICLR 2024

BEEF: Building a BridgE from Event to Frame

Jiahang Cao,Mingyuan Sun,Ziqing Wang,Hao Cheng,Qiang Zhang,Renjing Xu

OpenReview PDF

提交: 2023-09-21更新: 2024-02-11

TL;DR

A novel-designed event processing framework capable of splitting events stream to frames in an adaptive manner.

摘要

关键词

Event-based CameraSpiking Neural NetworkObject TrackingImage Recognization

评审与讨论

审稿意见

评分: 6置信度: 52023-10-24

This paper proposes a novel pre-processing framework (i.e., BEEF) to split continuous event streams into event slices in an adaptive manner. BEEF mainly adopt an energy-efficient SNN to trigger the slicing time. Technically, a new dataset is first split into event slices by SNN, which is robust to high-speed or low-speed scenarios. Then, event slices are used to finetune the ANN to verify the performance in downstream event-based vision tasks. The experiments show that the proposed BEEF achieves SOTA performance in event-based object tracking and event-based object recognition.

优点

i) The topic of adaptively splitting event streams using SNN is very interesting and attractive.

ii) The authors sufficient experiments in the main paper and the supplemental material to help reader better understand the main contributions of this work.

iii) The writing is straightforward, clear, and easy to understand.

缺点

i) While fixed windows or a fixed event count may not offer optimal performance for event partitioning pre-processing, they do provide a quick processing option for collaboration with subsequent vision tasks. The authors also adapt the SNN for event stream division, but it's crucial to determine if this process is time-consuming across different platforms (CPU, GPU) and if it's suitable for downstream tasks, particularly those requiring low-latency responses for agile robots. Although the authors give the analysis of processing speed, it should be given the computational analysis in CPU.

ii) The authors have conducted a comparison experiment with a fixed number of times, as shown in Table 3. Nevertheless, it is advisable for the authors to include experiments with a fixed time window. Furthermore, the authors should investigate how various parameters for fixed events or fixed time windows compare to BEEF. Additionally, it would be beneficial for the authors to provide more visual comparison results of event representations.

iii) There are articles exploring adaptive event stream splitting strategies. The author should consider citing some relevant references [1, 2] that utilize hyperparameters for implementation.

[1] EDFLOW: Event driven optical flow camera with keypoint detection and adaptive block matching, IEEE TCSVT 2022.

[2] Asynchronous spatio-temporal memory network for continuous event-based object detection, IEEE TIP 2022.

问题

See weakness.

评论- Response to erhw (Reply 1)

2023-11-17

Weakness 1 (Computational Analysis in both GPU and CPU)

Q1: While fixed windows or a fixed event count may not offer optimal performance for event partitioning pre-processing, they do provide a quick processing option for collaboration with subsequent vision tasks. The authors also adapt the SNN for event stream division, but it's crucial to determine if this process is time-consuming across different platforms (CPU, GPU) and if it's suitable for downstream tasks, particularly those requiring low-latency responses for agile robots. Although the authors give the analysis of processing speed, it should be given the computational analysis in CPU.

A1: Thank you for your insightful comments! We have conducted a computational analysis of the BEEF framework on both CPU and GPU platforms:

	Slicing Latency	FPS
SNN on GPU	0.009s per img	111 Hz
SNN on CPU	0.430s per img	2.3 Hz

The results indicate that SNN processing of event streams on a GPU offers the advantage of low latency, while there is a noticeable delay when processing on a CPU.

However, it's important to note that SNNs are primarily designed to operate on neuromorphic hardware, where they can leverage their low power consumption and low latency advantages for efficient data processing in scenarios demanding quick responses. There is significant research demonstrating the benefits of SNNs in processing event streams on neuromorphic hardware. For instance, the use of the Loihi hardware for event-based vision tasks [1,2] and the Tianji chip [3] for robotics applications [4] are notable examples.

We will include these details in our revised manuscript to provide a comprehensive understanding of the processing capabilities of SNNs across different hardware platforms. Thanks for your suggestion!

Reference:

[1] Roy A, Nagaraj M, Liyanagedera C M, et al. Live Demonstration: Real-time Event-based Speed Detection using Spiking Neural Networks. CVPRW 2023.

[2] Viale A, Marchisio A, Martina M, et al. Carsnn: An efficient spiking neural network for event-based autonomous cars on the loihi neuromorphic research processor. IJCNN 2021.

[3] Brain-inspired multimodal hybrid neural network for robot place recognition. Science Robotics 2023

[4] Viale A, Marchisio A, Martina M, et al. LaneSNNs: Spiking Neural Networks for Lane Detection on the Loihi Neuromorphic Processor. IROS 2022.

评论- Response to erhw (Reply 3)

2023-11-17

Weakness 3 (Relevant References)

Q3: There are articles exploring adaptive event stream splitting strategies. The author should consider citing some relevant references [1, 2] that utilize hyperparameters for implementation.

A3: Thank you very much for your suggestion! We acknowledge the importance of incorporating relevant references, particularly those that explore adaptive event stream splitting strategies. We will make sure to include the references you have mentioned [1,2] in the revised version of our manuscript! Thanks!

Reference:

[1] Liu M, Delbruck T. EDFLOW: Event driven optical flow camera with keypoint detection and adaptive block matching[J]. IEEE TCSVT 2022.

[2] Li J, Li J, Zhu L, et al. Asynchronous spatio-temporal memory network for continuous event-based object detection[J]. IEEE TIP 2022.

评论- I stuck with the original score.

2023-11-21

The author has addressed all my queries through the experimental responses. However, I suggest incorporating these experiments into the supplementary material to make them accessible to a broader audience of authors.

评论- Responese to erhw (Thanks)

2023-11-21

Thank you very much for your valuable suggestions and feedback. We are delighted to have addressed all your queries and appreciate your recognition of our work.

In response to your suggestion, we have submitted a revised manuscript that includes the details of the important experiments. We will submit the final revised version containing all the suggested experiments shortly!

Thank you once again!

评论- Response to erhw (Reply 2)

2023-11-17

Weakness 2 (Comparisons on Different Fixed Slicing Methods)

Q2: The authors have conducted a comparison experiment with a fixed number of times, as shown in Table 3. Nevertheless, it is advisable for the authors to include experiments with a fixed time window. Furthermore, the authors should investigate how various parameters for fixed events or fixed time windows compare to BEEF. Additionally, it would be beneficial for the authors to provide more visual comparison results of event representations.

A2: Thank you very much for your valuable suggestions! In the original experiments presented in Table 3, we used the slicing of fixed event count for comparison. We will include more specific details about this in the revised version of our manuscript. To facilitate a more complete comparison between dynamic slicing and traditional fixed slicing methods, we have supplemented our experiments in the object recognition task. These include slicing based on a fixed number of events and a fixed duration of events. We have also compared three different event representation methods, with the results as follows:

DVSGesture	Event Frame	Event Spike Tensor	Voxel Grid
Fix Duration	93.75%	93.75%	88.54%
Fix Event Count	93.06%	94.79%	88.19%
BEEF(ours)	94.79%	95.49%	89.24%

The results show that our dynamic slicing approach, BEEF, outperforms the fixed slicing approach in downstream tasks across different event representations. This highlights the efficacy of BEEF in handling event streams.

Additionally, we plan to include more illustrative figures in the revised version to vividly depict the differences and respective advantages of dynamic and fixed slicing. More visualization results related to the experiments will also be added to the appendix.

Thank you again for your suggestion! Your feedback is instrumental in enhancing the clarity and comprehensiveness of our research.

审稿意见

评分: 5置信度: 32023-10-26

The authors propose BEEF, a novel-design event processing framework that can slice the event streams in an adaptive manner. To achieve this, BEEF employs an SNN as the event trigger to dynamically determine the time at which the event stream needs to be split, rather than requiring hyper-parameter adjustment as in traditional methods.

优点

S1: papers dealing with spiking related algorithms should be of interest to the subset of the machine learning community investigating on-the-edge computing algorithms.

S2: the paper is relatively well written

缺点

W1: I am aware that with event and spiking cameras it is quite popular to convert the event/spike streams into a sort of frame based representation. However I have a fundamental objection with this type of an approach (which is shared by quite a few of my colleagues around the world, in private conversations at least) as to why should these fundamentally asynchronous event streams representations should be converted to a rather synchronous representation, simply to be able to map them into algorithms that were originally developed for synchronous frame like data. I think a more thorough discussion on this is needed in the paper to better motivate the work

W2: clarify better what are the alternative methods to which this is being compared? What exactly is meant by "fixed slice" approaches to which this is being compared? Many approaches for producing frame like representations (such as getting the max or union of all events in a time window) result in the introduction of significant amounts of noise. In contrast morphological operands like erosion and dilation can introduce much better quality frames. To what extent is the good performance of the algorithm attributable simply to noisy frame generation in competing approaches?

W3: unless i missed it, will source code be provided?

问题

See my questions above. Addressing them would improve the paper's relevance

伦理问题详情

N/A

评论- Response to 689Y (Reply 3)

2023-11-17

Weakness 3 (Code Available)

Q3: unless i missed it, will source code be provided?

A3: Absolutely！We will provide the source code upon acceptance.

评论- Response to 689Y (Reply 2)

2023-11-17

Weakness 2 (Clarification and Discussion)

Q2: clarify better what are the alternative methods to which this is being compared? What exactly is meant by "fixed slice" approaches to which this is being compared? Many approaches for producing frame like representations (such as getting the max or union of all events in a time window) result in the introduction of significant amounts of noise. In contrast morphological operands like erosion and dilation can introduce much better quality frames. To what extent is the good performance of the algorithm attributable simply to noisy frame generation in competing approaches?

A2:

Symbol Description: the total event stream $E$ ; the resulting sliced sub-event stream list by BEEF: $E_{beef}=[E^b_{1},..E^b_{N_1}]$ ; the resulting sliced sub-event stream list by fixed slicing method: $E_{fixtime}=[E^f_{1},..E^f_{M_1}]$ .

Thank you for your insightful suggestions! First, let me clarify our dynamic slicing approach, which allows the event stream to be sliced at any timestamp and then converted into frames for downstream tasks. In Table 1, the term "fixed slice" refers to a method that we employ a fixed slicing approach with a constant number of events per sub-event stream to segment the entire event stream into $E_{fixtime}=[E^f_{1},..E^f_{M_1}]$ . These sub-event streams are then transformed into different representations using the same event representation method $F$ (we used Event Frame [1]) and fed into downstream tasks (tracking/recognition) for testing. We will revise the terminology to "slicing with fixed event number" in our updated manuscript to make the experimental settings clearer.

We greatly appreciate your mention of the potential use of morphological operands. These methods are more commonly found in event stream denoising algorithms or event representation algorithms [2], and our paper primarily focuses on the event stream slicing process. The techniques you mentioned, like erosion or dilation, could be integrated with our current approach; for example, denoising the original event stream before applying BEEF’s dynamic slicing. We intend to investigate it further in future research. Thank you for this valuable suggestion!

Reference:

[1] Maqueda A I, Loquercio A, Gallego G, et al. Event-based vision meets deep learning on steering prediction for self-driving cars. CVPR 2018.

[2] Baldwin R W, Liu R, Almatrafi M, et al. Time-ordered recent event (TORE) volumes for event cameras. TPAMI 2022.

评论- Response to 689Y (Reply 1)

2023-11-17

Weakness 1 (Asynchronous Event and Synchronous Representation)

Q1: I am aware that with event and spiking cameras it is quite popular to convert the event/spike streams into a sort of frame based representation. However I have a fundamental objection with this type of an approach (which is shared by quite a few of my colleagues around the world, in private conversations at least) as to why should these fundamentally asynchronous event streams representations should be converted to a rather synchronous representation, simply to be able to map them into algorithms that were originally developed for synchronous frame like data. I think a more thorough discussion on this is needed in the paper to better motivate the work.

A1: Thank you for your insightful comment. First, I wholeheartedly agree with your statement regarding the conversion of asynchronous event data into synchronous formats, which indeed can undermine the inherent advantages of the asynchronous nature. This is not an optimal representation, and we acknowledge this limitation. However, the reasons for converting asynchronous event data into synchronous representations in current practices can be summarized as follows:

Why choose synchronous over asynchronous processing? Given that existing GPU hardware architectures and programming models are designed for highly synchronous and parallel tasks, the conversion of asynchronous event stream data into a synchronous representation becomes a necessity for algorithm simulations performed on GPUs. In our paper, we aim to slice the event stream into very fine event cells to represent the original event data as closely as possible (Sec.4.1), hoping to minimize the errors introduced by synchronous representation.
The synchronous simulations conducted on GPUs in our work do not preclude the possibility of future asynchronous implementations on neuromorphic hardware. Both SNNs and event stream data are inherently asynchronous. However, due to hardware constraints, current software simulations are temporarily unable to achieve true asynchronous processing. If deployed on neuromorphic hardware (e.g., Loihi [1], TrueNorth [2]), the asynchronous processing of event streams by SNNs would be extremely low-energy and low-lantancy[3,4,5]. Therefore, this work lays the theoretical groundwork for future hardware deployment, and efficiently implementing SNN processing of asynchronous event streams in conjunction with ANNs on neuromorphic hardware is one of our future goals.

We hope this explanation addresses your concerns. We are committed to further discussing this topic in our manuscript to provide a comprehensive motivation for our work.

Reference:

[1] Davies M, Srinivasa N, Lin T H, et al. Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro 2018.

[2] Akopyan F, Sawada J, Cassidy A, et al. Truenorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip. IEEE TCAD 2015.

[3] Roy A, Nagaraj M, Liyanagedera C M, et al. Live Demonstration: Real-time Event-based Speed Detection using Spiking Neural Networks. In CVPRW 2023.

[4] Viale A, Marchisio A, Martina M, et al. Carsnn: An efficient spiking neural network for event-based autonomous cars on the loihi neuromorphic research processor. In IJCNN 2021.

[5] Yu F, Wu Y, Ma S, et al. Brain-inspired multimodal hybrid neural network for robot place recognition. Science Robotics 2023.

评论- Response to 689Y (Clarification)

2023-11-23

Thank you for your response and thank you for taking the time and effort to review our paper. Although BEEF demonstrate superior performance compared with SOTA, but we believe that the flow in the original paper may have caused confusion, for instance, most reviewers confused our method with representation method. We notice that and emphasize our method is not one type of representation methods but compatible with any representation method (including asynchronous representation) in the revised paper. In addition, thanks to the suggestions made by the reviewers, we have added a number of experiments that have significantly increased the reliability of our paper.

Back to your query, to the best of our knowledge, there is no commonly used SNN framework that can directly handle asynchronous inputs; current frameworks can only handle frame-based inputs. However, we believe that the SNN trained with current frameworks can potentially be utilized to process asynchronous input. To demonstrate this, we quickly add one experiment. When the number of slices ( $N$ ) increases, the frame-like input will gradually become more similiar to asynchronous inputs. Therefore, we cut the event stream into up to 1000 slices, each slice has only ~100 events and almost all pixels are 0 or 1. Considering the total pixel number is 180 x 240 = 43200, it is quite sparse. From following table, the same SNN will fire at similar percentages (i.e., the resulting sub-stream always contains similar event point), demonstrating that SNN is capable of perceiving event information. Indeed, it is still not fully asynchronized input, but we hope it would provide some insights here.

$N$	30	50	100	200	300	400	500	600	700	800	900	1000
Spike Position	15	25	51	102	153	213	256	324	387	444	504	560
Percentage of Containing Event	50.00%	50.00%	51.00%	51.00%	51.00%	53.25%	51.20%	54.00%	55.29%	55.50%	56.00%	56.00%

2023-11-23

I have read the comments by the authors and reviewers. The overall evaluation of the paper seems to be consistent more or less across reviewers. As I indicated in my comments the conversion of asynchronous events to a frame based representation defeats the purpose of event/spiking cameras (in my personal opinion at least). Indicating that this is done because we need to run it on GPUs is not convincing to me, so I keep my ranking unchanged.

审稿意见

评分: 5置信度: 52023-10-31

This paper proposes an efficient way for event representation. Specifically, they introduce SNN for adaptive event slicing, which can choose appropriate slicing times considering the events’ temporal feature and downstream task. The authors present several losses to further improve the adaptiveness, and a strategy to let SNN better assist the of ANN in an iterative and cooperative manner.

优点

The overall writing of this work is clear and easy to follow.
The three observations and solutions seem to work well and improve the adaptation for slicing time.
Using SNN in event representation is rational considering the similar feature for SNN and event.

缺点

This paper fails to fully review the topic of this work: event representation. As suggested in [1][2], there are several existing event representation strategies including stacking based on time/event counts, voxel grid, histogram of time surfaces, event spike tensor, and a recent work introduces neural representation [3]. However, this paper only mentions two of them. In addition, the motivation to consider temporal information is similar with event counts integration, which is mentioned by the authors.
The necessity of a very lightweight SNN is not clear. Since SNN works with ANN cooperatively, SNN has only very limited contribution to the overall computational cost. As implied in Table 2, considering the ANN is the major cost for the process, the contribution and necessity for low energy and fast speed of SNN is reduced.
The compared methods in the experiment are not sufficient. More event representation/stacking methods should be considered to compare with the proposed methods, including the methods mentioned in [1-3].
I wonder whether such iterative optimization of SNN and ANN work better than joint optimization, like we regard the whole process as an end-to-end task and optimize the SNN loss and downstream task loss together.
More details about the experimental settings are required. The proposed methods use adaptive slicing time, how to create GT accordingly? And how to compare with fixed-sliced methods that have different timestamps for event frames?

[1] End-to-End Learning of Representations for Asynchronous Event-Based Data, ICCV 2019 [2] Event-based High Dynamic Range Image and Very High Frame Rate Video Generation using Conditional Generative Adversarial Networks, CVPR 2019 [3] NEST: Neural Event Stack for Event-based Image Enhancement, ECCV 2022

问题

See the weakness above

评论- Response to YMHz (Reply 5)

2023-11-17

Weakness 5 (Experiment Details)

Q5: More details about the experimental settings are required. The proposed methods use adaptive slicing time, how to create GT accordingly? And how to compare with fixed-sliced methods that have different timestamps for event frames?

A5: Thank you for your question. Below are more detailed explanations of our experimental settings and the methodology for generating Ground Truth (GT). We will include these details in the appendix of the revised version:

Experimental settings：

Network Structure: We used a Spiking Neural Network architecture comprising {16C3-IF-AP2-32C3-IF-AP2-64C3-IF-AP2-LNIF-LN-IF}, where IF denotes the use of integrated-fire neurons. (Appendix E)
Training Setup: We adopted the SGD optimizer with an initial learning rate of 1e-4, complemented by a cosine learning rate scheduler. SNN models were trained for 50 epochs with a batch size of 32. (Appendix E)
Data Processing Setup: For single object tracking tasks, we utilized the FE108 dataset, while for object recognition, we used the DVS-gesture and N-caltech101 datasets. Each dataset was divided into training, testing, and validation sets. The validation set was used to infer the ANN model to provide feedback for supervising SNN training. All results in the table represent the ANN's performance on the test set.
GT Setting: Since the SNN might choose to spike and segment the event stream at any position, the GT at any timestamp was obtained through linear interpolation from the GT provided by the original dataset.

Details of Fixed-sliced method:

Suppose an event stream (duration = $T$ ) is sliced into $N'$ slices ( $E_{beef}=[E^b_{1},..E^b_{N_1}]$ ) after dynamic slicing. Then for a fair comparison, the number of sub-event stream generated by the fixed slicing method should also be $N'$ (i.e., $len(E_{fixtime})=N'$ ), where the duration of each sub-event stream in $E_{fixtime}$ is $\frac{T}{N'}$ . Since it was mentioned earlier that we know the GT of each timestamp, we can likewise obtain the GT corresponding to each sub-event stream after fixed slicing. Then we can compare the results of the fixed slicing method with our dynamic slicing method.

评论- Response to YMHz (Reply 4)

2023-11-17

Weakness 4 (End-to-end Optimization)

Q4: I wonder whether such iterative optimization of SNN and ANN work better than joint optimization, like we regard the whole process as an end-to-end task and optimize the SNN loss and downstream task loss together.

A4: Thanks for your constructive suggestion! We have explored your idea of training the SNN and ANN from scratch and optimizing them together. However, as shown in the table below, the results of ResNet18 have a similar performance to the original results, while ResNet34 even has a performance degradation:

DVSGesture	Random Slice	Fixed Slice	BEEF	BEEF (optimize both from scratch)
ResNet18	93.06%	93.40%	93.49%	93.75%
ResNet34	95.14%	93.40%	96.18%	92.36%

Thus, we believe that further in-depth exploration is needed to fully realize the concept you've mentioned and the training strategies need to be improved.

Thank you for interesting comments!

评论- Response to YMHz (Reply 3)

2023-11-17

Weakness 3 (Comparisions of Different Event Representation)

Q3: The compared methods in the experiment are not sufficient. More event representation/stacking methods should be considered to compare with the proposed methods, including the methods mentioned in [1-3].

A3: Thanks for your suggestion! Event representation refers to the process of event information extraction that is performed after the event stream has been sliced into sub-event stream, and the resulting event representation meets the neural network input requirements. Thus, our dynamic slicing process and event representation can be used at the same time.

To validate the effectiveness of our slicing approach, we assess the downstream task performance using three distinct event representation methods, namely Event Frame [1], Event Spike Tensor (EST) [2], and Voxel Grid [3], on the DVSGesture dataset. We measure these against both fixed (slice by fixed duration and fixed event count) and dynamic slicing approaches to provide a comprehensive analysis:

DVSGesture	Event Frame	Event Spike Tensor	Voxel Grid
Fix Duration	93.75%	93.75%	88.54%
Fix Event Count	93.06%	94.79%	88.19%
BEEF(ours)	94.79%	95.49%	89.24%

Thanks for your suggestion! We will cite the event representation methods you mentioned in the revised paper and try to integrate them with BEEF in the future.

Reference:

[1] Maqueda A I, Loquercio A, Gallego G, et al. Event-based vision meets deep learning on steering prediction for self-driving cars. CVPR 2018.

[2] Gehrig D, Loquercio A, Derpanis K G, et al. End-to-end learning of representations for asynchronous event-based data. ICCV 2019.

[3] Zhu A Z, Yuan L, Chaney K, et al. Unsupervised event-based learning of optical flow, depth, and egomotion. CVPR 2019.

评论- Response to YMHz (Reply 2)

2023-11-17

Weakness 2 (Necessity of Using SNN)

Q2: The necessity of a very lightweight SNN is not clear. Since SNN works with ANN cooperatively, SNN has only very limited contribution to the overall computational cost. As implied in Table 2, considering the ANN is the major cost for the process, the contribution and necessity for low energy and fast speed of SNN is reduced.

A2: Thank you for your question. Let me elucidate the necessity of using SNN.

1.The reason why we choose SNN as the event slicing trigger is twofold:

Utilizing SNNs on neuromorphic hardware for processing event streams is low-energy and low-latency [1,2].
Deployed on neuromorphic hardware, SNNs can process event streams asynchronously [3,4,5], conserving energy when there is no data input—a capability that GPUs, operating synchronously, lack.

Due to the aforementioned reasons, there is a considerable amount of research [6,7,8,9,10] employing Spiking Neural Networks (SNNs) for event data. Although these SNNs are simulated on GPU platforms, the models resulting from such simulations could be deployed on neuromorphic hardware [3,4].

2.The rationale behind our aim for low-energy and fast-speed SNN processing is:

We design the BEEF as a plug-and-play algorithm, intending for the SNN to dynamically slice the event stream without adversely affecting the latency or energy consumption of the main network.

Regarding the comparison of resource consumption between SNNs and ANNs (Table 2), it is crucial to note that our dynamic slicing is performed in real-time, not in an offline manner. By employing a low-energy SNN model as a dynamic event stream slicer, we ensure that the speed does not impede the downstream processing rate and also enhances the overall performance of downstream tasks. This is one of the core motivations of our paper, and we hope it addresses your concerns.

Reference:

[1] Davies M, Srinivasa N, Lin T H, et al. Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro 2018.

[2] Akopyan F, Sawada J, Cassidy A, et al. Truenorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip. IEEE TCAD 2015.

[3] Roy A, Nagaraj M, Liyanagedera C M, et al. Live Demonstration: Real-time Event-based Speed Detection using Spiking Neural Networks. In CVPRW 2023.

[4] Viale A, Marchisio A, Martina M, et al. Carsnn: An efficient spiking neural network for event-based autonomous cars on the loihi neuromorphic research processor. In IJCNN 2021.

[5] Yu F, Wu Y, Ma S, et al. Brain-inspired multimodal hybrid neural network for robot place recognition. Science Robotics 2023.

[6] Hagenaars J, Paredes-Vallés F, De Croon G. Self-supervised learning of event-based optical flow with spiking neural networks. NeurlPS 2021.

[7] Yao M, Gao H, Zhao G, et al. Temporal-wise attention spiking neural networks for event streams classification. ICCV 2021.

[8] Zhu L, Wang X, Chang Y, et al. Event-based video reconstruction via potential-assisted spiking neural network. CVPR 2022

[9] Kosta A K, Roy K. Adaptive-spikenet: event-based optical flow estimation using spiking neural networks with learnable neuronal dynamics. ICRA 2023.

[10] Hussaini S, Milford M, Fischer T. Spiking neural networks for visual place recognition via weighted neuronal assignments. RAL 2023.

评论- Response to YMHz (Reply 1)

2023-11-17

Weakness 1 (Difference between Event Slicing and Event Representation)

Q1: This paper fails to fully review the topic of this work: event representation. As suggested in [1][2], there are several existing event representation strategies including stacking based on time/event counts, voxel grid, histogram of time surfaces, event spike tensor, and a recent work introduces neural representation [3]. However, this paper only mentions two of them. In addition, the motivation to consider temporal information is similar with event counts integration, which is mentioned by the authors.

A1: Thank you for your recommendations. I understand your concern regarding why our paper does not address a broader range of event representation methods. The reason is that our focus is on event slicing rather than event representation. Event stream data conversion into frames/representations involves two steps: Step 1: slicing the event stream into multiple sub-event streams, and Step 2: converting these sub-streams into frames using different event representation methods. There is a significant body of work dedicated to optimizing event representation (Step 2), including the voxel grid, time surface, and other methods you mentioned. However, these do not address the issues arised with fixed slicing (e.g., resulting non-uniform event in scenarios with changing motion speed). Hence, our paper primarily addresses the first step: event slicing.

After the dynamic slicing with BEEF, the events can indeed be transformed into different representational formats using various event representation methods. Although event slicing and representation are different processing steps, either better slicing or representation method benefits the feature extraction with neural network, thus improving performance. Experiments with different slicing methods and different event representation have been supplemented in Reply 3.

The references you have mentioned will also be included in the updated version of our manuscript. And the confusing description of motivation in the original article that you pointed out will also be revised shortly!

I hope this explanation addresses your query. Thank you once again!

审稿意见

评分: 3置信度: 42023-10-31

This paper is about the slicing step in the conversion from events to binned representations that can yield frames for classical image processing. The goal of this paper is to make the event-slicing step adaptive instead of fixed over time as it is now in the majority of the approaches that use slicing/binning/bucketing where events are assigned to slices with slices being constant time length or containing equal numbers of events.

The way it works is that events are fed to a spiking neural network with Leaky Integrate and Fire neurons. The SNN fires more sparsely than the original events. A new slice is created containing all events between the timings of two output spikes.

To control the desired time offset of the slice a membrane potential loss is introduced. Authors give a formal proof for the sufficient conditions. Moreover, a linear assuming loss resolves the dependence between neighboring membrane potentials.

Experiments are conducted on object tracking and gesture/object recognition with impressive results.

优点

The dynamic slicing of events using the output spikes of an SNN.
The connection between slicing and downstream task expressed in the additional two loss terms determining the hyperparameters of the SNN.
The theoretical treatment of the sufficient condition of firing at a desired time (given in the appendix).

缺点

Frame-like inputs to transformers or CNNs where frames have been derived from events may be sensitive to slicing. We need a toy experiment to study this hypothesis with a smaller network and different slicing techniques.
The exposition is really hard to follow. As stated directly after eq. 4, the slicing is done by grouping together events whose timestamps are between two output spikes of the SNN. Here, an experiment is needed on the statistics of this slicing and why such an approach makes sense.
4.3.1 has to be elaborated. While the math derivations are sound, it is not clear to the reader why the starting point of the derivations is the desire for $S_{out}$ to spike at $n^{*}$ . I tried to understand it also through the observations in 4.3.2 but could not.
The beginner's arena was meant to explain the above but is incomprehensible. What does it mean ``to slice at a specified time step $T^{*}'' ?
It is not clear what purpose the energy computations of the SNN serve when the task will be solved with ultra consuming GPUs.
The experimental comparison should be with approaches that are asynchronous end to end like HOTS or HATS or Cannici'19, Perot'20 etc. or approaches like the Event Transformer.
Table 3: It is not discussed why the transformer tracker performs almost the same or better without BEEF. Why does BEEF not add anything significant when an attention mechanism is used?
The feedback strategy is learnt during training. I understand that in this sense it is adaptive to the task rather than during inference to the event stream when the hyperparameters will be fixed.
It is unclear whether events are treated differently according to their polarity.
There is some problem with the definition of ${\cal D}$ because $n_q$ is not defined anywhere but mentioned ``where $n_q$ denotes the time of the last spike''.
It would be worth listing the latency from event to GPU output for the particular architectures on tracking and recognition. This is much more critical here than the power consumption of the CNN.

Summary: The authors need to explain the slicing method more clearly (possible misreadings are listed above). My main concern is the lack of any experimental analysis or motivation for the particular quite elaborate slicing method. There is no motivation to use an SNN since the slicing is only a minimal energy and latency fraction of a pipeline that uses transformers or regression. There is no comparison with architectures that use other event representations like time surfaces.

问题

Weaknesses are numbered and should be considered as questions.

评论- Response to q3dC (Reply 6)

2023-11-17

Weakness 6 (Comparisions of Different Event Representation)

Q6: The experimental comparison should be with approaches that are asynchronous end to end like HOTS or HATS or Cannici'19, Perot'20 etc. or approaches like the Event Transformer.

A6: Thank you for your suggestion! The above methods you mentioned are all about event representation methods. It is worth noting that our work focuses on the slicing of the event stream rather than focusing on event representation. Event representation refers to the process of event information extraction that is performed after the event stream has been sliced into sub-event stream, and the resulting event representation meets the neural network input requirements. Thus, our dynamic slicing process and event representation can be used at the same time, either better slicing or representation method benefits the feature extraction with neural network, thus improving performance.

To validate the effectiveness of our slicing approach, we supplement the event-based recognition task below. We compare the downstream performance of three different event representation methods (including Event Frame [1], Event Spike Tensor (EST [2]) and Voxel Grid [3]) on the DVSGesture dataset under fixed slicing and dynamic slicing:

DVSGesture	Event Frame	Event Spike Tensor	Voxel Grid
Fix Duration	93.75%	93.75%	88.54%
Fix Event Count	93.06%	94.79%	88.19%
BEEF(ours)	94.79%	95.49%	89.24%

The results demonstrate that across different event representation methods, our dynamic slicing approach, BEEF, outperforms fixed slicing methods in downstream tasks. This underscores the efficacy of BEEF in handling event streams.

Thanks for your suggestion! We will cite the event representation methods you mentioned in the revised paper and try to integrate them with BEEF in the future.

Reference:

[1] Maqueda A I, Loquercio A, Gallego G, et al. Event-based vision meets deep learning on steering prediction for self-driving cars. CVPR 2018.

[2] Gehrig D, Loquercio A, Derpanis K G, et al. End-to-end learning of representations for asynchronous event-based data. ICCV 2019.

[3] Zhu A Z, Yuan L, Chaney K, et al. Unsupervised event-based learning of optical flow, depth, and egomotion. CVPR 2019.

评论- Response to q3dC (Reply 5)

2023-11-17

Weakness 5 (Energy Computation)

Q5: It is not clear what purpose the energy computations of the SNN serve when the task will be solved with ultra consuming GPUs.

A5: Thank you for your comment. Indeed, you are correct that the implementation of Spiking Neural Networks (SNNs) on GPUs does not currently confer a significant energy advantage. This is largely due to the limitations inherent in current SNN simulation platforms (such as spikingjelly[9] and tonic[10]), which are primarily GPU-based.

However, it is important to note that SNNs demonstrate significant energy efficiency when operated on neuromorphic hardware, such as Loihi [7] and TrueNorth [8]. This advantage is well-established and extensively discussed in recent literature [1,2]. The energy calculations in our paper [3] are intended to provide an estimation of the potential energy efficiency of SNNs with future advancements in hardware and algorithms. In addition, comparing the theoretical energy consumption of SNNs with that of traditional Artificial Neural Networks (ANNs) is commonly-adopted in this field [4,5,6], aiding in the understanding of the energy dynamics of these systems.

Reference:

[1] Yin B, Corradi F, Bohté S M. Accurate online training of dynamical spiking neural networks through Forward Propagation Through Time. Nature Machine Intelligence 2023.

[2] Schuman C D, Kulkarni S R, Parsa M, et al. Opportunities for neuromorphic computing algorithms and applications. Nature Computational Science 2022.

[3] Yao M, Zhao G, Zhang H, et al. Attention spiking neural networks. TPAMI 2023.

[4] Kim S, Park S, Na B, et al. Spiking-yolo: spiking neural network for energy-efficient object detection. AAAI 2020.

[5] Zhou Z, Zhu Y, He C, et al. Spikformer: When spiking neural network meets transformer. ICLR 2023.

[6] Wang Z, Fang Y, Cao J, et al. Masked Spiking Transformer. ICCV 2023.

[7] Davies M, Srinivasa N, Lin T H, et al. Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro 2018.

[8] Akopyan F, Sawada J, Cassidy A, et al. Truenorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip. IEEE TCAD 2015.

[9] Fang W, Chen Y, Ding J, et al. SpikingJelly: An open-source machine learning infrastructure platform for spike-based intelligence. Science Advances 2023.

[10] Eshraghian J K, Ward M, Neftci E O, et al. Training spiking neural networks using lessons from deep learning. Proceedings of the IEEE 2023.

评论- Response to q3dC (Reply 4)

2023-11-17

Weakness 4 (Explaination of $T$ *)

Q4: The beginner's arena was meant to explain the above but is incomprehensible. What does it mean ``to slice at a specified time step $T$ * '' ?

A4: Following up on the explanation of Reply 3, consider the downstream ANN's feedback indicates that the SNN should spike at a specific moment $n^*$ to slice the event stream. In this case, our goal is to supervise the SNN to ensure that it spikes precisely at $n^*$ . The purpose of "The beginner's arena" is to demonstrate that our proposed loss function SPA-CE can effectively guide the SNN to pulse at $n^*$ , in contrast to common loss functions like Cross-Entropy (CE) or Mean Squared Error (MSE), which could not achieve this.

The term $T^*$ used in the text may have caused some confusion. In future revisions of our manuscript, we will standardize this notation to $n^*$ to avoid any ambiguity. We hope this clarification helps and apologize for any confusion caused. Thank you for the reminder.

评论- Response to q3dC (Reply 3)

2023-11-17

Weakness 3 (Explaination of our method)

Q3: 4.3.1 has to be elaborated. While the math derivations are sound, it is not clear to the reader why the starting point of the derivations is the desire for $S_{out}$ to spike at $n^*$ . I tried to understand it also through the observations in 4.3.2 but could not.

A3: Thank you for your query which provides us with an opportunity to clarify our methodology.

Contrary to fixed slicing methods (such as predetermined time intervals or event counts), our approach dynamically determines event stream slicing based on spike occurrences in the SNN, detailed in Figure 1.

In order to supervise the SNN to slice the event stream at the optimal position, we feed the events sliced at the SNN-determined position as well as the events sliced at its neighboring positions into the downstream model. The downstream model then returns the loss (e.g., classification loss, tracking loss), where the position corresponding to the minimum loss indicates the optimal slicing point (Eq.10). Thus, we obtain a position label $n^*$ , which in turn guides the SNN to make the best slicing strategy through supervised learning.

In summary, when ANN feedback indicates that slicing at the $n^*$ position can enhance model performance, the SNN discriminator is directed to spike at the $n^*$ . Section 4.3 of our paper delves into formulating the SPA-CE loss function, which aids the SNN to spike at this specified $n^*$ position.

We hope this response clarifies our approach and methodology. We are committed to improving the manuscript in its revised version for better understanding. Should there be any points needing further clarification, please feel free to reach out.

评论- Response to q3dC (Reply 2.1)

2023-11-17

Weakness 2 (Statistics of Slicing Method)

Q2: The exposition is really hard to follow. As stated directly after eq. 4, the slicing is done by grouping together events whose timestamps are between two output spikes of the SNN. Here, an experiment is needed on the statistics of this slicing and why such an approach makes sense.

A2: To demonstrate the effectiveness of our proposed dynamic slicing method, we provide the following statistical results:

1. Statistics of dynamic slicing (BEEF) vs. fixed slicing.

In the tracking task, the average duration of each sub-event stream $E^b_{k}(k\in[1,N_1])$ is 65ms (corresponding to 13 event cells, and the duration of the event stream contained in each event cell is 5ms). The maximum duration of each sub-event stream is 100ms, and the minimum duration is 30ms, while for our comparison of the slicing-by-fixed-time approach, the duration of each sub-event stream $E^f_{j}(j\in[1,M_1])$ is fixed at 75ms. The following are specific statistics:

Method	Avg Cell Num	Var Cell Num	Avg Duration	Min Duration	25th Duration	75th Duration	Max Duration
BEEF	12.99	3.96	~65ms	25ms	50ms	80ms	100ms
Slice by fixed duration	15	0	75ms	//	//	//	//

We will put the visualization of the statistic results in the appendix of the revised version.

2. Statistics of event density.

We also counted the average density of sub-event streams after fixed slicing and dynamic slicing for comparison.

Symbol Description: For each sub-event stream $E^b_{k}$ or $E^f_{j}$ , it contains several event points $e_i=[x_i,y_i,t_i,p_i]$ . We define a matrix $C$ to represent the event count, where $C_{xy}$ represents the number of events at coordinates $(x,y)$ . Given a threshold $T$ , we define the event density to be: $D = \frac{\sum_{x,y}\mathbb{1}\{C_{x,y}\geq T\}}{\sum_{x,y}\mathbb{1}\{C_{x,y}> 0\}}$ . The threshold $T$ is determined based on the percentile of the event count $C_{xy}$ .

Density Analysis: We define the density of events as a metric reflecting the amount of event information within sub-event streams. The smaller the fluctuation in the density of each sub-stream (the smaller the variance), the more stable the event information contained. The stability of the event density is crucial for scenarios where the distribution of events is not uniform (e.g., scenarios with changing motion speed).

We chose $T = 30$ % and $90$ % (lower $T$ to observe overall event activity, including those less frequent events; higher $T$ for focusing on repetitive or frequent events) to validate the effectiveness of BEEF to dynamically slice the event stream. Below are the statistics of event density, where the data (left vs right) on the left are statistics of BEEF and on the right for statistics of fixed slicing. We select 4 classes in the FE108 tracking dataset.

T=30%	airplane_mul222	box_hdr	dog	tank_low
Mean	0.9196 vs 0.9074	0.9850 vs 0.9850	0.9208 vs 0.8984	0.9584 vs 0.9575
Var $\downarrow$	0.0151 vs 0.0164	0.0038 vs 0.0039	0.0144 vs 0.0168	0.0096 vs 0.0100
Std $\downarrow$	0.1231 vs 0.1283	0.0622 vs 0.0626	0.1200 vs 0.1299	0.0984 vs 0.1003

T=90%	airplane_mul222	box_hdr	dog	tank_low
Mean	0.1256 vs 0.1263	0.1338 vs 0.1405	0.1120 vs 0.1131	0.9584 vs 0.9575
Var $\downarrow$	0.0004 vs 0.0004	0.0007 vs 0.0053	0.0001 vs 0.0001	0.0096 vs 0.0100
Std $\downarrow$	0.0215 vs 0.0215	0.0267 vs 0.0732	0.0073 vs 0.0078	0.0984 vs 0.1003

The results show that the event stream after BEEF's dynamic slicing has a more stable event density (low variance), which verifies that BEEF has a certain ability to perceive the event information, and ensures that BEEF is robust in different motion scenarios, as shown in Table 3.

评论- Response to q3dC (Reply 1)

2023-11-17

Weakness 1 (Slicing Sensitivity)

Q1: Frame-like inputs to transformers or CNNs where frames have been derived from events may be sensitive to slicing. We need a toy experiment to study this hypothesis with a smaller network and different slicing techniques.

A1: Thank you very much for your constructive suggestion! In response, we have conducted total 60 experiments with different models to investigate the impact of different slicing techniques and different number of slices on the performance in downstream tasks, thereby affirming the hypothesis that event streams are sensitive to slicing.

In our experiment, we employed two fixed slicing methods: (1). Slicing with a fixed number of events and (2). Slicing with a fixed duration. $N$ denotes the number of resulting event slices. Experimental results are detailed as follows:

NCaltech101	$N$	2	4	6	8	10	12	14	16	18	20	22	24	26	28	30	Mean	Var
ResNet18	Fixed Count	70.96	75.26	75.39	75.30	76.09	73.95	74.09	73.80	76.40	75.39	75.45	73.60	71.94	71.01	71.17	73.98	3.33
ResNet18	Fixed Time	62.90	72.64	76.38	74.48	74.91	73.70	74.30	74.69	76.95	74.75	74.46	74.42	71.61	71.52	69.69	73.16	10.80
ResNet34	Fixed Count	72.19	75.55	76.98	78.22	77.14	77.40	76.78	76.90	78.14	77.06	76.91	74.85	74.76	76.91	73.07	76.19	2.90
ResNet34	Fixed Time	65.42	75.92	78.29	78.20	78.48	76.22	77.76	76.57	75.94	76.80	76.61	75.91	75.11	74.76	74.19	75.74	9.15

The results indicate significant fluctuations (large variance) in downstream performance based on the slicing method and the number of slices used. We believe this addition effectively demonstrates the sensitivity of event streams to fixed slicing techniques, confirming the need for our motivation to propose dynamic slicing of event streams. Additionally, the accuracy achieved using the dynamic slicing method (82.54% by ResNet34) surpasses that of any fixed slicing approach (with the highest being 78.48%), further substantiating the efficacy of the dynamic method in our study.

Thanks for your suggestion!

评论- Response to q3dC (Reply 2.2)

2023-11-17

Continued from Reply 2.1

3. Statistics of Resulting Slicing Duration.

In order to further verify the stability and effectiveness of the dynamic slicing method, we explore the results of BEEF by changing the number of event cells in the event recognition task. $N_{cell}$ indicates that an event stream is divided into $N_{cell}$ event cells, and the larger the $N_{cell}$ implies that the event stream is divided into more fine-grained event cell sequences that are capable of better represent the raw event stream (as mentioned in Section 4.1.).

$N_{cell}$	15	20	25
Avg Cell Num	2.42	3.15	4.77
Percentage of Duration	16.13%	15.75%	19.08%

Calculation Process: Suppose the whole event stream (duration = $T$ ) is divided into 15 event cells, if the SNN is trained to sliced the event with 2.42 (average) event cells, which means that the sliced sub-event stream $E^b_k$ contains event data which lasts a duration of $\frac{1}{15}\times2.42T = 16.13$ % $T$ .

The experimental results show that the percentage of the duration of each sub-event stream to the total event stream duration after the adaptive slicing is relatively stable, i.e., the fineness of the event cell does not affect the event information contained in each sub-event stream after the slicing process, which proves the robustness and effectiveness of the dynamic slicing process of BEEF.

Summary

We have taken care to conduct an in-depth statistical analysis of the dynamic slicing process employed by our BEEF algorithm. This analysis, alongside the demonstrated enhancements in downstream task performance as shown in Tables 1 and 3 of our manuscript, reinforces the validity of our approach. We are confident that the slicing mechanism of BEEF, which groups events between two output spikes of the SNN, is both statistically sound and practically effective.

评论- Response to q3dC (Reply 12.2 Summary 2)

2023-11-17

Continued from Reply 12.1.

Summary

This article focuses on dynamic event slicing methods and also proposes a cooperative ANN-SNN paradigm for future deployment on hardware. We are deeply grateful for your advice! We are making every effort to resolve any confusion and will improve sections that could cause misunderstandings in the updated version, such as why we guide the SNN to pulse at $n^*$ and the difference between event slicing and event representation. If anything is still unclear, we will promptly revise and respond! Thanks!