PaperHub
5.3
/10
Poster4 位审稿人
最低4最高6标准差0.8
5
4
6
6
2.8
置信度
正确性2.5
贡献度2.3
表达2.5
NeurIPS 2024

Spatio-Temporal Interactive Learning for Efficient Image Reconstruction of Spiking Cameras

OpenReviewPDF
提交: 2024-05-13更新: 2024-11-06

摘要

关键词
Spiking cameraImage reconstructionHigh-speed motionSpatio-temporal interactionCoarse-to-fine

评审与讨论

审稿意见
5

Spiking cameras are sensors that capture high-speed motion by firing continuous binary spike streams asynchronously. Current image reconstruction methods from these spike streams use complex architectures that overlook the collaboration of spatio-temporal information. This paper proposes an efficient spatio-temporal interactive reconstruction network that aligns inter-frame features and filters intra-frame features progressively. The network refines motion fields and target frames scale-by-scale, utilizing a symmetric interactive attention block and a multi-motion field estimation block to enhance interaction capabilities. Experiments on both synthetic and real data show the method's high performance and low model complexity.

优点

  1. The tackled problem is relevant to NeurIPS.
  2. The preliminaries in Section 3 are described clearly.
  3. The results show that the proposed approach outperforms related works.
  4. Several ablation studies have been conducted.

缺点

  1. In Section 2, the paragraph describing Spike-Based Image Reconstruction is confusing. In particular, there is confusion when discussing CNNs and SNNs. Please clarify.
  2. In Section 4, it is not clear what are the limitations of the related work and how these issues are addressed in the proposed methodology. It would be useful to describe the proposed methodology more in detail through a detailed top-level algorithm describing all the operations involved.
  3. In Section 5.1, the experiment details are not described with sufficient detail to allow reproducibility. Please describe all the tools used and provide values of all the parameters.

问题

  1. In Section 1: “Our codes and model weights will be open source.” Is it possible to provide the codes and model weights in the supplementary material for reviewers’ inspection?
  2. The experiments have been conducted only on the SREDS dataset. Can the results be generalized with a larger variety of benchmarks?

局限性

The limitations have been discussed in Appendix G.

作者回复

Thank you for your precious time and insightful comments. We first list your advice and questions, then give our detailed answers.

W1: A little confusion in Spike-Based Image Reconstruction.

Thank you for your question that has brought our attention to a potential point of confusion. We would like to clarify two things in Related Works:

(1) “Spike-Based Image Reconstruction” refers to image reconstruction using spiking cameras (a type of neuromorphic camera), which is not related to the spiking neural network (SNN, a type of network architecture).

(2) When introducing deep learning techniques in this part, we first introduce supervised methods, which can then be divided into CNN-based [1, 2] and SNN-based [3]. Then what follows are the self-supervised CNN-based methods [4, 5].

To avoid ambiguity and aid understanding, we will revise the subtitle of Related Works from “Spike-Based Image Reconstruction” to “Spike-to-image Reconstruction” and Lines 109-111 in the manuscript as follows:

Original Lines 109-111: “Furthermore, several self-supervised CNNs have also been developed. However, due to the step-by-step paradigm, the above CNN-based architectures inevitably have higher model complexity, blocking them from mobile and real-time applications.”

Revised Lines: “The above three are supervised methods, whereas several self-supervised CNNs have also been developed. However, due to the step-by-step paradigm, the above CNN-based architectures inevitably have higher model complexity, blocking them from mobile and real-time applications. In contrast, our single-stage model jointly considers temporal motion estimation and spatial intensity recovery, thus facilitating the intrinsic collaboration of spatio-temporal complementary information.

W2: The limitations of the related work and our improvement.

The limitations of related work:

(1) SSIR [3] (the SNN-based method), though energy-efficient, performs largely below the ideal level. While CNN-based methods [1,2] achieve promising results, the step-by-step paradigm inevitably leads to higher model complexity.

(2) In spike embedding representation, previous methods relied on either explicit or implicit representation. But relying on each side leads to the drawback of not being able to balance interpretability and strong expressiveness.

Our improvement:

(1) Our single-stage architecture targets the previous step-by-step paradigm and addresses temporal motion estimation and spatial intensity recovery jointly, therefore exhibiting excellent performance while maintaining low model complexity (see Table 1 and Figure 1).

(2) We developed a hybrid spike embedding representation (HSER) to offer good certainty and strong expressive capability simultaneously while maintaining low computational cost.

We have provided an overview of the workflow of our method in Section 4.1. Combined with Figure 2, it becomes easy to understand the mechanism and role of each module.

W3: More training details to allow reproducibility.

Due to page limitations, we put more training details and loss functions in Appendices A and B, which are sufficient to reproduce. We resonate with your concern for reproducibility and will put all the training details into the main text.

Q1: Is it possible to provide the codes and model weights in the supplementary material for reviewers’ inspection?

In accordance with NeurIPS requirements, we have sent the anonymous open-source code link to the AC for reviewers’ inspection.

Q2: Experiments only on the SREDS dataset. Can the results be generalized with a larger variety of benchmarks?

Currently, there are two available benchmark datasets for the spike-to-image reconstruction task: SREDS and Spike800.

  • The Spike800 training set was introduced in Spk2ImgNet [1] in 2021. The spatial resolution of images is 400x250. Each scene contains 5 GT images, with 1 GT corresponding to 13 spike planes. There are 240,000 GT images and 13×240,000=3,120,000 spike planes.
  • The SREDS training set was introduced in SSIR [3] in 2023. The spatial resolution of images is 1280x720. Each scene contains 24 GT images, with 1 GT corresponding to 20 spike planes. There are 524,160 GT images and 20×524,160=10,483,200 spike planes.

Both are synthesized based on the REDS dataset [6] and by the same simulator as that in Spk2ImgNet [1]. Considering the higher resolution images and larger number of training samples in SREDS, SREDS can be considered an upgraded version of Spike800. So we ultimately chose the latest SREDS dataset.

The commonly used generalization test approach of current spike-to-image reconstruction methods is to train on simulated data and then perform generalization tests on real-world datasets. We not only did experiments on the synthesized SREDS dataset but also on a variety of real-captured datasets, including “momVidarReal2021”, “recVidarReal2019” and our newly collected spike data, which covers high-speed camera/object motion scenarios under complex indoor and outdoor conditions and is sufficient to demonstrate the generalization performance of our method. Our method significantly outperforms existing methods in terms of reconstruction accuracy on both synthetic and real datasets.


[1] Spk2ImgNet: Learning to reconstruct dynamic scene from continuous spike stream. In CVPR 2021.

[2] Learning temporal-ordered representation for spike streams based on discrete wavelet transforms. In AAAI 2023.

[3] Spike camera image reconstruction using deep spiking neural networks. In IEEE TCSVT 2023.

[4] Self-supervised mutual learning for dynamic scene reconstruction of spiking camera. In IJCAI 2022.

[5] Self-supervised joint dynamic scene reconstruction and optical flow estimation for spiking camera. In AAAI 2023.

[6] Ntire 2019 challenge on video deblurring and super-resolution: Dataset and study. In CVPRW 2019.

评论

In light of the other reviews and the author's rebuttal, I raised my score to 5.

评论

Thanks for raising your score to borderline accept. Your insightful comments have significantly contributed to refining our manuscript. We will address the aforementioned issues in the final version and release our code and model upon acceptance of the paper. We look forward to sharing our work with the community and believe that it will serve as a useful resource for researchers.

审稿意见
4

This paper proposes a new method for reconstructing images from spiking camera data called STIR (Spatio-Temporal Interactive Reconstruction network).

优点

  1. The joint motion-intensity learning architecture is innovative and addresses limitations of previous step-by-step methods.
  2. Faster inference speed compared to many existing methods, with competitive model size and complexity.

缺点

  1. The overall architecture is quite complex, which may make it challenging to implement or adapt.
  2. While the empirical results are strong, there's limited theoretical justification for why this approach works better.

问题

1.Can you explain the concept of the "hybrid spike embedding representation" (HSER) module and how it balances interpretability with expressive power?

  1. The paper discusses ablation studies on various components of the model. Which component seemed to have the most significant impact on the model's performance?

  2. The paper mentions that the architecture is flexible and can be scaled. How is this scaling achieved, and what are the trade-offs involved?

4.Can you explain the significance of using the intermediate feature FL t1 as the query and the temporal contextual features FL t0 and FL t2 as key/value in the attention mechanism?

局限性

  1. While the method performs well on the tested datasets, its performance on a wider range of real-world scenarios is not fully explored.

  2. Although faster than some existing methods, the approach still requires significant computational resources, which may limit its applicability in some real-time or resource-constrained settings.

作者回复

Thank you for your precious time and insightful comments. We address each concern below.

W1 & Limit2 & Q3: The overall architecture is too complex to implement or adapt. The approach still requires significant computational resources. How is the scaling achieved, and what are the trade-offs involved?

(1) Model Complexity. We understand your concerns. But if we take a holistic perspective, it becomes clear that the model follows a typical encoder-decoder architecture. More importantly, since our motivation is to jointly address temporal motion estimation and spatial intensity recovery in a single-stage manner, we actually simplified the whole reconstruction process. Our method has achieved real-time inference on an NVIDIA RTX 3090 GPU and 400×250 inputs (see Fig. 2), which cannot be done for existing CNN-based spike-to-image reconstruction methods. In addition to FLOPs presented in Table 1 (note that our method has the lowest FLOPs among CNN-based architectures), we further profile the computational complexity on an NVIDIA RTX 3090 GPU and 1280×720 inputs, i.e.,

ModelSSIRSpk2ImgNetWGSEOurs
GPU memory usage (MB)1061214956201809424

Hence, we respectfully contend that, compared with other existing methods, our method has already achieved excellent performance with low model complexity and low computational resources. In the future, we will explore ways to enhance the efficiency of our approach on resource-constrained platforms.

(2) How the model adapts. It is clearly stated in Section 5.3 that feature pyramid levels, model capacity (the width multiplier for the feature channel), and number of multi-motion fields can be adjusted. In the implementation, you just need to change the corresponding parameters to achieve scaling, e.g., when the number of pyramid levels is 3, our method maintains the state-of-the-art performance with the smallest number of parameters (0.832M).

(3) There is a trade-off between computational complexity and model performance. From Table 3, the increase in model size improves the reconstruction quality but leads to more parameters and computations. Our model is handy to scale to fit into diverse scenarios. For instance, in scenarios that demand high precision and have abundant computational resources, a larger model is preferred. Conversely, for mobile or real-time applications, a simpler model can be employed.

W2: Limited theoretical justification for why this approach works better.

We deeply relate to your concerns and acknowledge the importance of theoretical analysis. Yet, our primary focus was to demonstrate the practical effectiveness of our motivation, i.e., temporal motion estimation and spatial intensity recovery can be mutually reinforcing. Therefore, in the paper, we prioritized extensive experimental validation to show its feasibility. Particularly, the sub-models we used (like ResNet and cross-attention) to meet our needs are widely tested for their exceptional modeling ability in academic research and have a solid mathematical foundation. We have made modifications to fit into our model, whose effectiveness was justified by extensive ablation studies.

Q1: The concept of the "hybrid spike embedding representation" (HSER) module and how it balance interpretability with expressive power?

HSER is composed of two parts: explicit and implicit. Explicit representations using TFP provide an anchor for image reconstruction since the results of TFP can be viewed as low-quality reconstructed images that have good interpretability. Implicit representations obtained from residual blocks have strong expressive capability to map original spike data to features. Previous methods have relied on only one side without integrating the power of both. Besides, HSER is very light-weight, which can balance interpretability with expressive power in an efficient way. All of these are clearly described in Section 4.2.

Q2: Which component seemed to have the most significant impact on the model’s performance?

Ablation studies have shown that each module improves reconstruction quality in its own right and is equally important. Yet Table 3 (b) underscores the foundational role of synthesis-based intra-feature filtering since it reconstructs the intermediate intensity frame. Rather than being assembled haphazardly, all modules are designed with an overarching purpose: to address temporal and spatial information simultaneously in an interactive and joint manner. Under the guidance of this idea, we proposed a single-stage architecture and multiple customized modules, all of which synergistically enhanced the performance to the state-of-the-art.

Q4: The significance of using Ft2LF_{t_2}^{L} as the query and using Ft0LF_{t_0}^{L} and Ft2LF_{t_2}^{L} as key/value in the attention mechanism.

The center Ft1LF_{t_1}^{L} corresponds to the final reconstruction objective, while the sides Ft0LF_{t_0}^{L} and Ft2LF_{t_2}^{L} serve as auxiliary components. In doing so, we followed the standard formula of cross-attention and designed a symmetric interaction strategy to collaboratively incorporate contextual information from both sides into the center, thus improving the overall performance as demonstrated in Table 3(c).

Limit1: The performance on a wider range of real-world scenarios.

To the best of our knowledge, we have covered all the publicly available real-world datasets for testing, i.e., “momVidarReal2021” and “recVidarReal2019”. (Note: You might find some other real-world datasets, but their scenarios are either already covered by these two datasets or involve motion that is too slow to effectively demonstrate the model’s performance.) It is also worth mentioning that we tested it on our real-captured spike data, which exhibits excellent results (see Figs. 9 and 10).

评论

Thanks for your rebuttal and thank you for showing the computation budget of this work. The paper may needs some discussion on the reason why this approach works better, since ResNet and cross-attention are widely applied in state-of-the-art models. If the computation efficiency is a strength, I recommend to include a comparison among existing methods.

评论

Thank you very much for your valuable suggestions. We also believe it is important to provide additional discussion on why our method outperforms others, as this further emphasizes our innovation and contributions.

Therefore, we plan to include Section Discussion in the final version as follows:

Discussion

The primary innovation of our single-stage architecture lies in the interactive and joint perspective to handle temporal and spatial information simultaneously. It is this holistic approach that sets our method apart and yields superior performance, as opposed to depending solely on high-performing individual components. Ablation studies in Table 2 and Table 3(c) demonstrate that even when ResNet or Cross Attention is removed or replaced, our model still outperforms existing methods, which further underscores the robustness and effectiveness of our overall design. Rather than being assembled haphazardly, all modules are designed with an overarching purpose and synergistically enhance the performance to the state-of-the-art.

Moreover, we will add a new column for GPU memory usage in Table 1, alongside parameters and FLOPs, to further demonstrate the computational efficiency of our method.

If you have any further comments and questions, please let us know and we are glad to write a follow-up response. Thank you again!

评论

Thanks for authors' reply, I will still keep my score because there is still a lacking of theoretical reasoning to explain the novelty and contributions.

评论

Thanks for your feedback. We would like to take this opportunity to address your concern and further clarify our innovation and contributions.

Existing learning-based spike-to-image reconstruction methods [1,2] predominantly rely on a two-stage architecture that sequentially cascades temporal motion estimation and spatial intensity recovery. The two-stage design, upon closer examination, also lacks rigorous theoretical justification for its effectiveness. This absence of theoretical analysis of network architectures is a common issue in current methods; however, it does not detract from the empirical results that consistently demonstrate the effectiveness of these approaches. In future work, we plan to incorporate the principles of the spiking camera imaging and leverage interpretability theories in machine learning so as to explore stronger theoretical foundations. If you have any insights or suggestions for improving the theoretical analysis of existing methods, we would be very grateful to hear them.

Our key contribution lies in the paradigm shift from two-stage to single-stage. We recognize that motion estimation and intensity recovery are inherently a "chicken-and-egg" problem—more accurate motion estimation facilitates better image recovery, and vice versa. To address this, we propose to integrate these two independent steps through a joint interactive learning approach. This not only significantly improves reconstruction accuracy (PSNR: 38.79dB vs 37.44dB) but also demonstrates substantial advantages in computational complexity (FLOPs: 0.42T vs 3.93T) and memory usage (Memory: 9424MB vs 20180MB), all compared to WGSE [2].

While ResNet and cross-attention have been modified and integrated into the model, they are not the core contributions of our work, as demonstrated in Tables 2 and 3(c), where our model still outperforms existing methods when they are removed or replaced.

Once again, thank you for your review. We would be grateful if you could consider raising the score accordingly, as we believe in the value of our work and that the paradigm shift will introduce new perspectives and prompt a reevaluation of existing approaches in the community. We will address the aforementioned issues in the final version and release our code and model upon acceptance of the paper.


[1] Spk2ImgNet: Learning to reconstruct dynamic scene from continuous spike stream. In CVPR 2021.

[2] Learning temporal-ordered representation for spike streams based on discrete wavelet transforms. In AAAI 2023.

审稿意见
6

In this paper, authors propose a novel method for reconstructing images from spiking camera representations. The approach involves constructing a spiking embedding representation followed by a complex network of sub-networks, the importance of each respective block being evaluated through ablation studies. Notably, the proposed model achieves state-of-the-art results with relatively few parameters while providing detailed explanations of the intricate methodology (available in supplementary material).

优点

The main contribution of this paper lies in its ability to achieve exceptional performance on the task at hand. The importance of each sub-network within the complex network is accurately evaluated through ablation studies, demonstrating the effectiveness of the proposed approach. Furthermore, the detailed explanation of the methodology provided in supplementary material enhances our understanding of the intricate process.

缺点

One limitation of this paper is that it relies on previous models such as ResNet or transformers without providing a mathematical justification for these choices beyond their ability to efficiently solve the task at hand. This may hinder the reproducibility and interpretability of the model, which are crucial aspects in machine learning research.

问题

It would be beneficial to investigate whether this method can be directly applied to more common event-based cameras found in the market, given analogies between spiking and traditional event-based cameras. This could involve evaluating the proposed method on different benchmarks using data obtained from these cameras.

局限性

While the model achieves impressive results for the problem statement, it is challenging to determine which information contributes most to success. Further investigation of this point can be done by applying the methodology to various sources of inputs and observing how the model adapts (e.g., autonomous driving vs drone-taken scenes). This will provide valuable insights into the effectiveness of the proposed approach in different contexts.

作者回复

Thank you very much for your precious time and recognition of our work. We first list your advice and questions, then give our detailed answers.

W1: Using previous models without mathematical justification.

(1) Mathematical justification. We deeply relate to your concerns about the mathematical foundation of a machine learning model. However, rather than merely copying the model or relying on them to achieve the state-of-the-art performance, we choose them by matching their functionalities with our needs

  • ResNet excels at robust feature extraction by allowing gradients to flow more easily during backpropagation, and it has been tested in all sorts of tasks for its exceptional modeling ability. Its modularity and flexibility also allow us to integrate it easily into the spike embedding representation module. Combined with explicit representations (TFP), we designed a hybrid spike embedding representation (HSER). After trying different methods, empirical validation in Table 2 not only demonstrated the effectiveness of using ResNet compared with Multi-dilated [1] and HiST [2] representations, but also the effectiveness of the proposed hybrid scheme.

  • Cross-attention improves the model’s ability to understand and utilize contextual relationships and facilitates the alignment of features between sequences. In our setting (Section 3.2), we use three non-overlapping spike sub-streams St0NS_{t_0}^{N}, St1NS_{t_1}^{N}, and St2NS_{t_2}^{N} to reconstruct the intermediate intensity frame It1I_{t_1}. The center is the reconstruction objective, while the sides serve as auxiliary components. In order to better incorporate contextual information from St0NS_{t_0}^{N} and St2NS_{t_2}^{N} into the intermediate time t1t_1, we adopted the idea of cross-attention and presented a symmetric interactive attention block. This symmetric design collaboratively enhances the bilateral correlation between the intermediate feature Ft1LF_{t_1}^{L} and temporal contextual features Ft0L,Ft2LF_{t_0}^{L}, F_{t_2}^{L} and also injects prior motion-intensity guidance into the subsequent interactive decoder, which boosts the model performance as shown in Table 3(c).

(2) Reproducibility. The model structure and experiment details are clearly illustrated and described in the paper, which facilitates reproduction. Moreover, we have sent our anonymous open-source code link to the AC for reviewers’ inspection as requested by NeurIPS.

(3) Interpretability. We have justified above the functionalities of previous models, the needs of our method, and the modifications we make to match the two, which may help in understanding the designs intuitively.

Q1: It would be beneficial to explore whether this method can be applied to event-based cameras.

Thanks a lot for your insightful idea. We have demonstrated in Table 4 that our HSER can better adapt the event-to-image reconstruction model to spike-to-image reconstruction, which is a good indication that the key reconstruction module is transferable by adjusting the frontmost embedding representation. Moreover, the core contribution of this paper lies in the idea of spatial-temporal joint learning. Applying this design philosophy to event-based reconstruction holds significant promise, given analogies between spiking and event cameras.

However, the reconstruction of event cameras might pose more challenges and need further investigation considering the different working mechanisms of event cameras and spiking cameras. Event cameras utilize a differential sampling approach and record the change in light intensity. Whereas spiking cameras follow an integral sampling method, thus preserving the absolute value of light intensity. Though this presents a slight digression from our current work, we look forward to inventing an input-agnostic image reconstruction method that unifies the input types for neuromorphic cameras in the future.

Limit1: It is challenging to determine which information contributes most to success. Further investigation can be done by applying the methodology to various sources of input.

The key contribution is that we adopted an interactive and joint perspective to address temporal and spatial information simultaneously. It was under the guidance of this idea that we proposed a single-stage architecture and multiple customized modules, all of which synergistically enhanced the performance to the state-of-the-art. Among them, synthesis-based intra-feature filtering acts as the foundation since it is devoted to reconstructing the intermediate intensity frame (as shown in Table 3 (b)), indicating that “spatial” information is essential for intermediate frame reconstruction. In contrast, other modules improve reconstruction quality to varying degrees, e.g., warping-based inter-frame feature alignment helps to aggregate contextual “temporal” information to achieve higher quality reconstruction. So it is crucial to note that all modules are designed with an overarching purpose instead of being assembled haphazardly.

As for further investigation, we appreciate you a lot for your insight. At present, spike datasets in more scenarios like autonomous driving and drone-taken scenes are not available. In the future, we will take these scenarios into consideration and validate the model’s effectiveness and generalization across different contexts.


[1] Unsupervised optical flow estimation with dynamic timing representation for spike camera. In NeurIPS 2023.

[2] Optical flow for spike camera with hierarchical spatial-temporal spike fusion. In AAAI 2024.

评论

I have read other reviewers comments and your rebuttals, and appreciated the effort made to clarify your work. I have raised my score accordingly to 6.

评论

We are genuinely grateful for your encouraging feedback and the corresponding score increase. Your constructive insights have not only contributed to refining our manuscript but also provided us with valuable direction for our future research. We deeply appreciate your thoughtful consideration and thank you for your valuable time and effort.

审稿意见
6

This paper proposes a new efficient spatio-temporal interactive reconstruction network that enhances image reconstruction by jointly optimizing inter-frame feature alignment and intra-frame feature filtering in a coarse-to-fine approach. The network leverages a hybrid spike embedding representation and introduces novel components like a symmetric interactive attention block and a multi-motion field estimation block to refine the motion fields and target frames progressively. Tested on both synthetic and real-world data, this approach significantly outperforms existing methods.

优点

  1. Clarity: The paper is clearly written and easy to read.

  2. Comprehensive experiments: The author conducts extensive experiments and ablation studies to demonstrate the proposed method's effectiveness.

缺点

  1. Generalization: the paper lacks the verification of generalization ability under unknown or broader conditions.

问题

  1. How does the model handle image reconstruction tasks under extreme motion or lighting conditions?

局限性

Yes

作者回复

Thank you very much for your precious time and recognition of our work. We first list your advice and questions, then give our detailed answers.

W1: Generalization ability under unknown or broader conditions.

The commonly used generalization test approach of current spike-to-image reconstruction methods is to train on simulated data and then perform generalization tests on real-world datasets. We adopted this approach as well, but tested it on a wider range of real-world data, including “momVidarReal2021”, “recVidarReal2019” (also used in [1, 2]), and our real-captured data, which covers high-speed camera/object motion scenarios under complex indoor and outdoor conditions and is sufficient to demonstrate the generalization performance of our method.

As for the more unknown or broader conditions, they remain to be further explored. But we will consider involving more diverse scenarios and building datasets under more challenging conditions to tap into the potential of our models in future work. Thanks for your advice.

Q1: How does the model handle image reconstruction tasks under extreme motion or lighting conditions?

With a sampling rate of 40,000 Hz, the spiking camera is inherently capable of handling ultra-high-speed motion effectively. Even in scenarios that exceed the response speed of the human eye, our model still shows excellent performance. As illustrated in Figure 10 in the Appendix, our method successfully reconstructed the instantaneous process of a water balloon bursting in great detail.

In Section Limitations, we have discussed model performance in extremely low-light scenarios. Empirical experiments have shown that the limited accumulated light intensity often leads to darker images and increased noise. However, this limitation is common to current spike-to-image reconstruction methods [3, 4, 5] and can be further explored in our future work.


[1] Capture the moment: High-speed imaging with spiking cameras through short-term plasticity. In IEEE TPAMI 2023.

[2] Learning temporal-ordered representation for spike streams based on discrete wavelet transforms. In AAAI 2023.

[3] Spk2ImgNet: Learning to reconstruct dynamic scene from continuous spike stream. In CVPR 2021.

[4] Learning temporal-ordered representation for spike streams based on discrete wavelet transforms. In AAAI 2023.

[5] Spike camera image reconstruction using deep spiking neural networks. In IEEE TCSVT 2023.

评论

We sincerely appreciate your precious time and review. As the deadline for finalizing the reviews approaches, we kindly want to follow up to see whether our previous response clarified your concerns and if there are any further comments. Your insights are invaluable to us, and we would be grateful for your feedback. Thank you once again for your dedication and support.

评论

Dear Reviewers

Some reviewers had requested that the authors provide source code for inspection. The authors have provided a source code link: https://privatebin.net/?92502c619a5e2cd1#KQatXxUnLjRFWkAjWVuGCSqpcg1qwp36xXtsbwjy5fo

You can download the code from the attachment using the password: STIR5954

最终决定

This paper proposes a new method for reconstructing images from spiking camera data, called STIR (Spatio-Temporal Interactive Reconstruction network).

In general this was a borderline paper. Two of the reviewers were initially leaning towards rejecting the paper. After the rebuttal, one of the reviewers was satisfied sufficiently by the author responses, to raise the ranking to a borderline accept.

Among the drawbacks of the paper, as indicated by the reviewers, is that this is quite an expensive architecture, which might limit its applicability in various problems. However the authors argue that it is feasible to reduce the number of layers used if someone wants to make it less expensive. It is not clear how the quality of the generated images would be affected if the number of layers were reduced.

Another concern expressed by the reviewers is that it is not clear why this approach performs so well in terms of the quality of the generated images, compared to state-of-the-art models, since state-of-the-art models also use ResNet and cross-attention. The authors have indicated that they believe this is due to the fusion of motion and intensity data, and they will include some extra discussion on this in the paper. The authors indicated that they will release all their source code, making the work easily reproducible, and feasible for someone to better study the algorithm and understand it.

After reading the paper carefully, I agree for the most part that it is well written, and the quality of the results seems superior to the other methods presented, qualitatively and quantitatively. The algorithm is not terribly novel, and I agree that it is quite an expensive algorithm, limiting its applicability for various problems. Nevertheless I felt that the positives outweighed the negatives so I leaned towards accepting the paper.

The authors need to make sure that their source code is released, and they include detailed instructions on how someone can reproduce their results.