High Dynamic Range Imaging with Time-Encoding Spike Camera
摘要
评审与讨论
To solve dynamic range limitations of conventional spike cameras, and the high energy consumption of ML spike cameras, the authors propose a time-encoding based spike camera. This camera transforms counts into a time when an overflow is reached. The authors then propose a reconstruction method for recovering traditional imagery from the TE spikes. The method is compared against ML spike cameras and various alternative reconstruction algorithms.
优缺点分析
Strengths
- The paper thoroughly describes how a novel time-encoded spike cameras might operate.
- The comparison against a ML spike camera is complete and fair by examining on two datasets with many different software reconstruction algorithms.
- The ablation study shows why each part of the proposed reconstruction algorithm is useful.
Weaknesses
- Comparison is limited to spike camera technology. Also, advantages or motivation of using a TE spike camera instead of any of the other methods described in the related work is missing.
- Evaluation is limited to simulation under idealized settings where the only noise is due to intrinsic shot noise and not any other hardware inconsistencies.
- Many figure & table captions are short making them more difficult to understand and what the reader should take away. For instance, figure 3 is confusing because an arrow from the first box flows into (c) suggesting this is one large flow diagram, but that is not the case.
问题
- Over what time interval are spikes passed to the reconstruction network? And how fast does the network recover full frame HDR images?
- What is the motivation or advantage of using a spike camera for HDR imagery over the alternatives? For instance, SPADs and event cameras are equally as fast. Why is recovering a full frame necessary for applications such as autonomous driving where the application is something else like obstacle avoidance?
- What is the final dynamic range of the proposed system?
- What is the average bandwidth / energy requirement of the proposed system versus the ML spike camera system?
局限性
No method limitations are discussed.
- Some spike camera crops look blurry in comparison to the GT image.
- Only grayscale imagery is recovered, but for consumer applications where recovering a full frame is necessary, color is often important.
最终评判理由
The authors answered all my concerns. Although the paper is limited to simulated results, the additional experiments in the rebuttal demonstrate some real world promise. The proposed method is comparable to existing methods, but is an exciting direction for spike camera technology to potentially iterate on. Adding the event camera comparison, discussion of alternative camera technology and the noise experiments to the camera ready submission will greatly enhance this work.
格式问题
n/a
Thank you for your helpful comments, summary of our paper and affirmation of the performance. The questions and my answers are as follows.
Answers to the weaknesses
- Comparison is limited to spike camera technology. Also, advantages or motivation of using a TE spike camera instead of any of the other methods described in the related work is missing.
Thank you for your valuable comments.
- Conventional RGB cameras enhance dynamic range by increasing the full-well capacity of pixels and the bit depth of analogue-to-digital conversion. However, due to their reliance on synchronous exposure and relatively long exposure times, they are prone to motion blur, making them unsuitable for high-speed scenarios.
- Event cameras, on the other hand, record changes in light intensity with extremely high readout frequencies, enabling excellent adaptation to high-speed scenes and offering a large dynamic range. Nevertheless, they struggle to capture static objects or scenes with large-scale repetitive textures, often requiring the assistance of other kinds of cameras to reconstruct complete environmental textures.
- Spike cameras accumulate photons and generate a spike when the threshold is reached, allowing them to capture both static and dynamic objects simultaneously. Their high readout frequency—up to 40 kHz—also makes them well-suited for high-speed imaging. However, their main limitation lies in the restricted dynamic range. In this work, we aim to improve the dynamic range of spike cameras in high-speed scenarios by modifying their working mechanism.
- We have included a comparison with one of the SOTA Event-RGB hybrid method HDRev[1]. This is the best model we have found so far among the available open-source implementations. We retrained it on our own training dataset. Since HDRev generates colour images, we converted them to grayscale for a fair comparison. The results on the HDM-HDR-2014 dataset are shown in the table below. The proposed method achieves the best result.
| Metric | HDRev(event only)) | HDRev(RGB only) | HDRev | Ours |
|---|---|---|---|---|
| PSNR- | 13.40 | 14.35 | 22.47 | 30.86 |
| SSIM- | 0.545 | 0.546 | 0.777 | 0.853 |
The results on the Kalantari13 dataset are shown in the table below. The proposed method achieves the best result for the PSNR- metric, and HDRev achieves the best result for the SSIM- metric.
| Metric | HDRev(event only)) | HDRev(RGB only) | HDRev | Ours |
|---|---|---|---|---|
| PSNR- | 15.08 | 12.63 | 28.05 | 33.65 |
| SSIM- | 0.773 | 0.684 | 0.972 | 0.943 |
[1]. Yang, Yixin, et al. "Learning event guided high dynamic range video reconstruction." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
- Evaluation is limited to simulation under idealized settings where the only noise is due to intrinsic shot noise and not any other hardware inconsistencies.
Thank you for your insightful comments. Our current focus is on the methodological exploration and validation. In future work, we plan to investigate hardware implementation under this novel mechanism.
- Many figure & table captions are short making them more difficult to understand and what the reader should take away. For instance, figure 3 is confusing because an arrow from the first box flows into (c) suggesting this is one large flow diagram, but that is not the case.
Thank you for your helpful comments. We sincerely apologise for any confusion or misunderstanding that may have been caused. The arrow from the first box to indicates that the spike stream also serves as input to the Light Intensity-based Refinement module. We will revise both the layout of figures and their captions for improved clarity.
Answers to the questions
- Over what time interval are spikes passed to the reconstruction network? And how fast does the network recover full frame HDR images?
Thank you for your valuable questions. We divide the continuous 61-frame spike stream into five overlapping segments: (1–21), (11–31), (21–41), (31–51), and (41–61), which are fed into the network. The proposed method infers a image in 0.005 seconds.
- What is the motivation or advantage of using a spike camera for HDR imagery over the alternatives? For instance, SPADs and event cameras are equally as fast. Why is recovering a full frame necessary for applications such as autonomous driving where the application is something else like obstacle avoidance?
Thank you for your insightful questions.
- SPAD sensors are capable of photon counting and are typically designed for extremely low-light environments. However, they struggle to get the accurate photon count when the light intensity is high.
- Event cameras, on the other hand, record changes in light intensity and generally offer a high dynamic range. Nevertheless, they struggle to capture static objects or scenes with large-scale repetitive textures, often requiring the assistance of other kind of cameras to reconstruct complete environmental textures.
- Spike cameras accumulate photons and generate a spike when the threshold is reached, allowing them to capture both static and dynamic objects simultaneously. Their high readout frequency—up to 40 kHz—also makes them well-suited for high-speed imaging. However, their main limitation lies in the restricted dynamic range. In this work, we aim to improve the dynamic range of spike cameras in high-speed scenarios by modifying their working mechanism.
- For obstacle avoidance scenarios, relying solely on event cameras may lead to failure in detecting texture-similar obstacles, such as a wall ahead. Furthermore, effective obstacle avoidance often requires the detection of lane markings to ensure compliance with traffic rules. Thus, recovering a full-intensity frame is often necessary for such tasks.
- What is the final dynamic range of the proposed system?
Thank you for your valuable questions. The dynamic range of the TE spike camera is closely related to the bit depths of both the SFC (Spiking Firing Counter) and the CCC (Clock Cycle Counter). Assuming both the SFC and CCC are configured with an 8-bit depth, and that no significant motion blur occurs within the duration of 10 spike intervals in high-speed scenes, the TE spike camera can theoretically achieve a dynamic range of up to 116 dB ().
- What is the average bandwidth / energy requirement of the proposed system versus the ML spike camera system?
Thank you for your insightful questions. TE spike camera outputs either the SFC or CCC data. When TE spike camera employs both an SFC and a CCC with a bit depth of , and the ML spike camera uses an SFC with the same bit depth , TE spike camera can ensure a bandwidth similar to that of ML spike camera. Compared with ML spike camera, TE spike camera achieves an extended dynamic range at the cost of recording time information.
Answers to the limitations
- Some spike camera crops look blurry in comparison to the GT image.
Thank you for your valuable comments. There remains potential for further improvement in handling challenging high-speed motion scenarios. We plan to explore this aspect more thoroughly in future work.
- Only grayscale imagery is recovered, but for consumer applications where recovering a full frame is necessary, color is often important.
Thank you for your helpful comments. In theory, this mechanism is also applicable to color spike cameras. Alternatively, color information can be recovered by integrating an RGB camera. Your suggestion provides a valuable direction for further development.
I appreciate the authors response to my comments. Further clarity or discussion on alternatives to spike camera technology may be required, as I am not convinced by the current arguments.
-
In automotive applications, it is still unclear if full frame reconstruction is necessary. For instance, the authors mention lane following. What is the advantage to first recovering a full frame, then performing the additional analysis, instead of end-to-end performing lane departure analysis with the original signal?
-
Event cameras. Indeed event cameras can struggle to recover a scene when there is no motion, but the primary application of this paper is for high framerate HDR applications, which if there is no motion, high frame rates are not required. Moreover, it has been shown event camera noise characteristics [1] or adding camera motion [2] can help recover static parts.
-
SPADs. I do not believe SPADs are designed solely for low light environments (although they do excel in those cases). A few papers have already demonstrated SPAD technology for HDR imaging [3, 4]. More recently, Canon announced a SPAD HDR camera for automotive applications with a reported 150 dB dynamic range [5].
Better motivation for both high speed HDR imaging, and why spike cameras may uniquely be positioned over alternative technology would strength the paper.
[1] Cao et al. Noise2Image: Noise-Enabled Static Scene Recovery for Event Cameras. Optica 2025.
[2] He et al. Microsaccade-inspired event camera for robotics. Science Robotics 2024.
[3] Sharma et al. Transforming Single Photon Camera Images to Color High Dynamic Range Images. 2024.
[4] Liu et al. Single-Photon Camera Guided Extreme Dynamic Range Imaging. WACV 2022.
[5] Canon Inc. Canon develops High Dynamic Range SPAD sensor with potential to detect subjects even in low-light conditions or environments with strong lighting contrasts thanks to unique technology. 2025.
We sincerely appreciate the reviewer’s insightful question and for kindly providing relevant comparative references, which are very helpful for improving our work.
In automotive applications, it is still unclear if full frame reconstruction is necessary. For instance, the authors mention lane following. What is the advantage to first recovering a full frame, then performing the additional analysis, instead of end-to-end performing lane departure analysis with the original signal?
-
We fully agree that in some automotive applications, it is indeed feasible and even advantageous to perform end-to-end analysis directly on the original spike signals without full-frame reconstruction.
-
We would like to emphasize that the goal of our work is to extend the dynamic range of spike camera and provide a image reconstruction framework that recovers high-quality intensity frames from spike streams. This reconstruction enables a broad range of vision-based algorithms—such as semantic segmentation, object detection, and visual SLAM—that have been extensively developed and optimized on intensity frames, to be applied more easily and directly.
Event cameras. Indeed event cameras can struggle to recover a scene when there is no motion, but the primary application of this paper is for high framerate HDR applications, which if there is no motion, high frame rates are not required. Moreover, it has been shown event camera noise characteristics [1] or adding camera motion [2] can help recover static parts.
-
At the initial stage of autonomous driving, the vehicle remains stationary, making it difficult for event cameras to perceive static obstacles. Reference [1] focuses on low- and moderate-brightness regimes (e.g., room light, outdoor sunset), where photon noise—random fluctuations in photon arrival—is the dominant source of noise. However, in high-brightness environments (e.g., outdoor daylight), leakage noise events become more prevalent, which are not modeled in their work. Therefore, the performance of such approaches in HDR scenarios remains uncertain.
-
Reference [2] proposes an artificial micro-saccade-enhanced event camera that actively senses static scenes using a rotating wedge prism in front of the event camera. While effective in enabling the perception of static objects, this mechanism may face challenges during the initial driving phase, where both static and fast-moving objects are present. Moreover, the additional mechanical components and the resulting complex data structure may introduce higher energy consumption and increased computational burden.
SPADs. I do not believe SPADs are designed solely for low light environments (although they do excel in those cases). A few papers have already demonstrated SPAD technology for HDR imaging [3, 4]. More recently, Canon announced a SPAD HDR camera for automotive applications with a reported 150 dB dynamic range [5].
-
SPAD sensors estimate the total number of incident photons during an interval (exposure time + dead time) by counting the number of photons arriving within the exposure time. References [3] and [4] implicitly rely on the assumption that the number of photons arriving during the exposure time does not exceed the maximum countable capacity of the SPAD.
-
However, as pointed out in reference [5], when the incident photon count surpasses a certain threshold under high-illuminance conditions, conventional SPADs encounter difficulties in separating individual photon arrivals. This limitation leads to image white-out, where bright regions are saturated. Moreover, such sensors are power-intensive, as each photon count consumes energy independently.
-
To address these limitations, reference [5] proposes an alternative strategy: instead of counting all photons, the system records the time of arrival of the first photon and then estimates the total number of photons expected to arrive within a defined period. This allows for accurate photon number estimation without explicitly counting each photon, effectively avoiding white-out and reducing power consumption. The core idea of reference [5] is conceptually aligned with our proposed time-encoding mechanism, which also leverages temporal information to robustly estimate light intensity under extreme brightness conditions.
-
However, the avalanche multiplication process in SPADs not only amplifies the signal but also inherently amplifies the noise, leading to excessive noise levels that can severely compromise the overall imaging quality, particularly under high-brightness or noisy conditions.
We will include more detailed descriptions and comparisons in the manuscript to better clarify our motivation.
This more detailed discussion and comparison with non-spike camera methods is great and will strengthen the paper.
On the topic of real world experiments, both jHPf and YUQi have similar concerns. Can the author comment on how the system may or may not be robust to hardware inaccuracies (such as potential noise sources)? Perhaps an experiment that simulates the TE system with additional noise would help demonstrate the system is practically viable without having a physical prototype.
We sincerely appreciate the reviewer’s insightful question and for kindly providing suggestions, which are very helpful for improving our work.
On the topic of real world experiments, both jHPf and YUQi have similar concerns. Can the author comment on how the system may or may not be robust to hardware inaccuracies (such as potential noise sources)? Perhaps an experiment that simulates the TE system with additional noise would help demonstrate the system is practically viable without having a physical prototype.
Thank you for your valubale comments. The primary sources of noise include photon shot noise, dark current noise in the circuitry, and quantization noise. Quantization noise mainly arises from two aspects: the accumulated photons at the readout moment may not be an integer multiple of the threshold, and the time required to accumulate a given number of spikes may not be an integer multiple of the clock cycle.
In our previous experimental design, we modeled the Poisson shot noise as follows:
We multiplied each pixel’s normalized intensity value (ranging from 0 to 1) by 60,000 to obtain the average photon count per pixel during each readout interval. Then, we used the numpy.random.poisson function to simulate the randomness of photon arrivals according to Poisson statistics.
We also modeled the quantization noise: the residual photons at the readout moment was preserved, and the time to accumulate a certain number of spikes was rounded down to the nearest integer.
Furthermore, we simulated the dark current noise. We assumed that the number of electrons induced by dark current follows a Gaussian distribution with a mean of 400 and a standard deviation of 50. If the normalized intensity at time is , the total number of generated electrons is computed as:
where 0.7 represents the photoelectric conversion rate. The results on HDM-HDR-2014 dataset are shown in the table below.
| Metric | Spk2ImgNet_TE | BSF_TE | Ours |
|---|---|---|---|
| PSNR- | 28.43 | 28.57 | 28.63 |
| SSIM- | 0.810 | 0.798 | 0.810 |
The results on Kalantari13 dataset are shown in the table below.
| Metric | Spk2ImgNet_TE | BSF_TE | Ours |
|---|---|---|---|
| PSNR- | 29.78 | 31.95 | 33.36 |
| SSIM- | 0.929 | 0.931 | 0.936 |
Additionally, we referred to [1], which recorded the actual dark current behavior of spike camera. The results showed that the time intervals between dark current-induced spikes roughly follow a Gaussian distribution with a mean of 140 and a standard deviation of 50. Based on this observation, the total number of generated electrons is computed as:
To ensure a reasonable electron count caused by dark current, we clip the to the range . The results on HDM-HDR-2014 dataset are shown in the table below.
| Metric | Spk2ImgNet_TE | BSF_TE | Ours |
|---|---|---|---|
| PSNR- | 25.73 | 25.81 | 25.82 |
| SSIM- | 0.725 | 0.751 | 0.752 |
The results on Kalantari13 dataset are shown in the table below.
| Metric | Spk2ImgNet_TE | BSF_TE | Ours |
|---|---|---|---|
| PSNR- | 27.81 | 29.95 | 31.16 |
| SSIM- | 0.897 | 0.900 | 0.904 |
The introduction of dark current noise inevitably degrades the reconstruction quality to some extent. Nevertheless, our reconstruction method still achieves the best performance, particularly on the Kalantari13 dataset.
We sincerely appreciate your comment, which has helped make our system design more comprehensive and reasonable.
[1]. Zhao, Junwei, et al. "Spikingsim: A bio-inspired spiking simulator." 2022 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2022.
I appreciate this further analysis. Based on the random seed analysis performed for Reviewer jHPf, it looks like with noise the proposed method may only be comparable with the baseline methods rather than strictly best. Given the Kalantari13 dataset uses a separate neural network for ground truth labeling, the results on HDM-HDR-2014 are more compelling, which have a much smaller gap (i.e. could be within randomness).
I'll increase my score to an accept, conditional on [a] adding the event camera comparison, [b] discussion of alternative camera technology and [c] adding the noise experiments in the camera ready submission.
We're happy to hear that we've managed to alleviate your concerns. Thanks again for the review, and we will add the three points you mentioned.
The paper introduces time-encoding (TE) spike cameras, which address limitations of existing multi-level (ML) spike cameras. Rather than storing the number of spikes that are fired during each readout interval, TE spike cameras record the time required until a set number of spikes have fired. This is beneficial for scenes with high intensity content, where high bit-depth and power are required to count the large number of spikes, because TE spike cameras can encode HDR content with less bit-depth than their ML counterpart. The authors use a neural network to extract features, perform frame alignment, and refine the reconstruction result.
优缺点分析
Strengths
- The paper proposes a novel temporal encoding method that offers clear benefits over conventional ML spiking approaches.
- Key claims and network architecture choices are empirically supported. Simulated experiments provide evidence that TE data encodes more information than ML data. The ablation study provides information about the importance of each network component.
Weaknesses
- The experimental results are performed in simulation only.
问题
- I would like to see more concrete examples given in your discussion of "fast ultra-HDR" scenes to highlight the importance and impact of TE spiking cameras. From the current discussion, it's not immediately apparent which applications this method would be most beneficial for.
- Are there any challenges from a practicality standpoint in building the proposed TE spiking camera? Are there any tradeoffs (or increase in complexity) for building a TE vs. ML spiking camera?
- Representing spike streams as a large volume is inefficient since (presumably) much of the data in the stream will be zeroes (i.e., no spikes triggered). How does this affect network training / inference at scale? Can you run directly on spike data to save memory / computation?
局限性
yes
最终评判理由
I appreciate the author's answers to my questions and also their responses + additional results given to other reviewers. I will maintain my current rating, as I find the proposed approach and simulation experiments sufficient for acceptance.
格式问题
none
Thank you for your summary of our contributions and strengths. Your comments are helpful to our paper, and my answers are as follows.
Answers to the weaknesses
- The experimental results are performed in simulation only.
Thank you for your insightful comments. Our current focus is on the methodological exploration and validation. In future work, we plan to investigate hardware implementation under this novel mechanism.
Answers to the questions
- I would like to see more concrete examples given in your discussion of "fast ultra-HDR" scenes to highlight the importance and impact of TE spiking cameras. From the current discussion, it's not immediately apparent which applications this method would be most beneficial for.
Thank you for your helpful comments. In certain autonomous driving scenarios, high-speed and high-dynamic-range (HDR) conditions may occur—for example, when a vehicle enters or exits tunnels or underpasses, resulting in drastic changes in illumination. In nighttime environments, most regions may be dark, while others, such as those near streetlights, can be extremely bright. Similarly, extreme weather conditions like heavy rainfall can also create challenging HDR scenes. In future work, we plan to collect more datasets to further explore and investigate these scenarios.
2. Are there any challenges from a practicality standpoint in building the proposed TE spiking camera? Are there any tradeoffs (or increase in complexity) for building a TE vs. ML spiking camera?
Thank you for your insightful comments. In practical fabrication, the integration of both the SFC and CCC modules increases the complexity of the manufacturing process. Moreover, the timing errors introduced by CCC can significantly affect the performance of TE spike camera. Compared with ML spike camera, TE spike camera achieves an extended dynamic range at the cost of recording time information.
3. Representing spike streams as a large volume is inefficient since (presumably) much of the data in the stream will be zeroes (i.e., no spikes triggered). How does this affect network training / inference at scale? Can you run directly on spike data to save memory / computation?
Thank you for your helpful question. The spike stream characterises both the number of spikes and their firing times. Therefore, it is feasible to reconstruct images by focusing only on the fired spikes (i.e., excluding the zeros) along with their corresponding timestamps. This practice can save memory/computation, especially when much of the data in the stream is zero. This practice would require modifications to the network architecture and possibly an additional mapping between spike data and their firing time. This is an insightful question, and we plan to explore it in our future work.
Spike cameras offer high readout frequency but imaging HDR scenes using spike cameras remains an open problem. Prior work uses multi-level (ML) spike cameras to denote multiple spikes fired within the time frame of a single readout to increase dynamic range, while the authors claim that counting a large number of spikes can be costly in ultra-HDR scenes. To address this, the authors propose a time-encoding (TE) which records the time it takes for the spike count in every pixel to reach the maximum, thereby further increasing the dynamic range. They also propose a reconstruction method for the proposed TE spike data. Experimental results revealed that reconstructed TE spike data provides superior quality to reconstructed ML spike data.
优缺点分析
Strengths: The problem of interest, imaging HDR scenes with spike cameras, is a relatively new topic and worths investigating. The proposed TE spike camera is well motivated and the design is reasonable. According to the experiments, when the same reconstruction method is applied, the proposed TE spike camera consistently offers higher image quality than the existing ML spike camera. The proposed reconstruction method outperforms the other reconstruction methods investigated.
Weaknesses: It is not entirely clear to me how to understand TE spike cameras with reference to ML spike cameras. Do TE cameras provide higher dynamic ranges at the cost of additional resources, or do they present a better use of the circuitry and balance of bit depths? To image the same scene with a fixed dynamic range using the two types of cameras, how can we compare their energy consumption, bandwidth, latency, reconstruction difficulty etc.? In addition, the operating regimes of the two types of spike cameras in terms of dynamic range or bit depth need to be clarified.
The reconstruction section is somewhat disconnected from the proposal of TE spike cameras. It would be better if the authors could offer more insights into the rationale behind the method design.
According to Fig 4, the visible performance gain is mainly on motion mitigation instead of dynamic range recovery. This does not align with the focus of this paper as introduced in the previous sections. Furthermore, whether the test data aligns with the ultra-HDR theme is not explained. While the TE results present higher scores and fewer blur effects, the advantages of TE spike cameras over ML ones are not fully verified by the results.
This core contribution of this work is the proposal of a new design of spike cameras, but no prototype camera or real-world experiment is presented to verify the effectiveness and practicality of the proposed device.
问题
Can the authors comment on the operating regimes of the types of spike cameras, i.e., what dynamic range is considered HDR where ML spike cameras are applicable and what is considered ultra-HDR where TE spike cameras are needed?
What is the bit depth and the dynamic range of images in the dataset used? How does it align with the HDR and ultra-HDR arguments?
In the experiment, ML spike cameras use an 8-bit SFC while TE spike cameras use a 5-bit SFC plus a 4-bit CCC. Is this setting fair?
局限性
The limitation is only briefly mentioned in a single sentence.
最终评判理由
The rebuttal has not adequately address my raised concerns, so I am keeping my rating.
格式问题
N/A
Thank you for your summary of our contributions and strengths. Your comments are helpful to our paper, and my answers are as follows.
Answers to the weaknesses
- It is not entirely clear to me how to understand TE spike cameras with reference to ML spike cameras. Do TE cameras provide higher dynamic ranges at the cost of additional resources, or do they present a better use of the circuitry and balance of bit depths? To image the same scene with a fixed dynamic range using the two types of cameras, how can we compare their energy consumption, bandwidth, latency, reconstruction difficulty etc.? In addition, the operating regimes of the two types of spike cameras in terms of dynamic range or bit depth need to be clarified.
Thank you for your helpful comments.
- Spike camera encodes light intensity by accumulating photons and releasing a spike when a predefined threshold is reached. However, since only one spike can be latched during each readout interval and subsequently read out, spike camera can only indicate whether the accumulated photon count has reached the threshold by the readout time. It cannot distinguish whether the threshold has been reached multiple times within the same interval.
- Multi-Level (ML) spike camera addresses this limitation by incorporating a counter to record the number of spikes, enabling it to represent higher light intensities and thereby extend the dynamic range of spike-based imaging.
- Building upon this, Time-Encoding (TE) spike camera introduces temporal information by recording the time required for the accumulated spike count to reach a certain value. By selectively reading either the spike count or the time information, TE spike camera achieves a further enhancement in dynamic range while maintaining a bandwidth comparable to that of ML spike camera. Of course, TE spike camera comes at the cost of additional timing circuitry.
- The dynamic range of ML spike camera is closely related to the bit depths of both the SFC (Spiking Firing Counter). Assuming the bit depth of the SFC is 8 and that no significant motion blur occurs within the duration of 10 spike intervals in high-speed scenes, ML spike camera can theoretically achieve a dynamic range of up to 68 dB ().
- The dynamic range of TE spike camera is closely related to the bit depths of both the SFC (Spiking Firing Counter) and the CCC (Clock Cycle Counter). Assuming both the SFC and CCC are configured with an 8-bit depth, and that no significant motion blur occurs within the duration of 10 spike intervals in high-speed scenes, TE spike camera can theoretically achieve a dynamic range of up to 116 dB ().
- The reconstruction section is somewhat disconnected from the proposal of TE spike cameras. It would be better if the authors could offer more insights into the rationale behind the method design.
Thank you for your insightful comments. A basic reconstruction result can be obtained directly based on the following equation:
where denotes the number of fired spikes, is the threshold, and represents the spike interval . This is also the principle of the TFI_TE method used for comparison in our experiments.
To achieve higher-quality reconstruction, we leverage deep learning techniques to effectively exploit the spatiotemporal features embedded in the TE spike stream. Specifically, we first decode the TE spike stream to obtain the number of spikes generated by each pixel at different readout moments. Given that the spike count is proportional to light intensity, its gradient serves as a useful cue for motion alignment. Motivated by this observation, we extract multi-scale gradient features from the TE spike stream and perform temporal alignment using a similarity-based multi-level pyramid network. Finally, an intensity-based refinement module is employed to fuse spatial features and generate the final reconstruction result.
- According to Fig. 4, the visible performance gain is mainly on motion mitigation instead of dynamic range recovery. This does not align with the focus of this paper as introduced in the previous sections. Furthermore, whether the test data aligns with the ultra-HDR theme is not explained. While the TE results present higher scores and fewer blur effects, the advantages of TE spike cameras over ML ones are not fully verified by the results.
Thank you for your helpful comments.
- The reconstructed images preserve details in both bright and dark regions more effectively. Please refer to Figures 5–10 in the supplementary material for more visual comparisons.
- The HDM-HDR-2014 dataset features a dynamic range of up to 108 dB. The Kalantari13 dataset features a dynamic range of up to 97 dB. These two widely used public HDR datasets are typically regarded as benchmark datasets for ultra-HDR scenarios due to their extreme dynamic range. The images in the HDM-HDR-2014 dataset are organised in OpenEXR format. The images in the Kalantari13 dataset are organised in .hdr format.
- This core contribution of this work is the proposal of a new design of spike cameras, but no prototype camera or real-world experiment is presented to verify the effectiveness and practicality of the proposed device.
Thank you for your valuable comments. Our current focus is on the methodological exploration and validation. In future work, we plan to investigate hardware implementation under this novel mechanism.
Answers to the questions
- Can the authors comment on the operating regimes of the types of spike cameras, i.e., what dynamic range is considered HDR where ML spike cameras are applicable and what is considered ultra-HDR where TE spike cameras are needed?
Thank you for your valuable questions.
- Building upon ML spike camera, TE spike camera introduces temporal information by recording the time required for the accumulated spike count to reach a certain value. By selectively reading either the spike count or the time information, TE spike camera achieves a further enhancement in dynamic range while maintaining a bandwidth comparable to that of ML spike camera. Of course, TE spike camera comes at the cost of additional timing circuitry. Therefore, TE spike camera can be viewed as an extension of ML spike camera with an enhanced representational modality.
- The dynamic range of ML spike camera is closely related to the bit depths of both the SFC (Spiking Firing Counter). Assuming the bit depth of the SFC is 8 and that no significant motion blur occurs within the duration of 10 spike intervals in high-speed scenes, ML spike camera can theoretically achieve a dynamic range of up to 68 dB ().
- The dynamic range of TE spike camera is closely related to the bit depths of both the SFC (Spiking Firing Counter) and the CCC (Clock Cycle Counter). Assuming both the SFC and CCC are configured with an 8-bit depth, and that no significant motion blur occurs within the duration of 10 spike intervals in high-speed scenes, TE spike camera can theoretically achieve a dynamic range of up to 116 dB ().
- What is the bit depth and the dynamic range of images in the dataset used? How does it align with the HDR and ultra-HDR arguments?
Thank you for your insightful questions. The HDM-HDR-2014 dataset features a dynamic range of up to 108 dB. The Kalantari13 dataset features a dynamic range of up to 97 dB. These two widely used public HDR datasets are typically regarded as benchmark datasets for ultra-HDR scenarios due to their extreme dynamic range. The images in the HDM-HDR-2014 dataset are organised in OpenEXR format. The images in the Kalantari13 dataset are organised in .hdr format.
- In the experiment, ML spike cameras use an 8-bit SFC while TE spike cameras use a 5-bit SFC plus a 4-bit CCC. Is this setting fair?
Thank you for your helpful question. For TE spike camera, it would be more intuitive to compare a configuration using an 8-bit SFC along with an 8-bit CCC against ML spike camera using an 8-bit SFC, since TE spike camera outputs either the SFC or CCC data. This design ensures a similar bandwidth to that of ML spike camera. Our intention in this experiment setting was to reduce the bandwidth consumption of TE spike camera. We apologise for any confusion this may have caused.
We would like to know if our response has addressed your concerns. If you have any additional feedback, concerns, or suggestions regarding our manuscript or rebuttal, we would greatly appreciate the opportunity to discuss them further and work on improving the manuscript. Thank you again for the time and effort you dedicated to reviewing this work.
Thank you for your reply. I think my concerns on fairness of ML and TE spike cameras comparison and definition of HDR and ultra-HDR are not fully addressed. To fairly compare ML and TE spike cameras, the cost of additional circuitry for CCC should be deducted from the SFC circuitry, instead of only fixing the output bandwidth. On dynamic range, many CMOS cameras today already achieve 12- or 14-bit readout in a single capture. This is 84 dB already (). Claiming the data used in this work to be ultra-HDR is questionable, and the necessity of the proposed TE spike camera is not clearly explained. In addition, there is no prototype camera or real-world experiment presented, which is critical for such a proposal of a new sensor design. I would have to keep my rating.
Spike cameras count incoming photons and emit a binary “spike’’ whenever a preset threshold is reached. In very bright regions multiple spikes can occur inside one read-out period, but only a single bit is stored, so the dynamic range collapses. The authors propose a time-encoding (TE) spike camera in which the sensor stops counting after every b spikes (an “overflow’’) and records when that overflow occurred by sampling a high-frequency periodic timing signal. Storing the cycle index instead of every spike both increases dynamic range and reduces energy caused by rapid firing .
For reconstruction, they introduce a learning-based pipeline that (i) extracts multi-scale gradient features from the spike stream, (ii) aligns temporally adjacent streams with a similarity-based pyramid alignment (SPA) module, and (iii) fuses spatial features via a light-intensity refinement (LIR) module.
On two synthetic HDR benchmarks (HDM-HDR-2014 and Kalantari13) the TE camera plus their network improves PSNR by 3dB and SSIM by ≈0.05 over the same architecture fed with multi-level (ML) spike data, and beats the prior MambaSpike baseline by 2–6 dB . Ablations show each module contributes to the gain.
优缺点分析
Strengths
- Recording the moment of overflow is a clever way to encode high luminance while keeping digital bandwidth constant.
- Use of gradient-aware features and pyramid alignment match the temporal sparsity of spike data.
- Shows consistent improvement over both hardware (ML) and algorithmic baselines on two datasets .
Weaknesses and Questions
- Lack of real-sensor evidence – all experiments are synthetic. No prototype TE chip or mixed-signal simulation is shown, so claimed energy savings remain hypothetical.
- Limited baseline set – comparison omits recent hybrid-sensor HDR methods that fuse events with frames or ML-based spike counters; fairness of gains is unclear.
- Reproducibility gaps – authors will release code only “after publication’’ and provide no error bars. Statistical significance is therefore unknown .
- Hardware feasibility – storing a high-frequency CCC per pixel may raise SRAM and routing pressure; silicon overhead is not analysed.
- Generalisation to real HDR noise – Poisson shot noise and hot pixels are not modelled in the synthetic data. Hence robustness might degrade in practice.
问题
Suggestions
- Provide at least SPICE or Verilog-A level simulations of pixel, CCC, and timing signal to quantify area/energy overhead versus ML spike counter.
- Include newer event/frame fusion HDR networks as baselines; retrain them on the same datasets to ensure fairness.
- Release training scripts and checkpoints at submission time and add confidence intervals (3 seeds) for key metrics.
- Analyse sensitivity to clock frequency choice f and overflow depth b; a plot of dynamic-range gain vs. energy would guide sensor designers.
- Test reconstruction on real spike-camera HDR sequences (e.g., hybrid SpikeCam + LED-flash dataset) to demonstrate domain robustness.
局限性
see above in weaknesses
最终评判理由
Ive raised my score to a 4. based on the what I perceive is the addressing of concerns and to achieve consensus. Thanks all.
格式问题
none
Thank you for your summary of our contributions and strengths, and for providing five specific weaknesses together with constructive suggestions.
- Lack of real-sensor evidence – all experiments are synthetic. No prototype TE chip or mixed-signal simulation is shown, so claimed energy savings remain hypothetical. (Provide at least SPICE or Verilog-A level simulations of pixel, CCC, and timing signal to quantify area/energy overhead versus ML spike counter.)
Thank you for your helpful comments. We simulate the proposed method at the software level. The simulation results show that the imaging results are good under this mechanism. The bandwidth can be determined by the bit depth of the SFC and the CCC. Assuming that the bit depth of the SFC and CCC is , since TE spike camera outputs either the SFC or CCC data, the number of bits read out each time is (overflow flag). At present, we cannot effectively simulate the energy cost.
- Limited baseline set – comparison omits recent hybrid-sensor HDR methods that fuse events with frames or ML-based spike counters; fairness of gains is unclear. Include newer event/frame fusion HDR networks as baselines; retrain them on the same datasets to ensure fairness.
Thank you for your insightful comments. We have included a comparison with one of the SOTA Event-RGB hybrid methods, HDRev[1]. This is the best model we have found so far among the available open-source implementations. We retrained it on our own training dataset. Since HDRev generates colour images, we converted them to grayscale for a fair comparison. The results on the HDM-HDR-2014 dataset are shown in the table below. The proposed method achieves the best result.
| Metric | HDRev(event only)) | HDRev(RGB only) | HDRev | Ours |
|---|---|---|---|---|
| PSNR- | 13.40 | 14.35 | 22.47 | 30.86 |
| SSIM- | 0.545 | 0.546 | 0.777 | 0.853 |
The results on the Kalantari13 dataset are shown in the table below. The proposed method achieves the best result for the PSNR- metric, and HDRev achieves the best result for the SSIM- metric.
| Metric | HDRev(event only)) | HDRev(RGB only) | HDRev | Ours |
|---|---|---|---|---|
| PSNR- | 15.08 | 12.63 | 28.05 | 33.65 |
| SSIM- | 0.773 | 0.684 | 0.972 | 0.943 |
[1]. Yang, Yixin, et al. "Learning event guided high dynamic range video reconstruction." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
- Reproducibility gaps – authors will release code only “after publication’’ and provide no error bars. Statistical significance is therefore unknown. (Release training scripts and checkpoints at submission time and add confidence intervals (3 seeds) for key metrics.)
Thank you for your valuable comments.
-
Due to the limitations of the rebuttal process, we are unable to release our code at this stage. However, we will pay attention to this point in future submissions. Thank you for your reminder.
-
Regarding the error bars, we used three different random seeds to generate our training datasets (note that different seeds affect the shot noise) and trained the proposed method, Spk2ImgNet_TE, and BSF_TE. The testing results on the HDM-HDR-2014 dataset are shown in the table below. The proposed method achieves the best result.
| Metric | seed | Spk2ImgNet_TE | BSF_TE | Ours |
|---|---|---|---|---|
| PSNR- | seed1 | 29.64 | 29.64 | 30.86 |
| seed2 | 29.65 | 29.55 | 30.33 | |
| seed3 | 29.23 | 30.22 | 30.42 | |
| average | 29.51 | 29.80 | 30.54 | |
| variance | 0.04 | 0.09 | 0.05 | |
| SSIM- | seed1 | 0.830 | 0.826 | 0.853 |
| seed2 | 0.835 | 0.818 | 0.833 | |
| seed3 | 0.819 | 0.833 | 0.838 | |
| average | 0.828 | 0.826 | 0.841 | |
| variance | 0.000045 | 0.000038 | 0.000072 |
The testing results on the Kalantari13 dataset are shown in the table below. The proposed method achieves the best result.
| Metric | seed | Spk2ImgNet_TE | BSF_TE | Ours |
|---|---|---|---|---|
| PSNR- | seed1 | 30.29 | 31.74 | 33.65 |
| seed2 | 30.45 | 32.48 | 32.78 | |
| seed3 | 30.04 | 32.03 | 33.79 | |
| average | 30.26 | 32.08 | 33.41 | |
| variance | 0.03 | 0.09 | 0.20 | |
| SSIM- | seed1 | 0.937 | 0.937 | 0.943 |
| seed2 | 0.938 | 0.939 | 0.942 | |
| seed3 | 0.935 | 0.938 | 0.943 | |
| average | 0.937 | 0.938 | 0.943 | |
| variance | - | - | - |
- Hardware feasibility – storing a high-frequency CCC per pixel may raise SRAM and routing pressure; silicon overhead is not analysed. (Analyse sensitivity to clock frequency choice f and overflow depth b; a plot of dynamic-range gain vs. energy would guide sensor designers.)
Thank you for your insightful comments. To support a larger dynamic range, the bit depth of the SFC or CCC needs to be increased, which will lead to a larger bandwidth and greater energy cost. The focus of this paper is not on hardware implementation, but mainly explores the feasibility of the solution from the perspective of the working mechanism and principle design, and verifies the feasibility of the solution at the software level. The overhead and energy consumption analysis at the circuit level will be further explored by colleagues responsible for circuit design in the future.
- Generalisation to real HDR noise – Poisson shot noise and hot pixels are not modelled in the synthetic data. Hence, robustness might degrade in practice. (Test reconstruction on real spike-camera HDR sequences (e.g., hybrid SpikeCam + LED-flash dataset) to demonstrate domain robustness.)
Thank you for your helpful comments. In our simulations, we modeled Poisson shot noise as follows: we scaled each image pixel’s normalized intensity (ranging from 0 to 1) by 60,000 to obtain the average photon count per readout interval for the corresponding spike camera pixel, and then applied the function numpy.random.poisson to simulate the Poisson shot noise in the photon arrival process. Therefore, the proposed method exhibits a certain degree of robustness to Poisson shot noise. Currently, the proposed method can only simulate spike streams based on real datasets for conducting experiments.
Lack of reproducibility of results / application of code to realistic spiking camer data dampens my enthusiasm. Ill stay with my current score, and defer to whether authors assuage the concerns of the other reviewers.
We sincerely appreciate your thorough review and valuable time invested in evaluating our work. We fully acknowledge your primary concern regarding the absence of physical camera hardware implementation. While we recognize the importance of circuit-level simulation, we clarify that due to time constraints and interdisciplinary barriers (between computer vision and electrical engineering), such implementation remains beyond the current scope of this study. Nevertheless, in direct response to your constructive suggestion, we have:
-
Conducted comprehensive comparisons with state-of-the-art event-RGB hybrid HDR methods
-
Implemented rigorous validation through triple random-seed experiments to demonstrate method consistency
-
Expanded discussion of alternative camera technologies (as suggested by Reviewer ndvD)
-
Enhanced simulation credibility through detailed modeling of dark current noise characteristics (as suggested by Reviewer ndvD)
Regarding code availability, we respectfully maintain that in computer vision research, open-source implementation typically follows peer-reviewed publication. We regret to inform you that policy restrictions currently prevent us from sharing code links during the rebuttal phase.
We remain fully committed to addressing any additional concerns or suggestions you may have. Your expert feedback would be valuable for further improving our methodological framework and presentation quality. Thank you again for your professional engagement with our work.
This paper introduces a novel time-encoding (TE) spike camera design to extend the dynamic range of conventional spike cameras. Instead of counting all spikes in high-light regions, the TE design records the overflow timing relative to a periodic clock signal, thereby reducing energy cost while maintaining bandwidth efficiency. The authors also propose a reconstruction framework combining similarity-based pyramid alignment and light-intensity refinement modules to recover HDR images from TE spike data. Experiments, conducted on synthetic HDR datasets, demonstrate improvements in PSNR and SSIM compared with both multi-level (ML) spike cameras and state-of-the-art baselines.
The reviews were split. Reviewer n2tC and Reviewer ndvD were positive (ratings around accept), appreciating the novelty and potential impact of the TE design despite the simulation-only validation. Reviewer jHPf was initially skeptical due to the lack of hardware validation and limited baselines but, after rebuttal and discussion, raised the score to borderline accept. Reviewer YUQi remained unconvinced, citing fairness of comparison between ML and TE configurations, questionable “ultra-HDR” framing, and the absence of prototype hardware, ultimately recommending rejection. Thus, the scores range from reject to accept, with opinions clearly divergent.
The rebuttal and discussion were active. The authors added comparisons with event-RGB fusion HDR methods, conducted multi-seed experiments, and simulated noise sources including photon shot noise, dark current, and quantization noise. These additions alleviated some concerns for Reviewers jHPf and ndvD, who acknowledged improved robustness and raised their scores. Reviewer n2tC remained supportive throughout, considering simulation sufficient for a methodological contribution. However, Reviewer YUQi maintained strong reservations, arguing that the HDR datasets used may not justify the “ultra-HDR” claim, that fairness of bit-depth allocation was not addressed, and that a hardware prototype is indispensable for a sensor design paper. The final positions thus remained split but trended more positive after rebuttal.
The AC believes that while the absence of hardware implementation is a limitation, the methodological contribution is significant and provides a new perspective on spike camera design. The simulations, ablation studies, and additional comparisons added in the rebuttal strengthen the case that this is a promising direction for future research. The AC considers this paper worth presenting to the community to foster further exploration and discussion, and therefore recommends acceptance.