Event-based HDR Structured Light
We develop the first HDR 3D measurement framework for event-based structured light systems, enabling accurate 3D reconstruction even under extreme conditions.
摘要
评审与讨论
This paper proposes an event-based structured light (SL) framework for high dynamic range 3D measurement. The system introduces a multi-contrast speckle coding strategy and a confidence-driven stereo matching scheme using an energy-guided confidence estimation module and a confidence propagation volume. The authors also present a custom event-based SL simulator, a synthetic HDR dataset, and a small real-world dataset for evaluation. Experiments on both synthetic and real-world scenes demonstrate improvements over several traditional and deep stereo matching baselines.
优缺点分析
Strengths
- HDR 3D sensing under extreme lighting and reflectance variation is a well-motivated and practical problem.
- The combination of multi-contrast structured light coding with confidence-driven stereo matching is thoughtfully engineered.
- The paper provides comprehensive synthetic and real-world experiments, along with an ablation study.
Weaknesses
- While the paper claims to propose the first HDR 3D event-based structured light system with confidence-driven stereo matching, much of the multi-contrast coding concept and cost-volume stereo matching improvements seem to be incremental extensions of other methods: multi-contrast HDR coding seems to be a parameterization of contrast variation over a set of binary speckle projections, a technique seen in prior structured light HDR work (e.g. [7],[8],[9]). If the novelty comes primarily from combining these techniques in this specific application, it could be seen as integration rather than a fundamentally new algorithmic idea.
- Some sections, particularly Section 3.1 on multi-contrast HDR coding, are dense with notation and engineering descriptions without much intuitive explanation or diagrams to aid understanding.
- The authors openly acknowledge (Section 5) that their method does not address occlusions at all — a major shortcoming for a practical 3D reconstruction pipeline. Since occlusions are intrinsic in real-world HDR scenes, any method lacking a strategy to deal with them limits its robustness and applicability, potentially making the system unreliable for complex geometries.
- Although the authors collected real-world data, it is noted in Section 4.5 that the improvements are less pronounced than on synthetic data and the real-world dataset is relatively small (only 15 scenes).
问题
- Could you clarify in more detail how your multi-contrast HDR coding strategy and confidence-driven stereo matching fundamentally differ from prior event-based structured light systems such as [7], [8], and [9]? Several elements appear conceptually incremental — what makes your approach substantially novel from an algorithmic standpoint?
- Given the explicit limitation regarding occlusions, did you consider incorporating an occlusion detection mechanism or confidence-based filtering of unreliable disparity regions? If so, what challenges prevented its inclusion in this work?
- Your real-world evaluation is based on 15 scenes. Could you describe the diversity of these scenes in terms of reflectance materials, geometry complexity, and lighting conditions? How representative are these of industrial or consumer HDR applications?
- Did you conduct any quantitative analysis on how your method performs at different levels of dynamic range (e.g., 60 dB, 80 dB, 120 dB)? It would be useful to understand if performance degrades gracefully or if there’s a threshold beyond which the method becomes unstable.
局限性
yes
最终评判理由
On clarification, the proposed multi-contrast coding and confidence-driven stereo matching are clearly motivated and technically distinct; the added experimental and qualitative evidence strengthens the case for their novelty and practical relevance. I also acknowledge the authors' plan to include additional visualizations and analysis in the final manuscript to improve clarity and accessibility of the paper. Overall, the paper seems to represent a meaningful advancement for robust HDR 3D reconstruction and thus I support its acceptance.
格式问题
N/A
Response to Reviewer 9sYQ
Thank you for your constructive feedback and for taking the time to evaluate our work. The main concerns are addressed below.
Weakness 1 & Q1: Multi-contrast coding concept and cost-volume stereo matching improvements seem to be incremental extensions of [7][8][9].
Reply: We respectfully disagree. Our work is the first event-based structured light work specifically designed for HDR 3D reconstruction, while [7][8][9] in the manuscript were developed for general-purpose event-based structured light applications. Therefore, [7][8][9] are fundamentally different from our proposed multi-contrast coding and confidence-driven stereo matching in both objective and technical approach. We detail the technical details below:
Regarding multi-contrast coding, prior works such as [7], [8], and [9] rely on fixed-intensity scanning, which is prone to overexposure and underexposure in HDR scenes, limiting their effectiveness under such challenging conditions. In contrast, we propose a novel multi-contrast coding based on a range-partitioned sensing strategy to effectively image surfaces with varying reflectance. Besides, none of [7], [8], or [9] adopt speckle-based projection; therefore, their depth encoding principles and reconstruction pipelines differ fundamentally from ours.
Due to the fundamentally different encoding strategy, both the objective and technical design of our reconstruction algorithm differ significantly from those in [7], [8], and [9]. In our work, we propose a confidence-driven stereo matching strategy based on our observation of inter-frame interference, which is caused by overexposed and underexposed regions in multi-frame fusion. To tackle this problem, we propose a universal confidence-driven stereo matching strategy. Specifically, we estimate a confidence map as the fusion weight for features via an energy-guided confidence estimation. Further, we propose the confidence propagation volume, an innovative cost volume that offers both effective suppression of inter-frame interference and strong representation capability. To the best of our knowledge, this algorithmic design has not been investigated in prior work on HDR 3D reconstruction or in [7][8][9].
In summary, our work goes beyond a simple integration of existing ideas and pioneers the first framework for event-based HDR structured light.
Weakness 2: Section 3.1 on multi-contrast HDR coding is dense with notation and engineering descriptions without much intuitive explanation or diagrams to aid understanding.
Reply: Thank you for the suggestion. We have created an illustrative figure to more clearly present the design of the multi-contrast coding. Due to the visualization limitation of rebuttal, we will include it in the revised manuscript.
Weakness 3: The authors openly acknowledge (Section 5) that their method does not address occlusions at all — a major shortcoming for a practical 3D reconstruction pipeline. Method lacking a strategy to deal with them limits its robustness and applicability, potentially making the system unreliable for complex geometries.
Reply: Occlusion detection in monocular structured light systems is an ill-posed problem. Specifically, occlusions and low-reflectance regions are fundamentally indistinguishable, as both result in the missing or absence of speckles in the captured frame. This ambiguity becomes more pronounced in HDR scenes with complex surface properties. Consequently, nearly all existing monocular structured light methods do not attempt to estimate occlusions [1][2][3], and none of the baseline methods used in our comparisons incorporate occlusion estimation. We plan to explore this challenge in future work under a multi-view setting.
Weakness 4: Improvements of real-world data are less pronounced than on synthetic data and the real-world dataset is relatively small (only 15 scenes).
Reply: It is important to note that the real-world dataset is not intended for training, but rather for testing, similar to the famous Middlebury stereo dataset.
As discussed in Lines 281-283, the improvements seem to be marginal due to the sparsity of the ground-truth data, which results from large occlusions and extreme HDR surfaces. Given the limited valid regions in the ground truth, the performance differences appear less significant. In particular, as indicated by the white box in Fig. 6, the ground truth in extreme HDR areas only provides sparse disparity measurements—precisely the regions where our method demonstrates clear advantages. Therefore, in terms of quantitative results, the performance of our method is likely underestimated.
Q2: Did you consider incorporating an occlusion detection mechanism or confidence-based filtering of unreliable disparity regions? If so, what challenges prevented its inclusion in this work?
Reply: We did consider this. We first discuss the issue of occlusions. As detailed in response to weakness 3, occlusion detection or confidence-based filtering in monocular structured light systems is an ill-posed problem. Specifically, occlusions and low-reflectance areas are essentially indistinguishable, as both lead to missing or absent speckle patterns in the captured frames. This ambiguity is further exacerbated in HDR scenes with complex surface characteristics. As a result, almost all existing monocular structured light approaches [1][2] avoid explicit occlusion estimation, and none of the baseline methods included in our comparisons address this issue. We plan to explore this challenge in future work under a multi-view setting.
On the other hand, confidence-based filtering of unreliable disparity is a separate issue from occlusion estimation. This filtering process removes not only occluded regions but also areas with low-quality speckle responses and boundary artifacts. It functions as a post-processing step and is optional in practice. It is important to emphasize that the core contribution of this work lies in proposing a general HDR 3D reconstruction strategy that is compatible with arbitrary stereo networks. Therefore, whether to apply confidence-based filtering depends on the choice of stereo network, and users can flexibly adapt the configuration to suit their specific needs.
Q3: Could you describe the diversity of these scenes in terms of reflectance materials, geometry complexity, and lighting conditions? How representative are these of industrial or consumer HDR applications?
Reply: Our real-world evaluation includes 15 scenes selected to cover a wide range of HDR challenges commonly encountered in industrial and consumer applications. These scenes contain objects with diverse reflectance properties, such as common diffuse surfaces, highly reflective metals, low-reflectance matte materials, scattering surfaces, high-contrast textured regions, partially specular objects, and translucent materials. The objects also vary in geometric complexity, including smooth surfaces, regions with abrupt depth changes, and geometric structures ranging from coarse to fine size. The lighting conditions span natural indoor lighting, strong direct illumination, and completely dark environments, simulating realistic high dynamic range scenarios. This diversity ensures that our benchmark is representative of practical use cases such as quality inspection, dental modeling, and vehicle surface reconstruction under real-world HDR conditions.
Q4: Did you conduct any quantitative analysis on how your method performs at different levels of dynamic range (e.g., 60 dB, 80 dB, 120 dB)?
Reply: We supplement an experiment to evaluate performance under varying dynamic ranges. The experimental setup is as follows: we first construct a complex geometric scene and keep it fixed throughout the experiment. Then, we sequentially project binary speckle patterns with varying intensities and render the corresponding 32-bit HDR images. For each rendered image, we calculate the dynamic range by extracting its brightest and darkest regions. The overall dynamic range of the scene is then defined as the maximum value among the three projected frames. To vary the scene’s dynamic range, we modify the surface materials to adjust their reflectance properties. The reconstruction results under different dynamic range conditions are summarized as follows:
Table 1: Quantitative evaluation of reconstruction accuracy under varying dynamic ranges.
| Dynamic Range (dB) | 60 | 80 | 100 | 120 | 140 |
|---|---|---|---|---|---|
| EPE | 0.141 | 0.147 | 0.169 | 0.209 | 0.387 |
| BAD 2.0 | 0.005 | 0.005 | 0.007 | 0.009 | 0.019 |
The results show that our method maintains high accuracy across a wide range of dynamic ranges (60–120 dB), with only a modest increase in error as the dynamic range increases. At 140 dB, both high-reflectance and low-reflectance regions experience speckle loss across all three frames, leading to reduced reconstruction accuracy in those areas. Nevertheless, the network’s disparity completion capability enables reliable inference from surrounding regions, resulting in reasonable disparity estimates. Overall, our method consistently delivers stable and accurate reconstruction even under extreme HDR conditions. We will include the results above and corresponding qualitative results in the revised manuscript.
[1] Schreiberhuber S, Weibel J B, Patten T, et al. Gigadepth: Learning depth from structured light with branching neural networks[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 214-229.
[2] Riegler G, Liao Y, Donne S, et al. Connecting the dots: Learning representations for active monocular depth estimation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 7624-7633.
Dear Authors, thank you for your detailed responses. I am satisfied with the responses and my concerns have been addressed. Therefore, I will raise my score to reflect this.
Thank you for your thoughtful review and constructive feedback, which has provided us with valuable inspiration. We are pleased that our response effectively addressed all your concerns and are highly encouraged by your decision to raise the score.
The paper proposes an event-based structured light imaging using 3 pairs of speckle structured light projected images with adjusting background and foreground intensity. Each image pairs not only varying light intensity, but also Speckle contents, so it is not simple HDR preprocessing. To integrate the information, they suggest confidence map-based feature fusion.
优缺点分析
Strength
The paper suggests a new problem setting which is HDR event-based SR. It can help to acquire varying reflectance objects simultaneously.
The approach is not naïve such as HDR preprocessing in the image acquisition step. Speckle contents vary, so it enhances accuracy. Feature level fusion and confidence maps make it able to reconstruct varying speckle contents and also improve performance.
Weakness
The paper does not compare with the other event-based methods. Section 2.1 points out the problems of existing event-based methods, but the paper does not compare them at all, and the comparison targets are all general camera-based stereo methods with their event images. This comparison is a kind of an ablation study to select a basis stereo method (in this case, IGEV), not a comparison.
Multiple acquisition for HDR may not be in line with the main purpose of using event-based methods. The advantage of using event cameras is that they can obtain higher frame rates and can capture dynamic objects, but this advantage may be diluted when using 3 pairs of images. The paper does not mention the acquisition of dynamic objects or the frame rate of the system.
问题
What is the frame rate of the entire system? Is it possible to capture dynamic objects like other event-based methods? If not, what are the advantages over SL methods using a general camera?
局限性
yes
最终评判理由
My concerns have been resolved, and my final score go up to weak accept.
格式问题
No
Response to Reviewer Qbn3
Thank you for your thoughtful review and constructive feedback. We sincerely appreciate the time and effort you have dedicated to evaluating our work. We are also encouraged by your recognition of both our problem setting and the proposed approach. The main concerns are addressed below.
Weakness 1: The paper does not compare with the other event-based methods.
Reply: Thank you for the valuable feedback. Kindly note that, we did compare with the existing work of event-based structured light. Specifically, BM and SGBM are our reimplementation of Huang et al. [1], a seminal work in the domain of event-based structured light. To improve the quality of random speckle in HDR scenes, we adopt our imaging strategy written in Lines 108-111. Moreover, to ensure fairness, we implemented three-frame extension versions (BM-3 and SGBM-3) for comparison.
Additional event-based structured light methods were not included, since existing approaches were not specially designed for HDR 3D reconstruction and thus will perform poorly under HDR conditions. This has been proved in Fig. 1. To further support this point and emphasize the strength of our approach, we implement a state-of-the-art Graycode-based method [2]. The quantitative results are shown below:
Table 1: Quantitative comparison with Graycode under HDR scenes
| Scene | Method | EPE | Bad 0.5 | Bad 1.0 | Bad 2.0 | Bad 3.0 | Bad 5.0 | D1 |
|---|---|---|---|---|---|---|---|---|
| Scene 1 | Graycode | 43.6115 | 0.7982 | 0.6407 | 0.4766 | 0.4424 | 0.4315 | 0.4182 |
| Ours | 0.8449 | 0.2727 | 0.0781 | 0.0417 | 0.0379 | 0.0286 | 0.0249 | |
| Scene 2 | Graycode | 38.7694 | 0.7380 | 0.5921 | 0.4387 | 0.4068 | 0.3952 | 0.3810 |
| Ours | 0.7490 | 0.2498 | 0.0732 | 0.0383 | 0.0330 | 0.0276 | 0.0229 |
As demonstrated, [2], which is designed for general-scene reconstruction, totally fails in HDR regions. Our proposed HDR 3D reconstruction framework (only 3 frames) significantly outperforms the [2] (requiring 11 frames) in terms of both accuracy and completeness. We will add qualitative and quantitative results in the supplementary material for completeness.
Weakness 2 & Q1 & Q2: The advantage of high speed of event cameras may be diluted when using 3 pairs of images. The paper does not mention the acquisition of dynamic objects or the frame rate of the system.
Reply: Projecting multiple frames for depth acquisition in event-based structured light is a common practice [2,3], as the depth encoding itself is inherently multi-frame in nature. In our system, we project each set of patterns at 200Hz. By employing a time-overlapping strategy [2], our system achieves a depth sensing rate of 200 Hz. Even without the time-overlapping strategy, our system achieves a depth sensing rate of 66.6 Hz, surpassing representative event-based structured light methods [3] (50 Hz), [4] (60 Hz), and [5] (60 Hz). This rate fully supports reconstruction of dynamic scenes.
In our work, we leverage both the high-speed and high dynamic range (HDR) characteristics of the event camera, enabling dynamic HDR 3D reconstruction. We additionally supplement experiments on dynamic scenes to demonstrate our method’s capability in dynamic HDR scenes. Due to the visualization limitation of rebuttal, we will present qualitative results in the revised manuscript.
[1] X. Huang, Y. Zhang, and Z. Xiong. High-speed structured light based 3D scanning using an event camera. Optics Express, 29(22):35864–35876, 2021.
[2] X. Lu, L. Sun, D. Gu, et al. SGE: Structured light system based on Gray code with an event camera. Optics Express, 32(26):46044–46061, 2024.
[3] Y. Li, H. Jiang, C. Xu, et al. Event-driven fringe projection structured light 3D reconstruction based on time–frequency analysis. IEEE Sensors Journal, 24(4):5097–5106, 2024.
[4] N. Matsuda, O. Cossairt, and M. Gupta. MC3D: Motion contrast 3D scanning. In Proc. IEEE Int. Conf. Computational Photography (ICCP), pages 1–10, 2015.
[5] M. Muglikar, G. Gallego, and D. Scaramuzza. ESL: Event-based structured light. In Proc. Int. Conf. 3D Vision (3DV), pages 1165–1174, 2021.
Thank you for your rebuttal and additional experiments. My concerns have been resolved, and my final score will go up.
Thank you for raising the score. We are glad our response adequately addressed your concerns.
Dear Reviewer, Please also note that you must acknowledge having read the reply and enter your final rating and justification before the deadline. Respectfully, your AC
The paper proposes a speckle coding for capturing high dynamic range (HDR) structured light (SL) data with an event camera, allowing for accurate disparity reconstruction of challenging surfaces such as low albedo, specular, and transmissive surfaces. For reconstruction, the paper improves on cost volume stereo matching using two custom modules: an energy guided confidence estimation which is then used to create a confidence propagation volume. The two modules aim to suppress over/under exposed pixels’ contributions to the final reconstruction. Additionally, the paper provides a synthetic dataset with an accompanying event-based SL simulator in Blender.
优缺点分析
Strengths:
- Synthetic dataset and simulator provided
- Real world AND synthetic data evaluation
- Confidence-driven stereo matching also improves other baseline models (FastACV and RAFT-Stero), not only the used model.
Weaknesses:
- Results on real world data seem hard to compare quantitatively as the ground truths are very different from the reconstruction because of the shadows.
问题
- What is the inference-time compute?
- What if an object moves within the scene?
- Why speckle patterns and not lines of varying intensity?
- What about glitter or objects with small holes (ie a similar scale to the speckle)? Fig 5 has the shiny corn, but the size of the corn kernels is relatively large.
- Could including the negative events aid in the reconstruction, or are they redundant?
- Is the pixel-wise difference of the speckled image and non-speckled image for the simulation actually a good approximation of what happens in the pixel?
- In addition to the GT point cloud alignment, is the GT disparity map also masked in order to compare the GT with the reconstructed disparity in the real-world results?
局限性
Authors discussed limitations of the work, but did not discuss potential negative societal impacts even in the supplemental material. In fact, the paper checklist says, “there is societal impact of this work”. If that is true, then it begs the question: “why even research this”? Structured light is a widely used technology (3D scanners, Apple FaceID, etc), and event cameras are quickly growing in popularity due to their low latency, high temporal resolution, and HDR capabilities (among other advantages). Generating more accurate depth/disparity by combining these systems could lead to better art conservation/digitization, face recognition on darker skin, and better ground truth datasets – just to list a few downstream applications. At the same time, improving downstream applications such as those listed also has the potential to enable art theft (unauthorized use in training or reproduction), racial profiling, and training of weaponized AI.
最终评判理由
The author's responses to my questions were satisfactory, hence I am keeping my score the same.
格式问题
NA
Response to Reviewer uctW
Thank you for your thoughtful review and constructive feedback. We appreciate the time and effort you have dedicated to evaluating our work. We are encouraged by your recognition of our contribution of dataset and simulator, the comprehensive evaluations on both synthetic and real-world data, and the generalizability of our method. The main concerns are addressed below.
Weakness 1: Results on real-world data seem hard to compare quantitatively as the ground truths are very different from the reconstruction because of the shadows.
Reply: We use the Photoneo MotionCam 3D [1], a widely used industrial-grade high-precision HDR structured light scanner, to capture the ground truth. The evaluation is conducted only on regions with valid ground truth. The shadowed regions are caused by: (1) Occlusions due to the large baseline between the camera and the projector, which is common in high-precision SL scanners, as a larger baseline typically leads to higher accuracy. (2) Extreme HDR regions that are difficult to reconstruct. It is worth emphasizing that our method is capable of reconstructing regions where even the HDR scanner fails, which highlights a key advantage of our approach.
Q1: What is the inference-time compute?
Reply:
We add the inference time results in the table 1, as shown below:
Table 1: Quantitative results on the synthetic dataset with inference time
| Methods | EPE | Bad 2.0 | Inference Time (s) |
|---|---|---|---|
| BM | 31.4518 | 0.3156 | 0.0030 |
| BM_3 | 18.2216 | 0.2401 | 0.0120 |
| SGBM | 13.3384 | 0.1421 | 0.0060 |
| SGBM_3 | 6.3327 | 0.0918 | 0.0210 |
| CTD | 26.8375 | 0.2638 | 0.0190 |
| GigaDepth | 8.6688 | 0.1135 | 0.0210 |
| FastACV | 0.7112 | 0.0342 | 0.0670 |
| RAFT_Stereo | 0.4136 | 0.0219 | 0.8490 |
| IGEV | 0.3863 | 0.0177 | 0.4200 |
| Ours | 0.2937 | 0.0122 | 0.4080 |
Q2: What if an object moves within the scene?
Reply: The depth sensing rate of the system is 200 Hz, which means our system could produce 200 estimates of the scene depth in 1 second and is sufficient to support general dynamic scene reconstruction. We additionally conduct experiments on dynamic scenes, which demonstrate our method’s capability and advantage in reconstructing dynamic HDR scenes. Due to the visualization limitation of rebuttal, we will include the new results in the revised manuscript.
Q3: Why speckle patterns and not lines of varying intensity?
Reply: Event cameras can only perceive intensity changes and output binary data, and thus can NOT capture grayscale information in lines with varying intensity. Random speckle is a widely adopted depth encoding strategy with high encoding efficiency [2], which is a natural choice in our framework.
Q4: What about glitter or objects with small holes (i.e., a similar scale to the speckle)? Fig. 5 has the shiny corn, but the size of the corn kernels is relatively large.
Reply: As indicated by the real-world results in Fig. 6, our method can reconstruct objects with sizes equal to or slightly smaller than the speckle size, but it is unable to recover pixel-level structures. As an extremely efficient encoding strategy, this limitation is inherently determined by the nature of speckle encoding [2].
Q5: Could including the negative events aid in the reconstruction, or are they redundant?
Reply: It is possible. However, our multi-contrast coding is based on positive event triggering, which is also commonly adopted in many existing methods [3][4]. As a result, negative events are redundant in our setup. In implementation, we adjust the hardware configuration of the event camera to disable the generation of negative events, thereby avoiding interference and improving bandwidth efficiency.
Q6: Is the pixel-wise difference of the speckled image and non-speckled image for the simulation actually a good approximation of what happens in the pixel?
Reply: In implementation, we first compute the difference and then apply the logarithm, strictly following the imaging model of the event camera:
During simulation, we vary the value of C1 to further increase data diversity. Additionally, we use Blender’s Cycles renderer, which faithfully simulates real-world physical illumination, providing a realistic approximation of pixel-level behavior.
Q7: In addition to the GT point cloud alignment, is the GT disparity map also masked in order to compare the GT with the reconstructed disparity in the real-world results?
Reply: Yes, we perform the evaluation only on ground-truth regions with valid disparity values.
[1] Photoneo. MotionCam-3D M+. [Online]. Available: https://www.photoneo.com/products/motioncam-3d-m-plus/. Accessed: May 13, 2025.
[2] Z. Xiong, Y. Zhang, F. Wu, et al., "Computational depth sensing: Toward high-performance commodity depth cameras," IEEE Signal Processing Magazine, vol. 34, no. 3, pp. 55–68, 2017.
[3] M. Muglikar, G. Gallego, and D. Scaramuzza, "ESL: Event-based structured light," in Proc. IEEE Int. Conf. on 3D Vision (3DV), 2021, pp. 1165–1174.
[4] X. Lu, L. Sun, D. Gu, et al., "SGE: Structured light system based on Gray code with an event camera," Optics Express, vol. 32, no. 26, pp. 46044–46061, 2024.
- Q3: To clarify, I meant to ask about projecting pairs of lines instead of pairs of speckles. Basically, keeping the multiple contrast level selection process but using lines instead of speckle.
I am planning on keeping my score the same.
Thank you for your valuable feedback. We are pleased that most concerns have been addressed and are highly encouraged by your positive recommendation.
Q1: Why projecting pairs of speckles instead of pairs of lines?
Reply: Thank you for the clarification. We adopt random speckle patterns because their spatial randomness ensures unique encoding for each pixel along the epipolar line, which is indispensable for structured light systems. In contrast, line patterns are spatially repetitive and thus cannot provide pixel-wise unique encoding, leading to ambiguity in correspondence and failure in 3D reconstruction.
Dear Reviewer, Thank you for the discussions and considering the rebuttal and other reviews. Please remember to enter your final rating and justification before the deadline. Respectfully, your AC
This paper proposes the first HDR 3D measurement framework tailored for event-based SL systems and constructs the first datasets for this task. Experimental results show that the proposed method has achieved state-of-the-art performance in both synthetic and real-world datasets.
优缺点分析
Strengths:
- This paper presents a novel depth estimation framework based on event cameras and structured light, which improves the performance of current depth estimation in HDR scenes.
- The ablation experiments and visualization results of confidence maps demonstrate that the method can distinguish between overexposed and underexposed regions.
- This paper constructs a synthetic dataset and a real-world dataset, enriching the dataset resources in the field of depth estimation.
Weaknesses:
- As in Table I, the "multi-contrast HDR coding" plays the most significant role and delivers the greatest performance improvement. However, no theoretical basis and further analyses are provided.
- Why are three pairs of binary speckle patterns adopted? Will increasing the number of patterns enhance the effect?
- In the paper, the projection intensities of the three pairs of patterns are set as [32, 55], [32, 200], and [0, 255]. Is there any experimental or theoretical basis for this?
- Why does the performance of structured light-based methods in Table 1 lag far behind that of stereo matching methods? It is hoped that the authors can provide further explanations.
- The proposed method does demonstrate remarkable performance in the newly established scenarios of this paper, yet these scenarios are specialized constructs tailored for the method. The authors should supplement experiments on a broader range of datasets, separately illustrating the performance of the proposed method in HDR scenarios and non-HDR scenarios.
- The paper lacks a complexity analysis of the proposed method.
问题
Question:
- Supplement experiments. Refer to weakness 4 and 5.
- Provide theoretical explanation or relevant experiments. Refer to weakness 1 and 2.
局限性
yes
最终评判理由
While acknowledging the pioneering nature of this work, the proposed method cannot be directly compared with existing approaches on existing datasets because the information they utilize is actually different. Consequently, the potential improvement claimed is difficult to definitively quantify against the existing broader state-of-the-art. Therefore, I will keep my original score (borderline accept) unchanged.
格式问题
Not found
Response to Reviewer VZsb
Thank you for your thoughtful review and constructive feedback. We appreciate the time and effort you have taken to evaluate our work. We are encouraged by your recognition of the novelty of our HDR 3D reconstruction framework and our contribution of datasets to the community. The main concerns are addressed below.
Q1: As shown in Table I, the "multi-contrast HDR coding" plays the most significant role and delivers the greatest performance improvement. However, no theoretical basis and further analyses are provided.
Reply: Kindly note that we have provided the theoretical basis in Lines 113–124. This significant improvement arises from two key aspects:
- The multiple contrast levels proposed in our multi-contrast HDR coding enable imaging across regions with varying reflectance, as shown in Fig. 2(a).
- The variation of the content can address the issue that a single speckle frame cannot provide a globally unique encoding, thus improving disparity estimation accuracy, as also pointed out in [1]. Moreover, in HDR scenes, projecting several mutually independent random-speckle frames is equivalent to taking multiple independent samples of the surface micro-facet normal distribution, which reduces the likelihood that all frames at a given pixel suffer from event clustering or absence.
Q2: Why are three pairs of binary speckle patterns adopted? Will increasing the number of patterns enhance the effect?
Reply: Thank you for your insightful question. Increasing the number of patterns N improves reconstruction accuracy but also affects efficiency. To strike a balance between the two in learning-based methods, we conducted comprehensive experiments to determine the optimal value of N. We tested reconstruction accuracy for N ranging from 1 to 7 on planes and spheres, which is a widely adopted approach for precision calibration. The results are shown below:
Table 1: Accuracy under Varying Number of Patterns (RMSE in mm)
| Num of Patterns | Plane | Sphere |
|---|---|---|
| 1 | 1.10 | 1.18 |
| 2 | 0.85 | 0.91 |
| 3 | 0.55 | 0.65 |
| 4 | 0.55 | 0.57 |
| 5 | 0.50 | 0.52 |
| 6 | 0.41 | 0.49 |
| 7 | 0.40 | 0.47 |
The results show that increasing N improves accuracy, but the improvement becomes marginal as N increases. When N=3, we achieve accuracy comparable to using more frames, i.e., N=3 is an elbow point that strikes a balance between accuracy and efficiency. We will add an explanation in the main text and include the experimental results in the supplementary material.
Q3: In the paper, the projection intensities of the three pairs of patterns are set as [32, 55], [32, 200], and [0, 255]. Is there any experimental or theoretical basis for this?
Reply: The projection intensities are selected based on empirical experiments conducted under various scenes. Specifically, we constructed scenes with very low reflectance for selecting the first set of parameters, typical diffuse scenes for the second set, and highly reflective scenes for the third set.
This set of parameters allows for the reconstruction of non-HDR to HDR scenes. Additionally, due to the diversity of the synthetic dataset, our method is also robust to parameter adjustments, demonstrating high flexibility.
Q4: Why does the performance of structured light-based methods in Table 1 lag far behind that of stereo matching methods?
Reply: Monocular structured light methods (CTD and GigaDepth) implicitly learn the projected pattern during training and directly regress the depth from the input. This enables faster inference; however, the approach is inherently vulnerable to noise and system perturbations, limiting its robustness in challenging HDR scenes.
Stereo methods benefit from the explicit use of projected patterns, which provide rich structured information as a reference. The pixel-wise search and matching performed directly on the projected frames lead to high accuracy and robustness.
Q5: The authors should supplement experiments on a broader range of datasets, separately illustrating the performance of the proposed method in HDR scenarios and non-HDR scenarios.
Reply: The HDR scenes evaluated in Fig. 6 already include non-HDR objects, such as the bookshelf in Fig. 6 (a) and the fan in Fig. 6 (c), which are nearly diffuse in reflectance.
As suggested, we have added both qualitative and quantitative results for purely non-HDR scenes. The quantitative result is shown below:
Table 2: Quantitative Comparison on Non-HDR Scenes
| Metric | SGBM-3 | GigaDepth | FastACV | IGEV | Ours |
|---|---|---|---|---|---|
| EPE | 3.153 | 4.986 | 0.799 | 0.447 | 0.393 |
| Bad 2.0 | 0.066 | 0.079 | 0.038 | 0.021 | 0.018 |
And due to the visualization limitation of rebuttal, we will present qualitative results in the revised manuscript.
Q6: The paper lacks a complexity analysis of the proposed method.
Reply: We have additionally computed the inference time, and the results are shown below:
Table 3: Quantitative results on the synthetic dataset with inference time
| Methods | EPE | Bad 2.0 | Inference Time (s) |
|---|---|---|---|
| BM | 31.4518 | 0.3156 | 0.0030 |
| BM_3 | 18.2216 | 0.2401 | 0.0120 |
| SGBM | 13.3384 | 0.1421 | 0.0060 |
| SGBM_3 | 6.3327 | 0.0918 | 0.0210 |
| CTD | 26.8375 | 0.2638 | 0.0190 |
| GigaDepth | 8.6688 | 0.1135 | 0.0210 |
| FastACV | 0.7112 | 0.0342 | 0.0670 |
| RAFT_Stereo | 0.4136 | 0.0219 | 0.8490 |
| IGEV | 0.3863 | 0.0177 | 0.4200 |
| Ours | 0.2937 | 0.0122 | 0.4080 |
[1] Q. Tang, C. Liu, Z. Cai, et al., "An improved spatiotemporal correlation method for high-accuracy random speckle 3D reconstruction," Optics and Lasers in Engineering, vol. 110, pp. 54–62, 2018.
The authors' responses have addressed some of my questions, such as Q3 and Q4. However, the following issues remain unresolved:
Regarding Q2, why did the authors not present the computation time?
For Q4, were the experiments conducted on datasets that have been used by other methods?
Additionally, why does the proposed method still significantly outperform other approaches even in non-high dynamic range scenarios?
I believe that although this paper represents a pioneering work, it still lacks sufficiently robust comparative experiments to validate the effectiveness of its methodological innovations.
Thank you for providing further feedback. We are encouraged that our responses have successfully addressed part of your concerns. The remaining ones are further clarified as follows.
Q1: Regarding Q2, why did the authors not present the computation time?
Reply: Thank you for pointing this out. We further report the inference time for completeness:
Table 1: Accuracy under Varying Number of Patterns (RMSE in mm)
| Num of Patterns (N) | Plane | Sphere | Inference Time (s) |
|---|---|---|---|
| 1 | 1.10 | 1.18 | 0.261 |
| 2 | 0.85 | 0.91 | 0.338 |
| 3 | 0.55 | 0.65 | 0.408 |
| 4 | 0.55 | 0.57 | 0.499 |
| 5 | 0.50 | 0.52 | 0.565 |
| 6 | 0.41 | 0.49 | 0.643 |
| 7 | 0.40 | 0.47 | 0.712 |
As can be observed, selecting N=3 strikes a balance between accuracy and efficiency for our method.
Q2: For Q4, were the experiments conducted on datasets that have been used by other methods?
Reply: Unfortunately, existing datasets used by previous methods [1][2] only render one single speckle frame for each view, and thus our multi-frame method cannot be directly applied to them. Kindly note that all learning-based baselines were re-trained on our proposed dataset with a three-frame extension to ensure consistency and fairness in comparison.
Q3: Why does the proposed method still significantly outperform other approaches even in non-high dynamic range scenarios?
Reply: Even in non-HDR scenarios, complex surface geometry can still lead to distortion or loss of random speckles, resulting in varying speckle quality across frames at a given location. In this case, our proposed confidence-driven stereo matching remains effective by dynamically adjusting the weights of the three frames according to their quality, which in turn improves matching accuracy. Furthermore, as detailed in Section 3.2.2, the proposed confidence propagation volume provides enhanced representation capability, which also contributes to improved matching accuracy. These results collectively highlight the generality and superiority of our method across diverse scenarios.
[1] Schreiberhuber S, Weibel J B, Patten T, et al. Gigadepth: Learning depth from structured light with branching neural networks. In: European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 214–229.
[2] Riegler G, Liao Y, Donne S, et al. Connecting the dots: Learning representations for active monocular depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 7624–7633.
Thanks for the carefully rebuttal. I have no further questions and will take your response into account when making the final decision.
We are glad our response adequately addressed your concerns. Thank you again for helping us improve our paper.
Dear Reviewer, Thank you for the discussions and considering the rebuttal and other reviews. Please also note that you must acknowledge having read the reply and enter your final rating and justification before the deadline. Respectfully, your AC
With two borderline accept and two accept ratings, the perception of this work is fairly much on the positive side. All reviews and rebuttals have been carefully considered, and this AC agrees that this work meets the bar of NeurIPS.