PaperHub
6.2
/10
Poster5 位审稿人
最低5最高8标准差1.0
5
6
6
6
8
4.0
置信度
正确性3.0
贡献度3.0
表达3.0
NeurIPS 2024

Zero-Shot Event-Intensity Asymmetric Stereo via Visual Prompting from Image Domain

OpenReviewPDF
提交: 2024-05-01更新: 2024-11-06
TL;DR

We propose a zero-shot event-intensity asymmetric stereo method that adapts large-scale image domain models by using physical inspired visual prompting and a monocular cue-guided disparity refinement technique.

摘要

关键词
Event camerasstereo matchingasymetric stereovisual promptingdisparity filtering

评审与讨论

审稿意见
5

This paper addresses the zero-shot event-intensity asymmetric problem. Given an intensity image and the associated events, where the conventional and event cameras are spatially separated by a baseline, ZEST estimates the disparity map between the two input modalities. The key idea is to convert the input events and intensity image into a uniform intermediate representation before leveraging off-the-shelf stereo-matching models pre-trained on large datasets. The disparity refinement additionally solves a numerical optimization problem to refine the off-the-shelf stereo-matching result with the help of monocular depth estimation models operated on the input image.

优点

  1. This paper focuses on an important problem, using zero-shot models to transfer knowledge from conventional images to neuromorphic events. Over the past year or so, the computer vision community has witnessed the transition to extremely large models trained with extremely big data. However, the amount of available event data is limited due to the novelty of the event sensor. Zero-shot and few-shot learning are promising directions that have the potential to significantly advance event-based vision research.

  2. The proposed solution is reasonable, convincing, and technically sound. The representation alignment module exploits the EDI model to convert the input image and events into a uniform differential representation by taking the integral and temporal differences. The disparity refinement models the desired output disparity map as the linearly transformed map predicted by the monocular image-only model.

  3. Experimentally, the proposed ZEST outperforms the baseline approaches by a significant margin despite the fact that ZEST is a zero-shot model and does not require dataset-specific training.

缺点

  1. My biggest concern is the novelty of the proposed model. The proposed stereo-matching module appears to be a straightforward rearrangement of the Event-based Double Integral (EDI) model presented by Pan et al. The disparity refinement module uses traditional numerical techniques to optimize a two-term objective function, involving a residual term and a smoothness term.

  2. From Table 1 and Table 2, it appears that performance improvement is mainly the result of combining off-the-shelf EDI with CREStereo (CR) and DynamicStereo (DS). The last two rows of Table 2 demonstrate that the disparity refinement module, which comprises the majority of technical contributions, leads to a relatively minor performance difference.

  3. It is unclear if a linear transformation is the best way to connect the monocular prediction to the desired binocular disparity map.

问题

May I ask if connecting D^mono and D^bino by a linear transformation is the standard practice in existing literature?

局限性

Yes, the authors have adequately addressed the limitations and potential negative societal impact of their work.

作者回复

Thank you for your detailed feedback. We address your concerns below.

W1. Novelty and contributions of the visual prompting module:

While we draw inspiration from the EDI model for event-based image deblurring, our visual prompting module is not a straightforward rearrangement. We re-purpose its core idea to bridge the modality gap between events and frames specifically for stereo matching, which is not the original intention of the EDI model. Our method efficiently tackles challenges of non-negligible duty time between exposures and unknown triggering thresholds, which are crucial aspects of event-intensity asymmetric stereo and not addressed by prior work (L147-L156). This tailored reformulation, enabling the effective use of off-the-shelf stereo models without retraining on events in a zero-shot manner, forms a core novelty of our work and demonstrates "ingenuity in adapting well-established methods from the image domain to the event domain" (Reviewer dhj9 Strength 2). It is potentially "widely applicable for cross-model tasks" (Reviewer Ej3V Strength 3) and "other asymmetric vision tasks" (Reviewer ggyM Question 3).

W1&W2. Novelty and contributions of the disparity refinement module:

The disparity refinement module is a novel component of our approach that addresses key limitations inherent to event-intensity asymmetric stereo:

  1. Handling static regions: Event cameras excel at capturing dynamic scenes but suffer from information sparsity in static regions. The refinement module leverages monocular cues to effectively address this, particularly for challenging regions with sparse events or textureless areas (L162-L164).

  2. Addressing unknown scale: Monocular depth models often predict relative depth with an unknown scale. Our refinement module overcomes this by using a linear transformation guided by event density confidence to align the monocular prediction with the stereo disparity map (L170-L172).

This linear transformation-based fusion strategy, which utilizes numerical optimization techniques guided by event density confidence, is specifically tailored to the unique challenges of event-intensity asymmetric stereo, which is one of the novel aspects introduced by our approach.

The inclusion of the disparity refinement module brings improvements, particularly in preserving sharp depth boundaries for objects like cars and buildings in challenging scenes with sparse events or textureless areas, as evident in the visual comparisons between Ours-DS and Ours-DS-DA in Figures 6, 7, 12, and 13. The marginal gains observed in the evaluation metrics (Table 2, Rows 7-8) are likely due to the sparsity of the ground truth disparity provided in the DSEC dataset (Figure 5).

We profiled the runtime and found that the disparity refinement module takes about 306.82 ms, accounting for 48.6% of the total inference time (630.36 ms) of the Ours-CR-DA variant. We can reduce this overhead while maintaining comparable performance by decreasing the number of iterations, as shown in Table C in the response to Global - Q2.

W3&Q1. Linear transformation in disparity refinement:

The choice of a linear transformation to connect monocular and binocular predictions is not a standard practice but rather a novel aspect of our approach. It is motivated by the observation that monocular depth models often predict relative depth up to an unknown scale and shift (L170-L172). A linear transformation offers a simple yet effective way to align these predictions with the stereo disparity map, particularly in preserving sharp depth boundaries for objects like cars and buildings in challenging scenes with sparse events or textureless areas. While exploring more complex transformations is an interesting avenue for future work, our experiments demonstrate the effectiveness of this simple approach in achieving good performance.

评论

Dear Authors,

Thanks for submitting the rebuttal! I really like your response, which addresses my concerns. I think this is a solid paper and deserves publication. As such, I will increase my rating and recommend acceptance.

Sincerely,

Reviewer U212

评论

Thank you very much for your feedback.

审稿意见
6

The authors propose a novel zero-shot framework to leverage hybrid event-intensity stereo matching (i.e., one frame camera and one event camera) using off-the-shelf stereo models without additional training. Given the proposed representation alignment, the authors successfully achieve stereo matching using off-the-shelf networks (e.g., CREStereo) using conventional large-scale stereo datasets. Furthermore, the authors propose a monocular disparity refinement based on large monocular networks (e.g., MiDAS): the predicted disparity map from the stereo module is used to scale the relative disparity map to absolute metrics using a local optimization approach. The proposal is compared to three main competitors (two handcrafted and one deep-based) and three adapted competitors (i.e., using some adaptation techniques: ETNet, E2VID, v2e), showing state-of-the-art performances in cross-domain generalization.

优点

Novel solution for a known problem: even if the relationship between frame and integrated event stream was already discovered in [19], the proposed representation alignment is a clever way to exploit off-the-shelf stereo networks for the intensity-event stereo task. The authors start from the definition of event triggering (Eq. 1) and find an aligned frame and event representation that sounds theoretically good (Eq. 4 and 5). The limited annotated data availability is a known problem also in other tasks (e.g., in Bartolomei L., et al. "Revisiting depth completion from a stereo matching perspective for cross-domain generalization" 3DV 2024, authors use "visual prompting" to cast stereo matching to the depth completion task to deal with out-of-domain scenarios).

Exhaustive experiments: the proposal is compared with different main competitors (i.e., SHEF, HSM, and DAEI (deep)). To further assess the performance of the proposal authors managed to adapt conventional frame-stereo networks (i.e., PSMNet, CREStereo, and DynamicStereo (the latter exploit multiple frames at once)) using two event-to-frame converters (i.e., ETNet, E2VID) and a conventional event-stereo network (i.e., SE-CFF) using a frame-to-event converter (i.e. v2e). As a suggestion, the proposal could be compared also to methods that fuse both frame and event stereo (e.g., Mostafavi M., et al. "Event-intensity stereo: Estimating depth by the best of both worlds." ICCV 2021), for example disabling the left event camera and the right frame camera. The ablation study confirms the benefits of the proposed representation alignment module.

The reading was smooth: The authors wrote the paper linearly and clearly. They expose the problem to the reader and following logical steps they arrive to the proposed solution. The clear figures help the reader understand the proposal and the results. Some minor imperfections can be fixed (see question paragraph).

缺点

Monocular disparity refinement is not fully justified: This large module (row 421-424) requires additional computation power (GPU - row 416) and permits only marginal gains (Tab. 2 row 7-8). The computational power is a potential limitation (as highlighted in rows 449-451) and the monocular module could fail not only for wrong disparities from the stereo module (Tab. 2 rows 3-4, rows 266-267) but also for wrong relative disparities from the monocular module itself (e.g., optical illusions). Instead, the authors could have tried other strategies: for example, given the large amount of frame stereo datasets, the frame stereo pair can be converted to the proposed representation (i.e., apply Eq. 4 to both images) and then the stereo network can be fine-tuned to better handle the novel event-frame alignment.

Related works are okay, but they could be extended: I suggest extending the related works with two additional related topics: i) "Visual Prompting": since "visual prompting" is the technique where authors took inspiration from, rows 42-45 could be extended as a related works paragraph; ii) "Event-Intensity Stereo Fusion": this task requires two cameras that could capture both events and intensity to estimate a disparity map. An example is the paper previously cited (Mostafavi M., et al. ICCV 2021). Since the suggested topics are relevant but not as important as those shown in the main paper, the authors could extend the related works in the appendix to save space.

"Visual Prompting" term is a bit abused: looking at [1], "visual prompting" refers to the technique of adapting a network for a different task adding learned perturbations in the input. It is true that the frame-based stereo network is used for frame-event stereo matching, however, the representation alignment completely changes the input of the stereo-matching network using a non-learnable transformation.

问题

Before questions, I resume here the motivations behind my overall rating: the main idea of using the "temporal difference of frames" and "temporal integral of events" (already proposed in [19]) for representation alignment to achieve zero-shot Event-Intensity stereo is a novel contribution to the literature. Even if there are some concerns about the monocular module and the definition of "visual prompting", I believe that the proposed representation alignment is a valid contribution to the community.

Minor comments: 1) Row 153: shouldn't it be Eq. 4 instead of Eq. 3? 2) At the start of Eq. 14, shouldn't it be W^{(0)}_\mathbf{p} instead of W^{(0)}? 3) Fig. 2: Event camera and frame camera are placed in opposite order w.r.t. input column (i.e., "RGB Frames" and "Event Data") 4) Fig. 2: "Event-based filtering" is confidence C (Eq. 12)?

局限性

The authors have addressed the limitations of the proposal. I suggest the authors to insert limitations from the monocular module, as previously discussed in the weaknesses section.

作者回复

Thank you for your insightful comments and positive feedback on our work. We address your concerns below.

S1. Comparison to more methods:

We appreciate the suggestion to compare with Mostafavi et al. [17]. However, for the slightly different setting where both frames and events are used in both views, no pre-trained models are publicly available. We have, however, included a comparison with the event-based stereo setting from their follow-up CFF [18] using their released checkpoint. Due to the character limit, please refer to the response to EJ3V - W1 for more details about the availability of publicly released code and checkpoints for related works in Table E.

W1. Disparity refinement module:

The inclusion of the disparity refinement module brings improvements, particularly in preserving sharp depth boundaries for objects like cars and buildings in challenging scenes with sparse events or textureless areas, as evident in the visual comparisons between Ours-DS and Ours-DS-DA in Figures 6, 7, 12, and 13. The marginal gains observed in the evaluation metrics (Table 2, Rows 7-8) are likely due to the sparsity of the ground truth disparity provided in the DSEC dataset (Figure 5).

We profiled the runtime and found that the disparity refinement module takes about 306.82 ms, accounting for 48.6% of the total inference time (630.36 ms) of the Ours-CR-DA variant. We can reduce this overhead while maintaining comparable performance by decreasing the number of iterations, as shown in Table C in the response to Global - Q2.

We appreciate the suggestion regarding fine-tuning a stereo network on synthetically transformed frame pairs. While promising, this approach presents challenges in accurately simulating event data from frames and potential domain gaps between synthetic and real data. Our zero-shot method offers a practical and robust solution that can easily integrate future advances in frame-based stereo matching and monocular depth estimation.

W2. Related works:

We will expand the related works section in the final version as suggested to include paragraphs on "Visual Prompting" and "Event-Intensity Stereo Fusion," citing and discussing relevant works, including Mostafavi et al. [17].

W3. Visual prompting terminology:

We agree that our usage differs slightly from [1] and is more aligned with [NeurIPS'23], which also uses a non-learnable transformation as a visual prompt. We will clarify this in the final version, emphasizing that while we don't use learned perturbations, our approach shares the core concept of modifying inputs to adapt pre-trained models without fine-tuning.

[NeurIPS'23] Yang et al., Fine-Grained Visual Prompting.

Q. Minor comments:

Thank you for your careful reading. We will correct all the issues you identified: 1) The reference will be changed to Eq. (4). 2) We will use the correct notation Wp(0)W^{(0)}_\mathbf{p} in Eq. (14). 3) It will be adjusted to align the input order. 4) We will clarify that it refers to CC in Eq. (12). We will ensure all references and definitions are accurate in the final version.

评论

Dear Authors,

Thanks for your response and the global rebuttal.

S1: Thanks for the clarification.

W2, W3 and Q: Thanks, I'm satisfied with your response.

W1

  • Authors: The inclusion of the disparity refinement module brings improvements, particularly in preserving sharp depth boundaries for objects like cars and buildings in challenging scenes with sparse events or textureless areas, as evident in the visual comparisons between Ours-DS and Ours-DS-DA in Figures 6, 7, 12, and 13. Thanks, I share the request of reviewer EJ3V to better quantify this using numerical results.

  • Authors: We can reduce this overhead while maintaining comparable performance by decreasing the number of iterations. Yes, I see from Table C. It seems from Table 2 rows 7-8 that the refinement requires at least 100 iterations to reduce error w.r.t. table 2 row 7.

  • Authors: We appreciate the suggestion regarding fine-tuning a stereo network on synthetically transformed frame pairs. While promising, this approach presents challenges in accurately simulating event data from frames... I apologize, I did not express my idea correctly: given a synthetic sequence stereo dataset (such as VirtualKITTI 2), we can apply Eq. 4 to the left image at time tt and the left image at time t+1t+1. At the same moment, we can apply Eq. 4 to the right image at time tt and the right image at time t+1t+1. As you can see from Figure B of your global rebuttal, "Temporal Grad. (Left)" and "Temporal Integ. (Right)" are quite similar: my idea is to substitute "Temporal Integ. (Right)" with "Temporal Grad. (Right)", without any event data simulation.

Global PDF Figure A: It seems that those failure cases could be resolved using a global scale and shift instead of your linear transformation. This could help justify better the monocular module. What do you think?

Best regards,

Reviewer PsoF

评论

Thank you for your kind and constructive feedback. We appreciate your insights and the opportunity to address your questions and suggestions.

Q1. Analysis of improvement of the refinement module in edge and textureless areas

An analysis of the EPE improvements on the interlaken_00_c sequence is shown in the table below. We differentiate between "textureless areas" and "edge areas" according to the values in the left-view temporal gradient images, which are derived from the differences in pixel values between consecutive frames. The results show that the refinement module yields more improvements in static areas. While depth estimation is more challenging for stereo matching between events and frames in these regions, it is generally easier for monocular depth estimation due to the large smooth areas.

Table G. Analysis of improvement of the refinement module in edge and textureless areas

EPEW/o refinementW/ refinementImprovement
Edge areas1.561.5180.042
Textureless areas1.3931.2770.116
Total1.4841.4080.075

Q2. Finetuning versus refinement

Thank you for the suggestion and the clarification. We have attempted a similar approach prior to exploring the zero-shot approach, motivated by the desire to leverage prior knowledge in the frame domain. However, we found that fine-tuning a large model without encountering catastrophic forgetting posed significant challenges given our computational resources. We will explore this approach more in-depth in future work.

Q3. Global scale versus spatially-varying scales

Thank you for your suggestion. In the examples showcased, a global scale approach could indeed perform better. However, for most cases, spatially-varying scales outperform global scales, as demonstrated in our comparison with the ablation study "DA w/ GT scale", which serves as an upper bound for all global-scale-based methods. Our proposed spatially-varying scale scheme can encompass global scales, and it might be beneficial to design a hybrid method that automatically switches between the two models based on specific criteria. We will explore this in future work.

评论

Dear authors,

Thanks for your response.

I'm satisfied with the rebuttal and I will raise my rating. As a last thing, I would like to add two more suggestions:

  • as previously pointed out in this rebuttal, authors could highlight that the ZEST framework is future-proof: any advancement in monocular depth estimation or frame stereo matching could further increase the accuracy;

  • as future work, authors could try sparse stereo matching networks and exploit the spatially varying scales, maybe using a strong regularization term.

Best regards,

Reviewer PsoF

评论

Thank you for your positive feedback. We greatly appreciate your valuable suggestions. We will highlight the potential for accuracy improvements with future advancements in the final version, and explore sparse stereo matching networks and spatially varying scales with strong regularization in our future work.

审稿意见
6

This paper propose a zero-shot framework called ZEST, which employs a representation alignment module as a visual prompt for the utilization of off-the-shelf image-oriented stereo models. To further improve robustness, a cue-guided disparity refinement method is proposed.

By comparing the imaging principle of frames and events, the representation alignment module establishes an explicit intermediate representation that bridges the gap between them.

Furthermore, to enable the correct estimation in textureless regions of the frame or the static regions in the events view, the cue-guided disparity refinement estimates a scale map and a shift map through optimization and then computes the refined disparity map.

Benefiting from the great generalization ability of the foundation model, the ZEST achieves promising disparity estimation results in a zero-shot manner on the DSEC dataset. The effectiveness of the two proposed modules is well proved through ablation studies.

优点

  1. The paper writing is good.
  2. The proposed method works in a zero-shot manner, which is valuable for the community.
  3. The paper proposes a potential widely applicable representation alignment method for cross-modal tasks with events and frames.

缺点

  1. Only one deep-learning-based model (DAEI) is compared and only one benchmark (DSEC) is tested, which weakens the reliability of the proposed ZEST.

  2. Some written mistakes are found. For example, in line 153, the temporal difference map is defined in equation (4), not (3).

  3. The optimization detail of the cue-guided disparity refinement should be included in the main body of the paper, not in the appendix. In addition, as the optimization is utilized during inference, the inference time should be compared and discussed.

问题

  1. Why only one benchmark is tested? The MVSEC seems also to provide the stereo data. Would you consider also providing results on the MVSEC benchmark?

  2. Is there any alternative approach to constructing the scale map and the shift map without utilizing optimization?

  3. The resolution of the image sensor and event sensor are usually different, does it influence the estimation result?

局限性

The authors have already reported the limitations of their work, including the potential lack of ability to capture the intricacies of the modality gap between frames and events, and the heavy computation cost because of the implementation of the foundation model.

However, the computation cost of the optimization in the cue-guided disparity refinement part is not discussed, which is important for the final deployment of the model.

作者回复

Thank you for your thoughtful feedback and positive assessment of our work. We address each point below.

W1. Comparison with more methods:

In the main text, we have compared our approach with several methods (Table 1): the deep learning-based method DAEI [33], after obtaining their code upon request; the traditional methods HSM [13] and SHEF [24], both of which have released code; and the event-based stereo method CFF [18], using their released checkpoint. Notably, DAEI [33] is recognized in the literature for achieving state-of-the-art performance.

In response to the reviewer's suggestion, we conducted a thorough survey of the availability of publicly released code and checkpoints for related works, as summarized in Table E. Unfortunately, to the best of our knowledge, there are currently no publicly available implementations for deep learning-based event-intensity asymmetric stereo matching methods [39, 33, 3]. Even for the slightly different scenario where both frames and events are used in both views, pre-trained models are not publicly available.

We will continue to track the progress of these methods and actively engage with the authors to request access to their code or checkpoints. We are also in the process of reproducing the results from [39] and plan to include them in the final version of our paper. Our commitment is to provide a comprehensive evaluation and comparison as more resources become available.

Table E. Availability of publicly released code and checkpoints for related works.

PublicationsMethodSetting of inputsCodeCheckpoint
[13]HSMEvent-intensity asymmetric stereoN/A
[24]SHEFEvent-intensity asymmetric stereoN/A
[39]HDESEvent-intensity asymmetric stereo
[33]DAEIEvent-intensity asymmetric stereo
[3]SAFE-SfMEvent-intensity asymmetric stereo
[17]EISEvent-intensity stereo
[18]SE-CFFEvent-intensity stereo
[18]SE-CFFEvent-based stereo
[5]ADESEvent-based stereo
[23]DDESEvent-based stereo
[37]TSESEvent-based stereo
[36]IDERVEvent-based stereoN/A

W2. Written mistakes:

We apologize for the error. This will be corrected in the final version, along with a thorough review to address any other minor mistakes.

W3&Limit. Disparity refinement:

We agree that the optimization details for cue-guided disparity refinement should be in the main text and will move this information from the appendix in the final version.

Please refer to the response to Global - Q2 for the computational overhead introduced by the refinement module in Table C.

Q1. Evaluation on additional datasets:

The DSEC dataset covers a wide range of scenarios, including various lighting conditions, motion patterns, and scene complexities, as illustrated in Figures 4 and 14. However, we agree with the reviewers that evaluation on more datasets would provide a more comprehensive assessment.

To demonstrate the generalization ability of our approach, we evaluated Ours-CR-DA on sequences from two additional datasets: MVSEC [38] (DAVIS sensor, 2018) and M3ED [CVPRW'23] (Prophesee sensor). The quantitative results are shown in Table F, with the qualitative results in Figure B of the attached PDF. These results demonstrate ZEST's robust generalization across different environments, motion patterns, and sensor characteristics, supporting its applicability in diverse scenarios.

We acknowledge the current limitations in our comparison with other methods on these datasets and the small number of sequences used. The distinct data formats of the new datasets and slow download speeds posed challenges to the timely completion of our experiments, preventing us from providing more extensive evaluations within the given timeframe. In the final version, we will include more sequences and conduct comparisons with additional methods to offer a more thorough evaluation.

[CVPRW'23] Chaney et al., M3ED: Multi-Robot, Multi-Sensor, Multi-Environment Event Dataset.

Table F. Quantitative results of the proposed zero-shot disparity estimation method on additional datasets.

DatasetSequenceTest clipEPERMSE3PE2PE1PE
MVSECindoor_flying1400-9002.7373.29578.44435.86915.218
M3EDcar_urban_day_horse120-2802.1613.48760.10831.99219.322

Q2. Alternatives for constructing scale and shift maps.

We considered guided filtering-based alternatives without explicit optimization, which apply a linear transformation to respect its structure while maintaining the absolute amplitudes of the edges of the binocular disparity prediction. However, we found that the guided filtering-based method produced inferior results with blurry boundaries, as shown in Figure C of the attached PDF. We agree this is an interesting direction for future work to potentially improve efficiency.

Q3. Resolution differences:

Our method assumes input data from both views have the same spatial resolution, aligning with the DSEC dataset. If necessary, different resolutions between image and event sensors can be handled through appropriate resampling.

评论

The rebuttal helps to solve some concerns. Still, the improvement from the disparity refinement module should be better quantified by offering more detailed numerical results. How much gains are obtained on edges and textureless areas? Such a refinement tends to be time-consuming. Would you consider presenting the computational costs of this process? Besides, regarding the results on other datasets, the rebuttal only shows the scores of the proposed solution but does not compare against a baseline.

Overall, despite these minor issues, given the novelty and applicability of the presented work, the reviewer would like to maintain a rating of weak acceptance.

评论

Thank you for your valuable and constructive feedback. We appreciate the recognition of the novelty and applicability of our work.

Q1. Analysis of improvement of the refinement module in edge and textureless areas

Thank you for the suggestion. We agree that a more detailed analysis of the disparity refinement module can provide deeper insights. Below is a table detailing the EPE improvements on the interlaken_00_c sequence. We differentiate between "textureless areas" and "edge areas" according to the values in the left-view temporal gradient images, which are derived from the differences in pixel values between consecutive frames. The results show that the refinement module yields more improvements in static areas. While depth estimation is more challenging for stereo matching between events and frames in these regions, it is generally easier for monocular depth estimation due to the large smooth areas.

Table G. Analysis of improvement of the refinement module in edge and textureless areas

EPEW/o refinementW/ refinementImprovement
Edge areas1.561.5180.042
Textureless areas1.3931.2770.116
Total1.4841.4080.075

Q2. Computational cost of the refinement module

The computational cost of our 500-iteration refinement module is 306.82 ms per image on an RTX 4090 GPU. Please refer to the last row of Table A in the response to Global - Q1 for details. We further explore variants with fewer iterations in Table C in the response to Global - Q2, demonstrating that 100 iterations (taking 70.92 ms) can yield improvements in terms of both EPE and 3PE. These suggest that a balance between accuracy and computational efficiency can be achieved by adjusting iteration counts.

Q3. Baseline comparisons on more datasets

We acknowledge the importance of comparing our proposed solution against baseline methods on different datasets. However, the input stereo data for most sequences exceeds 50 GB, which results in slow download speeds. Additionally, the distinct data formats require significant preprocessing to ensure compatibility with the compared methods. We are actively working on this analysis and commit to including baseline results in the final version of our work.

Thank you once again for your constructive feedback and for your rating of weak acceptance.

审稿意见
6

The paper introduces a novel zero-shot framework for event-intensity asymmetric stereo matching. It leverages visual prompts to align frame and event representations and utilizes monocular depth estimation and stereo-matching models pre-trained on diverse image datasets. The key contributions include a visual prompting technique for representation alignment and a monocular cue-guided disparity refinement module. Experiments on the DSEC dataset demonstrate superior performance and generalization ability compared to existing methods.

优点

(1)The paper introduces a novel approach to event-intensity asymmetric stereo matching by leveraging visual prompts to align frame and event representations. This technique is quite an improvement in the field from my aspect, as it addresses the challenge of modality alignment without requiring additional training data.

(2) By utilizing monocular depth estimation and stereo-matching models pre-trained on diverse image datasets, the authors provide a practical solution that capitalizes on existing robust models. This approach demonstrates ingenuity in adapting well-established methods from the image domain to the event domain.

(3) As for the writing, I think this paper is well-written, with clear and comprehensive mathematical formulations. The explanation of the representation alignment and disparity refinement processes is logically sound, providing a solid foundation for the proposed method.

缺点

(1) The experimental evaluation is limited to the DSEC dataset, which raises questions about the generalizability of the results. While the dataset is comprehensive, additional experiments on other datasets would provide a more robust evaluation of the method's generalization capabilities.

(2)The paper lacks a detailed analysis of the computational efficiency and scalability of the proposed method. Understanding the computational requirements is crucial for assessing the feasibility of deploying the method in real-time applications or resource-constrained environments.

(3)The evaluation does not include a broad enough range of baseline methods, particularly those that do not rely on off-the-shelf models. A more diverse set of baselines would provide a clearer picture of the proposed method's relative performance and highlight its unique contributions.

问题

  1. First about the Scalability: How does the proposed framework scale with the size and complexity of input data? Are there specific optimizations or strategies to enhance its efficiency, particularly for real-time applications?

  2. How robust is the proposed framework to variations in input data quality and resolution? Are there specific scenarios or conditions where the method is likely to fail or significantly underperform? Understanding the robustness of the method is essential for assessing its reliability in real-world applications.

局限性

(1) The framework's zero-shot setting shows potential, but its ability to generalize to new and unseen environments with different characteristics remains uncertain. The paper does not address how the method handles scenarios with significant deviations from the training data, such as different lighting conditions, motion patterns, or sensor noise levels. The robustness of the approach in extreme conditions, such as rapid scene changes or very low light environments where event data might be sparse or noisy, is not adequately tested.

(2) Computational Complexity: The paper does not discuss the computational overhead introduced by the monocular cue-guided disparity refinement module. The reliance on large pre-trained models could limit the adaptability and scalability of the proposed framework. In scenarios where fine-tuning or customization is necessary for specific tasks or datasets, the method may face practical limitations due to the size and complexity of these models.

作者回复

Thank you for your valuable comments and insightful suggestions. We address each concern below.

W1. Evaluation on more datasets:

The DSEC dataset covers a wide range of scenarios, including various lighting conditions, motion patterns, and scene complexities, as illustrated in Figures 4 and 14. However, we agree with the reviewers that evaluation on more datasets would provide a more comprehensive assessment.

To demonstrate the generalization ability of our approach, we evaluated Ours-CR-DA on sequences from two additional datasets: MVSEC [38] (DAVIS sensor, 2018) and M3ED [CVPRW'23] (Prophesee sensor). The quantitative results are shown in Table F in the response to EJ3V - Q1, with the qualitative results in Figure B of the attached PDF. These results demonstrate ZEST's robust generalization across different environments, motion patterns, and sensor characteristics, supporting its applicability in diverse scenarios.

[CVPRW'23] M3ED: Multi-Robot, Multi-Sensor, Multi-Environment Event Dataset

W2&Q1&Limit2. Computational efficiency and scalability:

Due to the character limit, please refer to the response to Global - Q1 for an analysis to computational efficiency in Tables A and B.

We have also examined the scalability of our method with varying input resolutions. As presented in Table D, GPU memory usage and runtime only increase marginally as the resolution scales up. Notably, the Depth Anything model internally uses a fixed resolution for inference, which keeps its memory usage constant.

Various strategies can be applied to speed up our method. For example, we can run the stereo matching module and the monocular depth estimation module in parallel. Due to the flexibility of our framework, our method can also be accelerated by using more lightweight models. We will further explore lightweight alternatives for the stereo and monocular modules (e.g., Depth-Anything-Small) that could achieve speedups for real-time applications, and we will provide the results and discussions in the final version. Our modular design also allows for further optimizations like model pruning and quantization.

Table D. Performance and computational cost comparison for varying input data sizes.

Input spatial resolutionPixelsCRES runtime (ms)CRES GPU Mem (MB)DA runtime (ms)DA GPU Mem (MB)Refinement runtime(ms)Refinement GPU Mem (MB)
240x3201x156.59206481.273640300.961688
480x6404x243.55207880.003640306.821736
720x9609x624.11273880.263640311.701808

W3. Comparison with more methods:

In the main text, we have compared our approach with several methods (Table 1): the deep learning-based method DAEI [33], after obtaining their code upon request; the traditional methods HSM [13] and SHEF [24], both of which have released code; and the event-based stereo method CFF [18], using their released checkpoint. Notably, DAEI [33] is recognized in the literature for achieving state-of-the-art performance.

In response to the reviewer's suggestion, we conducted a thorough survey of the availability of publicly released code and checkpoints for related works, as summarized in Table E in the response to EJ3V - W1. Unfortunately, to the best of our knowledge, there are currently no publicly available implementations for deep learning-based event-intensity asymmetric stereo matching methods [39, 33, 3]. Even for the slightly different scenario where both frames and events are used in both views, pre-trained models are not publicly available.

We will continue to track the progress of these methods and actively engage with the authors to request access to their code or checkpoints. We are also in the process of reproducing the results from [39] and plan to include them in the final version of our paper. Our commitment is to provide a comprehensive evaluation and comparison as more resources become available.

Q2. Robustness to input data quality:

Our method exhibits robustness to variations in data quality. Figure 4 demonstrates consistent performance under challenging conditions: sparse event inputs in Column 2, rapid scene changes in Column 3, and extreme low-light conditions in Column 4. Figure 14 further showcases robustness in diverse scenarios. These results indicate stable performance even with moderate input quality degradations.

Limit1. Zero-shot generalization.

We recognize the limitations of a purely zero-shot setting and plan to explore techniques for efficient adaptation to new domains with limited data in future work.

Limit2. Computational overhead.

We profiled the runtime and found that the disparity refinement module takes about 306.82 ms, accounting for 48.6% of the total inference time (630.36 ms) of the Ours-CR-DA variant. We can reduce this overhead while maintaining comparable performance by decreasing the number of iterations, as shown in Table C in the response to Global - Q2.

The inclusion of the disparity refinement module brings improvements, particularly in preserving sharp depth boundaries for objects like cars and buildings in challenging scenes with sparse events or textureless areas, as evident in the visual comparisons between Ours-DS and Ours-DS-DA in Figures 6, 7, 12, and 13. The marginal gains observed in the evaluation metrics (Table 2, Rows 7-8) are likely due to the sparsity of the ground truth disparity provided in the DSEC dataset (Figure 5).

评论

Thank you for the response. I appreciate that you provided an analysis of the computational efficiency and scalability of the proposed method, seems that it lies in a proper solution for a real-time application. Considering the applicability of this paper, I am glad to increase a rating of weak acceptance. Good Luck.

评论

Thank you for your thoughtful feedback and for recognizing the method's applicability.

审稿意见
8

This paper proposes a novel visual prompting technique for event-intensity asymmetric stereo matching. The key idea is to align event and frame representations using visual prompts, enabling the use of off-the-shelf stereo matching models for event-intensity pairs. The key contributions are: 1) A visual prompting technique to align representations between events and frames, enabling the use of off-the-shelf stereo models without modification; 2) A monocular cue-guided disparity refinement module to improve robustness in regions with few events or textures; 3) Demonstration of superior zero-shot evaluation performance and enhanced generalization compared to existing approaches. The paper presents a significant advancement in bridging the gap between event-based and traditional vision for stereo matching tasks.

优点

  1. The visual prompting approach for aligning event and frame representations is quite enlightening. It creatively addresses the domain gap between events and intensity images, by aligning the physical formulation of frames and events using temporal difference and integral. It opens new possibilities for leveraging powerful pre-trained models in event-based vision tasks.
  2. The experimental evaluation is comprehensive and rigorous. The authors demonstrate clear improvements over existing methods on standard benchmarks, validating the effectiveness of their approach. The ablation studies provide valuable insights into the contribution of each component.
  3. The paper is well-structured and clearly explains the methodology and implementation details. The authors provide sufficient information for reproducibility, which is crucial for the research community.
  4. This work has broad impact for the field of event-based vision. By enabling the use of off-the-shelf stereo models for event-intensity pairs, it significantly lowers the barrier for applying advanced stereo matching techniques to event-based data. This approach could potentially be extended to other similar tasks, making it a valuable contribution to the field.

缺点

  1. The paper would benefit from a more detailed analysis of the computational efficiency and resource requirements of the proposed method, especially considering the use of two off-the-shelf models.
  2. Some technical details and abbreviations could be explained more thoroughly to enhance readability for a broader audience.

问题

  1. Could the authors provide an analysis of the execution time and memory requirements for each variant of the ZEST framework?
  2. How does the ZEST framework address potential unknown event triggering thresholds, and how does this affect the representation alignment?
  3. Have the authors considered applying the visual prompting technique to other asymmetric vision tasks beyond stereo matching? What challenges do they anticipate?
  4. How sensitive is the performance of ZEST to the choice of off-the-shelf stereo matching model? Were significant variations observed when using different base models?
  5. Figure 5 shows very sparse disparity maps in the ground truth. How do you evaluate the metrics given these sparse disparity maps? Do you only consider the pixels with valid disparity values, or is there a specific strategy for handling the sparse nature of the ground truth?
  6. What does "Spatial Integ." mean in Figure 6? In column 4 of Figure 6, the textures of the visual prompts for the two views are not identical, yet the matching results in the rightmost column are surprisingly good. Why does the stereo matching perform well despite these differences?
  7. In Table 2, what does "GT scale" refer to in row 6? How this ground truth scale is obtained and what is the significance in the context of ablation study?

局限性

The authors have adequately addressed the limitations of their work in Appendix A.3 and discussed potential societal impacts in Section A.4. A more detailed analysis of potential failure cases or challenging scenarios would further strengthen the paper.

作者回复

We appreciate your thorough review and insightful questions. We are pleased that you found our work technically strong, novel, and impactful. We address each point below.

W1&Q1. Computational efficiency analysis:

We profiled the computational complexity of our framework's modules on a machine with an Intel i7-13700K CPU and an NVIDIA RTX 4090 GPU, using an input resolution of 480x640. The performance values were evaluated on the interlaken_00_c sequence of the DSEC dataset unless otherwise specified. A breakdown of the computational cost in each algorithm stage is shown in Table A in the response to Global - Q1, and the total computational costs of each method variant, along with the compared methods, are shown in Table B in the response to Global - Q1.

The Ours-CR-DA variant achieves a runtime of about 630.36 ms per frame with 7454 MB GPU memory usage. The disparity refinement module is the most computationally intensive component, consuming 48.6% of the total runtime. In the Ours-DS-DA variant, the DS model dominates the computational cost, while the DA and refinement modules add minimal overhead. We will explore optimizations to improve efficiency in future work and include this discussion in the final version.

W2. Technical clarity:

We will add explanations for technical terms and concepts to enhance readability and accessibility for a broader audience.

Q2. Unknown event triggering thresholds: Our method manages unknown event-triggering thresholds with a normalization operation (L404-406). The normalization eliminates the threshold cc in Eq. (17), bridging temporal event integrations and temporal image gradients. In the final version, we will move this explanation to the main text for clarity.

Q3. Extending visual prompting: We see potential in extending the visual prompting technique to other tasks, such as optical flow estimation or object detection. Challenges include adapting the prompting mechanism to accommodate different data characteristics and ensuring prompt effectiveness across diverse modalities. For instance, in optical flow estimation, the prompts could encode temporal information about object motion, while for object detection, they could highlight salient features relevant to object recognition.

Q4. Sensitivity to stereo model choice:

As shown in Table 1, the proposed method, when combined with different stereo matching models such as CREStereo (CR) [14] and DynamicStereo (DS) [12], exhibits variations in performance. Overall, variants using DS perform better in terms of EPE and RMSE, while those using CR excel in 3PE and 2PE. Importantly, the proposed method consistently improves accuracy across various stereo matching models, highlighting the robustness of our intermediate representation.

Q5. Evaluation with sparse ground truth: Consistent with previous works on the DSEC dataset, we evaluate metrics using only pixels with valid ground truth disparity values. This approach focuses on meaningful disparities and excludes void or uncertain areas. However, we acknowledge that this evaluation strategy has limitations due to the sparsity of the ground truth. Future work could explore alternative evaluation metrics specifically designed for sparse data.

Q6. "Spatial Integ." and stereo matching: "Spatial Integ." refers to the spatial integral of events, capturing accumulated changes over time in different positions of the sensor corresponding to the same physical position.

The slight differences in textures between views are due to the asynchronous nature of events and manufacturing imperfections. Despite the differences, our method performs well due to the surprising robustness of state-of-the-art stereo models, which are able to exploit high-level feature similarities despite low-level texture inconsistencies.

Q7. "GT scale" in ablation study:

In our method, we model the relationship between the monocular-predicted disparity DmonoD^\text{mono} and the stereo-predicted disparity DbinoD^{\text{bino}} as

Di,jbino=ki,jDi,jmono+bi,j,D_{i,j}^{\text{bino}} = k_{i,j} D_{i,j}^{\text{mono}} + b_{i,j},

where kk and bb are spatially-varying scale and shift maps obtained by optimization. However, there also exists a simpler model where kk and bb are globally consistent

Di,jbino=kglobalDi,jmono+bglobal.D_{i,j}^{\text{bino}} = k_{\text{global}} D_{i,j}^{\text{mono}} + b_{\text{global}}.

In the "GT scale" ablation experiment, we prove that the later model is insufficient. We calculate the optimal kglobalk_{\text{global}} and bglobalb_{\text{global}} by fitting the ground truth disparities to the relative disparities linearly for each frame. The corresponding result, which is the upper-bound of all algorithms using global scales, is worse than the currently proposed method, proving that spatially-varying scales and shifts are essential.

Limit. Failure cases:

Representative failure cases are shown in Figure A of the attached PDF.

  • Row 1 demonstrates the impact of noisy events on the visual prompts. When the events are noisy, the visual difference between the two views increase. The stereo model can robustly manage this inconsistency in most cases, such as in row 1 & 2 of Figure B, but it may occasionaly fail. In row 1 of Figure A, the CR stereo model produced erroneous results. The monocular DA predictions improved the disparity estimation via the refinement module, but the final result was still suboptimal.

  • Row 2 shows the impact of sparse events. The sparse events did not provide sufficient information for stereo matching, which the refinement module could not fully compensate for.

We will expand our discussion of potential failure cases in the final version.

评论

Thank you for your detailed clarification. I am satisfied with the responses, which have fully addressed my concerns. I really recommend an acceptance of this solid and insightful paper.

Best regards, Reviewer ggyM

评论

Thank you very much for your positive feedback and for recognizing the novelty and applicability of our work.

作者回复

We sincerely thank the reviewers for their valuable feedback and insightful suggestions. We appreciate the recognition of the novelty and potential impact of our visual prompting technique, and the advancement our work presents in bridging event-based and traditional frame-based vision for stereo matching tasks. We are committed to thoroughly addressing each aspect of the feedback. Below, we address shared concerns and provide detailed responses to each reviewer's specific points.

Q1. Computational efficiency:

We profiled the computational complexity of our framework's modules on a machine with an Intel i7-13700K CPU and an NVIDIA RTX 4090 GPU, using an input resolution of 480x640. The performance values were evaluated on the interlaken_00_c sequence of the DSEC dataset unless otherwise specified. A breakdown of the computational cost in each algorithm stage is shown in Table A, and the total computational costs of each method variant, along with the compared methods, are shown in Table B.

The Ours-CR-DA variant achieves a runtime of about 630.36 ms per frame with 7454 MB GPU memory usage. The disparity refinement module is the most computationally intensive component, consuming 48.6% of the total runtime. In the Ours-DS-DA variant, the DS model dominates the computational cost, while the DA and refinement modules add minimal overhead. We will explore optimizations to improve efficiency in future work and include this discussion in the final version.

Table A. Computational complexity analysis of each stage.

StageGPU Memory (MB)Params (M)Runtime (ms)Equivalent FPS
Data preparation0--39.0625.59
DS922421.478515.320.11
CRES20785.43243.554.11
DA3640335.3279.9912.5
MiDaS3344344.0531.1432.1
Refinement1736--306.823.25

Table B. Computational complexity analysis of various methods.

Method3PEGPU Memory (MB)Params (M)Runtime (ms)Equivalent FPS
SHEF54.370--28944.850.03
HSM33.08766--224.854.44
DAEI86.96323811.2575.1513.3
DS+DA15.0514600356.798902.130.11
DS+MiDaS14.9114304365.528853.270.11
CRES+DA9.847454340.75630.361.58
CRES+MiDaS29.267158349.48581.511.71

Q2. Disparity refinement module:

The runtime profiling shows that the disparity refinement module takes about 306.82 ms, accounting for 48.6% of the total inference time (630.36 ms) of the Ours-CR-DA variant. We can reduce this overhead while maintaining comparable performance by decreasing the number of iterations, as shown in Table C.

The inclusion of the disparity refinement module brings improvements, particularly in preserving sharp depth boundaries for objects like cars and buildings in challenging scenes with sparse events or textureless areas, as evident in the visual comparisons between Ours-DS and Ours-DS-DA in Figures 6, 7, 12, and 13. The marginal gains observed in the evaluation metrics (Table 2, Rows 7-8) are likely due to the sparsity of the ground truth disparity provided in the DSEC dataset (Figure 5).

Table C. Computation cost analysis of the disparity refinement module across different iterations.

IterationsEPE3PERuntime (ms)Equivalent FPS
01.4877.7854.2238.06
501.4888.02842.3923.59
1001.4517.45770.9214.1
2001.437.27127.757.82
3001.427.234188.145.31
4001.4137.227247.634.03
500 (Ours)1.4097.23306.823.25

List of figures in the attached PDF

Figure A. Examples of failure cases for the proposed method, illustrating scenarios with excessive noise and sparse event data that impact the reliability of visual prompts and lead to suboptimal stereo matching results. From left to right: Frame & Ground Truth (Left), Temporal Integration (Right), CR Predictions, DA Predictions, Ours-CR-DA.

Figure B. Comparison of disparity estimation results for real data from the MVSEC and M3ED datasets. From left to right: Frame & Ground Truth (Left), Temporal Gradient (Left), Event (Right), Temporal Integration (Right), Ours-CR-DA.

Figure C. Comparison between the proposed disparity refinement and guided filtering-based alternatives, demonstrating the advantages of our approach in maintaining sharp depth boundaries and handling textureless regions. From left to right: Frame & Ground Truth (Left), DS Predictions, DA Predictions, DS-DA with guided filtering, Ours-DS-DA.

最终决定

Since the first round, reviewers recognized the value of the proposed method to perform RGB-event stereo starting from an interesting intermediate image representation that enables the deployment of off-the-shelf stereo networks pre-trained on conventional images. However, they also raised concerns and requested clarifications. The main issues mainly regarded the experimental results only on the DSEC dataset and the actual effectiveness of the refinement module and its runtime. The authors provided a detailed and persuasive response to reviewers' questions in the rebuttal/discussion, and all reviewers recommended acceptance. As the reviewers, the AC finds the proposal original and valuable for the community and recommends acceptance. However, a relevant point that needs to be addressed in the final version is an exhaustive evaluation (with baseline) with the MVSEC and M3ED datasets, as requested by Reviewer EJ3V.