6.4

/10

Poster4 位审稿人

最低3最高5标准差0.7

4.0

置信度

创新性2.8

质量2.8

清晰度3.0

重要性2.3

NeurIPS 2025

Re-coding for Uncertainties: Edge-awareness Semantic Concordance for Resilient Event-RGB Segmentation

Nan Bao,Yifan Zhao,Lin Zhu,Jia Li

OpenReview PDF

提交: 2025-05-11更新: 2025-10-29

TL;DR

A novel multi-modality framework for resilient event-RGB semantic segmentation under extreme conditions.

摘要

关键词

event camerasemantic segmentationevent-based segmentationmulti-modality fusionuncertainty modeling

评审与讨论

审稿意见

评分: 4置信度: 42025-06-08

This paper proposes an "Edge-awareness Semantic Concordance (ESC)" framework for event-RGB semantic segmentation, specifically targeting extreme conditions like low light and fierce camera motion where traditional RGB-only methods struggle due to information loss. To facilitate reliable evaluation in extreme scenarios, the authors introduce two synthetic (DERS-XS, DSEC-Xtrm) and one real-world (DERS-XR) event-RGB semantic segmentation datasets. Experimental results demonstrate that ESC outperforms state-of-the-art methods.

优缺点分析

Strengths:

The proposed framework is well-structured with distinct, logical modules (ELR, RC, UO) that each contribute to the overall goal of resilient fusion.
The paper introduces a novel approach to multi-modal fusion by leveraging semantic edges as an intermediate commonality.
The experimental results clearly show improved SOTA performance.

Weaknesses:

Unexplained Learnable Noise Embeddings: The paper states that "learnable noise embeddings" are used to "improve fitting ability and enhance learning stability" within RC and UO. However, it fails to provide a clear explanation of how these embeddings achieve this, or to cite prior work. There is also a lack of corresponding in ablations.
Bias towards Extreme RGB Degradation: A significant portion of the evaluation appears to deliberately diminish the contribution of the RGB modality, particularly through the use of extremely low-light frames in both synthetic datasets and real-world data. While the goal is to highlight the event modality's role, such severe consistent RGB degradation might not always reflect the real-world extreme conditions. Testing on existing real-world extreme datasets like DSEC-Night (proposed by CMDA [40]) or collecting a more diverse set of truly challenging real-world scenarios would provide a more comprehensive and less-biased validation.
Limited Efficacy in Non-Extreme Real-World Conditions: The proposed method appears to demonstrate its most significant advantages primarily when the RGB modality is severely degraded. For instance, on the real-world DSEC-Semantic dataset (Table 1), which represents more typical, non-extreme conditions, the performance gain over CMNeXt is only marginal (a 2.01% mIoU improvement).
Limited DERS-XR: While DERS-XR is introduced as a real-world dataset for "extreme conditions", the qualitative examples shown in Figure 14 suggest that the RGB images are severely degraded, appearing almost completely dark. This level of degradation might deliberately diminish the RGB signal. Furthermore, with only 120 frames used for fine-tuning and 120 for testing, the dataset size is significantly small.
High Computational Cost: While Table 5 shows that the proposed method has a comparable number of parameters to CMNeXt, its FLOPs are significantly higher. This substantial increase in computational cost raises questions about the efficiency of the method. It is unclear whether the performance gains observed are primarily due to the architectural innovations or simply a result of a higher computational budget. A critical aspect to assess would be if the method could maintain its superior performance using a smaller backbone to achieve similar FLOPs and inference speed as CMNeXt.

问题

The authors need to provide a clearer explanation or intuition for how the learnable noise embeddings contribute to improved fitting and stability.
The use of extremely dark RGB inputs may overly emphasize the advantage of the event modality. It’s better to evaluate on more balanced extreme-condition datasets like DSEC-Night to ensure fairer assessment.
The method shows higher FLOPs compared to CMNeXt. The authors need to test the approach with a lighter backbone to better isolate the impact of the proposed components from increased computational cost?

局限性

Yes

最终评判理由

As my concerns have been resolved, I have decided to increase my score to 4

格式问题

No Paper Formatting Concerns

作者回复

2025-07-31

We appreciate the reviewer’s careful evaluation, including both the comprehensive review and the thoughtful critiques. The following responses aim to address the raised concerns in detail.

Clarification on Learnable Noise Embeddings (W1, Q1)

Thank you for pointing this out. Here we clearly explain, from both technical and experimental aspects, how learnable noise embeddings are beneficial for improving fitting ability and enhancing learning stability.

Technical explanation:

From an optimization perspective, the learnable noise embeddings serve to enhance the model's fitting capacity and learning stability by introducing additional degrees of freedom into the cross-modality interaction process. This added flexibility allows the model to better adapt to heterogeneous modalities during learning.

From the viewpoint of the cross-attention mechanism, in the absence of noise, the query (Q) tends to attend excessively to its own features in the key (K), thereby suppressing signals from the other modality and impeding effective fusion. Introducing noise embeddings mitigates this issue by perturbing the attention space in a controlled, learnable manner, encouraging richer and more balanced cross-modal interactions.

Conceptually, this mechanism is inspired by prompt-based learning, where learnable prompts are used to guide model behavior. In our case, the noise embeddings act as fixed prompts that help regulate and stabilize the cross-modal attention patterns during training.

Experimental explanation:

We conduct an ablation study by removing the learnable noise embeddings on three datasets: DERS-XS, DSEC-Semantic, and DSEC-Xtrm. As shown in Table R8, the removal of noise embeddings leads to a 1.05% and 0.81% mIoU drop on DERS-XS and DSEC-Xtrm, respectively, confirming their contribution to improved fitting ability and stability. The performance drop on DSEC-Semantic is minimal (0.18%), which we attribute to its reliance on RGB-based pseudo-labels. As the supervision signal is already biased towards RGB, the model naturally relies less on event modality. In such cases, the role of learnable noise embeddings in facilitating cross-modal interaction becomes less significant.

Table R8: Ablation study on learnable noise embeddings.

	gACC(%)↑	mACC(%)↑	mIoU(%)↑
ESC on DERS-XS (w/o noise embeddings)	93.12	73.41	66.05
ESC on DERS-XS	93.27	75.26	67.10
ESC on DSEC-S. (w/o noise embeddings)	94.69	79.45	70.86
ESC on DSEC-S.	94.85	78.61	71.04
ESC on DSEC-X. (w/o noise embeddings)	88.22	57.13	50.05
ESC on DSEC-X.	88.18	59.45	50.87

Justification of Challenging RGB and DSEC-Night Evaluation (W2, Q2)

We would like to clarify that the use of dark or degraded RGB inputs is intentional and consistent with our primary motivation, that is, to facilitate resilient semantic segmentation by addressing the core challenge of inferior optimization issues of heterogeneous event and RGB in modality imbalance and failure situations. In this context, using challenging RGB conditions and applying spatial occlusions via local masking is not a source of artificial bias, but a deliberate design choice to evaluate the fusion capabilities under representative adverse scenarios.

That said, we fully understand the reviewer’s concern about broader real-world applicability. Notably, our method also achieves consistent gains on DSEC-Semantic, a dataset that represents more typical, non-extreme conditions. This demonstrates that the effectiveness of our approach is not limited to extremely degraded RGB scenarios.

In addition, to directly address the reviewer’s suggestion, we include a new evaluation on DSEC-Night, which was originally proposed for unsupervised cross-modality domain adaptation by CMDA. As this dataset only contains 150 annotated frames for testing, we adapt it to a supervised setting by splitting the labeled sequences into training and testing sets: 54 frames from zurich_city_09_a and zurich_city_09_b for training, and 96 frames from zurich_city_09_c/d/e for testing. All models are trained for 100 epochs with the same settings for fair comparison. As shown in Table R9, our method outperforms CMX by 6.55%, CMNeXt by 6.28% mIoU, and EISNet by 5.95%, demonstrating the strong efficacy of our method in real-world extreme conditions.

Table R9: Quantitative comparisons on DSEC-Night.

	gACC(%)↑	mACC(%)↑	mIoU(%)↑
CMX	77.38	43.24	32.85
CMNeXt	77.38	43.62	33.12
EISNet	77.87	44.95	33.46
ESC (Ours)	81.68	48.55	39.40

Lastly, we agree that a more comprehensive evaluation on diverse extreme conditions would be valuable. As part of our ongoing work, we are actively collecting a new, large-scale real-world dataset covering a broader range of challenging scenarios, including underexposure, overexposure, motion blur, and lens flare. Due to the high cost of manual annotation, this dataset could not be completed during the rebuttal period, but we plan to release it in future work.

Efficacy in Non-Extreme Real-World Conditions (W3)

We respectfully disagree with the characterization of the 2.01% mIoU improvement as marginal. The significance of a performance gain should be evaluated in the context of task difficulty, dataset characteristics, and the strength of the baseline. Achieving a 2.01% improvement over a strong baseline like CMNeXt is non-trivial, particularly given that DSEC-Semantic represents non-extreme real-world conditions where existing methods already perform well and leave limited room for further improvement.

For reference, CMNeXt itself reports only 0.2% and 0.6% mIoU improvements on the MFNet and NYU Depth V2, respectively. In this context, our 2.01% improvement is both meaningful and practically relevant. Moreover, in real-world deployment scenarios, even modest increases in mIoU can lead to noticeably improved segmentation quality.

Justification of DERS-XR Dataset (W4)

We would like to clarify that the DERS-XR dataset is constructed from real-world captures under naturally occurring extreme conditions. Our intention is not to deliberately diminish the RGB signal, but rather to faithfully reflect real-world scenarios. Although the RGB images appear severely degraded, they still retain essential semantic cues for segmentation.

Regarding the dataset scale, we acknowledge that DERS-XR contains a limited number of frames. This limitation is primarily due to the high cost of accurate pixel-level annotation under extreme conditions. The original motivation behind DERS-XR is to provide a benchmark for evaluating model performance under real-world extreme conditions with reliable ground truth. Given the scarcity of such labeled data, DERS-XR fulfills this purpose effectively despite its modest size.

Efficiency Performance Trade-off with Lighter Backbones (W5, Q3)

To address the reviewer's concern, we conduct additional comparative experiments on DERS-Xtrm using smaller backbones (2×MiT-B0), reducing the FLOPs of ESC to be even lower than CMNeXt. In addition, we measure the end-to-end inference latency of CMNeXt and our ESC (both standard ESC and reduced variant) on a single NVIDIA GeForce RTX 3090 GPU with a batch size of 1. All latency measurements are performed with a fixed input size of 512 × 512, with each measurement calculating the average execution time over 100 inferences, and we repeat 3 times for stability.

As shown in Table R10-A, the reduced ESC variant still outperforms CMNeXt, achieving 49.06% mIoU vs. 45.16%, despite lower FLOPs (60.658G vs. 62.805G) and significantly fewer parameters (14.184M vs. 58.687M). This suggests that the performance gains stem from architectural design rather than merely an increased computational cost. Furthermore, as shown in Table R10-B, the reduced ESC variant has an average inference latency of 29.37 ms, which is shorter than that of CMNeXt (29.79 ms), demonstrating its potential for more efficient deployment.

Above results indicate that even with lightweight backbones, our model maintains strong performance with higher inference speed, highlighting the effectiveness of our design beyond raw FLOPs. This reflects a favorable trade-off between efficiency and performance, which is essential for practical deployment in real-world systems.

We hope these additional results and analyses adequately address the reviewer’s concern.

Table R10-A: Ablation study on lighter backbones on DSEC-Xtrm.

	gACC(%)↑	mACC(%)↑	mIoU(%)↑	#Params(M)	FLOPs(G)
CMNeXt	87.04	52.12	45.16	58.687	62.805
ESC (Reduced)	88.03	56.31	49.06	14.184	60.658
ESC (Standard)	88.18	59.45	50.87	56.875	95.086

Table R10-B: Inference Latency comparison on lighter backbones.

	Latency #1(ms)	Latency #2(ms)	Latency #3(ms)	Avg. Latency(ms)
CMNeXt	29.75	29.78	29.83	29.79
ESC (Reduced)	29.30	29.38	29.43	29.37
ESC (Standard)	34.56	34.46	34.78	34.60

2025-08-04

Thank you for the rebuttal. My concerns have been addressed.

2025-08-05

Dear Reviewer tfrK,

Thank you very much for your final response and for recognizing that your concerns have been addressed. We truly appreciate your comprehensive review and the thoughtful critiques during the review process. We will carefully incorporate the necessary clarifications and improvements in the final version of the paper.

Sincerely,

The Authors

审稿意见

评分: 4置信度: 52025-06-27

This paper addresses the challenge of semantic segmentation under extreme conditions by leveraging heterogeneous Event-RGB inputs. The authors propose a novel framework called Edge-awareness Semantic Concordance (ESC), which utilizes semantic edges as a modality-bridging representation to align RGB and event features in a unified semantic space. The method consists of three main components: Edge-awareness Latent Re-coding (ELR), Re-coded Consolidation (RC), and Uncertainty Optimization (UO). Additionally, the authors construct three new benchmark datasets (DERS-XS, DERS-XR, and DSEC-Xtrm) designed to evaluate performance and robustness under degraded visual conditions. Experimental results demonstrate that the proposed method achieves state-of-the-art performance across multiple metrics and maintains superior robustness under extreme conditions like simulated occlusion and modality degradation.

优缺点分析

Pros: 1.The framework introduces semantic edge as a unified representation to bridge RGB and event modalities. The motivation is well illustrated in Fig. 2, with visualizations of edge and event distributions showing high spatial correlation. 2.The modular design is clear and well-motivated: ELR provides semantic consistency supervision, RC fuses refined edge features across modalities, and UO models the confidence of edge features to enable attention-based adaptive fusion. Together, they form a coherent pipeline for robust multi-modal segmentation. 3. The experimental evaluation is comprehensive: the authors not only validate the method on the existing popular dataset DSEC-Semantic, but also construct three new and more challenging datasets. The experiments include extreme scenarios such as under-exposure and occlusion, providing a thorough validation of the method's applicability.

Cons: 1.The ELR module uses a VQ-VAE-based edge dictionary to achieve modality alignment. It is unclear whether alternative encoding strategies (e.g., prototype-based clustering) have been considered. Ablation or comparison with other dictionary designs could help validate the generality of the framework. 2.The recent work ESEG [1] focuses on event-only segmentation using explicit edge semantics. If its detection branch were replaced by RGB input, the pipeline would become highly similar to ESC. A clearer discussion of conceptual and methodological differences between the two works is warranted. 3.In Appendix B, the DSEC-Semantic event input is built by sampling 50,000 events per sequence rather than using a fixed time window (e.g., 50ms). The rationale for this design choice is not clearly explained. A discussion of the implications and potential performance trade-offs would strengthen the methodological transparency. 4.While the paper showcases under-exposure scenarios, ESC is proposed to handle a broader range of extreme conditions. Including examples under over-exposure or motion blur would better illustrate the full scope of robustness. 5.The authors adopt MiT-B2 for RGB and MiT-B1 for event inputs. Combined with multiple attention operations, the overall FLOPs reach 95G, which may hinder deployment efficiency. [1] Zhao Y, Lyu G, Li K, et al. ESEG: Event-Based Segmentation Boosted by Explicit Edge-Semantic Guidance[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2025, 39(10): 10510-10518.

问题

See the weaknesses (Cons) for detailed suggestions.

局限性

Yes

最终评判理由

I appreciate the authors' response, which has addressed most of my concerns regarding the experiments and differences bettween related works.

格式问题

N/A

作者回复

2025-07-31

We thank the reviewer for the constructive feedback. The comments have been very helpful in guiding us to improve the manuscript. Please find our detailed responses below.

Framework Generality and Alternative Dictionary Strategies (W1)

We appreciate the reviewer for the valuable suggestion. In our framework, VQ-VAE-based edge dictionary not only provides a feasible edge encoding method, but also provides a feasible optimization mechanism for modality edge features, deriving a series of reliable indicators based on confidences and uncertainties to guide the fusion of the heterogeneous event and RGB. The utilization of alternative encoding strategies, such as a prototype-based encoding method, will have a significant change in the current training paradigm, for example, the confidence indicator based on classification softmax logit cannot be obtained. This results in different edge encoding strategies not being able to be replaced losslessly in the framework.

Despite this, we still do our best to re-implement our framework with a simple prototype-based edge encoding method. We pre-define 24 categories of edge prototypes, which represent 11 categories of semantic boundaries and one non-boundary category for each modality. The prototypes are updated every epoch based on the average feature vectors of each category. In the training and testing stages, the nearest neighbour look-up algorithm is used to match the prototypes with edge features at each spatial position, and the matched prototypes reconstitute the re-coded edge embeddings $**Γ**^\mathcal{I}$ and $**Γ** ^\mathcal{E}$ .

As shown in Table R6, for the prototype-based edge encoding method, the mIoU on DERS-XS drops by 1.20%, on DSEC-Semantic by 0.12%, and on DSEC-Xtrm by 0.36%, compared to the original VQ-VAE-based edge encoding method. These results suggest that our framework has generality in different edge encoding designs, while the VQ-VAE-based edge dictionary is uniquely designed in our proposed framework after careful consideration.

Table R6: Ablation study on prototype-based edge encoding method and VQ-VAE-based edge encoding method.

	gACC(%)↑	mACC(%)↑	mIoU(%)↑
ESC on DERS-XS (prototype)	92.91	74.07	65.90
ESC on DERS-XS	93.27	75.26	67.10
ESC on DSEC-S. (prototype)	94.79	78.84	70.91
ESC on DSEC-S.	94.85	78.61	71.04
ESC on DSEC-X. (prototype)	87.98	58.17	50.50
ESC on DSEC-X.	88.18	59.45	50.87

Differences Between ESC and ESEG (W2)

We appreciate the reviewer for suggesting this paper, and we would like to discuss and clarify any discrepancies between ESEG and our ESC.

Both ESEG and our ESC utilize edge information to guide the learning process; however, there are several main discrepancies as follows.

(a) Different tasks. ESEG is a uni-modality method that relies solely on event input, aiming to approximate the segmentation performance typically obtained from RGB data. Our ESC is a multi-modality semantic segmentation method with event and RGB as inputs, which mainly focuses on the inferior optimization issues in modality imbalance and failure situations. For this reason, it is acceptable to use RGB-based pseudo-labels for ESEG, while our ESC needs more accurate labels for assessment.

(b) Different motivations and learning objectives of the utilization of the edge. ESEG utilizes edge-semantic messages to explicitly inform the interested region to the model as guidance, thus ESEG structure introduces direct edge supervision with different semantics of the objects. Our ESC utilizes edge information as an intermediate commonality for heterogeneous event and RGB, which realigns event and RGB into a unified semantic space and jointly optimizes them based on the confidence and uncertainty indicators derived from latent edge distributions.

(c) Different pipelines even if ESEG's detection branch replaced by RGB input. For ESEG, if the dense-semantic branch is replaced by RGB input, the entire pipeline will become that events only represent edge semantics, and RGB only represent dense semantics. However, our assumption is that event and RGB both contain edge semantic information. Therefore, our pipeline realigns event and RGB into the unified semantic space to jointly optimize their edge semantics, with implicit edge information as a crucial clue.

Although this work differs from ours in many ways, it is still a key message for us. We are grateful to the reviewer for his keen observation, and we will discuss and cite this work in our related work section with the above analysis.

Rationale Behind Event Sampling Strategy (W3)

The DSEC-Semantic event input is built by sampling a fixed number of events per voxel grid rather than a fixed time window, which follows the setting of ESS [34]. In ESS, the event input is built with 100,000 events per voxel grid. We found that a 50-ms fixed time window or 100,000 fixed number of events is relatively large, which reduces the performance of data preprocessing, and may lead to insufficient edge characteristic representation. After the above trade-offs, we decided to use 50,000 events per voxel grid for DSEC-Semantic as our event sampling strategy in our work.

To further demonstrate the impact of different event sampling strategies, we conduct experiments on DSEC-Semantic with a fixed time window of 50 ms, a fixed number of events of 100,000 per voxel grid, compared with 50,000 events per voxel grid in the main paper. As shown in Table R6, different event sampling strategies present comparable results, with mIoU slightly lower (0.20% and 0.19% respectively) than the fixed number of events of 50,000 in the main paper.

Table R7: Ablation study on different event sampling strategies.

	gACC(%)↑	mACC(%)↑	mIoU(%)↑
ESC on DSEC-S. (50 ms)	94.76	78.44	70.83
ESC on DSEC-S. (100,000 events)	94.87	78.00	70.84
ESC on DSEC-S. (50,000 events)	94.85	78.61	71.04

Data of Broader Range Extreme Conditions (W4)

Given the specificity of this field and the high cost of precise annotation, there is currently a lack of publicly available event-RGB semantic segmentation benchmarks that feature a broader range of extreme real-world conditions with true labels. Existing datasets such as DDD17 and DSEC-Semantic primarily rely on RGB-based pseudo labels and do not explicitly differentiate between different conditions.

To address this gap, we are currently dedicated to collecting a set of larger-scale, more diverse, and truly challenging real-world scenes data as our future work, which will include challenging scenarios such as underexposure, overexposure, motion blur, and lens flare. Nevertheless, we stress that the datasets employed in our current study are carefully designed and sufficient to demonstrate the core effectiveness and robustness of our method.

Deployment Efficiency and FLOPs Ablation (W5)

We appreciate the reviewer's concerns regarding the deployment efficiency. We address this from both a hardware evaluation and an algorithmic ablation perspective.

Hardware-wise, we report the end-to-end inference latency, throughput, and peak memory usage of NVIDIA GeForce RTX 3090 GPUs and an AMD EPYC 7642 48-Core CPU, with a fixed input resolution of 512 × 512. As shown in Table R4, our framework achieves favorable deployment efficiency on standard hardware.

Table R4: Inference latency, throughput, and peak memory usage of GPU and CPU.

#	Inference latency	Inference throughput	peak memory usage (GPU)	peak memory usage (CPU)
1	34.56 ms	84.26 samples/sec	12169.33 MB	1903.26 MB
2	34.46 ms	84.17 samples/sec	12169.33 MB	1901.64 MB
3	34.78 ms	84.35 samples/sec	12169.33 MB	1879.25 MB
Avg.	34.60 ms	84.26 samples/sec	12169.33 MB	1894.72 MB

Algorithm-wise, we further evaluate our framework with lighter backbones (2×MiT-B0). As shown in Table R10-A, even with FLOPs lower than CMNeXt, the reduced ESC variant achieves better performance on DSEC-Xtrm compared to CMNeXt (49.06% vs. 45.16% mIoU) and a significantly lower parameter count. Moreover, as shown in Table R10-B, the average inference latency of the reduced variant is also slightly lower than CMNeXt (29.37 ms vs. 29.79 ms). These results highlight the effectiveness of backbone scaling in achieving a favorable balance between deployment efficiency and performance.

Table R10-A: Ablation study on lighter backbones on DSEC-Xtrm.

	gACC(%)↑	mACC(%)↑	mIoU(%)↑	#Params(M)	FLOPs(G)
CMNeXt	87.04	52.12	45.16	58.687	62.805
ESC (Reduced)	88.03	56.31	49.06	14.184	60.658
ESC (Standard)	88.18	59.45	50.87	56.875	95.086

Table R10-B: Inference Latency comparison on lighter backbones.

	Latency #1(ms)	Latency #2(ms)	Latency #3(ms)	Avg. Latency(ms)
CMNeXt	29.75	29.78	29.83	29.79
ESC (Reduced)	29.30	29.38	29.43	29.37
ESC (Standard)	34.56	34.46	34.78	34.60

2025-08-05

Dear Reviewer 53vx,

Thank you for your final acknowledgement and for taking the time to review our rebuttal. We appreciate your thoughtful engagement throughout the review process and are glad that our response helped to address your concerns. We will reflect the necessary improvements in the final version.

Sincerely,

The Authors

审稿意见

评分: 3置信度: 32025-07-02

This paper presents ESC that exploits semantic edges to fuse heterogeneous RGB images and event streams for robust semantic segmentation under extreme conditions. ESC first constructs a discrete edge dictionary, then performs ELR to align event- and RGB-derived edge distributions in a shared latent space. Two tailored modules—RC and UO—respectively integrate edge cues into RGB context features and adaptively weight modalities using confidence estimates, yielding resilient predictions when either modality degrades. To enable reliable evaluation, the authors use three benchmark datasets with ground-truth labels in extreme low-light or noisy scenarios.

优缺点分析

The introduction of a shared discrete edge dictionary combined with the Edge-aware Latent Re-coding mechanism is novel and well-motivated. The paper backs this design with solid empirical evidence, including an event-edge correlation analysis and ablations isolating each module’s contribution.

Three benchmark datasets further amplify its impact and should facilitate future research in multimodal segmentation.

The study compares ESC mainly with token- or attention-level fusion baselines. Other recent techniques such as cross-modal contrastive pretraining or adaptive modality dropping are not investigated.

The manuscript layout appears misaligned with NeurIPS guidelines: (1) Text embedded in figures is noticeably smaller than the main font, making labels hard to read. (2) Several tables spill over margins or have inconsistent column widths.

[1] SAM-Event-Adapter: Adapting Segment Anything Model for Event-RGB Semantic Segmentation

问题

NONE

局限性

NONE

格式问题

NONE

作者回复

2025-07-31

We sincerely appreciate the reviewer’s time and effort in thoroughly reviewing our manuscript. We also appreciate the insightful suggestions and provide detailed responses below.

Other Techniques Investigation and Comparison (W1)

We appreciate the reviewer for highlighting this important research direction. Our ESC framework lays emphasis on the inferior optimization issues of heterogeneous event and RGB in the case of modality imbalance and failure situations. Motivated by this, our work emphasizes developing more robust and resilient fusion strategies under the existing extreme input conditions. Therefore, we center our investigation around recent token-level and attention-level fusion baselines that address similar concerns.

Other techniques suggested by the reviewer, such as cross-modal contrastive pretraining (e.g., Event-Camera-Data-Pre-training [A], SAM-Event-Adapter [B], CM3AE [C]) and adaptive modality dropping (e.g., Missing-Modality-Prediction [D], RAGPT [E]), indeed provide valuable and complementary perspectives in the broader multimodal learning field. However, cross-modal contrastive pretraining approaches primarily aim to acquire informative and effective pretrained backbones, while adaptive modality dropping methods mainly tackle issues of incomplete modality inputs. In contrast, our focus lies in inferior optimization issues of modality imbalance and failure, which are prevalent yet underexplored challenges in real-world scenarios. This distinction reflects different motivations and problem settings, rather than limitations of any particular method.

For completeness of comparison, we include in Table R5 the reported performance of two representative contrastive pretraining methods ([A], [B]) on the DSEC-Semantic dataset. The results are cited directly from the original papers, as [B] does not release its code, and we rely on their published numbers.

In the final version of the paper, we will expand the above discussion and theoretical comparison in the related work section to include these approaches and cite the relevant literature, including [A-E]. We believe that this will help situate our work more clearly within the multimodal fusion landscape and highlight its distinct contributions.

Table R5: Comparison on DSEC-Semantic with two cross-modal contrastive pretraining methods.

	gACC(%)↑	mACC(%)↑	mIoU(%)↑
Event-Camera-Data-Pre-training [A]	-	-	59.16
SAM-Event-Adapter [B]	93.58	-	69.77
ESC (Ours)	94.85	78.61	71.04

Formatting and Layout Issues (W2)

Thanks for your careful review and valuable suggestions. We will carefully adjust the texts embedded in figures for more convenient reading in our final manuscript version. As for tables, we will also trim them neatly in the final version.

It must also be seriously pointed out that we DO NOT violate any NeurIPS style guidelines. No tables spill over margins or have inconsistent column widths.

References

[A] Yan Yang, Liyuan Pan, and Liu Liu. "Event camera data pre-training." ICCV, 2023.

[B] Bowen Yao, Yongjian Deng, Yuhan Liu, Hao Chen, Youfu Li, and Zhen Yang. "Sam-event-adapter: Adapting segment anything model for event-rgb semantic segmentation." IEEE ICRA, 2024.

[C] Wentao Wu, Xiao Wang, Chenglong Li, Bo Jiang, Jin Tang, Bin Luo, Qi Liu. "CM3AE: A Unified RGB Frame and Event-Voxel/-Frame Pre-training Framework." ACM MM, 2025.

[D] Donggeun Kim, and Taesup Kim. "Missing modality prediction for unpaired multimodal learning via joint embedding of unimodal models." ECCV, 2024.

[E] Jian Lang, Zhangtao Cheng, Ting Zhong, and Fan Zhou. "Retrieval-augmented dynamic prompt tuning for incomplete multimodal learning." AAAI, 2025.

2025-08-06

Dear Reviewer BVgg,

Thank you again for your time and effort in reviewing our submission. We hope that our rebuttal has addressed your concerns and clarified the key points. If there are any remaining questions or if further clarification would be helpful, we would be glad to provide additional information.

We truly appreciate your contributions to the review process.

Sincerely,

The Authors

2025-08-07

Thank you for your reply, my question has been solved. I will support this article.

2025-08-07

Dear Reviewer BVgg,

Thank you very much for your final feedback. We are truly grateful for your support and are glad that our response was able to resolve your concerns. Your engagement during the review process is greatly appreciated, and we will reflect the corresponding revisions in the final version of the manuscript.

Sincerely,

The Authors

审稿意见

评分: 5置信度: 42025-07-03

The paper presents Edge-awareness Semantic Concordance (ESC), a framework for robust event-RGB semantic segmentation under challenging conditions. ESC leverages a shared discrete edge dictionary learned via a VQ-VAE on semantic edge maps to bridge heterogeneous event and RGB features. It introduces three key modules: (1) Edge-awareness Latent Re-coding (ELR), which aligns event/RGB features into a unified edge-categorical space and produces uncertainty indicators. (2) Re-coded Consolidation (RC), which injects re-coded edge embeddings into image features and (3) Uncertainty Optimization (UO), which dynamically fuses modalities based on confidence/uncertainty maps. The authors also construct two synthetic (DERS-XS, DSEC-Xtrm) and one real-world (DERS-XR) extreme condition datasets to evaluate ESC’s performance. Experiments on these benchmarks demonstrate consistent mIoU improvements of ~2–3% over state-of-the-art fusion methods and greater resilience to spatial occlusion.

优缺点分析

Strengths:

The discrete edge dictionary learned via VQ-VAE provides an effective bridge between event and RGB feature spaces.
The modular design of ELR, RC, and UO is thoughtfully structured, and the comprehensive ablation studies transparently demonstrate the value of each component.
Consistent mIoU improvements (2–3 %) across both synthetic (DERS-XS, DSEC-Xtrm) and real-world (DERS-XR) benchmarks highlight the approach’s robustness under extreme conditions.
The incorporation of uncertainty maps to guide dynamic fusion of event and RGB modalities enhances performance when data quality varies.

Weaknesses:

Dependence on a VQ-VAE–learned discrete edge dictionary adds a non-trivial pretraining step and raises questions about how well the same dictionary transfers to new domains.
The combination of ELR, RC, and UO modules, while effective, increases architectural complexity and may incur higher runtime and memory overhead in practice.
Although parameters and FLOPs are reported in Table 5, inference latency and throughput on real hardware are not provided, making it difficult to fully assess deployment feasibility.

问题

How well does the VQ-VAE–learned edge dictionary transfer across domains? For example, if it's trained on one dataset and appliedunchanged to another, what is the mIoU change?
How sensitive are your results to key hyperparameters such as dictionary size and uncertainty threshold?
Can you report end-to-end inference latency and peak memory usage for ESC on representative hardware?

局限性

Yes

最终评判理由

The rebuttal addressed all of my earlier concerns with additional experiments and clarifications. I am satisfied with the responses and maintain my 5 (Accept) rating.

格式问题

None significant

作者回复

2025-07-31

We are grateful to the reviewer for the detailed and professional review, which reflects a deep understanding of the topic. We sincerely appreciate the positive comments, valuable concerns, and suggestions on our work. Here is our response to the mentioned weaknesses and questions.

Edge Dictionary Domain Transferability (W1, Q1)

Thank you for the insightful question. Theoretically, the discrete edge dictionary learned by VQ-VAE is expected to have good transferability across datasets. As an intermediate representation, semantic edge exhibits relatively simple and consistent structures, and the latent distributions of semantic edge derived from segmentation labels tend to vary only slightly across different datasets. Therefore, we expect the performance degradation under dictionary transfer settings to be minimal.

Following the reviewer’s suggestion, we conduct cross-domain evaluations by (i) using an edge dictionary pretrained on DSEC to evaluate on DERS-XS, and (ii) using a dictionary pretrained on DERS-XS to evaluate on DSEC-Semantic and DSEC-Xtrm. As shown in Table R1, the performance drops are small in all cases. Specifically, the mIoU on DERS-XS drops by 0.66%, on DSEC-Semantic by 0.10%, and on DSEC-Xtrm by 0.21%, compared to the original non-exchanged dictionary settings. These results suggest that the learned edge dictionary generalizes well across domains, and our method remains robust under moderate domain shifts.

Table R1: Ablation study on edge dictionary exchange settings.

Settings	gACC(%)↑	mACC(%)↑	mIoU(%)↑
ESC on DERS-XS (w/ edge dictionary of DSEC)	93.23	74.96	66.44
ESC on DERS-XS	93.27	75.26	67.10
ESC on DSEC-S. (w/ edge dictionary of DERS-XS)	94.91	78.04	70.93
ESC on DSEC-S.	94.85	78.61	71.04
ESC on DSEC-X. (w/ edge dictionary of DERS-XS)	88.57	58.00	50.65
ESC on DSEC-X.	88.18	59.45	50.87

Hyperparameters Sensitivities (Q2)

Thank you for raising this important point. We observe that the model's performance shows noticeable variation under different dictionary size $K$ , but remains robust across a reasonable range. As shown in Table 4 of the main paper (reproduced here as Table R2), using too small a dictionary leads to insufficient latent edge representations, while an overly large dictionary introduces ambiguity in modality-specific learning, potentially due to increased difficulties in selecting appropriate dictionary entries for modality-specific edge patterns. Both extremes negatively impact performance. The best results are achieved with a moderate dictionary size (e.g., $K=128$ ), which offers a good balance between representation capacity and stability in codebook learning.

Table R2: Ablation study on key usage (dictionary size) on DERS-XS.

Dictionary Size	16	32	64	128	256	512
Dictionary Usage	16	32	64	92	99	97
gACC(%)↑	92.66	93.12	93.03	93.27	93.19	92.93
mACC(%)↑	74.59	74.77	74.90	75.26	73.91	73.65
mIoU(%)↑	65.87	66.84	66.64	67.10	66.54	66.11

As for the uncertainty threshold, our framework does not rely on any manually set threshold as a hyperparameter. Instead, the uncertainty indicators are dynamically inferred and represented as continuous values in the range of 0 to 1. These maps serve as soft attention weights, allowing the model to adaptively determine the fusion strength between heterogeneous event and RGB.

To further examine the role of these learned confidence and uncertainty maps, as well as the contribution of the event modality, we conduct additional experiments on DERS-XS by injecting controlled modifications during the inference phase, specifically by zeroing out the confidences of event modality. As shown in Table R3, this results in a 1.49% drop in mIoU, indicating a noticeable degradation in performance. This ablation underscores the effectiveness of our dynamically learned uncertainty indicators, demonstrating that the model can adaptively control the fusion strength of each modality based on the input content, without relying on a manually tuned threshold.

Table R3: Ablation study on injecting modifications to confidence and uncertainty indicators.

Settings	gACC(%)↑	mACC(%)↑	mIoU(%)↑
ESC (zero conf.)	92.98	73.50	65.61
ESC	93.27	75.26	67.10

End-to-end Inference Latency, Throughput and Peak Memory Usage (W2, W3, Q3)

Certainly, and thank you for the suggestion. Here are the end-to-end inference latency, inference throughput, and peak memory usage of GPU and CPU in Table R4, averaged over three runs for stability. All experiments are performed on a machine equipped with NVIDIA GeForce RTX 3090 GPUs and an AMD EPYC 7642 48-Core CPU. Inference latency is measured on a single GPU with a batch size of 1, while throughput and memory usage are measured on two GPUs with a batch size of 16 per GPU, and only the peak memory usage of the process on GPU 0 is reported. All measurements are conducted with a fixed input resolution of 512 × 512. For inference latency and throughput, each measurement calculates the average execution time over 100 inferences.

These results indicate that ESC achieves competitive inference efficiency and moderate memory usage, suggesting good potential for deployment in practical scenarios. Combined with the reported FLOPs and parameter count in Table 5, this further supports the practical feasibility of our proposed ESC framework.

Table R4: Inference latency, throughput, and peak memory usage of GPU and CPU.

#	Inference latency	Inference throughput	peak memory usage (GPU)	peak memory usage (CPU)
1	34.56 ms	84.26 samples/sec	12169.33 MB	1903.26 MB
2	34.46 ms	84.17 samples/sec	12169.33 MB	1901.64 MB
3	34.78 ms	84.35 samples/sec	12169.33 MB	1879.25 MB
Avg.	34.60 ms	84.26 samples/sec	12169.33 MB	1894.72 MB

2025-08-07

I appreciate the detailed rebuttal. My concerns have been fully addressed.

2025-08-07

Dear Reviewer DoyC,

We are sincerely grateful for your final response and truly appreciate your recognition that all concerns have been fully addressed. Your thoughtful and constructive feedback throughout the review process has been invaluable in improving our work. We will revise and extend the manuscript accordingly to incorporate your valuable suggestions in the final version.

Thank you once again for your time, expertise, and generous engagement with our submission.

Sincerely,

The Authors

最终决定Accept (poster)

2025-09-17

This paper tackles the problem of semantic segmentation in extreme conditions using a novel approach that fuses heterogeneous Event and RGB inputs. The authors introduce the Edge-awareness Semantic Concordance (ESC) framework, which uses semantic edges as a bridge to align features from both modalities within a unified semantic space. The framework's key components are the Edge-awareness Latent Re-coding (ELR), Re-coded Consolidation (RC), and Uncertainty Optimization (UO) modules. To facilitate robust evaluation, the authors also contribute three new benchmark datasets specifically designed for degraded visual conditions. After the rebuttal, the paper received scores of 5, 4, 4, and 3. Importantly, reviewer BVgg, who initially provided the score of 3, did not submit a Final Justification, but explicitly stated in their rebuttal that they "will support this article." Given this positive shift in sentiment, this paper actually got an average score 4.25, which is around the borderline. The ACs recommended acceptance for this paper. The ACs recommend that the authors integrate their rebuttal into the final version of the paper to further enhance its overall quality. Additionally, the authors should supplement appropriate citations where needed, for example, the discussion of noise embedding in section 3.4 should reference the work geminifusion.