Availability-aware Sensor Fusion via Unified Canonical Space
摘要
评审与讨论
This paper presents a method for a new angle on sensor fusion between photometric (e.g., image) and geometric (e.g., ToF including lidar/radar) modalities -- "availability-aware" sensor fusion, which they call ASF.
The method consists of two "sub-modules" which are (1) a unified canonical projection into a common space for all modalities, and the authors claim that this eliminates inconsistencies between modalities. Then, (2) the method addresses the problem of a missing sensor/modality by learning to estimate missing, or uncertain, modalities.
Quantitative results on the K-Radar dataset are shown, and the proposed method is seen to outperform baseline methods of multi-sensor fusion.
优缺点分析
(S1) First of all, the idea is well-motivated and the problem setting of dealing missing or damaged sensors is very interesting and relevant to real-world systems such as autonomous driving vehicles.
(S2) Most parts of the method are described thoroughly with sufficient mathematical rigor, and importantly, are well-motivated. In other words, the authors provide a written justification of most of the design choices they made, e.g., removal of positional embedding and post-feature normalization, among others.
(S3) The ablation and sensitivity choices are thorough and present quantitative backing for the design choices mentioned in S2.
(W1) The authors claim that they only evaluate performance on the K-Radar dataset because it "is the only dataset with data captured in adverse weather conditions." While I will not go so far as to call this assertion incorrect, it is important to note that "adverse weather conditions" is far from an objective term -- in my opinion, many datasets including the nuScenes dataset contain at least some amount of data captured in some degree of adverse conditions. Importantly, most autonomous driving works studying multi-modal (particularly, image-lidar) 3D object detection works are benchmarked on the canonical nuScenes dataset, such as UniTR (Wang et al., ICCV 2023) and SparseFusion (Xie et al., ICCV 2023). Can the authors present the results of their proposed sensor fusion for detection framework on this dataset, as has become the standard in this line of work? Besides nuScenes being the standard, it would be extremely valuable to see that the proposed method works well (enough) on more than one dataset. Currently, this is the main reason for the "fair" rating for Significance.
(W2) In the introduction, the authors seem to equate a missing sensor and an uncertain sensor. While these concepts may seem tied together at face value, the concept of uncertainty is not brought up again in the rest of the paper. The authors should clarify whether their method actually addresses the problem of uncertainty, which in itself is studied in a large body of existing works.
(W3) Minor weakness: Figure 3 would be more convincing if ground truth and prediction were overlaid, as it is currently hard to draw conclusions between many combinations who seem, at least to my eye, to show very similar performance. A form of error map or plot would be equally helpful. Relying on the captions containing quantitative results to tell the story means that it would have been more efficient to simply present another small table for the two examples shown.
(W4) Very minor weakness: A few too many abbreviations that hinder the flow and readability of the paper. The authors could consider undoing some of the more unnecessary ones like autonomous driving (AD) and feature map (FM).
问题
(Q1) Can the authors present results on the nuScenes dataset, which is canonical for multi-modal multi-sensor 3D object detection? This would greatly help to motivate the method, see W1 for more details. Alternatively, if the authors would like to provide a rebuttal for why the nuScenes dataset would be inappropriate for this setting, I would be open to this argument, but in this case I would still like to see at least one more relevant experiment against existing approaches.
(Q2) Can the authors expand upon the claim that this method details with uncertainty in certain sensors? Have the authors done the necessary literature search on methods that quantify uncertainty in data, i.e., aleatoric uncertainty?
(Q3) Again, these are minor points, but can the authors provide a plan to improve clarity and convincingness in the aforementioned areas of the paper (figure, readability)?
I am inclined to keep my score of Accept given that my few concerns are addressed.
局限性
Yes.
最终评判理由
While I maintain that the submission would have been more convincing with evaluation on nuScenes (or at least more than just one dataset), I believe the paper--as currently is--still warrants an acceptance.
格式问题
None.
We appreciate your thorough review and constructive feedback. We address your concerns below:
Q1/W1: Results on the nuScenes dataset
The primary reason for selecting K-Radar is that this dataset contains extreme weather conditions such as heavy snow and sleet, where sensor surfaces become completely occluded (missing/failure), resulting in total loss of camera and LiDAR measurements. These extreme situations are essential for validating ASF's core contribution of maintaining robust performance through dynamic attention redistribution when individual sensors completely fail. This unique characteristic cannot be found in nuScenes or other datasets.
Thank you for this important suggestion. While K-Radar is unique in providing extreme adverse weather conditions (complete camera/LiDAR failure), we agree that demonstrating generalizability is valuable. We appreciate your suggestion and are currently experimenting with ASF on the nuScenes dataset. Since the nuScenes dataset differs significantly from K-Radar in sensor configuration and ASF requires careful hyperparameter tuning (as shown in Table 4), the adaptation process is ongoing. We will publicly release all experimental code and trained models on the project page once the optimization is complete.
Q2/W2: Clarification on uncertainty handling
Thank you for pointing out this important distinction. We apologize for the confusion. To clarify, ASF addresses sensor availability rather than uncertainty quantification (e.g., aleatoric uncertainty). We will revise the words "uncertain" and "certain" used in line 64 of the introduction and line 217 of subsection 3.3 to "degraded" and "available". Thank you for clearly pointing out this terminology confusion.
Q3/W3: Figure 3 improvements
We appreciate your constructive feedback. To improve Figure 3's clarity as you suggested, we plan three enhancements: (1) overlaying ground truth bounding boxes onto prediction results to facilitate direct comparison, (2) inserting clear visual separators between different weather conditions, and (3) revising captions to better emphasize key findings.
Furthermore, acknowledging that Figure 3 currently contains dense information, we will add a minimum of 2 supplementary scenes in the Appendix formatted as 4×2 layouts. This expanded format enables higher-resolution display of individual subfigures, improving detail visibility and addressing your clarity concerns. We will also make interactive visualizations (inference_results.gif from the appendix zip file) available on the project page, allowing readers to observe sensor attention dynamics more clearly.
Q3/W4: Abbreviation reduction
Thank you for your suggestion on readability. We will eliminate unnecessary abbreviations such as "autonomous driving (AD)" and "feature map (FM)", keeping only essential technical abbreviations like ASF, UCP, CASAP, and BEV. This will improve the flow and readability of the paper.
Thank you again for your constructive feedback that helps improve our work's clarity and impact. We will carefully incorporate all suggested revisions to make our paper better.
Thank you for responding to each of my concerns. Regarding the nuScenes dataset, could results (even if preliminary) be provided within the next couple of days? This would be greatly influential on my final decision regarding this paper.
Thank you very much for your continued engagement with our work and for considering our rebuttal responses. We sincerely appreciate your willingness to reconsider based on additional evidence.
As mentioned in our rebuttal, we selected K-Radar as it uniquely provides extreme adverse weather conditions (heavy snow, sleet) where sensor surfaces become completely occluded, resulting in total loss of camera and LiDAR measurements - critical scenarios for validating ASF's core contribution of maintaining robust performance when individual sensors completely fail.
However, following your suggestion about nuScenes dataset experiments, we have been actively working on this evaluation as stated in our rebuttal. While we are making our best efforts to expedite the process, we would like to clarify the technical challenges involved:
-
Sensor Configuration Differences: nuScenes has significantly different sensor configurations compared to K-Radar. Specifically for LiDAR, nuScenes uses 32-channel LiDAR with 10 sweep accumulation, while K-Radar uses 64-channel LiDAR with a single frame. Additionally, nuScenes employs 6 cameras while K-Radar uses a single front camera (or stereo camera). These fundamental differences require additional time for architectural adaptation and validation of each sensor's feature extractor (i.e., sensor-specific encoders).
-
Hyperparameter Optimization: As demonstrated in Table 4, ASF is sensitive to hyperparameters, requiring careful tuning across multiple configurations (patch size, channel dimensions, number of heads, etc.) for optimal performance on a new dataset, necessitating multiple training and validation iterations.
-
Computational Resource Constraints: While many SOTA nuScenes fusion networks including UniTR (Wang et al., ICCV 2023) and Cross Modal Transformer (Yan et al., ICCV 2023) utilized 8 A100 GPUs with 80GB VRAM each (640GB total VRAM), we have been working with a single RTX 3090 GPU (24GB total VRAM) for our ASF experiments. Recognizing the importance of nuScenes evaluation for generalization as you suggested, we are in the process of acquiring and setting up additional GPUs. Meanwhile, despite VRAM limitations preventing SOTA-comparable training settings, we have progressed with dataset download and code adaptation for nuScenes, and are exploring technical solutions including gradient accumulation for batch size expansion.
Given these constraints and the review timeline, while we are working diligently to provide preliminary results, it may not be feasible to provide nuScenes results within the next few days. Therefore, we committed to releasing nuScenes experimental results, code, and trained models on our project page once optimization is complete. However, we can confirm the following:
- Upon completion of ASF optimization for nuScenes, we will release results and code on the project page
- Manuscript revisions: (1) Clarification of terminology (from "uncertain" and "certain" to "degraded" and "available") and removal of unnecessary abbreviations, (2) Improved Figure 3 visualization
- Appendix updates with additional experiments: (1) Quantitative analysis of sensor contributions across different object distances, revealing how each modality's importance varies with range, (2) Statistical analysis of attention weight distributions under various weather conditions, and (3) Additional qualitative results visualization
We sincerely appreciate your thorough and insightful review, as well as your constructive feedback.
I greatly appreciate the authors' transparency regarding this matter and understand the line of thinking. Furthermore, I appreciate the efforts to expedite the experiments. While I continue to maintain that the submission would have been more convincing with evaluation on nuScenes (or at least more than just one dataset), I believe the paper--as currently is--still warrants an acceptance, and I look forward to seeing the results on nuScenes in the next revision/project page. Thank you for your continued discussion.
This paper proposes a pipeline for 3D object detection by exploiting the fusion of multimodal sensor data. The pipeline works with 3 steps: first, specific pre-trained feature extractors are used for each type of sensor (camera, LiDAR, radar). This provides features from all sensors in the same BEV format. Then, the fusion is done in two stages, one to unify the features from different source in a common canonical space, then use cross-attention model that is trained on available data, making it give more attention to available sensors; This improves the performance by filtering noisy data from bad weather or broken sensors. The patch based optimization and the single common loss for all modalities allow faster computation. This provides features that can be used for 3D object detection. The final results show SOTA performance across different weathers and sensor failures against multiple baselines.
优缺点分析
Strengths:
- Proposes a method to leverage multimodal sensors for 3D detection that is able to automatically filter noisy data through attention weight.
- SOTA results in difficult situations (bad weather, broken sensors)
- Paper is clear and well-written with in-depth experiences and analysis
Weaknesses:
- Camera information still provides less reliable performance as it does not provide geometric information like the LiDAR or radar.
问题
How would stereo cameras impact the performance, as it would provide geometric information from the images? Would any other representation that BEV be possible and what impact would it have on performance? How much of the performance variation can be attributed to the feature extractor of each modality?
局限性
Yes
最终评判理由
My questions were answered, no other concern.
格式问题
No
We appreciate your thoughtful review. We address your questions and concerns below:
Q1/W1: How would stereo cameras impact the performance, as it would provide geometric information from the images?
Thank you for this excellent question. Stereo cameras can indeed enhance performance by providing depth information. We have implemented Chen et al.'s DSGN++ (IEEE T-PAMI 2022) on K-Radar and conducted experiments. We achieved improved performance compared to monocular cameras: 26.5% AP3D for Sedan class and 26.3% AP3D for Bus or Truck class. This demonstrates the value of geometric information provided by stereo vision. These results validate that stereo vision substantially addresses the camera's geometric limitations you mentioned.
However, K-Radar's ZED 2i camera has a short 12cm baseline compared to KITTI's 54cm, limiting depth estimation accuracy for distant objects. Nevertheless, we plan to integrate this stereo-based approach as ASF's camera-specific encoder. All experimental code and weights will be released on the project page upon completion.
Q2: Would any other representation than BEV be possible and what impact would it have on performance?
Thank you for raising this important point. Several representations beyond BEV are indeed possible. (1) Voxel representation can directly represent 3D space preserving height information, but significantly increases memory and computational costs. (2) Polar coordinate representation suits 4D Radar but is less intuitive for downstream tasks (e.g., planning, segmentation). (3) Perspective view fusion benefits camera viewpoint but struggles with occlusion handling.
We chose BEV because it: (1) enables unified 2D representation for all sensor features, (2) directly supports autonomous driving path planning, and (3) provides balance between computational efficiency and performance.
Q3: How much of the performance variation can be attributed to the feature extractor of each modality?
This is a very insightful question. While various feature extractors (termed sensor-specific encoders in our paper) have been developed for each modality, recent sensor fusion research typically fixes well-established extractors to isolate fusion module contributions, e.g., Lang et al.'s PointPillars (CVPR 2019) and Yan et al.'s SECOND (Sensors 2018) for LiDAR-specific encoders. As mentioned in Appendix A.1, we intentionally froze pre-trained encoders to clearly isolate ASF's contribution.
As you correctly note, better sensor-specific encoders would improve overall performance. As mentioned in Q1, we are experimenting with replacing the current monocular BEVDepth with the higher-performing stereo-based DSGN++ for ASF integration. Completed experimental code and weights will be uploaded to the project page to demonstrate how improved sensor-specific encoders impact performance.
Thank you again for your constructive feedback. We will carefully consider all points raised to improve our research, particularly regarding camera geometric information utilization.
Thank you, my questions have been answered.
We sincerely appreciate your time and effort in reviewing this paper and your recognition of our research.
The manuscript proposes a novel approach to sensor fusion designed to enhance robustness and reliability in autonomous vehicle perception by integrating data from multiple sensors (cameras, LiDAR, radar) based on their availability. To this end, authors propose a cross-attention method and a unified canonical projection. The unified canonical space allows a unified and consistent of features from different sensors, while the cross-attention across sensors along patches dynamically adjusts the important of features from the sensors based on their current availability and reliability. The objective is to prevent adverse effects on the fused data quality when one sensor is unavailable or sends low quality data. A sensor combination loss is used to optimize the detection performance across sensors. The method is evaluated through experiments on the K-Radar dataset. Results show improvements on object detection performance metrics compared to state-of-the-art methods.
优缺点分析
Strengths
The proposed method is technically sound, novel, and very useful to the field, as it tackles a real issue in perception. Sensor fusion drastically improves performances and robustness, but can have negative impacts when some sensors become unavailable or degraded. This works provides a solution to this issue by dynamically changing the impact of the contribution of features from each sensor based on cross-attention. This is a clever use of cross-attention that solves a real-world problem. The contribution of the Unified Canonical Space for more consistent feature representation across sensors is valuable, and its usage with the proposed cross-attention method seems to succeed in reducing issues linked to sensor availability. The paper is well written, the equations describing the cross-attention method are clear, and the method is well evaluated on the K-Radar dataset. Visualizations of t-SNE plots and attention maps are an interesting addition. The results of experiments are also quite impressive, proposing significant improvement over the state of the art on a key task: object detection. The code is also already available on GitHub.
Weaknesses
Figure 3, while very useful and containing interesting visualizations, could be made clearer for the reader. It would also be interesting to evaluate the method on different datasets, although I understand adequate datasets are not that numerous.
问题
How do you plan to address the limitations related to the camera features integration? Particularly in difficult weather conditions.
The method is evaluated on K-Radar. This is enough, but it would be interesting to see evaluations on other datasets or on real-world data.
I would recommend to improve a bit the figures, which are very informative but slightly complex. Figure 3 would especially benefit from a simpler organization and more space for better readability.
局限性
yes
最终评判理由
This is a strong contribution, and I already had a favorable view of it. Authors made efforts to answer my minor remarks and provided clear answers. Other reviewers seem to have similar opinions. Therefore, I will not update my rating and leave it at 6.
格式问题
None
We appreciate your thoughtful and positive review. We address your suggestions below:
Q1: How do you plan to address the limitations related to camera features integration? Particularly in difficult weather conditions.
Thank you for this excellent question. Camera vulnerability in adverse weather stems from its fundamental reliance on visible light. Our strategy is to address this inherent limitation in two ways:
First, ASF automatically assigns higher weights to other sensors (particularly 4D Radar) when camera performance degrades. As demonstrated in inference_result_2.gif (in the appendix zip file), when cameras fail due to heavy snow, ASF increases reliance on 4D Radar (blue) to maintain robust performance.
Second, we are exploring stereo-based approaches to enhance camera utilization. We have implemented Chen et al.'s DSGN++ (IEEE T-PAMI 2022) on K-Radar, achieving improved performance (26.5% and 26.3% AP3D for Sedan and Bus or Truck, respectively) compared to the monocular camera-based approach. We plan to integrate this as ASF's camera-specific encoder. However, it's worth noting that K-Radar's ZED 2i camera has a 12cm baseline compared to KITTI's 54cm, limiting improvement potential.
Q2/W2: The method is evaluated on K-Radar. This is enough, but it would be interesting to see evaluations on other datasets or on real-world data.
Thank you for this important suggestion. While K-Radar is unique in providing extreme adverse weather conditions (complete camera/LiDAR failure), we agree that demonstrating generalizability is valuable. We are currently conducting ASF experiments on the nuScenes dataset. Upon completing hyperparameter optimization, we will release results and code on the project page.
Q3/W1: I would recommend to improve a bit the figures. Figure 3 would especially benefit from a simpler organization.
Thank you for this constructive feedback. To address your concern about Figure 3's clarity, we will implement several improvements: (1) overlay GT bounding boxes on prediction results for easier comparison, (2) add clear separation lines between weather conditions, and (3) enhance captions to highlight key findings.
Additionally, recognizing that the current Figure 3 may be too dense with information, we will include 2 additional scenes in the Appendix using a 4×2 format. This larger format allows each subfigure to be displayed at a higher resolution, making details more visible and addressing your concern about clarity. We will also provide interactive visualizations (inference_results.gif in the appendix zip file) on the project page, enabling readers to dynamically observe sensor attention changes with better visual clarity.
Thank you again for your positive evaluation and constructive suggestions. We will incorporate these improvements to enhance our contribution.
Authors have provided a clear rebuttal that addresses the concerns. They discuss two well-thought strategies to handle camera limitations in adverse weather. The ongoing evaluation on nuScenes dataset will provide more insights on generalizability. Detailed planned improvements of Figure 3 and additional illustrations will enhance clarity and readability. Overall, the paper presents a significant contribution to the field.
We greatly appreciate your time and effort in reviewing this paper and your recognition of our work.
This paper proposes a sensor fusion framework (Availability-aware Sensor Fusion) to address the degradation problem in autonomous driving sensors. ASF introduces two components: Unified Canonical Projection (UCP), which projects features from heterogeneous sensors into a unified representation space, and CASAP (Cross-Attention across Sensors Along Patches), which dynamically fuses sensor features based on availability. A Sensor Combination Loss (SCL) is adopted to improve robustness. Experiments on the K-Radar benchmark demonstrate significant improvements in both performance and efficiency over prior works.
优缺点分析
Strengths 1.ASF achieves state-of-the-art results under multiple weather and sensor failure conditions. 2.Lower memory usage and higher inference FPS than existing methods. CASAP simplifies the standard SCF computation by reducing the cross-attention complexity.
Weaknesses 1.The camera-based object detection (particularly in adverse weather) remain unaddressed. Yet the performance on adverse weather is highlight in the paper. 2.The experimental section primarily consists of straightforward performance tables under different sensor/weather configurations, but lacks in-depth diagnostic or interpretative analysis.
问题
1.How does CASAP actually infer sensor reliability? Is there any theoretical justification or formal analysis behind its attention mechanism, or is it purely learned empirically? 2.Can the method generalize to datasets without 4D Radar (e.g., nuScenes, Waymo)? If not, how tightly is the approach bound to K-Radar’s sensor setup? 3.The influence of camera features remains unclear. Given that the limitation section identifies camera as a weak modality, how much does it actually contribute to detection performance in normal vs. adverse conditions?
局限性
yes
格式问题
None
Thank you sincerely for your thoughtful review. We address your concerns below:
Q1: How does CASAP actually infer sensor reliability?
Thank you for this excellent question. CASAP empirically learns sensor reliability through cross-attention mechanisms trained on multi-sensor data. Features projected into the unified canonical space via UCP are optimized during training such that the reference query (Eq. 5) achieves high correlation with features from reliable sensors. During inference, when a sensor is degraded or missing, its features exhibit low correlation with and consequently receive lower attention weights. This is validated in Fig. 3, where 4D Radar assumes a dominant role when camera/LiDAR fail under adverse weather conditions. To facilitate understanding of this mechanism, we will release dynamic sensor attention map videos (inference_result_2.gif in the appendix zip file) on the project page.
Q2: Can the method generalize to datasets without 4D Radar?
Thank you for raising this important point. ASF's architecture is inherently sensor-agnostic by design. We selected K-Radar as it uniquely provides multi-modal data under extreme adverse weather conditions where cameras and LiDAR experience complete failure due to heavy snow occlusion. Following your valuable suggestion, we are currently conducting experiments with ASF on nuScenes. Given the differences in sensor configurations and training settings between nuScenes and K-Radar, as well as ASF's sensitivity to hyperparameters (as shown in Table 4), the optimization process requires careful hyperparameter tuning across multiple experimental configurations. Upon completion, we will release the training code and model weights for both benchmarks on the project page.
Q3: Camera contribution analysis?
Thank you for this insightful observation. While cameras are inherently vulnerable under adverse weather conditions due to their reliance on visible light, they nonetheless provide crucial contributions for long-range object detection. This is clearly demonstrated in inference_result_1.gif (in the appendix zip file), where red regions (indicating camera attention) dominate in distant object areas (upper-right portion of the video). This phenomenon aligns with findings from Wu et al.'s Virtual Sparse Convolution (CVPR 2023), which achieved 1st place on the KITTI 3D object detection benchmark, where their Figure 2 illustrates that virtual points (derived from camera features) are instrumental in achieving substantial AP improvements for distant objects. We are currently conducting quantitative analyses of distance-based sensor contributions, with results to be included in the revised Appendix in the camera-ready version.
W1: Camera performance in adverse weather
Your observation is indeed accurate. The K-Radar dataset comprises over 60% adverse weather frames, where heavy snow and sleet conditions result in complete occlusion of the sensor surfaces, leading to total loss of camera measurements (as demonstrated in inference_result_2.gif). These extreme conditions yield near-zero AP for camera-only detection, hence our notation of "-" (not available) in Table 3. This reflects the fundamental physical limitations of optical sensors rather than any methodological deficiency. ASF's primary contribution lies in gracefully handling such sensor failures through dynamic attention redistribution, maintaining robust performance despite individual sensor degradation. Comprehensive camera-only performance logs across all weather conditions will be made available on the project page, following the format provided in the supplementary materials (logs_10_seeds).
W2: Diagnostic analysis
We appreciate this constructive feedback. In addition to the t-SNE visualizations (Figs. 1, 5) and attention maps (Fig. 3) currently presented, we are conducting comprehensive diagnostic analyses to provide deeper interpretative insights: (1) quantitative analysis of sensor contributions across different object distances, revealing how each modality's importance varies with range, and (2) statistical analysis of attention weight distributions under various weather conditions, quantifying the dynamic adaptation of sensor reliability. These additional analyses will be incorporated into the revised Appendix in the camera-ready version, offering a more thorough understanding of ASF's adaptive behavior under diverse operational conditions.
Thank you once again for your thorough review. We believe these addressed concerns and planned enhancements will substantially improve the paper's impact and reproducibility.
The paper proposes a multimodal data (including cameras, LiDAR, radar signals) fusion method for autonomous vehicle perception. All four reviewers gave positive scores: two borderline accepts, one accept, and one strong accept. The major concerns raised by the reviewers include insufficient details and additional experimental results on the nuScenes benchmark. The authors addressed all the concerns during the discussion period. Based on all of these, the decision is to recommend the paper for acceptance.