PaperHub
5.8
/10
Rejected5 位审稿人
最低5最高6标准差0.4
6
6
6
5
6
3.8
置信度
正确性3.0
贡献度3.0
表达2.8
ICLR 2025

$\alpha$-OCC: Uncertainty-Aware Camera-based 3D Semantic Occupancy Prediction

OpenReviewPDF
提交: 2024-09-26更新: 2025-02-05
TL;DR

To address the challenging camera-based 3D Semantic Occupancy Prediction (OCC) problem for autonomous driving, we recognize the problem from a fresh uncertainty quantification (UQ) perspective.

摘要

In the realm of autonomous vehicle (AV) perception, comprehending 3D scenes is paramount for tasks such as planning and mapping. Camera-based 3D Semantic Occupancy Prediction (OCC) aims to infer scene geometry and semantics from limited observations. While it has gained popularity due to affordability and rich visual cues, existing methods often neglect the inherent uncertainty in models. To address this, we propose an uncertainty-aware camera-based 3D semantic occupancy prediction method ($\alpha$-OCC). Our approach includes an uncertainty propagation framework (Depth-UP) from depth models to enhance geometry completion (up to 11.58% improvement) and semantic segmentation (up to 12.95% improvement) for a variety of OCC models. Additionally, we propose a hierarchical conformal prediction (HCP) method to quantify OCC uncertainty, effectively addressing the high-level class imbalance in OCC datasets. On the geometry level, we present a novel KL-based score function that significantly improves the occupied recall of safety-critical classes (45% improvement) with minimal performance overhead (3.4% reduction). For uncertainty quantification, we demonstrate the ability to achieve smaller prediction set sizes while maintaining a defined coverage guarantee. Compared with baselines, it reduces up to 92% set size. Our contributions represent significant advancements in OCC accuracy and robustness, marking a noteworthy step forward in autonomous perception systems.
关键词
Uncertainty PropagationSemantic Occupancy PredictionConformal Prediction

评审与讨论

审稿意见
6

This paper proposes an uncertainty-aware camera-based 3D semantic occupancy prediction method, named α-OCC. Specifically, it addresses two key challenges in occupancy prediction: depth estimation uncertainty and high-level class imbalance. To tackle these challenges, it introduces uncertainty propagation (Depth-UP) from depth models to enhance occupancy prediction performance and proposes a novel hierarchical conformal prediction (HCP) method to improve the recall of rare classes. The effectiveness of the proposed approach is validated through experiments on two methods (VoxFormer and OccFormer) and two datasets (SemanticKITTI and KITTI360).

优点

  1. The paper is well-structured, clearly identifying two key challenges and outlining corresponding solutions.
  2. Although rare classes have little impact on the performance, they hold significant value for real-world safety.
  3. This work makes the first attempt to propose the uncertainty propagation framework, Depth-UP, to improve OCC performance.

缺点

  1. Considering that the challenges addressed in the paper are reflected across almost all datasets, the experiments conducted solely on the KITTI dataset lack some persuasiveness. In the camera-based occupancy prediction field, more recent methods tend to evaluate on the nuScenes occupancy dataset. Therefore, performance results on nuScenes would be more compelling.
  2. How was the value of alpha determined in the experiments, and how should it be decided in practical applications?

问题

  1. It seems that the ground truth in the third row of Figure 4 appears to be incorrect. Why is the entire middle row filled with cars?
  2. In the text, does "Hierarchical" in HCP refer to first distinguishing whether it is occupied and then identifying the specific category?
  3. From Figure 2, it seems that the semantic prediction is directly based on the depth prediction as input, which appears somewhat odd.
评论

We appreciate your constructive comments and feedback. Now we will explain your concerns and questions point by point in the following. We have highlighted the modified sections in the revised paper for clarity.

W1. Thank you for your suggestion. Now we are working on conducting experiments on the NuScenes dataset. We will show the results before the discussion deadline.

W2. Thank you for your valuable question. As mentioned in L512-L516 of our paper, the class-specific error rate αy\alpha^y is set by multiplying the original class error rate of OCC models with the scale λ<1\lambda < 1, to raise the coverage requirement for each OCC model and class in our experiments. In the ablation study (Figure 6), we consider five settings with λ{0.86,0.89,0.92,0.95,0.98}\lambda \in \{0.86, 0.89, 0.92, 0.95, 0.98\}. For the results in Table 2, we use λ=0.86\lambda = 0.86.

In practical application, deciding the error rate α\alpha for conformal prediction depends on the specific use case and the trade-off between prediction set size and risk tolerance [1]. In high-risk applications, such as medical diagnosis, lower error rates are often chosen to ensure high coverage and minimize critical errors. For instance, α=0.01\alpha = 0.01 is used for achieving a 99% confidence level. In low-risk applications, such as recommendation systems, a higher error rate, such as 0.1 or 0.2, may be acceptable if smaller prediction sets are preferable.

Q1. Thank you for your valuable question. The SemanticKITTI dataset has known annotation defects for dynamic objects, primarily due to inaccuracies in labeling. These defects arise from the use of LiDAR temporal fusion for annotations, which can lead to ghosting effects for dynamic objects. This phenomenon is thoroughly discussed in Figure 2 of the SSCBench paper [2]. In the third row of Figure 4, the car is in motion on the road, which results in incorrect labels filling the entire middle row with cars. We added a footnote in Figure 4 of the revised paper, for this problem.

SSCBench[2] has acknowledged this issue, and thus KITTI360 proposed in SSCBench does not suffer from ghosting problems. This may also explain why our Depth-UP enhances VoxFormer significantly more on KITTI360 compared to SemanticKITTI, showing a 1.64 mIoU improvement versus a 1.01 mIoU improvement.

Q2. Yes, that is correct. In HCP, "Hierarchical" refers to first determining whether a voxel is occupied, and if it is, subsequently identifying its specific category.

Q3. Thank you for your valuable question. We think it is common to use the depth prediction on the semantic prediction, such as works in [3,4,5]. As mentioned in L246-L254 of our paper, to utilize the depth uncertainty information on the semantic features, we extract depth features from the concatenated depth mean and standard deviation. These newly acquired depth features are then seamlessly integrated with the original 2D image features, constituting a novel set of input features {FI,FD}\{**F**_I, **F**_D\}. This integration strategy capitalizes on the extensive insights gained from prior depth predictions, enhancing the OCC performance with enhanced semantic understanding.

References.

  1. Conformal prediction for reliable machine learning: theory, adaptations and applications, Newnes 2014.

  2. SSCBench: Monocular 3D Semantic Scene Completion Benchmark in Street Views, IROS 2024.

  3. Indoor semantic segmentation using depth information, ICLR 2013.

  4. Semantic Scene Completion From a Single Depth Image, CVPR 2O17.

  5. Depth-guided Texture Diffusion for Image Semantic Segmentation, arXiv 2024.

评论

Thank you for the author's response, which addressed most of my concerns. Regarding Q3, my point is that the current illustration in the figure might give the impression that the predictions are made directly based on depth features, rather than using a combination of Fi and Fd as described. I will wait to review the NuScenes experiment results before making my final decision.

评论

Apologies for the misunderstanding. In the left part of Figure 2, the concatenation operation between the depth feature and image feature is present but may not have been easily noticeable. We have revised Figure 2 in the revised paper to emphasize this point more clearly. Additionally, we have updated the caption of Figure 2 to include the statement: "C**C** denotes the concatenation of the depth feature FD\mathbf{F}_D and image feature FI\mathbf{F}_I."

评论

W1. Thank you for your suggestion on doing experiments on the nuScenes occupancy dataset. We have conducted experiments on the Occ3D-nuScenes dataset [1] (occupancy version of nuScenes). The Occ3D-nuScenes dataset consists of 1,000 outdoor driving scenes captured using six surround-view cameras. All results are in the Subsection A.11 of the revised paper. The OCC model we used here is BEVStereo [2]. BEVStereo serves as a commonly used OCC baseline for the Occ3D-nuScenes dataset. Due to time and computational constraints, both the base BEVStereo model and BEVStereo with our Depth-UP were trained for 12 epochs and 20 batch sizes on the server with 4 Tesla V100 GPUs.

Table 6 presents the mIoU across all classes and the IoU for each individual class for both the base BEVStereo model and BEVStereo enhanced with our Depth-UP on the Occ3D-nuScenes dataset. Depth-UP demonstrates notable improvements over the base OCC model, achieving a 1.61 (8.77%) increase in mIoU. Furthermore, our Depth-UP significantly enhances performance for small, safety-critical classes, including a 9.43 IoU improvement for the motorcycle class and a 1.09 IoU improvement for the person class. These improvements are attributed to the effective integration of uncertainty information from the depth model into the OCC model.

Table 7 compares our HCP method with SCP and CCCP on the Occ3D-nuScenes dataset using the BEVStereo model. The results demonstrate that HCP consistently achieves robust empirical class-conditional coverage while generating smaller prediction sets. Compared to SCP, it reduces the set size by up to 87% and the coverage gap by up to 97%. Similarly, compared to CCCP, it achieves reductions of up to 10% set size and 6% coverage gap. These findings are consistent with experimental results on the SemanticKITTI and KITTI360 datasets, further validating the scalability of our approach.

Reference.

  1. nuScenes: A multimodal dataset for autonomous driving, CVPR 2020.

  2. BEVStereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo, AAAI 2023.

评论

Thank you to the authors for adding the results on nuScenes. Although the baseline performance is relatively lower, it still demonstrates the effectiveness of the proposed method. I encourage the authors to further improve the experimental section on nuScenes. Overall, I am inclined to accept this paper.

评论

Thank you for your response. We will continue to improve the experimental section on nuScenes. To help this paper receive a higher score and increase the likelihood of acceptance, do you have any additional suggestions on how to further improve it? We would sincerely appreciate your constructive feedback!

审稿意见
6

The paper presents an uncertainty-aware camera-based method for 3D semantic occupancy prediction called α\alpha-OCC. This approach includes Depth-UP that propagates depth uncertainty, and HCP that quantifies the uncertainty of the OCC model, which enhance the performance of geometry prediction and semantic segmentation, while also improving the recall of rare safety-critical classes. The authors test their methods using the SemanticKITTI and KITTI360 datasets based on VoxFormer and OccFormer.

优点

This paper is well-written and presents the ignore of depth uncertainty hinder the performance of geometry and semantics in OCC models. The authors conduct extensive experiments to demonstrate the effectiveness of their uncertainty propagation and quantification.

缺点

  1. Table 1 shows improved scores across each metric, while Figure 4 provides better visualizations for the safety-critical class. In Table 4, the mIoU for the "person" category decreases slightly, and the mIoU for "bicyclist" also declines a bit on SemanticKITTI datasets. Could the authors explain the relationship between the higher recall and some decreasing mIoU?
  2. the authors take experiments mainly on SemanticKITTI and KITTI360, Could it keep the performance of the methods for propagate uncertainty in multi-cameras datasets, such as NuScenes or Waymo?

问题

Please see weaknesses.

评论

We appreciate your constructive comments and feedback. Now we will explain your concerns and questions point by point in the following. We have highlighted the modified sections in the revised paper for clarity.

W1. Thank you for your valuable question. Upon reviewing every image in the validation set, we identified the reason for the mIoU decrease in the person and bicyclist categories on the SemanticKITTI dataset. These issues primarily stem from annotation defects, particularly for dynamic objects such as persons and bicyclists. The SemanticKITTI dataset generates annotations using LiDAR temporal fusion, which introduces ghosting effects for moving objects. This problem has been documented in Figure 2 of the SSCBench paper [1]. While cars are also affected, most are stationary, so the impact is minimal. However, nearly all persons and bicyclists in the SemanticKITTI validation set are moving, leading to erroneous annotations. SSCBench [1] has acknowledged these issues, and thus KITTI360 proposed in SSCBench does not suffer from ghosting problems. Our Depth-UP shows mIoU improvements in both person and bicyclist categories on KITTI360, aligning with our expectations. This may also explain why our Depth-UP enhances VoxFormer significantly more on KITTI360 compared to SemanticKITTI, showing a 1.64 mIoU improvement versus a 1.01 mIoU improvement. We have added the explanation in the Subsection A.7 of the revised paper.

In the paper, we did not explicitly show the relationship between recall and mIoU for specific categories. However, if you are referring to the relationship between recall and IoU, we illustrate this in Figure 5. To clarify the relationship between higher recall and a decrease in IoU, the explanation is as follows: Recall measures the model's ability to capture true positives relative to false negatives, while IoU also takes false positives into account [1]. High recall indicates that the model effectively identifies true occupied voxels but does not penalize over-predictions (false positives). In contrast, IoU provides a more balanced metric by penalizing both false negatives and false positives. Therefore, when a model over-predicts (generates too many occupied voxels), recall may remain high since true positives are well captured, but IoU will decrease due to the increased number of false positives. This trade-off highlights the differing focuses of the two metrics.

In summary, the slight mIoU drop in certain categories on SemanticKITTI is influenced by annotation quality, and the relationship between higher recall and decreased IoU is due to the trade-off between true positives and false positives.

W2. Thank you for your suggestion. Now we are working on conducting experiments on the NuScenes dataset. We will show the results before the discussion deadline.

References.

  1. SSCBench: Monocular 3D Semantic Scene Completion Benchmark in Street Views, IROS 2024.
评论

Thank you for addressing my concerns. I noticed a minor error in Table 4: the category for "motorcyclist" seems to be missing.

评论

Thank you for bringing this to our attention. We have updated Table 4 in the revised paper to include the mIoU results for the "motorcyclist" category. While KITTI360 does not have the "motorcyclist" category, for SemanticKITTI, all OCC models achieved a mIoU of zero for this category, indicating poor performance across the board.

评论

W2. Thank you for your suggestion on doing experiments in the multi-cameras dataset. We have conducted experiments on the Occ3D-nuScenes dataset [1] (OCC version of nuScenes), which is a multi-cameras dataset. All results are in the Subsection A.11 of the revised paper. The Occ3D-nuScenes dataset consists of 1,000 outdoor driving scenes captured using six surround-view cameras. The OCC model we used here is BEVStereo [2]. BEVStereo serves as a commonly used OCC baseline for the Occ3D-nuScenes dataset. Due to time and computational constraints, both the base BEVStereo model and BEVStereo with our Depth-UP were trained for 12 epochs and 20 batch sizes on the server with 4 Tesla V100 GPUs.

Table 6 presents the mIoU across all classes and the IoU for each individual class for both the base BEVStereo model and BEVStereo enhanced with our Depth-UP on the Occ3D-nuScenes dataset. Depth-UP demonstrates notable improvements over the base OCC model, achieving a 1.61 (8.77%) increase in mIoU. Furthermore, our Depth-UP significantly enhances performance for small, safety-critical classes, including a 9.4 IoU improvement for the motorcycle class and a 1.09 IoU improvement for the person class. These improvements are attributed to the effective integration of uncertainty information from the depth model into the OCC model.

Table 7 compares our HCP method with SCP and CCCP on the Occ3D-nuScenes dataset using the BEVStereo model. The results demonstrate that HCP consistently achieves robust empirical class-conditional coverage while generating smaller prediction sets. Compared to SCP, it reduces the set size by up to 87% and the coverage gap by up to 97%. Similarly, compared to CCCP, it achieves reductions of up to 10% set size and 6% coverage gap. These findings are consistent with experimental results on the SemanticKITTI and KITTI360 datasets, further validating the scalability of our approach.

Reference.

  1. nuScenes: A multimodal dataset for autonomous driving, CVPR 2020.

  2. BEVStereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo, AAAI 2023.

评论

Thank you for adding the experiments on the nuScenes dataset. The experiments have demonstrated the effectiveness of the proposed modules in comparison to BEVStereo, particularly in terms of mIoU metrics. Further analysis on metrics such as recall with this dataset could be conducted to evaluate your methods. The reviewer appreciates the authors' responses and prefers to keep the scores.

评论

Thank you for your response. We will evaluate our methods on nuScenes with recall and precision metrics. To help this paper receive a higher score and increase the likelihood of acceptance, do you have any additional suggestions on how to further improve it? We would sincerely appreciate your constructive feedback!

审稿意见
6

The paper introduces α-OCC, an uncertainty-aware camera-based 3D semantic occupancy prediction method designed for autonomous vehicle (AV) perception systems. The method aims to enhance the understanding of 3D scenes, which is crucial for tasks like planning and mapping. It addresses the limitations of existing methods that often overlook the inherent uncertainty in models. The paper achieves uncertainty modeling by incorporating a depth distribution in the initialization process of the ocupancy prediction process. The paper concludes that by integrating uncertainty quantification into OCC tasks, significant advancements in accuracy and robustness can be achieved, particularly for rare safety-critical classes, thus reducing potential risks for AVs.

优点

  1. The authors propose Depth-UP to improve geometry completion and semantic segmentation by propagating uncertainty from depth models to occupancy prediction models. This framework enhances the performance of various OCC models.
  2. To tackle class imbalance in OCC datasets, the paper introduces HCP, which quantifies the uncertainty of OCC and improves the recall of safety-critical classes like pedestrians and bicyclists.
  3. The paper demonstrates significant improvements in OCC accuracy and robustness. Depth-UP achieves up to 11.58% improvement in geometry completion and 12.95% in semantic segmentation. HCP improves the occupied recall of safety-critical classes by 45% with only a 3.4% reduction in performance overhead.
  4. The paper shows the ability to achieve smaller prediction set sizes while maintaining a defined coverage guarantee, reducing set size by up to 92% compared to baselines.
  5. Extensive experiments on two OCC models (VoxFormer and OccFormer) and two datasets (SemanticKITTI and KITTI360) validate the effectiveness of the proposed α-OCC approach.

缺点

  1. The paper incorporates uncertainty modeling into occupancy prediciton. Despite its good preformance, the paper does not provide an explicit illustration of what uncertainty means in occupancy prediction, and why it is important, and how it could benefit occupancy prediction.
  2. The paper does not compare against state of the art methods on SemanticKITTI and KITTI360.

问题

  1. Could you please explain what uncertainty means in occupancy prediction, and why it is important, and how it could benefit occupancy prediction as raised in weakness 1.
  2. Why do you not compare against SOTA methods on the two datasets?
评论

We appreciate your constructive comments and feedback. Now we will explain your concerns and questions point by point in the following. We have highlighted the modified sections in the revised paper for clarity.

W1 & Q1. Thank you for pointing this out. As outlined in the contributions part of the introduction, in the depth model, the uncertainty refers to the error in depth estimation. For the occupancy prediction (OCC), uncertainty is defined as the prediction set under a given class coverage guarantee for each voxel. In the introduction, we explain the importance of considering depth uncertainty propagation and OCC uncertainty quantification in Figure 1. Figure 1(a) illustrates the relationship between depth estimation uncertainty and OCC performance: as the uncertainty (error) in depth estimation increases, the accuracy of OCC decreases. This underscores the significant influence of depth uncertainty on OCC performance and motivates our Depth-UP. By incorporating depth uncertainty into both the geometry completion and semantic segmentation components of OCC, Depth-UP enhances overall OCC accuracy. The importance of uncertainty quantification in OCC is further demonstrated in Figure 1(b). Using our HCP to quantify uncertainty enables improved occupied recall for rare classes and the generation of prediction sets for each occupied voxel, reducing the likelihood of potential crashes. This ability to quantify and manage uncertainty in OCC is crucial for enhancing the planning and control processes of safety-critical autonomous systems, ensuring both reliability and safety. We have modified the revised paper in L051 and L082.

W2 & Q2. Thank you for your suggestion. First, it is important to note that VoxFormer[1] and OccFormer[2], which we used in our paper, were recently proposed (in 2023) on SemanticKITTI and KITTI360. They were widely regarded as baselines in OCC at the time we began this work. VoxFormer[1] has achieved 185 citations and OccFormer[2] has achieved 121 citations, highlighting their relevance in the field. Second, the main contribution of our paper is to recognize the OCC problem from a fresh uncertainty quantification perspective, as emphasized in the introduction. We do not aim to propose a state-of-the-art OCC model or compare it directly against others. Instead, our focus is on enhancing existing OCC models through uncertainty propagation (via Depth-UP) and achieving better uncertainty quantification under the high-class imbalance challenge on OCC (via HCP). Due to limitations in time and computational resources, we were unable to conduct additional experiments with the latest state-of-the-art OCC models.

References.

  1. Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion, CVPR 2023.

  2. OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction, CVPR 2023.

评论

The reviewer appreciates the clarification from the authors and prefer keeping the score.

评论

Thank you for your response and for taking the time to review our work.

评论

Thank you for your response. To help this paper receive a higher score and increase the likelihood of acceptance, do you have any additional suggestions on how to further improve it? We would sincerely appreciate your constructive feedback!

审稿意见
5

This paper proposes to incorporate uncertainty inherent in models for the camera-based semantic occupancy prediction. The presented framework integrates the uncertainty propagation (Depth-UP) from depth models to improve occupancy prediction performance in both geometry and semantics. A hierarchical conformal prediction (HCP) method is designed to quantify occupancy uncertainty effectively under high-level class imbalance. The extensive experiments demonstrate the effectiveness of proposed α-OCC.

优点

(1) This paper leverages uncertainty to lift 2D features to 3D space and designs a post-processing module to improve the accuracy of the classification head. The motivation throughout the paper is strong, and the writing is relatively clear, with detailed explanations of the background knowledge and the proposed method.

(2) The paper presents experiments conducted on the SemanticKITTI and KITTI360 datasets, demonstrating improvements in both geometric and semantic predictions. Additionally, it provides relevant performance metrics for uncertainty quantification.

缺点

(1) The change in the number of parameters after adding the module should be provided in the ablation study section.

(2) The proposed Depth-Up and HCP modules appear to be relatively independent. I would like to understand the relationship between them.

(3) Could you provide an evaluation of how the estimated uncertainty varies with distance?

(4) I would like to know whether the uncertainty of the final predicted semantic occupancy has been improved, specifically with respect to ECE-related metrics, as provided in methods like PaSCo[1].

[1] Cao A Q, Dai A, de Charette R. Pasco: Urban 3d panoptic scene completion with uncertainty awareness[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 14554-14564.

问题

(1) The change in the number of parameters after adding the module should be provided in the ablation study section.

(2) The proposed Depth-Up and HCP modules appear to be relatively independent. I would like to understand the relationship between them.

(3) Could you provide an evaluation of how the estimated uncertainty varies with distance?

(4) I would like to know whether the uncertainty of the final predicted semantic occupancy has been improved, specifically with respect to ECE-related metrics, as provided in methods like PaSCo[1].

[1] Cao A Q, Dai A, de Charette R. Pasco: Urban 3d panoptic scene completion with uncertainty awareness[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 14554-14564.

评论

We appreciate your constructive comments and feedback. Now we will explain your concerns and questions point by point in the following. We have highlighted the modified sections in the revised paper for clarity.

W1 & Q1. Thank you for your suggestion. In the revised paper, we have provided the number of parameters for each model in Table 3. Specifically, adding PGC introduces 0.11M parameters (0.2%) for introducing the additional head to estimate depth uncertainty. Adding PSS increases the parameter count by 11.54M (19.25%), due to the additional head for depth uncertainty estimation and the extra ResNet backbone for depth feature extraction. Consequently, the entire Depth-UP module also results in an increase of 11.54M parameters (19.25%). To reduce the added parameters, we could explore using a simpler backbone for depth feature extraction; however, this might affect overall performance. Investigating alternative feature extraction backbones will be a focus of our future work.

W2 & Q2. Thank you for pointing this out. As highlighted in the introduction, the primary contribution of our work is to recognize the OCC problem from a fresh uncertainty quantification perspective. Both Depth-UP and HCP methods are designed to tackle the uncertainty in OCC and improve its overall performance: Depth-Up is the uncertainty propagation framework to improve the OCC performance, where the uncertainty quantified by the direct modeling is utilized on both geometry completion and semantic segmentation. HCP is the uncertainty quantification method to solve the high-level class imbalance challenge on OCC, where on geometry completion, a novel KL-based score function is proposed to improve the occupied recall of safety-critical classes with little performance overhead. While they may appear relatively independent, they complement each other by demonstrating how uncertainty can influence both the internal mechanisms and the external outputs of an OCC model. Together, they form an integrated pipeline for addressing OCC challenges. Moreover, employing Depth-UP and HCP simultaneously in a model provides significant improvements in both accuracy and uncertainty quantification. As illustrated in Figure 5 of our paper, both modules independently enhance IoU at the same occupied recall for the rare class. When used in combination, they achieve the best performance. For instance, in the bottom subfigures of Figure 5(a), the Depth-UP & HCP model maintains an IoU above 30 even at an occupied recall of 70 for person class, while other models achieve, at most, an IoU of around 10. This clearly demonstrates that the joint application of Depth-UP and HCP delivers superior performance.

W3 & Q3. Thank you for your suggestion. We provide the evaluation of how the estimated uncertainty of depth and uncertainty of OCC vary with distance separately in Figure 10 of the revised paper. We explain the results in detail in Subsection A.9. Figure 10(a) shows the correlation between the estimated standard deviation (uncertainty) of depth and the distance from the camera to the object. The results reveal that the depth uncertainty is highest when the object is very close to the camera. This phenomenon arises because, in stereo vision systems, objects at close range result in minimal disparity between the two images, making depth estimation inherently challenging [1]. The uncertainty reaches its lowest point at approximately 15 meters, beyond which it increases with distance. This trend aligns with the inverse relationship between depth and disparity, as well as the reduced pixel resolution available for objects further away from the camera [1]. These observations confirm that the depth uncertainty estimation in our model is consistent with theoretical expectations.

Figure 10(b) presents the relationship between the Expected Calibration Error (ECE) metric of VoxFormer outputs and the distance to the voxels. In this case, we applied the voxel-based ECE computation method described in Pasco[2] as you mentioned. The results show that the OCC uncertainty is minimized at approximately 15 meters, consistent with the depth uncertainty trend observed in Figure 10(a). When voxels are very close to the camera, the OCC ECE is relatively high, likely due to depth estimation errors. Similarly, when voxels are far from the camera, the OCC ECE increases, attributed to the limited pixel resolution for distant objects.

Notably, the similarity in the shapes of the curves in Figure 10(a) and 10(b) highlights the significant influence of depth uncertainty on OCC performance, as discussed in Section 1. These findings reinforce the importance of utilizing the depth uncertainty in improving final OCC performance.

评论

W4 & Q4. Thank you for your suggestion. As described in [3] and [4], conformal prediction methods generate set-valued predictions that satisfy a predefined class coverage guarantee, which is different from the deep ensemble approach used in PaSCo[2]. The average set size (AvgSize) is a widely used uncertainty metric for conformal prediction methods [3,4], which we also adopt in our paper. The experimental results in Table 2 and Figure 6 demonstrate that our HCP achieves the best performance on AvgSize under the same CovGap, compared to existing conformal prediction methods such as SCP and CCCCP. This highlights the effectiveness of our HCP in improving uncertainty quantification for OCC.

We evaluated the uncertainty performance of our Depth-UP method applied to the VoxFormer and OccFormer models in Table 5 of the revised paper. We explain the results in detail in Subsection A.10. As you suggested, the voxel ECE and voxel NLL metrics provided in PaSCo[2] are used to evaluate the uncertainty. From the results, it is evident that Depth-UP achieves a modest but consistent reduction in uncertainty across most cases, particularly for the NLL metric. This improvement is noteworthy given that Depth-UP was primarily designed uncertainty propagation method to enhance the accuracy performance of the original OCC models. Uncertainty quantification for OCC is not the primary focus of our Depth-UP method.

References.

  1. D3RoMa: Disparity Diffusion-based Depth Sensing for Material-Agnostic Robotic Manipulation, ECCV Workshop 2024.
  2. Pasco: Urban 3d panoptic scene completion with uncertainty awareness, CVPR 2024.
  3. A gentle introduction to conformal prediction and distribution-free uncertainty quantification, arXiV 2021.
  4. Class-Conditional Conformal Prediction with Many Classes, NeurIPS 2023.
评论

To help this paper receive a higher score and increase the likelihood of acceptance, do you have any additional suggestions on how to further improve it? We would sincerely appreciate your constructive feedback!

评论

Thank you for the detailed response, which addressed some of my concerns. The significant increase in the number of parameters for depth estimation (+19.25%) raises doubts about whether the improvement in uncertainty estimation is genuine, as the performance gains might primarily stem from better depth estimation. Additionally, the method shows only limited improvement in key uncertainty evaluation metrics such as ECE and NLL. In response to Q2, the authors emphasize that the main contribution is recognizing the OCC problem from a fresh uncertainty quantification perspective. However, in the response to Q4, they state that "Uncertainty quantification for OCC is not the primary focus of our Depth-UP method," which appears somewhat contradictory. Considering these factors, I am inclined to maintain my original score.

评论

Thank you but we respectfully disagree with your following comments::

  1. “the performance gains might primarily stem from better depth estimation.” We freeze all parameters of the original depth model in all our experiments (both baselines and our method) and only retrain the additional head for estimating depth uncertainty, denoted as Σ^\hat{\mathbf{\Sigma}}. Consequently, the performance of depth estimation D^\hat{\mathbf{D}} remains unchanged between the newly trained model and the original depth model. The number of parameters of the depth estimation is not changed. Of the 11.54M additional parameters in our Depth-UP, 0.11M parameters are contributed to the additional head for the depth uncertainty estimation and 11.43M parameters are from the extra ResNet backbone in the OCC model.
  2. "However, in the response to Q4, they state that "Uncertainty quantification for OCC is not the primary focus of our Depth-UP method," which appears somewhat contradictory. " Our Depth-UP method focuses on propagating the uncertainty of the depth model to the OCC model to enhance its performance (accuracy). Here, "UP" stands for uncertainty propagation, not uncertainty quantification. Therefore, Depth-UP is not an uncertainty quantification method for OCC models. The uncertainty quantification method we proposed for OCC models is HCP.
审稿意见
6

In this paper, the authors focus on improving the Camera-based 3D semantic occupancy prediction (OCC) task by introducing the geometric the geometric uncertainty.

Specifically, an uncertainty propagation network named Depth-UP is used to predict depth values and pixel-wise uncertainty as well. The predicted uncertainty, together with the depth map the image features are used for semantic occupancy prediction.

In order to effectively utilize the uncertainty, a hierarchical conformal prediction (HCP) method is proposed to quantity the OCC uncertainty, which effectively addresses the class im-balance in OCC datasets.

Additionally, a KL-based score function is leveraged to improve the occupied recall of safety-critical classes.

Experiments on public SemanticKITTI and KITTI360 datasets demonstrate the effectiveness of the proposed uncertainty propagation and uncertainty quantification strategies.

优点

  1. Interesting uncertainty propagation for geometric completion. The uncertainty exists in almost all computer vision tasks, and it is also very important to the OCC task. In this paper, the authors use the network to predict the uncertainty for each pixel is straightforward. The strategy of computing the uncertainty of each 3D voxel by accumulating all potential rays though the voxel makes sense. Experiments also show the improvements in the geometry completion task.

  2. Useful hierarchical conformal prediction (HCP). The proposed HCP, compared to SCP and CCP, are specially designed for vary and safety-critic classes from the perspective of both geometric and semantic levels. A KL-based score function enforces these classes to have a higher prediction during the training process.

  3. Good performance. Experiments on SemanticKITTI and KITTI360 datasets show the proposed uncertainty propagation and uncertainty quantification strategies achieve significant improvements when they are applied to two basic models VoxFormer and OccFormer.

  4. The paper is well-organized and easy to read.

缺点

The proposed method is simple and easy to follow, so I don’t have many concerns about this paper.

In L209, the authors say that the depth prediction model is retrained to allow to regress both depth values and uncertainties. I am curious about the performance of the newly trained depth model and the basic depth model. If the joint training improves the depth prediction?

Also in the experiments, do the two base models (indicated as Base) use the original depth prediction networks or the newly trained networks with uncertainty branch removed?

Other minor mistakes

a. L051, (1 + beta%) -> (1 + beta)?

b. L356-357, in Table 1, 55.78 (+2.34) -> 55.78 (+1.60)

问题

Overall, I think this is an interesting paper. The uncertainty propagation improves the geometry completion performance and the HCP improves the accuracy of the safety-critic classes in class im-balanced datasets. As I am not an expert in this area, it would also be very important to see other reviewers’ comments before making the final decision.

评论

We appreciate your constructive comments and feedback. Now we will explain your concerns and questions point by point in the following. We have highlighted the modified sections in the revised paper for clarity.

W1. Thank you for pointing this out. To estimate the uncertainty of the depth model, we freeze all parameters of the original depth model and only retrain the additional header for Σ^\hat{\mathbf{\Sigma}}. So in terms of the performance of the depth estimation D^\hat{\mathbf{D}}, there is no difference between the newly trained depth model and the basic depth model. The joint training does not improve the depth prediction. We have added more explanation in L211 of the revised paper.

W2. Thank you for pointing this out. The base models and models with our Depth-UP use the same depth estimation model, producing identical depth estimation outputs D^\hat{\mathbf{D}} for the same input data. The sole difference in the depth estimation model is that the base model does not include the additional header for Σ^\hat{\mathbf{\Sigma}} (uncertainty branch).

Other Minor Mistakes. Thank you for pointing them out. We have corrected them in L050 and L354-355 of the revised paper.

评论

Hi authors,

Thanks for addressing my concerns. I don't have any other questions and I keep my rating.

Best, Reviewer MKcP

评论

Thank you for your response and for taking the time to review our work.

评论

Thank you for your response. To help this paper receive a higher score and increase the likelihood of acceptance, do you have any additional suggestions on how to further improve it? We would sincerely appreciate your constructive feedback!

评论

We noticed the feedback from Reviewer MRxz whose question (https://openreview.net/forum?id=sgaMYvGRG5&noteId=UxDtrSbfSl) about our new experiment results might be of interest to other reviewers, although we respectfully disagree with MRxz's speculation and comments. For the sake of constructive rebuttal discussion, we provide our response to all reviewers as follows:

  1. “the performance gains might primarily stem from better depth estimation.” We freeze all parameters of the original depth model in all our experiments (both baselines and our method) and only retrain the additional head for estimating depth uncertainty, denoted as Σ^\hat{\mathbf{\Sigma}}. Consequently, the performance of depth estimation D^\hat{\mathbf{D}} remains unchanged between the newly trained model and the original depth model. The number of parameters of the depth estimation is not changed. Of the 11.54M additional parameters in our Depth-UP, 0.11M parameters are contributed to the additional head for the depth uncertainty estimation and 11.43M parameters are from the extra ResNet backbone in the OCC model.
  2. "However, in the response to Q4, they state that "Uncertainty quantification for OCC is not the primary focus of our Depth-UP method," which appears somewhat contradictory. " Our Depth-UP method focuses on propagating the uncertainty of the depth model to the OCC model to enhance its performance (accuracy). Here, "UP" stands for uncertainty propagation, not uncertainty quantification. Therefore, Depth-UP is not an uncertainty quantification method for OCC models. The uncertainty quantification method we proposed for OCC models is HCP.

We appreciate constructive comments and feedback from all reviewers. Thank you for taking the time to review our work.

AC 元评审

This paper tackles semantic occupancy estimation from camera views. Given a camera view, the goal is to estimate the corresponding occupancy in a bounded, voxelized space. In this voxelized space, the network estimates the occupancy value for each voxel and a posterior over a (pre)fixed set of semantic classes.

The core contribution of this work is an investigation of how we can improve the performance of camera-based semantic occupancy estimation models by accounting for uncertainty in model predictions. To this end, work addressed uncertainty estimation on two fronts:

  • To improve occupancy estimation robustness, the paper proposes a Depth-UP (Depth Uncertainty Propagation) module that learns to account for uncertainties in monocular depth estimation modules within the voxelized volume. This approach alone improves occupancy estimation accuracy.
  • To improve semantic segmentation results (specifically, for rarer classes), the paper proposes HPC (Hierarchical Conformal Prediction). Accounting for uncertainty in semantic posterior estimation helps this approach with rarer classes.

Proposed modules are evaluated in conjunction with two state-of-the-art models (VoxFormer, Li et al. (2023b) and Occ- Former, Zhang et al. (2023)), and are evaluated on two canonical datasets for semantic occupancy estimation.

The paper received mixed reviews (5, 6, 6, 6, avg. 5.8).

Reviewers appreciate:

  • That papers address uncertainty estimation to improve semantic occupancy prediction- occupancy estimation task is inherently uncertain (ill-posed), and accounting for uncertainty makes sense.
  • Liked the approach to semantic uncertainty calibration, that utilizes hierarchical conformal prediction to obtain well-calibrated semantic scores on two levels (classes and super-classes).
  • Acknowledge that papers shows improvements consistent improvements in both occupancy and semantic predictions (on two relevant datasets, SemanticKITTI and KITTI360).
  • Agree that the paper reports good results,
  • Agree that improvements (compared to baselines) are indeed due to the two added modules, and,
  • that the paper is well-written and organized.

However, reviewers also point out several issues with this paper, among those are:

  • Compared to baselines, the number of network parameters changed, and this was not discussed nor ablated;
  • There is a disconnect between the proposed depth uncertainty propagation module (Depth UP) and the semantic (HCP) module. While reading the paper, reviewers (as well as AC) did not see a clear connection. It is true that both proposed modules improve two different aspects of the end-task (depth uncertainty propagation helps with occupancy estimation, while conformal prediction is helpful for semantic calibration). Both can be understood as two independent contributions.
  • As MRxz pointed out, the Expected Calibration Error (ECE) metric reveals that the improvement due to uncertainty estimation is quite limited compared to baselines and suggests that the improvements may originate from a higher number of parameters for depth estimation (+19.25%) compared to baselines.

Overall, I agree with MRxz. It is not entirely clear whether improvements are due to uncertainty quantification (as revealed by the ECE metric). The paper should provide a more transparent discussion of the network changes / additional parameters and provide thorough empirical evidence to support the claims (i.e., that improvements are indeed due to improved uncertainty calibration).

Finally, author responses solicit reviewers to change a score during the rebuttal or discussion phase (see "Additional Comments On Reviewer Discussion"), which violates the ethical principle of "Be Honest, Trustworthy and Transparent" outlined in the ICLR Code of Ethics, as it undermines the integrity and independence of the review process.

审稿人讨论附加意见

Overall, the paper received ratings of 5, 6, 6, 6.

Reviewer MKcP (rating: 6 liked the paper overall and had only minor comments. Specifically, the reviewer inquired whether the newly-trained depth model (depth + uncertainty) performs better than the based model due to the joint training. The authors confirmed that the depth model is frozen and that only the uncertainty estimation network is trained. Reviewers retained their rating of 6 (borderline accept) after this explanation. The discussion concluded with the authors stating: "To help this paper receive a higher score and increase the likelihood of acceptance, do you have any additional suggestions on how to further improve it?", which can be perceived as a request for a higher rating. The reviewer did not respond.

Reviewer MRxz (rating: 5, below borderline) questioned the overall story (there is a disconnect between two independent modules) and asked about a number of parameters wrt. baselines, and asked whether the final predicted semantic occupancy improved wrt. ECE-metrics, introduced in recent work addressing uncertainty estimation in the context of Lidar panoptic scene completion [1]. After the response, MRxz remained unconvinced, noted a significant increase in the number of parameters for depth estimation (+19.25%), and commented that uncertainty quantification improvements revealed by ECE metrics show limited improvement. Reviewers also point to contradictions in paper/author claims (one response by authors claims that the main contribution is tackling this problem from a ''fresh uncertainty quantification perspective''; on another occasion, they stated, ''Uncertainty quantification for OCC is not the primary focus of our Depth-UP method''). Due to this, MRxz retained a rating of 5.

Reviewer x1WM asked for clarifications regarding uncertainty modeling (why is it important, and how can uncertainty quantification benefit occupancy prediction) and stated that the paper ''does not compare against the state of the art methods on SemanticKITTI and KITTI360'' (AC notes that the reviewer did not specify which baselines they had in mind). The response did not fully clarify why the most recent methods were not considered. The response rather emphasized that two baselines, reported in the paper, given they are relatively recent (publ. in 2023), and based on the citation count. The reviewer concluded with a comment that they are retaining their rating (to which authors responded with: "To help this paper receive a higher score and increase the likelihood of acceptance, do you have any additional suggestions on how to further improve it?".

Reviewers L8Q4, a2b2 asked about results on other (multi-camera) datasets, NuScenes or Waymo. Authors responded with a short discussion of the results on nuScenes, and confirmed that their proposed modules also yield improvements. L8Q4 retained their rating of 6. Reviewer a2b2 commented that even though overall results on nuScenes are low, results do confirm the effectiveness of the proposed components and provided the final rating of 6.

The authors responded to both discussions with: "To help this paper receive a higher score and increase the likelihood of acceptance, do you have any additional suggestions on how to further improve it?".

[1] Cao et al., Pasco: Urban 3d panoptic scene completion with uncertainty awareness. CVPR, 2024.

最终决定

Reject