OPUS: Occupancy Prediction Using a Sparse Set
摘要
评审与讨论
The paper presents OPS, a novel framework for occupancy prediction in autonomous driving. It formulates the task as a direct set prediction problem, using a transformer encoder-decoder architecture to predict occupied locations and classes simultaneously. This approach eliminates the need for explicit space modeling or complex sparsification. The main results of the paper demonstrate that the OPS framework outperforms previous state-of-the-art methods in occupancy prediction on the Occ3D-nuScenes dataset. It achieves superior RayIoU scores, a metric designed to address the overestimation issues of the traditional mIoU metric. The paper also includes an ablation study highlighting the effectiveness of the various strategies incorporated in OPS, such as adaptive re-weighting, consistent point sampling, and coarse-to-fine prediction. These strategies contribute to improved performance, particularly in terms of mIoU and RayIoU.
优点
- [S1] Clarity: The paper is well-written and easy to follow. The problem formulation, methodology, and experimental results are clearly presented. Overall, the figures are informative and complement the text well.
- [S2] Quality: The proposed OPS framework is thoroughly described, and the experimental results on the Occ3D-nuScenes dataset shows merits of the proposed method in terms of RayIoU. The ablation studies further validate the contribution of each component in the framework.
缺点
-
[W1] Technical soundness: The paper proposes to use Chamfer distance as the main objective function, which is known to be sub-optimal. Although several strategies including FocalLoss, coarse-to-fine refinement have been proposed, it makes the system overly complicated and potentially unfair for comparison. For example, it is unclear whether the baseline methods can benefit from the proposed strategies as well. Please comment this in the rebuttal.
-
[W2] Limited Evaluation: The experimental evaluation is solely conducted on the Occ3D-nuScenes dataset. While this is a standard benchmark, evaluating the method on additional datasets, such as SemanticKITTI or Waymo Open Dataset, would provide a more comprehensive assessment of its generalizability and robustness.
- Reference: [NewRef1] Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving
-
[W3] mIoU Performance:
-
[W3.1] Although OPS excels in RayIoU, its performance on the mIoU metric lags behind some dense models. Given that mIoU is a widely used metric in occupancy prediction, addressing this weakness would make the method more appealing to a broader audience. In comparison, RayIoU is only introduced in a recent paper that has not been peer reviewed.
-
[W3.2] As the paper is focusing on autonomous driving applications, it is unclear whether mIoU performance can cause safety-critical problems for downstream tasks such as behavior prediction and planning. It would be good to discuss this aspect in the rebuttal and next version of the paper.
-
[W3.3] In Table 3, the reference number for FB-OCC is wrong.
-
-
[W4] Lack of Qualitative Analysis: While the paper provides some visualizations, a more thorough qualitative analysis of the predictions would be insightful. Analyzing failure cases, comparing predictions across different classes and distances, and examining the impact of the proposed strategies on the quality of predictions would provide a deeper understanding of the method's strengths and weaknesses.
问题
Please check the weakness section, especially W1 and W3.
局限性
Yes, the limitations have been discussed in the paper.
We thank the reviewer for the insightful comments. We have carried out experiments on Occ3D-Waymo, detailed in the global response to all reviewers. Our results underscore the generalizability of our OPS. We have also corrected the reference number for FB-Occ in Tab.3. Below are our responses to other specific concerns:
- [W1] Chamfer distance is sub-optimal. As stated in lines 46-52 and Sec.C in the appendix, the Hungarian algorithm is optimal but not scalable for the occupancy task. In spite of the sub-optimal property, Chamfer distance is justified by its computational efficiency and satisfactory outcomes accroding to our experiments.
- [W1] Proposed strategies lead to complicated system. We introduce four strategies in total, as shown in Tab.2. The re-weighting losses does not alter model structures. CPS is computationally similar to the sampling scheme in SparseBEV[1]. The coarse-to-fine approach, meanwhile, streamlines the computation in the initial phases. Therefore, these strategies do not complicate the system building on top of SparseBEV.
- [W1] Unclear if the baseline methods can benefit from the proposed strategies.
- The re-weighted CD loss and CPS are tailored to our set prediction framework and are not directly applicable to other occupancy prediction methods.
- The coarse-to-fine scheme is also adopted by CTF-Occ[4] and SparseOcc[2]. However, each method has its uniquely crafted designs, which are not intended for external use.
- Regarding the classification loss, our main baselines, SparseOcc[2] and FB-Occ[3], incorporate more complex designs than our re-weighted focal loss. SparseOcc, for instance, integrates focal and Dice losses, while FB-Occ sums up four distinct classification losses. We are not sure if their performances can be improved by simply replacing their focal loss with ours, given the composite nature of their loss functions.
- [W1] Proposed strategies lead to potentially unfair comparison. Strategies not exclusively tailored for OPS (CD loss and CPS), have comparable or even more complex alternatives in SparseOcc. In addition, we humbly believe that the entirety of the strategies and other components should be evaluated together for a single approach. Therefore, we respectfully assert that the comparisons are fair.
- [W3.1] mIoU vs. RayIoU.
- Our mIoU are further enhanced: As detailed in the general response, OPS-L(8f) now achieves 36.14 mIoU, marking a 3.74 improvement over our prior submission and 6 mIoU lead over SparseOcc (8f). The gap between dense and sparse methods has been largely reduced to 3 mIoU.
- RayIoU also matters: RayIoU, introduced by SparseOcc[2], has gained acceptance through peer review at ECCV and is utilized in recent literature[2,5,6] and a CVPR workshop[7]. As illustrated in their Fig.4, Fig.5 and Sec. 4.1, the mIoU can be hacked by predicting a thicker surface, a common occurrence with dense methods trained on visibility masks. This could account for the higher mIoU observed in dense methods, despite poorer visualizations. In contrast, RayIoU is not vulnerable to that over-estimation.
- [W3.2] Relatively low mIoU performance may cause safety-critical problems. Please refer to our global response.
- [W4] More qualitative analysis. We thank the reviewer for the kind suggestion and have incorporated additional analysis into our draft. Here are the key points:
- Failure cases: As shown in Fig.3 and Fig.8, a common OPS failure mode is the prediction of scattered and discontinuous surfaces at long distances. Another is the presence of holes in predicted driving surface, a phenomenon also observed in SparseOcc due to the sparsity properties.
- Impact of proposed strategies: Tab.4 demonstrates the impact of our proposed strategies. Tab.3, Tab.6 and Tab.7 examine other facts that could affect model performance. Fig.4 Fig.5 and Fig.6 uncover some underlying mechanisms of the proposed OPS.
- Predictions across different distances: We report the RayIoU of FB-Occ and OPS at different ranges in the following table. It is evident that OPS demonstrates a more pronounced advantage in nearby areas than at far distances. This could be attributed to the phenomenon pointed out by SparseOcc: dense approaches tend to overestimate the surfaces, especially in nearby areas. |model|overall|0-20m|20-40m|>40m| |-|-|-|-|-| |FB-Occ|33.5|41.3|24.2|12.1| |OPS-L|41.2|49.10|31.15|13.73|
- Predictions across different classes: The per-class RayIoU results are presented as follows. OPS outperforms FB-Occ across all classes. The top 5 most improved classes are driving surfaces, other flats, sidewalks, man-made structures, and vegetation, indicating the background categories can benefit most from our model. |Method|others|barrier|bicycle|bus|car|c.veh.|motor.|ped.|cone|trailer|truck|surface|flat|sidewalk|terrain|manmade|vege.| |-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| |FB-Occ|5|44.9|26.2|59.7|55.1|27.9|29.1|34.3|29.6|29.1|50.5|44.4|22.4|21.5|19.5|39.3|31.1| |OPS-L|10.9|46.2|29.6|65.5|58.4|29.7|31.1|35.8|33.8|34.7|52.7|68.6|37.3|35.1|37.5|50.1|43.1|
[1] Sparsebev: High-performance sparse 3d object detection from multi-camera videos, CVPR23.
[2] Fully sparse 3D panoptic occupancy prediction, ECCV24.
[3] Fb-occ: 3d occupancy prediction based on forward-backward view transformation, arXiv23.
[4] Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving, Neurips24.
[5] CascadeFlow: 3D Occupancy and Flow Prediction with Cascaded Sparsity Sampling Refinement Framework.
[6] Panoptic-FlashOcc: An Efficient Baseline to Marry Semantic Occupancy with Panoptic via Instance Center, arXiv24.
[7] The Autonomous Grand Challenge at the CVPR 2024 Workshop.
This paper focuses on the sparsity property in occupancy prediction, given that most voxels are occupied. In order to reduce computation costs on empty voxels, this paper introduces a set prediction paradigm to explicitly model sparsity. OPS, the proposed framework, utilizes the encoder-decoder architecture to jointly reconstruct and classify point clouds. Chamfer distance is used as the supervision. For experiments, OPS outperforms state-of-the-art occupancy methods by 4.9 RayIoU.
优点
-
Good motivation. The introduction of sparsity as the representation of scenes is with a great motivation.
-
Novel architecture. Using the point cloud + set prediction approach for occupancy is novel, elegant, and makes sense.
-
Strong performance. The proposed OPS with RayIoU achieves significantly better performance and inference speed compared to the current state-of-the-art. Although the authors mentioned the current limitation is the need for longer epochs, it is not a major issue, as most set-prediction methods face this challenge.
缺点
-
Error citation. I think the "sparseocc" mentioned frequently in this paper should be [1], instead of [2].
[1.] Fully sparse 3D panoptic occupancy prediction
[2.] SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction
-
Is there a typo in Line 245? The query count for OPS-tiny is only 0.6k, while for OPS-S, OPS-M, and OPS-L, the query counts are as high as 12K, 24K, and 48K respectively.
-
If the query count is as high as 48K, self-attention would not be able to handle such a large query size. How did the authors manage to do this? The design details are not elaborated on.
-
Since the authors have replaced the basic representation of occupancy with point cloud, they should discuss the differences with point cloud forecasting tasks, such as 4D-Occ [3] and ViDAR [4], especially ViDAR paper which also constructs point clouds from visual images. Discussions on [3] and [4] are suggested to be included in the main paper.
[3.] Point Cloud Forecasting as a Proxy for 4D Occupancy Forecasting
[4.] Visual Point Cloud Forecasting enables Scalable Autonomous Driving.
问题
See Weakness.
局限性
N/A
We thank the reviewer for the valuable feedback and feel honored for being approbated on our motivation, novelty and performances. Below, please find our responses to the weaknesses:
- Error citation. Thanks for bringing this to our attention. We have thoroughly reviewed the cited works and identified an incorrect citation in line 71, which has been duly corrected. In the remaining instances (lines 35, 88, 92, 122, 239, 243, 268, 274, and Tab.1), we have confirmed that the correct paper[1] has been cited.
- Typos in line 245. We appreciate the reminder and have made the corrections. The query numbers for OPS-S, OPS-M, and OPS-L are indeed 1.2K, 2.4K, and 4.8K. The small number of queries is a key factor for the fast inference speed of OPS. More detailed configurations of different models can be found in Tab.5 in the appendix.
- Discussions on point cloud forecasting tasks. We have included comparisons between OPS and works[2,3] in the appendix. Regrettably, due to page limits, we were unable to incorporate this discussion into the main text.
[1] Fully sparse 3D panoptic occupancy prediction, ECCV24.
[2] Point cloud forecasting as a proxy for 4d occupancy forecasting. CVPR23.
[3] Visual point cloud forecasting enables scalable autonomous driving, CVPR24.
Thanks for the authors' rebuttal. I will maintain my rate.
We sincerely thank the reviewer for maintaining the positive assessment of this paper. We will include the details mentioned in the rebuttal in the revised version.
This paper considers the problem of occupancy prediction from multi-view images for autoonomous driving. One of the main challenges in the occupancy prediction task is the high computational demand incured by discretizing the 3d space. Traditional methods typically predict the occupancy of each voxel individually and assign it a semantic class. The key idea of this work if to reduce computational burden by formulating occupancy prediction as a set prediction task. The authors employ the Chamfer distance in the loss term and propose a novel point sampling process, termed Consistent Point Sampling, to make the SparseBEV decoder applicable. In experiments on the Occ3d-nuScenes dataset, their method demonstrates superior performance with regard to the RayIoU metric while being computationally more efficient than the state-of-the-art methods.
优点
I believe the proposed general approach is interesting and the work seems well evaluated (to the extent that I can judge). The paper is largely well-written and seems easily understandable for someone working in that space. I also want to applaud the authors for committing to making their code publicly available.
缺点
This topic seems to be traditionally mainly covered in Computer Vision venues. And while the contribution of this work may also be of interest to the ML community, the writing needs to provide more background and context to be more accessible to the NeurIPS community. E.g. the readers at NeurIPS may not be familiar with the broader tasks and some more specialized architectures such as SparseBEV.
The work contains numerous writing and grammar issues. Some examples are mentioned in the minor comments section below. However, the work should undergo a careful proofread in its entirety.
Some of the mathematical notation seems off. E.g. {} is denoted as a set containing two sets while it is treated as a set cotaining tuples of point locations and semantic classes. Some more minor math notation issues are given in the comments below.
The first bullet point in the contribution list is somewhat broad and focuses more on the properties of the method. It does not mention some of the new key technical ideas underlying the method such as CPS. I would recommend fully rephrasing the contribution list and focusing on what is technically new (or at least for the first time applied in the context of occupancy prediction). Simply mentioning that there are several strategies that are introduced to boost performance makes the reader wonder what these strategies are.
Minor Comments:
- l.3 "discretize 3D environment" -> "discretize the 3D environment"
- l.33/34 "Alternative sparse latent representations has been explored" -> "Alternative sparse latent representations have been explored"
- l. 37 What is meant by "necessitating complex intermediate designs and explicit"? Complex architecture?
- l.42/43 "Our OPS eliminates" should be "Our method eliminates" or "OPS eliminates"
- l.48 "unable to tacke tremendous voxels" maybe something like "unable to handle a very high number of voxels"
- l.69/70 "all our model configurations easily surpass all prior arts" -> "all our model variants easily surpass all prior work".
- l.83 "This task recently becomes a foundational perception task in autonomous driving" -> "This task has recently become a foundational perception task for autonomous driving"
- Consider using instead of for the number of occupied voxels in ground truth. The latter looks like to the power of . Same for all other occurrences of in a superscript.
- If denotes a semantic class, it should not be defined as element of . This is not necessarily wrong but may be confusing as it is likely represented as an integer if not a one-hot vector?
- Using to represent the number of semantic classes may be confusing given that is used to represent the number of occupied ground truth voxels.
问题
- From the writing, it is not really clear to me, whether some of the competing methods are also set based? I would not see why the Hungarian algorithm would be needed otherwise. If those methods are set-based, the writing might clarify this. From the abstract it seems like competing approaches perform classification of each voxel individually.
- In the right most panel of fig 2, why is "Consistent Point Sampling" written in a green box while everything else is in grey?
- What is meant by a set of learnable queries? Are these simply the queries that would be used in the traditional transformer except that they are now learned?
局限性
To the extent that I can understand it, the authors have properly addressed the limitations of this work.
We thank the reviewer for constructive suggestions and feedback. We will release our full code and models once the paper is made public. Below are our responses to specific comments:
- Writing issues. We greatly appreciate meticulous suggestions on grammar and notations, which indeed help elevate paper's quality. We have revised our draft accordingly and will conduct a comprehensive proofreading to ensure clarity and precision.
- Providing more background and context. Thanks for the advice. We will enrich the background descriptions in sections of related works, methodology, and appendix. For specified architectures in SparseBEV[1], we will provide explanations to guide readers to the original paper, delineating which modules are newly designed and which are adapted from SparseBEV.
- Details about compting methods. As mentioned in lines 74-75, OPS is the first set-based approach for occupancy prediction. All baselines perform voxel-wise classification, resulting in predictions that are inherently ordered according to the physical locations of voxels, thus differing from set-based methods whose predictions are unordered. We will provide further clarification on this distinction in Section 4.2.
- Why the Hungarian algorithm would be needed. The primary challenge of a set-based approach lies in associating the unordered prediction set with ground-truths. As detailed in lines 46-52, while the Hungarian algorithm is effective for association in object detection, it is not suitable for occupancy prediction, which motivates development of OPS.
- Consistent Point Sampling in Fig.2. We highlight the "consistent point sampling" in Fig.2, as it is newly introduced in OPS. Other components, such as "adaptive mixing," are inherited from SparseBEV[1]. We will clarify this in the caption.
- Learnable queries. The learnability of queries in Transformers is contingent upon the specific application. In standard Transformers for tasks such as machine translation, queries are derived from the input data and are not learnable. However, in many Transformer-based models (e.g., SparseBEV[1] and DETR[2]), queries are initialized randomly and are optimized during training. They enable the model to dynamically focus on various aspects of the input data.
[1] Sparsebev: High-performance sparse 3d object detection from multi-camera videos, CVPR23.
[2] End-to-end object detection with transformers, ECCV20.
I appreciate the authors responses and continue to see the work on the accept side although I am not yet sure I will be able to raise my scores as some of the suggested changes are hard to judge without a full new review.
We sincerely appreciate the reviewer's time and effort in offering insightful feedback and keeping a positive assessment of this paper. Due to the NeurIPS policy, we are unable to provide the reviewer with our revised version of this paper. We feel regretful for the inconvenience, but have indeed incorporated the reviewer's valuable suggestions into our new draft.
This paper introduces OPS, a novel framework that treats occupancy prediction as a set prediction problem. The approach leverages a transformer encoder-decoder architecture and Chamfer distance loss to align predicted and ground-truth points. The model improves performance with strategies like coarse-to-fine learning, consistent point sampling, and adaptive re-weighting. OPS achieves superior RayIoU and faster FPS compared to state-of-the-art methods on the Occ3D-nuScenes dataset. In summarize, this paper is technically sound and shows strong and convincing performance on a commonly used dataset.
优点
-
This paper formulates occupancy prediction as a sparse set prediction problem, which is technically sound and interesting. Indeed, predicting a sparse set introduces several efficiency and memory benefits, as long as a significant performance improvement.
-
Several proposed techniques, including coarse-to-fine learning, consistent point sampling, and adaptive re-weighting, are effective and have the potential to generalize to even other methods. Also, these techniques are validated by extensive ablation studies on the nuScenes dataset.
-
Experiments are carefully designed and adequate ablation studies are provided, though only on the nuScenes split.
缺点
-
In a high level, the proposed method is similar to SparseOcc, although implementation details can be different.
-
Although this method has strong performance on Occ3D-nuScenes dataset and several ablation studies are provided, I am still concerned that this method can overfit to this relatively small dataset. Can authors provide further evidences on the Occ3D-Waymo split?
问题
-
Can authors provide results on the Waymo split? Or at least can authors provide a convincing explanation why such experiments cannot be conducted on Waymo? As far as I know, Occ3D-Waymo is on a par with Occ3D-nuScenes in terms of data scale, thus computation should not be a significant concern.
-
Can authors comment on the difference between SparseOcc and OPS? I'd like to understand both the detail differences and the high-level differences.
局限性
No potential negative societal impact is detected by reviewer.
We thank the reviewer for the insightful comments. Please refer to the global response to all reviewers for our experiments on Occ3D-Waymo. In summary, the proposed OPS demonstrates its generality with superior mIoU results and fast inference speed. Below, we discuss the differences between SparseOcc and OPS:
- View perspective of occupancy prediction. The fundamental difference lies in the perspective of occupancy prediction. As depicted in lines 33-37 and 122-130, all previous methods, including SparseOcc, treat occupancy prediction as a standard classification task. OPS, however, pioneers a set prediction viewpoint, offering a novel, elegant, and end-to-end sparsification approach, as pointed out by Reviewer Zbj3.
- Multi-stage vs. end-to-end sparsification procedure. SparseOcc generates sparse occupancy by gradually discarding voxels through multiple stages. The discarding of empty voxels at early stages is irreversible, leading to obvious cumulative errors, as detailed in lines 276-280 and illustrated in Fig.3. Conversely, OPS circumvents complex filtering mechanisms by directly predicting a sparse set, resulting in more coherent outcomes.
- Detailed model design. In terms of a more detailed perspective of the structure, there are also many differences such as
- Query number: In Occ3D-Nuscene, SparseOcc necessitates 32K queries in its final stage. OPS, by comparison, operates with a mere 0.6K-4.8K queries for occupancy prediction, capitalizing on its flexible nature and contributing to its fast inference.
- Coarse-to-fine procedure: SparseOcc's coarse-to-fine strategy involves progressively filtering empty voxels and subdividing occupied voxels into finer ones. In contrast, OPS interprets coarse-to-fine as the escalation in number of predicted points across stages.
- Sampling process: The feature sampling in SparseOcc is deterministic, with each query anchored to a specific location. In contrast, the query locations are learnable in OPS. Therefore, we propose consistent point sampling to dynamically and efficiently gather features from input images.
- Learning objective: Our learning target encompasses predicting both semantic classes and occupied locations, simultaneously. The latter is a new objective introduced by OPS, achieved through a modified Chamfer distance loss.
In conclusion, we believe that SparseOcc and OPS are markedly different in both fundamental and detailed designs.
Thank you for providing these responses! They address my questions and I'll keep my positive evaluation of this paper.
We sincerely thank the reviewer for the positive evaluation of this paper. We will further improve our revised version based on the reviewer's comments.
We sincerely thank the reviewers for their constructive comments and are privileged by their praise regarding our motivation (5U1g, TqRW, Zbj3), novelty (Zbjs), writing (TqRW, iFoi), and experiments (5U1g, Zbj3). We'd like to first mention that OPS performances have further improved since our last submission, as detailed in the table below. In a nutshell, all OPS(8f) variants achieve substantial boosts in mIoU (+2.64 to +3.90) and RayIoU (+1.77 to +2.27). These improvements result from tuning hyperparameters and fixing post-processing, without altering any network structures. All our codes and checkpoints are well-organized and will be released once the paper is made public.
| mIoU | RayIoU | newmIoU | newRayIoU | |
|---|---|---|---|---|
| OPS-T(8f) | 30.6 | 35.9 | 33.24(+2.64) | 38.40(+2.50) |
| OPS-S(8f) | 31.2 | 37.3 | 34.24(+3.04) | 39.07(+1.77) |
| OPS-M(8f) | 31.7 | 38.0 | 35.60(+3.90) | 40.27(+2.27) |
| OPS-L(8f) | 32.4 | 38.9 | 36.14(+3.74) | 41.17(+2.27) |
Experiments on Occ3D-Waymo. We'd like to clarify the comments about Occ3D-Waymo from reviewers 5U1g and iFoi. Initially, our draft did not evaluate OPS on the Occ3D-Waymo, as it is not commonly used as a standard benchmark for vision-centric approaches. The only vision-based method we found with reported results on this dataset is the Occ3D paper, which evaluates BEVDet, TPVFormer, BEVFormer, and the newly proposed CTF-Occ. During the rebuttal phase, we trained the OPS-L (1f) on 20% of the dataset for a fair comparison with these baselines. Despite not fine-tuning the training configurations, OPS-L already achieves 19.0 mIoU at 8.5 FPS, outperforming the baseline methods. We are grateful for the reviewers' suggestions and will incorporate the results into our revised draft.
| mIoU | RayIoU | FPS | general | vehicle | bicyclist | ped. | sign | tfc.light | pole | Cons.cone | bicycle | motorcycle | building | vegetaion | Treetrunk | road | sidewalk | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BEVDet | 9.88 | - | - | 0.13 | 13.06 | 2.17 | 10.15 | 7.80 | 5.85 | 4.62 | 0.94 | 1.49 | 0.0 | 7.27 | 10.06 | 2.35 | 48.15 | 34.12 |
| TPVFormer | 16.76 | - | - | 3.89 | 17.86 | 12.03 | 5.67 | 13.64 | 8.49 | 8.90 | 9.95 | 14.79 | 0.32 | 13.82 | 11.44 | 5.8 | 73.3 | 51.49 |
| BEVFormer | 16.76 | - | 4.6 | 3.48 | 17.18 | 13.87 | 5.9 | 13.84 | 2.7 | 9.82 | 12.2 | 13.99 | 0.0 | 13.38 | 11.66 | 6.73 | 74.97 | 51.61 |
| CTF-Occ | 18.73 | - | 2.6 | 6.26 | 28.09 | 14.66 | 8.22 | 15.44 | 10.53 | 11.78 | 13.62 | 16.45 | 0.65 | 18.63 | 17.3 | 8.29 | 67.99 | 42.98 |
| OPS-L | 19.00 | 24.7 | 8.5 | 4.66 | 27.07 | 19.39 | 6.53 | 18.66 | 6.41 | 11.44 | 10.40 | 12.90 | 0.0 | 18.73 | 18.11 | 7.46 | 72.86 | 50.31 |
Safety concerns. Our OPS-L(8f) has achieved a state-of-the-art RayIoU of 41.17, outperforming the previous sparse model SparseOcc[1] by 6.07 and the dense model FB-Occ[2] by 7.7. The mIoU gap between sparse and dense methods is also reduced from 8.5 (in SparseOcc) to 3.0. However, as noted by Reviewer iFoi, the implications of this gap on safety remain ambiguous. This concern is particularly pertinent in the context of autonomous driving, and we would like to clarify this as follows:
- Risks of dense predictions. The biggest issue of dense predictions is the huge discrepancies between evalutation metrics and real-world scenarios. As shown in Fig.1 in the attached one-page pdf, evaluation metrics only consider voxels within the camera visiblity mask, which is derived from camera parameters and ground truth. Detailed precedure for generating the mask can be found in Occ3D[3]. However, in real-world applications, we can only produce view mask based on camera intrinsics and extrinsics, failing to filtering out over-estimated voxels. From this example and and Fig.3 in our paper, dense methods can misidentify occupied voxels, even close to the ego vehicle. These errors are overlooked during evaluation but pose significant safety hazards in real-world scenarios. In contrast, OPS suffer much less from this issue as it does not over-estimate occupancy.
- The depth errors of OPS is much smaller than FB-Occ. In Fig.2 in the attached one-page pdf, we compare the depth errors of FB-Occ and OPS along camera rays. OPS demonstrates lower depth errors across all scenes, despite its relatively low mIoU performance. Given the significance of the first occupied voxel for safety, OPS's precision in this regard enhances safety rather than detracting from it. This aligns with Fig. 9 in SparseOcc, which shows that training FB-Occ without the camera visiblity mask results in poorer mIoU but lower errors.
In conclusion, while it is necessary to minimize the mIoU gap between sparse and dense methods, our analysis indicates that mIoU might not fully represent potentially hazardous situations. Therefore, it would be more rational to take both mIoU and RayIoU into consideration for the occupancy task.
[1] Fully sparse 3D panoptic occupancy prediction, ECCV24.
[2] Fb-occ: 3d occupancy prediction based on forward-backward view transformation, arXiv23.
[3] Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving, Neurips24.
This paper addresses occupancy prediction from a new persepctive via a sparse set. Without explicitly modelling the full 3D space set, since there are many empty voxels, this work simultaneously predict the occupied locations as a streamlined set prediction paradigm.
All reviewers reach concensus that the manuscript has been greatly improved after rebuttal. Most concerns raised by reviewers have been resolved. AC read the paper, rebuttal, revised experiments in pdf. Please incorporate all comments in the camera-ready version.