6.4

/10

Poster5 位审稿人

最低3最高5标准差0.6

4.2

置信度

创新性2.8

质量2.6

清晰度3.2

重要性2.4

NeurIPS 2025

QuadricFormer: Scene as Superquadrics for 3D Semantic Occupancy Prediction

Sicheng Zuo,Wenzhao Zheng,Xiaoyong Han,Longchao Yang,Yong Pan,Jiwen Lu

OpenReview PDF

提交: 2025-04-21更新: 2025-10-29

摘要

关键词

autonomous driving

评审与讨论

审稿意见

评分: 5置信度: 52025-06-13

This paper introduces a novel method for 3D occupancy prediction that leverages superquadrics as a more efficient and expressive scene representation compared to traditional voxels or Gaussians. The core motivation is that the greater representative power of superquadrics allows them to model fine-grained scene geometry with fewer primitives, thereby reducing computational requirements. The authors propose a probabilistic superquadric mixture model, where each superquadric defines an occupancy probability distribution. This is integrated into a query-based network, QuadricFormer, which features a pruning-and-splitting module for efficient prediction. The method reports state-of-the-art (SOTA) results on the nuScenes dataset while demonstrating superior computational efficiency.

优缺点分析

Strengths

The paper is well-written and effectively illustrated. Figures 1, 2, and 3 are particularly helpful for building intuition, clarifying the differences between primitives, and explaining the method's architecture.
While superquadrics have been explored for object-centric and indoor scene modeling, this paper presents a novel and timely application to the challenging domain of 3D occupancy prediction for autonomous driving.
The central narrative -- that the superior representational capacity of superquadrics enables more efficient scene modeling by requiring fewer primitives -- is clearly articulated and consistently woven throughout the paper, making the motivation and design choices easy to follow.

Weaknesses and Questions

The paper's central argument is that superquadrics are inherently more efficient due to their expressive power. However, the evidence does not fully substantiate this claim. A key concern is that the observed reduction in primitives may stem more from the aggressive pruning-and-splitting module rather than the intrinsic properties of superquadrics. It is unclear if a similar architecture using Gaussians with the same pruning strategy would also achieve a comparable reduction in primitives.
- Suggestion: To isolate the benefit of superquadrics, please replace the superquadrics in QuadricFormer with Gaussians, keeping the rest of the architecture (including the pruning module, query design, and training strategy) as identical as possible. This would directly test whether the representational choice itself is the primary driver of the efficiency gains.
While the qualitative results in Figures 1 and 4 are visually appealing, they do not strongly demonstrate the claimed advantage of superquadrics in capturing fine-grained geometric details. The rendered superquadrics often appear as simple cuboids, quite similar in shape to what Gaussians would produce, just at a larger scale. The "deforming" capability that gives superquadrics their power (e.g., creating boxier or more cylindrical shapes) is not apparent. They do not seem to wrap around object corners or capture complex surfaces any better than their Gaussian counterparts in the visualizations provided. This weakens the core motivation of the paper.
- Suggestion: Please provide more detailed visualizations that highlight the advantages of superquadrics in capturing fine-grained geometric details.
The evaluation is currently limited to the nuScenes dataset. To demonstrate the generalizability and robustness of the proposed method, it is crucial to include results on other standard benchmarks for this task.
- Suggestion: Please add results on SemanticKITTI and/or SSCBench-KITTI-360. Given that recent works like GaussianFormer-2 also report on SSCBench.
Table 2 presents a confusing picture regarding efficiency. A superquadric contains more parameters than a Gaussian (e.g., shape parameters in addition to pose and scale). Therefore, it is counter-intuitive that a model with N superquadrics would be faster and require less memory than a model with N Gaussians, as the table seems to imply.
- Suggestion: Please clarify this by augmenting Table 2. It would be very helpful to include columns that distinguish between network parameters and primitive parameters, and to explicitly state the number of primitives used in each configuration being compared. This would help readers understand the true source of the efficiency gains.
The comparison with GaussianFormer-2 appears to be against a weaker variant (Ch=128, mIoU=20.02) rather than its best-performing published model (Ch=192, mIoU=20.82), which has a higher mIoU than the proposed method.
- Suggestion: For a fair and transparent comparison, the paper should include the strongest published version of prior works.
Several key hyperparameters are missing. Please specify:
- The initial number of quadrics before the pruning-and-splitting stage.
- The value of B, the number of blocks in the decoder (line 209). It is also unclear what a "block" refers to in this context.
- The definition of "small-scale" and "large-scale" quadrics used for the pruning criteria (line 207).
Recent relevant LiDAR-based methods are missing: PaSCo (CVPR'24), SCPNet (CVPR'23), and CAL (ICML'25).

问题

I already included all questions in Strengths And Weaknesses section.

Overall, while this paper introduces a compelling and novel approach by leveraging superquadrics for efficient 3D occupancy prediction, its central claim -- that the observed gains stem directly from the superior representational power of superquadrics -- is not yet fully substantiated. The primary concern is that the performance and efficiency improvements may originate from the QuadricFormer architecture and its aggressive pruning module rather than the choice of primitive itself, a suspicion reinforced by qualitative results (Figs. 1 & 4) where the advertised "deforming" capability of superquadrics is not apparent. They often appear as simple, large-scale cuboids rather than adaptively capturing fine-grained geometric details. To resolve this, the authors should replace superquadrics with Gaussians within the same architectural framework is essential. The authors should provide new visualizations that offer close-up, side-by-side comparisons of a geometrically complex object (e.g., a vehicle, a traffic sign, or a building corner) as reconstructed by your method versus a Gaussian-based method. This direct comparison would make any superior detail-capturing ability immediately apparent. Finally, adjust the table to have fairer SOTA comparisons and evaluate on an additional dataset like SSCBench as in GaussianFormer2.

局限性

yes

最终评判理由

The authors have convincingly addressed all my concerns, and in my view, have also adequately responded to those raised by other reviewers.

This work introduces the use of superquadrics as a novel representation for 3D occupancy prediction in urban environments, while prior work only focused only on indoor scenes. The application to complex urban settings is both non-trivial and novel. The response clearly demonstrates that the observed performance gains are attributable to the use of superquadrics rather than auxiliary architectural components such as the pruning-and-splitting module. Notably, the method does not require depth initialization, which is a significant practical advantage.

The proposed approach achieves performance on par or better than recent SOTA models such as GaussianFormer-2, while operating under fewer constraints (i.e., no depth init). The improvements on large object classes (e.g., road surfaces) are particularly compelling, as shown in Tab. 1 on nuScenes. As the authors note, and I agree, modeling large flat surfaces with Gaussians is inefficient, whereas superquadrics (e.g., cuboids) offer a much more compact and interpretable representation.

Regarding benchmark selection, I find the evaluation strategy reasonable. The authors follow the standards established by recent leading works (e.g., TPVFormer, SurroundOcc, GaussianFormer). Given the concurrent release of Occ-3D and SurroundOcc, I do not find sufficient justification for requiring both benchmarks, especially when one is already widely accepted in the community.

The clarifications on the pruning-and-splitting module are now clear, and any needed refinements appear minor.

Finally, I disagree with reviewer #uPb2’s claim that the idea lacks impact. Demonstrating that superquadrics can be effectively applied to urban 3D occupancy prediction, with competitive results and practical benefits, is a meaningful contribution. Research impact should not be measured only by numerical performance but also by the novelty, practicality, and potential for future extensions.

In conclusion, the paper presents a novel idea, strong empirical evidence, and sound motivation. I recommend acceptance.

格式问题

No concern.

作者回复

2025-07-31

We thank the reviewer for the constructive comments and positive feedback on our paper. Regarding the concerns of the reviewer GNa2, we provide the following responses.

W&Q1: A key concern is that the observed reduction in primitives may stem more from the aggressive pruning-and-splitting module rather than the intrinsic properties of superquadrics.

Thanks for your valuable suggestion! To directly evaluate the advantage of superquadrics, we performed an ablation experiment where we replaced the superquadrics in QuadricFormer (with 1,600 primitives) with Gaussians, while keeping all other aspects of the model exactly the same.
The results in the table below show that the Gaussian-based model achieves much lower performance. This is because, without a heavy depth backbone and with a limited number of primitives, the Gaussian-based method cannot model the scene as effectively as superquadrics. In contrast, the superquadric-based approach maintains strong performance and efficiency.
These results demonstrate that the core improvements in performance and efficiency stem from the intrinsic representational power of superquadrics, rather than from architectural choices or the pruning strategy alone. We have included these findings in the revised manuscript for greater clarity. Thank you for prompting us to further validate this important point.

Method	Depth-Initialization	Pruning & Splitting	Latency (ms)	Memory (MB)	IoU	mIoU
Gaussian-based	✔	✘	341	3075	28.99	18.73
Gaussian-based	✘	✔	162	2554	29.31	18.91
Superquadric-based	✘	✔	162	2554	30.71	20.04

W&Q2: The rendered superquadrics often appear as simple cuboids, quite similar in shape to what Gaussians would produce, just at a larger scale. They do not seem to wrap around object corners or capture complex surfaces any better than their Gaussian counterparts in the visualizations provided.

Thanks for your insightful suggestion. We acknowledge that superquadrics cannot capture precise object boundaries or highly complex surfaces in the occupancy prediction task. This is because the task uses 0.5-meter voxel labels, which only represent approximate object geometry rather than detailed surfaces.
However, superquadrics do possess clear advantages over Gaussians in modeling typical scene structures. For example, modeling large flat regions (such as road surfaces) would require many overlapping Gaussian ellipsoids, whereas only a small number of cuboid superquadrics are needed.
While images cannot be included in the NIPS rebuttal, we have added more detailed visualizations in the revised manuscript to better demonstrate the structural advantages of superquadrics in capturing geometric details within the limitations of current occupancy benchmarks.

W&Q3: The evaluation is currently limited to the nuScenes dataset. To demonstrate the generalizability and robustness of the proposed method, it is crucial to include results on other standard benchmarks for this task.

In response to the reviewer’s concerns about robustness and generalizability of our method, we have conducted additional experiments on the SSCBench-KITTI-360 [1] dataset. The results are summarized in the table below.
Compared to previous dense voxel-based and sparse Gaussian-based methods, our method achieves comparable performance with much fewer scene primitives. This highlights the efficiency of our method, as it can represent complex scenes with fewer expressive superquadrics without sacrificing much accuracy.
We would like to clarify a key difference with GaussianFormer-2, which employs a heavy depth backbone to initialize Gaussians for improved performance. While this strategy can yield better results, it comes at the cost of significantly reduced efficiency (i.e., high inference latency and memory usage), as evidenced by Table 2 in the paper. In contrast, to preserve the efficiency advantage of our method, we did not incorporate such a heavy depth-initialized module.
These results demonstrate both the effectiveness and the efficiency of our approach. We hope this addresses your concerns and further highlights the practical advantages of our method.
Due to the limited time during the rebuttal period, we only conducted the experiments with 12800 superquadrics on SSCBench-KITTI-360. We will provide more experiments on Occ3D and SSCBench-KITTI-360 later.

Method	Input Modality	Number of Primitives	IoU	mIoU
LMSCNet [2]	LiDAR	-	47.53	13.65
SSCNet [3]	LiDAR	-	53.58	16.95
MonoScene [4]	Camera	262144 (128x128x16)	37.87	12.31
Voxformer [5]	Camera	262144 (128x128x16)	38.76	11.91
TPVFormer [6]	Camera	81920 (256x256+256x32x2)	40.22	13.64
OccFormer [7]	Camera	262144 (128x128x16)	40.27	13.81
GaussianFormer [8]	Camera	38400	35.38	12.92
GaussianFormer-2 [9]	Camera	38400	38.37	13.90
Ours	Camera	12800	36.81	12.86

W&Q4: Table 2 presents a confusing picture regarding efficiency. It is counter-intuitive that a model with N superquadrics would be faster and require less memory than a model with N Gaussians, as the table seems to imply.

The key reason for the efficiency difference is not the number of parameters per primitive, but rather the architectural choices. Specifically, GaussianFormer-2 relies on a heavy depth backbone (ResNet-101) to initialize the Gaussians, which becomes the main bottleneck for inference speed and memory usage. In contrast, QuadricFormer adopts cost-free random initialization for the superquadrics, which substantially reduces the computational and memory overhead.
We appreciate your suggestion, which helped us improve Table 2 to avoid any confusion. In the revised table, we have now indicated whether depth-backbone-based initialization is used and provided model parameters, making it clear where the efficiency gains originate.

W&Q5: The comparison with GaussianFormer-2 appears to be against a weaker variant (Ch=128, mIoU=20.02) rather than its best-performing published model (Ch=192, mIoU=20.82), which has a higher mIoU than the proposed method.

Thanks for your suggestion. In Table 1, we only reported the results of GaussianFormer-2 (Ch=128, mIoU=20.02) with a feature dimension of 128 to ensure a fair comparison, since our method is under the same setting. However, we agree that the best results should also be included to provide a more complete picture. In the revised version, we have added the strongest results of GaussianFormer-2 (Ch=192, mIoU=20.82) and clarified this in the manuscript.
Notably, even when compared with the best configuration of GaussianFormer-2, our QuadricFormer (with 12,800 superquadrics) still achieves superior performance with clear efficiency advantages. We appreciate your suggestion, which has improved the completeness and transparency of our experimental comparisons.

W&Q6: Several key hyperparameters are missing.

Thanks for your detailed comments and suggestions. We address your questions as follows:

Number of Quadrics before Pruning-and-Splitting: The number of superquadrics remains unchanged before and after the pruning-and-splitting stage, since each pruned superquadric is replaced by a newly added one.
Definition of a Block and Value of B in the Decoder: A "quadric-encoder block" refers to a module that consists of a sparse convolutional layer, a cross-attention layer, and a residual update layer. Our model employs $B=4$ blocks to encode superquadrics.
Small-scale and Large-scale Quadrics: The distinction between small-scale and large-scale superquadrics is determined by the product of their scale values. More details can be found in our general response (GR3).

We appreciate your valuable suggestions. We will include these clarifications in the revised version of our manuscript.

W&Q7: Recent relevant LiDAR-based methods are missing: PaSCo (CVPR'24), SCPNet (CVPR'23), and CAL (ICML'25).

Thanks for your suggestion. We will include these works and discuss them in the revised version of our paper.

References

[1] SSCBench: A Large-Scale 3D Semantic Scene Completion Benchmark for Autonomous Driving, IROS 2024.

[2] LMSCNet: Lightweight Multiscale 3D Semantic Completion, 3DV 2020.

[3] Semantic Scene Completion From a Single Depth Image, CVPR 2017.

[4] MonoScene: Monocular 3D Semantic Scene Completion, CVPR 2022.

[5] VoxFormer: Sparse Voxel Transformer for Camera-Based 3D Semantic Scene Completion, CVPR 2023.

[6] Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction, CVPR 2023.

[7] OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction, ICCV 2023.

[8] GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction, ECCV 2024.

[9] GaussianFormer-2: Probabilistic Gaussian Superposition for Efficient 3D Occupancy Prediction, CVPR 2025.

评论- My initial response

2025-08-01

Thanks the authors for their detailed response and the considerable effort put into conducting experiments, especially within such a short timeframe.

Regarding W&Q1: Is reduction in primitives stem from the pruning-and-splitting module?

I am convinced by the authors' response that the performance gains stem from the use of superquadrics rather than network architecture components like the "aggressive pruning-and-splitting module." Additionally, the fact that superquadrics don't require depth initialization is a significant advantage of the method.

Regarding W&Q2: The rendered superquadrics appear as simple cuboids, similar to Gaussians?

I agree that "modeling large flat regions (such as road surfaces) would require many overlapping Gaussian ellipsoids, whereas only a small number of cuboid superquadrics are needed." The results in Table 2 convincingly show that QuadricFormer performs best for "large flat regions" classes. However, Figure 2a is somewhat misleading.

Regarding W&Q3 (also raised by #N2Bk, #cJXu, #GNa2): Experiments are done only on NuScenes?

The authors' addition of results on SSCBench-KITTI-360 addresses this concern adequately.

Regarding W&Q5 (also shared by #N2Bk, #cJXu): Why not compare against the strongest GaussianFormer-2 variant?

The authors state that "In Table 1, we only reported the results of GaussianFormer-2 (Ch=128, mIoU=20.02) with a feature dimension of 128 to ensure a fair comparison, since our method is under the same setting." Could you elaborate on what "the same setting" means here? Typically, we compare the best against the best. If GaussianFormer-2 has a more advantageous setting, why not place QuadricFormer in the same setting for comparison?

Clarification on pruning-and-splitting module

If the number of superquadrics remains unchanged (since each pruned superquadric is replaced by a newly added one), isn't the name "Pruning-and-splitting" misleading?

Thank you for the clarifications on "Definition of a Block and Value of B in the Decoder" and "Small-scale and Large-scale Quadrics," as well as confirming that relevant works will be included in the revised paper.

My current stand

I find the comments from other reviewers compelling:

SOTA claim issues (#cJXu, #GNa2, #N2Bk)
Insufficient experiments (#uPb2)
Confusing pruning-and-splitting module (#N2Bk)

I would like to see these discussions resolved before making my final decision.

However, I disagree with reviewer #uPb2's assertion that "The idea lacks sufficient impact." Research impact is difficult to judge and cannot be dismissed simply due to lower performance metrics. Demonstrating that an idea works and showing its benefits compared to counterparts, even if not SOTA, is already sufficient contribution.

I currently remain overall positive about the paper.

评论- Response to Reviewer VvnD

2025-08-04

We sincerely thank the reviewer for the timely response. We continue our discussions below.

Regarding W&Q1:

Thank you for recognizing the effectiveness of superquadrics in our approach.

Regarding W&Q2:

We agree with your observation. Figure 2a is intended as an illustrative diagram to highlight the advantages of superquadrics over Gaussians in geometry modeling. However, we acknowledge it may be somewhat misleading, as the occupancy prediction task does not require such fine-grained shapes. We will revise Figure 2a to better reflect the task while demonstrating the strengths of superquadrics.

Regarding W&Q3:

Thanks for recognizing our results.

Regarding W&Q5:

By “the same setting”, we mean QuadricFormer uses the same feature dimension of 128 as GaussianFormer-2 (Ch=128, mIoU=20.02). We did not use this higher feature dimension due to its impact on the speed and memory usage shown below. For a best-vs-best comparison, we are running additional experiments with QuadricFormer at 192 dimensions to match the strongest setting of GaussianFormer-2. These are still in progress, and we will update the results once available.

Method	Dimension of Features	Number of Primitives	IoU	mIoU	Latency (ms)	Memory (MB)
GaussianFormer-2 (Ch=128)	128	12800	30.56	20.02	451	4535
GaussianFormer-2 (Ch=192)	192	12800	31.74	20.82	482	5839
Ours	128	1600	31.22	20.12	162	2556

Regarding the pruning-and-splitting module:

The purpose of the pruning-and-splitting module is to enhance modeling accuracy rather than reduce the number of primitives. We remove small-scale superquadrics from empty regions that contribute little to scene modeling, and split large-scale ones in non-empty regions to achieve finer scene representation. We agree that "pruning-and-splitting" may be misleading. Since the core objective is to adjust the spatial distribution of superquadrics, "primitive relocating module" is a more appropriate name.

For the other reviewers' comments, we now provide a short summary of our responses:

SOTA claim issues (#cJXu, #GNa2, #N2Bk):

Sorry for the confusion. We have included a SOTA version (with 12800 superquadrics) in Table 2 but not in Table 1 (only reported results with 1600 superquadrics). We will fix Table 1 in the revised version.

Insufficient experiments (#uPb2):

We did not report results on Occ3D because we want to fairly compare our method with the most related counterparts, i.e., the GaussianFormer series, so we followed their setting and employed SurroundOcc as the main benchmark. We have provided the additional results on SSCBench-KITTI-360 (with updated results shown below). We hope these results are sufficient to demonstrate the benefits of using superquadrics as primitives.

We are pleased to share the new results on SSCBench-KITTI-360. By adjusting the hyperparameters (the loss weights), our method achieves an mIoU of 13.63 and IoU of 38.89 with 12800 superquadrics and without the depth-initialized module. The updated results are shown below:

Method	Input Modality	Number of Primitives	IoU	mIoU
GaussianFormer	Camera	38400	35.38	12.92
GaussianFormer-2 (Depth Initialized)	Camera	38400	38.37	13.90
Ours	Camera	12800	38.89	13.63

Confusing pruning-and-splitting module (#N2Bk):

We summarize the implementation and effects of the pruning-and-splitting module.

The module improves scene modeling accuracy by re-distributing the superquadrics, rather than reducing their total number. After initial training, we remove superquadrics with small scales (often in empty regions) and replace them by splitting and relocating those from non-empty regions. Further refinement is achieved with extra quadric-encoder blocks.
Quantitative experiments with 1600 superquadrics show that this module consistently improves performance with minimal impact on inference latency and memory usage. Excessive pruning reduces the number of valid superquadrics and leads to performance drops. For detailed experimental results, please see our reply to Reviewer N2Bk.

We sincerely appreciate your insightful review. We hope the results we provide are sufficient to demonstrate the benefits. Please let us know if you require additional information. Thank you again for your time and constructive comments.

评论- All concerns addressed, I support acceptance unless critical issues arise

2025-08-04

Great results! Thank you for the detailed response. I truly appreciate the additional experiments you conducted, especially over the weekend.

Overall, I find that all concerns, including those raised by other reviewers, have been addressed convincingly:

The use of superquadrics as a new representation for 3D occupancy prediction is both novel and non-trivial. While previously explored in indoor scenes, this is the first time they are applied to urban environments.
The advantages of this method are clear: it matches or even outperforms GaussianFormer-2 (very recent SOTA), while operating under fewer constraints (e.g., no need for depth initialization). The improvements on large object classes are particularly noteworthy and observed on nuScenes. Not sure about SCCBench-KITTI360, maybe the authors should report per-class performance on this dataset too, I hope it doesn't require much additional work?
Regarding benchmarks, I believe Occ-3D and SurroundOcc were released around the same time, roughly between the CVPR’23 and ICCV’23 deadlines. I'm not convinced that Occ-3D is “more widely adopted,” as suggested by reviewer #uPb2. Evaluating on one benchmark seems sufficient, especially since the authors follow the standard benchmarks used by TPVFormer, SurroundOcc, GaussianFormer, and GaussianFormer-2, arguably the most influential works in this domain.
Finally, the pruning-and-splitting module is now much clearer, and I believe it can be revised with minimal effort.

In summary, I’m convinced by the method’s merits and would support accepting the paper (will raise score to 5), unless other reviewers raise any critical points.

2025-08-09

Thanks for your time and effort. Your insightful review and valuable suggestions have significantly improved the clarity and quality of our work.

审稿意见

评分: 3置信度: 42025-06-20

The paper QuadricFormer proposes a novel and efficient object-centric scene representation for 3D semantic occupancy prediction using superquadrics instead of traditional dense voxels or Gaussian primitives. By leveraging the geometric flexibility of superquadrics, the proposed QuadricFormer framework efficiently models complex scene structures with significantly fewer primitives. The paper introduces a probabilistic superquadric mixture model and design a pruning-and-splitting module to further enhance representation efficiency. Extensive experiments on the nuScenes dataset demonstrate that QuadricFormer achieves state-of-the-art performance while substantially reducing computational cost and memory usage.

优缺点分析

Strengths:

This paper proposes a novel and efficient object-centric scene representation for 3D semantic occupancy prediction based on superquadrics.
The proposed model achieves superior performance while significantly reducing computational cost and memory usage compared to previous methods.

Weaknesses:

The idea lacks sufficient impact.

While this paper provides a valuable attempt to apply superquadrics to the occupancy prediction task, the performance improvement over previous methods, such as GaussianFormer-2, appears to be limited.

Regarding efficiency, memory usage is influenced by various factors, including the backbone architecture, the number of layers, and other design choices. Therefore, the reported efficiency improvements are difficult to attribute solely to the use of superquadric representations. Based on the current information, the comparison in Table 2 does not appear to be entirely fair.

Overall, this paper seems to have made a modest contribution and, at best, would likely receive only a lukewarm response within the community.

Insufficient experiments.

The experiments are conducted only on the nuScenes dataset and primarily compared with outdated methods such as TPVFormer and SurroundOcc. Several recent state-of-the-art approaches, including SparseOcc [1], FB-Occ [2], CVT-Occ [3], and many others, are notably missing from the comparisons. Additionally, evaluations on more widely adopted benchmarks, such as Occ3D-nuScenes and Occ3D-Waymo [4], are essential to convincingly demonstrate the advantages of the proposed method.

Furthermore, SparseOcc [1] has challenged the use of IoU and mIoU as evaluation metrics, advocating for the adoption of RayIoU instead. An update of the evaluation metrics is therefore necessary to ensure fair and meaningful comparisons.

Reference:

[1] Fully Sparse 3D Occupancy Prediction, ECCV 2024.

[2] FB-OCC: 3D Occupancy Prediction based on Forward-Backward View Transformation, arXiv 2023.

[3] CVT-Occ: Cost Volume Temporal Fusion for 3D Occupancy Prediction, ECCV 2024.

[4] Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous Driving, arXiv 2023.

问题

The main concerns, as outlined in the Weaknesses section, can be summarized as follows:

The proposed idea lacks sufficient impact.
The efficiency comparison requires further clarification.
The experiments are insufficient in terms of benchmark diversity and baseline comparisons.
The evaluation metrics are inadequate and need to be updated.

局限性

Yes

最终评判理由

The authors' rebuttal partially addresses my concerns. However, the proposed method demonstrates only moderate performance on the SSCBench-KITTI-360 benchmark. Since the introduction of the Occ3D benchmark, it has become a widely adopted standard, with over a hundred methods evaluated on it. Additionally, SparseOcc introduces RayIoU, a more principled metric that extends the conventional mIoU. Therefore, comparison experiments with both the Occ3D benchmark and RayIoU are crucial for a comprehensive evaluation. The current experimental results remain insufficient. Given that further experiments are still necessary to validate the soundness and effectiveness of the proposed approach, I maintain my negative rating.

格式问题

None

作者回复

2025-07-31

We thank the reviewer for the constructive comments. Regarding the concerns of the reviewer uPb2, we provide the following responses.

W1 & Q1 & Q2: The proposed idea lacks sufficient impact. The efficiency comparison requires further clarification.

On the fairness and validity of efficiency comparison:
1. We would like to clarify that our efficiency comparison is fair and meaningful. As noted, model efficiency (inference speed and memory usage) can be affected by various factors. To ensure a fair comparison with Gaussian-based methods, we made the model architectures and evaluation settings as consistent as possible.
2. The only difference is that GaussianFormer-2 requires an additional depth backbone for Gaussian initialization, while our method dose not. Instead, we employ a lightweight pruning-and-splitting module to enhance performance.
3. Since the depth backbone in GaussianFormer-2 is a heavy ResNet-101, it becomes the main bottleneck for inference speed and memory usage. As a result, even with the same number of primitives, our method achieves better performance and efficiency. This highlights the effectiveness and efficiency of our method.
On performance gains and broader impact:
1. We respectfully disagree that the performance improvements over GaussianFormer-2 are limited. While GaussianFormer-2 achieves strong results by leveraging a heavy depth backbone and increasing feature dimensions, this comes at the cost of significant efficiency loss (see Table 2).
2. Without relying on these additional tricks, QuadricFormer (with 12800 superquadrics) surpasses GaussianFormer-2 in mIoU (21.11 vs 20.82) with superior efficiency (179 ms latency vs 451 ms). This demonstrates that our method not only exceeds the performance of existing baselines but also sets a new standard for efficiency, which is valuable for real-world applications.
In summary, our results demonstrate that the proposed superquadric representations leads to meaningful advances in both performance and efficiency, supporting the impact and relevance of our work. We thank you for your suggestions, which have encouraged us to further clarify these points in the revised manuscript.

W2 & Q3 & Q4: The experiments are insufficient in terms of benchmark diversity and baseline comparisons. The evaluation metrics are inadequate and need to be updated.

Experiments on additional benchmarks:

Thanks for your valuable suggestion. Due to the limited time of the rebuttal period, we were only able to provide additional results on SSCBench-KITTI-360 [1], following the experimental protocol of GaussianFormer-2. The results are summarized in the table below.
Compared to previous dense voxel-based and sparse Gaussian-based methods, our method achieves comparable performance with much fewer scene primitives. This highlights our method’s efficiency, as it is able to represent complex scenes with fewer expressive superquadrics without sacrificing much accuracy.
We would like to clarify a key difference with GaussianFormer-2, which employs a heavy depth backbone to initialize Gaussians for improved performance. While this strategy can yield better results, it comes at the cost of significantly reduced efficiency (i.e., high inference latency and memory usage), as evidenced by Table 2 in the paper. In contrast, to preserve the efficiency advantage of our method, we did not incorporate such a heavy depth-initialized module.
These results demonstrate both the effectiveness and the efficiency of our approach. We hope this addresses your concerns and further highlights the practical advantages of our method.
Due to the limited time during the rebuttal period, we only conducted the experiments with 12800 superquadrics on SSCBench-KITTI-360. We will provide more experiments on Occ3D and SSCBench-KITTI-360 later.

Method	Input Modality	Number of Primitives	IoU	mIoU
LMSCNet [2]	LiDAR	-	47.53	13.65
SSCNet [3]	LiDAR	-	53.58	16.95
MonoScene [4]	Camera	262144 (128x128x16)	37.87	12.31
Voxformer [5]	Camera	262144 (128x128x16)	38.76	11.91
TPVFormer [6]	Camera	81920 (256x256+256x32x2)	40.22	13.64
OccFormer [7]	Camera	262144 (128x128x16)	40.27	13.81
GaussianFormer [8]	Camera	38400	35.38	12.92
GaussianFormer-2 [9]	Camera	38400	38.37	13.90
Ours	Camera	12800	36.81	12.86

Evaluation metrics:
- While SparseOcc [10] introduced the RayIoU metric, the majority of recent and leading works on the SSCBench-KITTI-360 continue to report results primarily with mIoU and IoU as evaluation metrics. Therefore, for consistency and comparability with prior art, we continue to evaluate our model using these established metrics.
- Nevertheless, we acknowledge the value of more comprehensive evaluation. In future work, we will conduct additional experiments on the Occ3D dataset and adopt metrics such as RayIoU to enable a more thorough comparison with more recent occupancy prediction methods.