PaperHub
5.6
/10
Poster5 位审稿人
最低5最高7标准差0.8
5
6
5
7
5
3.2
置信度
正确性3.0
贡献度2.6
表达2.6

摘要

关键词
Domain AdaptationDenoising Diffusion Models3D Object Detection

评审与讨论

审稿意见
5

In this paper, the authors propose a method that make use of a point diffusion model for 3D bounding box refinement. The points around proposals are transformed into a normalized box view, and the model denoises them into accurate boxes conditioned on the points near the proposals. The model learn the distribution of points relative to the object’s bounding box to refine noisy proposals from the detection models for off-the-shelf domain adaptation. The paper conducts many domain adaptation experiments on KITTI, Lyft L5, and Ithaca365 datasets, and demonstrates ablation studies on Context Limit and Shape Weight.

优点

Originality: It is a novel approach to 3D bounding box refinement using point diffusion model in 3D domain adaptation. Quality: The code is provided in supplementary material. The paper is evaluated on multiple datasets, KITTI, Lyft Level 5, and Ithaca365, and get good performance. The ablation studies are also good. Clarity: The derivation of formulas and visualizations are clear and easy to understand. Significance: Good application of point diffusion model for 3D bounding box refinement in 3D domain adaptation area.

缺点

From the experiment results, when the domain adaptation baselines are higher (by adding OT and SN), the performance gain of DiffuBox is smaller, which weakens the significance of the paper. It is recommended to try recent 3D detector (like CenterPoint, and other SOTA detectors) and SOTA domain adaptation baselines to see if DiffuBox can still improve the performance, which can help to prove the contribution.

Lack of discussion on the model size/runtime. Because the diffusion model brings computational overhead (larger model size and longer runtime), it's better to take them into consideration when comparing the performance with other methods, make apple to apple comparisons, to prove the performance gain of DiffuBox is not only because of larger model size/capacity, which further enhances the significance of the paper.

The diversity of the qualitative visualization results is limited, and it is suggested to include more diverse categories (like pedestrian, cyclist), size ranges (like rare vehicles), depth ranges, and environments.

问题

Have you evaluated the robustness of DiffuBox when the sensor data quality varies? For instance, how does DiffuBox perform if the point cloud contains noise?

How does DiffuBox perform when handling objects of different sizes? For example, is there a difference in accuracy between small and large objects?

局限性

Yes.

作者回复

We thank the reviewer for finding our method novel and of high quality, pointing out that we have extensive experimentation across datasets. We address detailed concerns below.

Q1: Comparison to more recent 3D detectors

Thank you for this suggestion, and we are actively working towards including more recent baselines to strengthen our work. While we are currently in the middle of performing these experiments and cannot provide the results before the rebuttal deadline, we will include the updated results during the discussion period.

Q2: Computation and runtime vs performance trade-off analysis

We perform an analysis on the latency and runtime of our method in the common questions section. Additionally, we plot an ablation figure showing the number of diffusion steps to performance trade-off in Fig. 1 of the uploaded Rebuttal PDF, and further discuss their implications in the common questions section. We will include these in the final version.

Q3: More diverse visualizations

We have included additional, more diverse visualizations in Fig. 3 of the uploaded Rebuttal PDF. To summarize, we provide additional qualitative results that showcases DiffuBox on car, cyclist, and pedestrian classes. IoU values of the boxes before and after DiffuBox are included for visualization purposes, showing gains in alignment and better shape estimation after applying our method. Our method is effective across different actor class types. We thank you for this valuable suggestion, and will include these in the final version.

Q4: Robustness of method across object sizes and sensor quality

We have performed additional analysis on objects of different sizes in Fig. 2(b) of the uploaded Rebuttal PDF. Observe that performance gain is clear across all object sizes. This result arises naturally, since we conduct our diffusion modeling in the normalized box view (NBV), which effectively corrects for boxes of all sizes. We believe that this is a benefit of our work, and we will include these results in the final version as well. Regarding sensor set-up robustness, we experimented with have significant differences in the sensor setup, and we further report results on nuScenes in the common questions. Regarding different sensor modalities, please refer to our response to reviewer YzBS Q3.

评论

Thank you for the detailed rebuttal and the efforts to address the concerns raised. I appreciate the additional experiments and analyses you're conducting, particularly the upcoming comparisons with more recent 3D detectors, which will be crucial for evaluating DiffuBox's effectiveness. The runtime and performance trade-off analysis, along with the added diverse visualizations and robustness tests, significantly strengthen the paper. Maintain the rating.

评论

Dear Reviewer 7LbH,

Thank you again for your time and constructive feedback during the reviewing process! We are happy to have addressed the concerns you had and thank you for the additional experiments you suggested that help strengthen our work. We would like to point you to the additional experimental results which we provide in the general response section on more recent detectors, CenterPoint [1] and DSVT [2], which further demonstrates the value of our method. As the end of the discussion period is approaching, we would like to know whether our responses have properly addressed your remaining issues. Please feel free to let us know if there are any additional clarifications we can provide!

Best Regards,

Authors

[1] Yin, Tianwei, Xingyi Zhou, and Philipp Krahenbuhl. "Center-based 3d object detection and tracking." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.

[2] Wang, Haiyang, et al. "Dsvt: Dynamic sparse voxel transformer with rotated sets." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

审稿意见
6

The paper introduces a diffusion-based box refinement method aimed at enhancing the robustness of 3D object detection and localization across diverse sensor setups or geographic locations. DiffuBox utilizes a domain-agnostic diffusion model, conditioned on LiDAR points around a coarse bounding box, to refine the box’s location, size, and orientation. Evaluated under various domain adaptation settings, the method has shown significant improvements in localization accuracy across different datasets, object classes, and detectors.

优点

  1. The paper is well-organized, providing a clear structure and offering a comprehensive analysis of existing experimental results.
  2. The paper provides a detailed analysis of the domain gap across different datasets.
  3. Using diffusion models to address domain adaptation in point cloud 3D object detection is a good idea.

缺点

  1. The paper lacks experimental results on larger and more diverse datasets such as nuScenes and Waymo, as seen in similar studies ST3D[52], SN[47], etc.
  2. Given the iterative nature and complexity of diffusion processes, utilizing diffusion models solely for bounding box refinement may introduce disproportionately high computational costs and increase training difficulty. Discussing how to assess the trade-off between performance gains and increased computational demands in the paper is essential.
  3. The discussion and experimental validation of domain differences caused by varying sensor setups across different datasets are very limited in the paper. However, this remains a significant issue between datasets.

As highlighted in the above weaknesses, the paper still lacks key experiments and discussions necessary to fully validate the method's effectiveness. Addressing these limitations would result in a more comprehensive and robust paper.

问题

Minor weaknesses which are not the final reason to the final rating: "groun truth" in fig 1 is wrongly spelled, etc.

局限性

Yes.

作者回复

We thank the reviewer for their valuable feedback! Below, we address additional concerns:

Q1: Request for results on additional dataset

We are happy to provide additional experimental results on the nuScenes dataset, and report the results in the common questions section. Our results confirms DiffuBox’s effectiveness, even on this new dataset.

Q2: Computational expense vs performance trade-off analysis

We investigate the tradeoff between performance and computational cost by varying the number of diffusion steps, and plot the mAP against per box refinement time (corresponding to different number of diffusion steps). The figure is included in Fig. 1 of the uploaded Rebuttal PDF, and additional discussion is provided in the common questions section.

Q3: Additional discussion about varying sensor set-ups

The three datasets we experimented with have significant differences in the sensor setup, and we further report results on nuScenes in the common questions section. We detail the sensor set-up for different datasets in the table below:

DatasetLiDAR TypeBeam Angles
KITTI1 x 64-beam[-24°, 4°]
Lyft1 x 40 or 64 + 2 x 40-beam[-29°, 5°]
Ithaca3651 x 128-beam[-11.25, 11.25°]
nuScenes1x32-beam[-30°, 10°]

We believe that the diversity of sensor set-ups within these datasets help justify our method’s robustness to sensor distributions. We admit that we do not evaluate on other modalities of LiDAR such as solid state LiDAR sensors, and will leave that to future work. We will include this discussion in the final version.

评论

Thanks for the authors's responses, which have addressed my concerns in general. I keep my original rating.

评论

Dear Reviewer YzBS,

Thank you again for your time and constructive feedback! We are happy to have addressed the concerns you had and are grateful for the suggestions you provided during the review process. Please feel free to let us know if there are any additional clarifications we can provide!

Best Regards,

Authors

审稿意见
5

The article presents DiffuBox, a novel method to refine 3D object detection using point diffusion model. This approach addresses the challenges posed by domain shift, where 3D object detectors trained in one geographic region or sensor setup may not perform well in different settings. DiffuBox uses a domain-agnostic diffusion model conditioned on LiDAR points around a coarse bounding box to refine the box’s location, size, and orientation. This model operates on object scale-invariant data, transforming LiDAR points into a normalized box view relative to the bounding box, thus eliminating the shape priors from the source domain. The paper demonstrates significant improvements in mean Average Precision (mAP) across various datasets and object classes, especially in near-range detections where more LiDAR points are present.

优点

  • The use of a diffusion-based model for refining 3D object detection is novel and addresses significant limitations of current domain adaptation methods.
  • The method shows substantial improvements in mAP across different datasets and object classes, highlighting its effectiveness.
  • The paper provides a thorough explanation of the problem of 3D object detectioin, the proposed solution, and the underlying theory, which enhances understanding and reproducibility.
  • DiffuBox is shown to improve the performance of various detectors and across different domain adaptation methods, demonstrating its broad applicability.
  • The approach has significant implications for improving the reliability of 3D object detection in autonomous driving and robotics, which is a critical application area.

缺点

  • Limited Discussion on False Negatives. The method focuses on refining existing bounding boxes but does not address the issue of false negatives, which could be an important aspect of overall detection performance.
  • The reason why the diffusion model can refine the output bounding boxes should be further pointed out.
  • More recent SOTA methods should be used for comparison.

问题

Please see weaknesses.

局限性

Please see weaknesses.

作者回复

We thank the reviewer for finding our method novel and effective, and address their concerns below:

Q1: Additional discussion on false negatives

Thank you for this suggestion! We include an analysis plot of Recall vs IoU threshold in Fig. 2(a) of the uploaded Rebuttal PDF. DiffuBox is able to improve IoU for mislocalized detections and reduce false negatives that arise from match IoU being lower than the threshold, which forms a major cause of reduced cross-domain performance [1]. However, one limitation of our method is that we are inherently a refinement method, and we cannot recover from false negatives due to completely missed detections. We will include these results and discussion in the final version as well.

Q2: Intuition behind the quality of our refinement results

Our method can be thought of as learning a relevant shape as a prior, as well as correcting for the localization. Diffusion models excel at object-level shape refinement [2, 3] due to their iterative nature and ability to approximate the scoring function. Viewing the points as part of a distribution within the bounding box, the diffusion model is able to guide the box towards better localization and shape. We will further point out reason why the diffusion model can refine the output bounding boxes in the final version.

Q3: More recent methods as baselines

Thank you for your valuable suggestion, and we believe an additional baseline will strengthen our results. We are actively performing experiments with more recent detectors. Unfortunately, we could not obtain the results before the rebuttal deadline, and will provide updates during the reviewer-author discussion period.

[1] Wang, Yan, et al. "Train in germany, test in the usa: Making 3d object detectors generalize." CVPR 2020.

[2] Zhou, Linqi, Yilun Du, and Jiajun Wu. "3d shape generation and completion through point-voxel diffusion." Proceedings of the IEEE/CVF international conference on computer vision. 2021.

[3] Vahdat, Arash, et al. "Lion: Latent point diffusion models for 3d shape generation." Advances in Neural Information Processing Systems 35 (2022): 10021-10039.

评论

Dear Reviewer DshE,

Thank you again for your time and constructive feedback during the reviewing process! We would like to point you to the additional experimental results which we provide in the general response section on more recent detectors, CenterPoint [1] and DSVT [2], which further demonstrates the value of our method. As the end of the discussion period is approaching, we would like to know whether our responses have properly addressed your remaining issues. Please feel free to let us know if there are any additional clarifications we can provide!

Best Regards,

Authors

[1] Yin, Tianwei, Xingyi Zhou, and Philipp Krahenbuhl. "Center-based 3d object detection and tracking." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.

[2] Wang, Haiyang, et al. "Dsvt: Dynamic sparse voxel transformer with rotated sets." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

审稿意见
7

A novel diffusion-based method for refining 3D object detection bounding boxes to address domain adaptation issues. Domain-agnostic approach, this paper leverages the consistency of point distributions relative to bounding boxes across different domains, improving robustness

优点

Usage of diffusion model to fix incorrectly positioned bounding boxes to fit the correct point-distribution, even across domains. Can be used as a post-processing step with various existing 3D object detection methods to enhance their performance under domain shifts Provides strong quantitative improvements in mAP across multiple datasets and detectors, showcasing its effectiveness Can be integrated with various detection models without requiring retraining, making it versatile and easy to adopt

缺点

I didn't find any major issue in this paper, I think the approach of using Diffusion model to refine box locations is interesting.

问题

Would be great if the authors include an image of overall architecture of the model, for now it's unclear to me how all components connect as a whole? Have the authors try the approach on bigger datasets like Nuscenes or Waymo? Since the paper is for 3D object detection for autonomous driving, the authors should include ablation on the model's latency, and contributions of each module to the final performance?

局限性

None

作者回复

We thank the reviewer for finding our method interesting, robust, and pointing our the benefits of our domain agnostic method. We further address their concerns below:

Q1: Cumulative image description of overall method

We thank the reviewer for pointing out this possible point of confusion, and included an algorithmic description of the overall training and inference workflow in Alg. 1 of the Rebuttal PDF. We hope this clarifies the overall DiffuBox method, and will include it in the final write up.

Q2: Evaluation on bigger dataset

Thank you for the suggestion. We are happy to report DiffuBox’s strong performance on a larger dataset. We conducted additional experimental results on NuScenes and reported the performance in the common questions section.

Q3: Latency vs. mAP

We perform ablation on the number of denoising steps and measure their corresponding performance and latency. The results are reported in Fig. 1 in the Rebuttal PDF and further discussed in the common questions section.

评论

Dear Reviewer uVLB,

Thank you again for your time and constructive feedback during the reviewing process! We sincerely appreciate your insightful comments that can help us improve our work and are highly encouraged by your positive feedback! Please feel free to let us know if there are any additional clarifications we can provide!

Best Regards,

Authors

审稿意见
5

This paper proposes a diffusion model-based box refinement module to enhance detection. The experiments are conducted on several setting.

优点

  • Using diffusion model to refine box sounds interesting.
  • The proposed module has been proven effectiveness by a series of experiments.

缺点

  • The idea is simple. Refining box is not innovate although the authors use recently popular diffusion model to implement this goal.
  • The proposed diffusion-guided box refine module can be seem as a new stage for detection. However, only old PointPillar. SECOND and PV-RCNN are used as baselines, without new works.
  • I’m worried about the motivation of the article, refining the box doesn't seem to be necessarily related to domain adaptation detection. The authors may have written it in this way to avoid comparisons with SOTA detection models.

问题

Please refer to the above weaknesses.

局限性

Please refer to the above weaknesses.

作者回复

Thank you so much for the insightful comments. Your valuable suggestions are very helpful for further strengthening our paper.

Q1: The idea is simple. Refining box is not innovate although the authors use recently popular diffusion model to implement this goal.

Thank you for your comments regarding the novelty of our work. Domain adaptation for 3D object detection is indeed a challenging task. Existing methods [1, 2] typically require finetuning the model based on target dataset statistics or pseudo labels. In contrast, our model achieves zero-shot adaptation without any finetuning, outperforming these baselines. We deliberately kept our algorithm simple to highlight our key insight: that the diffusion model is domain-agnostic and can effectively refine box size and location.

Q2: The proposed diffusion-guided box refine module can be seem as a new stage for detection. However, only old PointPillar. SECOND and PV-RCNN are used as baselines, without new works.

Thank you for the valuable suggestions. We believe adding more advanced baselines will strengthen our paper. We are currently working on incorporating more recent baselines with our proposed diffusion model. Unfortunately, we could not obtain the results before the rebuttal deadline, but we will provide updates during the reviewer-author discussion period.

Q3: The authors may have written it in this way to avoid comparisons with SOTA detection models.

Thank you for the suggestion to apply our model directly to the detection task. Existing works [1] demonstrate significant challenges in domain adaptation for 3D object detection, particularly in terms of shape normalization and localization accuracy. We focus on domain adaptation because the diffusion model naturally addresses these issues by learning normalized box shapes that are disentangled from size, and by correcting mislocalized boxes. Nevertheless, we tried our method in the in-domain evaluation and only observed a minor improvement (mAP from 77.77 to 79.53 for KITTI in-domain). Such a result is expected: a model trained on an in-domain dataset inherits biases from that dataset, giving it an advantage when evaluated in-domain. Conversely, our denoising model aims to remove these biases, enabling better performance in out-domain evaluations. Without the learned bias, in-domain performance does not necessarily improve.

[1] Wang, Yan, et al. "Train in germany, test in the usa: Making 3d object detectors generalize." CVPR 2020.

[2] Yang, Jihan, et al. "St3d: Self-training for unsupervised domain adaptation on 3d object detection." CVPR 2021.

评论

Dear Reviewer spFc,

Thank you again for your time and constructive feedback during the reviewing process! We would like to point you to the additional experimental results which we provide in the general response section on more recent detectors, CenterPoint [1] and DSVT [2], which further demonstrates the value of our method. As the end of the discussion period is approaching, we would like to know whether our responses have properly addressed your remaining issues. Please feel free to let us know if there are any additional clarifications we can provide!

Best Regards,

Authors

[1] Yin, Tianwei, Xingyi Zhou, and Philipp Krahenbuhl. "Center-based 3d object detection and tracking." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.

[2] Wang, Haiyang, et al. "Dsvt: Dynamic sparse voxel transformer with rotated sets." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

作者回复

We thank all reviewers for their time and effort and are encouraged by the positive feedback. We are very excited that all reviewers are generally happy with our work, and find our method “novel” and “interesting” (spFc, DshE) and all reviewers found our work to be an “effective solution” towards addressing the challenge presented by domain adaptation. We are happy that reviewers found our work to be “well-organized” and provides a “detailed analysis” of the problem and our proposed solution (YzBS, 7LbH). Below, we address the common questions raised. We will respond to individual concerns and questions under each reviewer comment.

Computational Cost Benchmarking

A few reviewers mentioned wishing to see the computation cost of our method. We conducted multiple DiffuBox experiments with varying numbers of denoising steps, ranging from 0 to 14. On average, one denoising iteration (which can be parallelized across bounding boxes) takes 0.09 seconds on an Nvidia A6000 GPU. In general, we see a majority of the performance is already reached with 8 steps of diffusion, and it saturates at around 14 steps. We have included a detailed breakdown in Fig. 1 of the rebuttal PDF, plotting number of diffusion steps to performance. We thank the reviewers for proposing this analysis and will include it in the final version. Please see Rebuttal PDF for details.

Experiment Results on NuScenes Dataset

We have included additional results on a large, real world dataset, nuScenes, at the request of a few reviewers. We report the performance of direct adaptation vs. direct adaption + DiffuBox on NuScenes below:

BEV Performance:

0-30m30-50mAll Range
Direct44.780.7015.86
Direct+DiffuBox58.071.0620.70

3D Performance:

0-30m30-50mAll Range
Direct14.820.004.66
Direct+DiffuBox22.770.007.40

DiffuBox consistently performs strongly, even on the nuScenes dataset, further suggesting the effectiveness of our method. We thank the reviewers for this suggestion and will include these strong results in the final version of our work.

评论

To complement the results obtained with CenterPoint, we conducted additional KITTI\rightarrowLyft experiments using DSVT as the base object detector. The results, presented in the table below, further validate the generalizability of DiffuBox and its capacity to enhance domain adaptation performance across various object detection frameworks and object classes.

CategoryMethodBEV mAP3D mAP
0-30m30-50m50-80m0-80m0-30m30-50m50-80m0-80m
Car@0.7DSVT (Direct)68.9347.4911.3241.7733.7211.741.4215.67
DSVT (Direct)+DiffuBox89.0163.4117.5055.2765.2236.315.0235.61
DSVT (OT)71.8542.9313.1843.0515.664.470.318.06
DSVT (OT)+DiffuBox90.2163.5017.9155.6156.8431.194.6331.12
Cyclist@0.5DSVT (Direct)38.462.040.0019.8230.161.560.0015.65
DSVT (Direct)+DiffuBox54.413.190.0028.0147.571.940.0024.06
DSVT (OT)43.432.180.0022.3014.651.230.008.01
DSVT (OT)+DiffuBox55.263.040.0028.7538.121.890.0019.52
Pedestrian@0.5DSVT (Direct)16.426.480.478.0911.494.410.105.32
DSVT (Direct)+DiffuBox27.898.131.0512.4223.556.490.5310.28
DSVT (OT)20.897.471.0610.6514.835.460.336.84
DSVT (OT)+DiffuBox27.687.841.2512.1825.676.660.6810.64
评论

Encouraged by the insightful feedback from reviewers spFc, DshE, and 7LbH, we have supplemented Tables 1, 3, and 4 with additional KITTI\rightarrowLyft experiments which employ CenterPoint as the base object detector. Our experiment results demonstrate consistent and significant localization improvements achieved by DiffuBox across all object classes.

CategoryMethodBEV mAP3D mAP
0-30m30-50m50-80m0-80m0-30m30-50m50-80m0-80m
Car@0.7CenterPoint (Direct)74.9136.642.4736.2328.634.050.1510.29
CenterPoint (Direct)+DiffuBox90.2558.656.9551.3871.0234.911.4834.40
CenterPoint (OT)82.8151.026.2646.1025.3412.350.5213.34
CenterPoint (OT)+DiffuBox91.7959.247.4551.8863.0132.531.2430.95
Cyclist@0.5CenterPoint (Direct)23.340.840.018.1916.530.340.005.42
CenterPoint (Direct)+DiffuBox42.221.490.0215.2231.740.580.0111.28
CenterPoint (OT)27.400.950.019.737.320.050.011.98
CenterPoint (OT)+DiffuBox42.221.540.0215.2331.350.670.0111.20
Pedestrian@0.5CenterPoint (Direct)0.851.220.020.490.410.470.010.19
CenterPoint (Direct)+DiffuBox5.582.060.031.783.991.750.021.20
CenterPoint (OT)1.882.020.050.870.611.220.010.37
CenterPoint (OT)+DiffuBox5.782.320.031.784.041.900.021.16
最终决定

This paper introduces DiffuBox, a novel method for refining 3D object detection bounding boxes using a diffusion model, demonstrating improvements in localisation accuracy across multiple datasets and detectors. Reviews were mixed, with five reviewers providing a range of ratings from borderline accept to clear accept. Positive comments highlighted the originality of the approach, its effectiveness in domain adaptation, and the strong performance across different datasets. However, some reviewers raised concerns about the method's novelty, the adequacy of baseline comparisons, and the computational cost. The authors addressed these concerns effectively in their responses, providing additional experiments, comparisons with recent detectors, and analyses of computational trade-offs. Given the positive feedback and the authors' satisfactory responses to critiques, the AC recommends accepting the paper condition on the inclusion of the promised additional experiments, comparisons, and clarifications. This decision has been approved by Senior AC. Congratulation!!