PaperHub
5.3
/10
Poster4 位审稿人
最低4最高7标准差1.1
5
4
5
7
3.0
置信度
正确性3.0
贡献度3.0
表达2.8
NeurIPS 2024

ZOPP: A Framework of Zero-shot Offboard Panoptic Perception for Autonomous Driving

OpenReviewPDF
提交: 2024-05-14更新: 2025-01-16

摘要

关键词
Autonomous DrivingOffboard PerceptionZero-shot

评审与讨论

审稿意见
5

The paper introduces a novel framework, termed Multi-modal Zero-shot Offboard Panoptic Perception (ZOPP), specifically designed for autonomous driving applications. This innovative approach integrates zero-shot recognition capabilities with 3D representations generated from point cloud data, enhancing the model's ability to interpret complex driving environments without the need for extensive labeled datasets.

优点

  1. The proposed work by the authors is significant and unique.
  2. The proposed ZOPP approach is more capable of handling diverse types of data compared to any existing model, making it more robust.

缺点

  1. The manuscript is severely deficient in addressing the issues arising from labelling constraints and offers no substantive solutions within the proposed model to mitigate these problems.
  2. The authors have egregiously failed to provide any details regarding the computational resources utilized, including the computational cost and the necessary hardware specifications, leaving the readers in the dark about the feasibility and scalability of the proposed approach.
  3. The manuscript is conspicuously lacking in structure, offering an incoherent presentation of the proposed work, which undermines the clarity and comprehensibility of the research.
  4. Despite the purported significance of the proposed work, the evaluation is lamentably limited, providing insufficient empirical evidence to substantiate the claimed advancements and benefits.
  5. The manuscript is grossly inadequate in offering detailed information about the training and testing procedures, thereby failing to provide the essential methodological transparency required for replication and validation of the results.

问题

  1. The authors mention that data labeling is a crucial factor and a primary hurdle in training within existing work. Could the authors elaborate on how they propose to counter this issue or if they have developed any specific labeling mechanism to address this challenge?
  2. How have the authors handled data from multi-view cameras? Specifically, how do they manage instances of the same object appearing in different views?
  3. (With respect to Fig. 3), how does the proposed convolution filtering operation for removing background 3D points from foreground pixels account for varying disparity values at the upper edges of foreground objects? What specific kernel design is employed to optimize this process?
  4. In the context of disparity occlusion handling around the upper edges of foreground objects (as seen in Fig. 3), what is the mathematical formulation for projecting 3D background points into the pixel regions of foreground objects? Additionally, how does the proposed convolution filtering algorithm differentiate between true foreground and occluded background points at a sub-pixel level?
  5. What is the likelihood of the model's success in real-world applications? Have the authors conducted any real-world testing, and if so, what were the results?

局限性

  1. The abstract is grossly inadequate for the proposed topic.
  2. The manuscript fails to deliver a coherent and sequential exposition of the proposed work.
  3. The authors have entirely neglected to address the computational cost associated with the work or necessary for deployment.
  4. The manuscript is deficient in presenting a comparative analysis with the visual results of existing works.
作者回复

Response 4.1 Data labeling issue. (Weakness 1 and Question 1)

Previous literatures focus on generating high-quality 3D detection results as auto labels with offboard perception fashion. However, they still need high-quality human labels of the AD dataset as a prerequisite for training the whole pipeline. This is precisely the issue of data auto-labeling we aim to address.

Therefore, we propose a unified framework with a compact structure to support various perception tasks of AD scenes in an offboard manner, without any requirements of human labels from AD dataset. To be specific, each proposed module and stage of our method do not rely on any human labels to generate the corresponding perception results. Naturally, we can leverage ZOPP as a cold-start paradigm for existing auto-labeling methods.

Response 4.2 Computational resources and hardware specifications (Weakness 2)

We have introduced the corresponding computational resources in Sec. C (Implementation Details) of Appendix, please kindly refer to it.

The entire pipeline of our proposed method is a lightweight and compact framework, which does not rely heavily on many computational resources. Only the point completion module and the resconstruction module need to train the networks. We utilize 1 NVIDIA A100 to train the point completion network, and 4 NVIDIA A100 to accelerate the reconstruction with multi-processing settings.

Response 4.3 Writing structure and incoherent presentation (Weakness 3)

We will improve the presentation and polish the structure in the revised version.

Response 4.4 Insufficient evidence (Weakness 4)

Could you please kindly specify which aspect of the content "insufficient evidence" refers to?

Response 4.5 Training and testing procedures (Weakness 5)

We have presented the detailed training and resting procedures, hyper parameters, and settings for each module in Sec. C (Implementation Details) of Appendix, please kindly refer to it.

Response 4.6 Instance management across multi-view cameras (Question 2)

In section 3.1 of the main contents, we have proposed and introduced a multi-view mask tracking module to manage the object instances across multiple views. Specifically, we designed a simple yet effective similarity cost involving the computation of appearance and location similarities to facilitate object association. The appearance similarity is compared across different objects by computing the cosine distance of the visual features. The location similarity is derived by concatenating the images of all viewpoints in a panoramic order, followed by normalizing the pixel distances along the horizontal axis for each object. Therefore, objects with large similarity scores would be associated together with the same instance ID. We will then manage all the object instances with their corresponding unique IDs in the following stages of our approach.

Response 4.7 Removing background 3D points at the upper edges of foreground objects (Question 3)

Firstly, since LiDARs are always equipped higher than multi-view cameras on autonomous vehicles [1,2,3], the parallax occlusion issue will thereby arise at regions around the upper edges of foreground objects, rather than the center or bottom parts of foreground objects.

Secondly, we have designed an experiential threshold of distance θ\theta to determine whether the projected background pints should be filtered out. Regardless of the disparity differences among these projected background points, as long as they exceed the threshold, we will filter them out.

Response 4.8 Specific kernel design of parallax occlusion filtering (Question 3)

The design of the kernel is influenced by the configuration of the sensors, specifically determined by the beam numbers and rotating frequency of the LiDAR. They respectively determine the vertical and horizontal densities/resolutions of the point cloud projected onto 2D image plane. Therefore, if the LiDAR has a large number of beams and a high rotation frequency, we need to use a smaller kernel size to handle the dense projection points and improve the accuracy of the filtering operation. Conversely, if the density of projection points is lower, we can increase the kernel size and step size along the vertical and horizontal directions, thereby enhancing the algorithm's operational efficiency without compromising filtering accuracy.

Response 4.9 Mathematical formulation of projection (Question 4)

The process of projecting all 3D point clouds onto a 2D image plane follows the same projection formula, as discussed in Sec. 3.2.1. There is no separate projection formula for background points; rather, they appear in the region of foreground objects in the 2D image due to the parallax occlusion problem mentioned above. This issue is precisely what our proposed method aims to address.

Response 4.10 Sub-pixel level results (Question 4)

Currently, we do not support sub-pixel level calculations. All 3D point cloud projections onto the 2D image are rounded to the nearest integer pixel value coordinates, to obtain pixel-level semantic and instance ID information. Additionally, the semantic and instance mask results generated in the previous stage of our method are also at the pixel level.

Response 4.11 Likelihood of applying in real-world applications (Question 5)

In our experiments, we leverage WOD as the primary benchmark to assess the effectiveness of our method. WOD is one of the large-scale autonomous driving dataset collected from real-world commercial vehicles. In addition to evaluating the main results across various perception tasks, we also perform auto labeling applications. As shown in Sec. D.4 of Appendix, we generate 3D boxes on 5% of training set as auto labels to train the onboard detection model. This experiment demonstrates that our method could generate comparable auto labels and serve as an efficient cold-start approach for existing perception models.

评论

In response to the feedback provided, the authors have agreed to revise the manuscript and address several of its current limitations. However, certain critical aspects of the manuscript still require closer examination and further refinement. Notably, significant limitations remain in areas such as the evaluation phase, and more results proof will be required, which strengthens the proposed model's useability.

评论

We sincerely appreciate your efforts and valuable suggestions during the review and response periods. We are wondering whether you have any more specific questions or suggestions after our response. Your willingness to accept our paper is truly appreciated.

审稿意见
4

Offboard perception creates 3D labels for autonomous driving scenes. Current methods are limited and don't match human recognition levels. The authors developed a new framework called Zero-shot Offboard Panoptic Perception (ZOPP), which combines advanced recognition technologies with 3D point cloud data. ZOPP is a pioneering approach in automatic labeling for driving scenes. They validated its effectiveness on the Waymo dataset and downstream applications with good performance.

优点

  1. Using foundation model to generate labels for autonomous driving is meaningful task. it can help to push a great advance in the realm of autonomous driving.
  2. Experiments are extensive with good results.

缺点

  1. the proposed work just combines some off-the-shelf modules together, and use some basic mathematical thing to support the alignment, which looks not very innovative as NeurIPS requested, although the validation results is competitive.
  2. The presentation need to be improved. sometimes, it is hard to guess what you refer to.

问题

refer to weaknesses

局限性

refer to weaknesses

作者回复

We are grateful to you for recognizing our efforts in addressing your concerns during the reviewing process.

Response 3.1 Lack of novelty (Weakness 1)

Perception and understanding play a vital role in current data-driven autonomous driving. Previous literatures focus on alleviating the burdens if human labor and the cost of labeling. And we found several challenges in this field:

  • Only 3D object detection task is supported to generate auto labels in an offboard manner.
  • Still require huge amounts of data with high-quality human-labels.
  • Lack the capabilities of open-set and zero-short settings.

Therefore, we tackle these challenges by proposing a unified framework for various perception tasks in an offboard manner without the requirements of human labels in AD scenes. Although our method incorporates several existing foundation models, previous research has not explored these aspects to address the practical needs of auto labeling in AD. To the best of our knowledge, we are the first to propose such kind of work.

Response 3.2 Confusing presentation (Weakness 2)

We will revise the writing carefully in the next version.

评论

I have read the rebuttal

评论

We sincerely appreciate your efforts during the review and responses. We are wondering whether our response has addressed your concerns and do you have any more suggestions or questions. Moreover, if you find our response satisfactory, we kindly invite you to consider the possibility of improving your rating.

审稿意见
5

ZOPP proposes an offboard auto-annotation method to achieve lidar 3D detection as well as the occupancy label without any annotation data. The whole pipeline ensembles several models including the SAM-track and point cloud completion model. By using some post-processing to complete the

优点

  1. The idea is straightforward.
  2. The writing is clear.

缺点

Lack of novelty. It ensembles several SOTA methods but lacks in its contribution.

问题

See weaknesses.

局限性

See weaknesses.

作者回复

We are thankful to you for raising such important concerns and questions about our work, we highly appreciate your efforts during review process.

Response 2.1 Performance of [0, 30]m on Waymo dataset (Weakness 1)

We follow the experiment setting in prior zero-shot 3D detection work [1] to ensure consistency with other methods, to report the performance of [0, 30]m on Waymo open dataset in Table 2 of the main paper. Addtionally, results for full distances are presented in Table 1 of the main paper, Table 6 and Table 8 of Appendix. Please kindly refer to them.

Furthermore, we evaluate the performance of several methods, PV-RCNN, Voxel-RCNN, DetZero and SAM3D [1], across different distance ranges and levels of occlusion. The distance ranges include the following segments: [0, 30]m, [30, 50]m, [50, +inf)m. The occlusion levels are categorized into three grades based on the extent to which the object is obscured in the image perspective.

Total0-30m30-50m50+m
MethodTraining DataL1L2L1L2L1L2L1L2
VoxelRCNNtrain set74.2465.9188.8587.5273.0966.7152.7540.43
PVRCNNtrain set74.3165.8989.5288.1972.7466.2452.3840.12
DetZerotrain set89.4983.3496.6495.9088.8484.3778.3266.77
SAM3D-6.905.8819.5119.050.0290.0260.00.0
ZOPP (ours)-37.5635.6142.3141.1635.1433.8629.8928.67

From the table, as the distance increases, the performance of all methods are decreased. Specially, VoxelRCNN decreases (L1 AP) with a ratio of 17.74% and 40.63% on [30, 50]m and [50, +inf)m compared [0, 30]m, PVRCNN decreases with 18.74% and 41.48%, our method decreases with 16.94% and 29.35%. The reason is that our method could utilize the entire temporal information in the point cloud sequence by our mask tracking module to generate the unique object ID. Therefore, we can overcome the influence of distance compared to other onboard methods, especially at farther ranges.

Reference: [1] Dingyuan Zhang et al. Sam3d: Zero-shot 3d object detection via segment anything model. arxiv preprint, 2023.

Response 2.2 Performance of occlusions. (Weakness 1)

We report the performance of the overall and the occlusion part on Waymo open dataset to compare with other methods. The occlusion levels are defined based on whether the objects are obscured in the image perspective, which are provided by WOD.

MethodTraining DataAllOcclusion
VoxelRCNNtrain set74.2458.39
PVRCNNtrain set74.3158.47
SAM3D-6.904.74
ZOPP (ours)-37.5633.42

As we can see, VoxelRCNN and PVRCNN decrease with a ratio of 21.35% and 21.32%, SAM3D decreases with a ratio of 31.30%, our method decreases with a ratio of 11.02%. Our method could overcome the influence of occlusion by leveraging the temporal contexts with our mask tracking module.

Response 2.3 Failure pattern analysis (Weakness 1)

We have briefly summarized some representative challenging scenarios in Sec. 6 (Limitations and Broader Impacts) of the main contents of our submitted paper. Firstly, our method would fail to effectively recognize similar object categories (e.g., construction vehicle, truck, trailer) and some uncommon object categories (e.g., tree trunk, lane marker) with the foundation models (Grounding-DINO). Since this is the first stage of our entire method, it will result in subsequent stages lacking the output of corresponding perception results, such as 3D segmentation and occupancy prediction. Secondly, neural rendering methods may encounter numerous challenges in street-view scenes, constrained by practice factors (adverse weather conditions, sensor imaging issues), such as camera overexposure. In these scenarios, where it is impossible to generate geometrically plausible 3D reconstructions, our occupancy decoding will fail.

Please kindly refer to Fig. 2 of our global response PDF file to see the visualization.

Response 2.4 Lack of novelty (Weakness 2)

Please kindly refer to Response 3.2 for Reviewer Swqu.

评论

We sincerely appreciate your valuable and helpful suggestions. We are wondering whether you have any more suggestions or questions after our response. Specifically, do you have any questions about the performance comparison based on distances and occlusions? Moreover, if you find our response satisfactory, we kindly invite you to consider the possibility of improving your rating.

评论

Thank you for the detailed explanation! It resolves my concerns on its performance.

评论

We are grateful to you for recognizing our efforts to address your concerns during the response process. Your feedback has been instrumental in enhancing the quality of our work, especially for the comprehensive comparisons based on the distances and occlusions. We look forward to continuing to meet your expectations in the final version of our paper.

审稿意见
7

This paper introduces ZOPP, a framework for zero-shot panoptic perception of autonomous driving scenes. Leveraging image foundation models, ZOPP is able to perform zero-shot 3D object detection, 3D semantic segmentation, 3D panoptic segmentation, and 3D occupancy prediction, the first zero-shot model of its kind. Experiments on the Waymo dataset achieve strong performance.

优点

  1. The paper is well-written and easy to follow.
  2. ZOPP, to the best of my knowledge, is the first zero-shot panoptic perception model for autonomous driving constituting a significant novelty.
  3. Experiments and ablation studies are extensive.
  4. A successful zero-shot panoptic perception framework is highly useful for autolabeling driving scenes, giving this paper a high likelihood for significant impact.

缺点

The most significant weakness is that zero-shot performance still significantly lags behind models trained using human-annotated labels. However, as this is the first work of it's kind, this is acceptable.

问题

Do the authors have any quantitative results for their point completion module? This would be interesting to see.

局限性

The authors adequately address limitations and broader impacts.

作者回复

We sincerely appreciate your positive acknowledgment of our work. We are pleased to provide the supplemental responses.

Response 1.1 Performance gap between our method and fully supervised methods (Weakness 1)

Yes. Indeed, our zero-shot method still exhibits a notable gap compared to fully supervised training methods when applied to datasets with abundant annotations. However, in scenarios where annotated data is scarce, our approach leverages autonomously generated perceptual outputs and can recognize objects whose classes were never labeled before. For instance, it can output 3D detection boxes for traffic signs and traffic lights, which are visualized in Fig. 1 of our global response PDF file.

Moreover, our research demonstrates the great potential of integrating foundation models into the field of autonomous driving. This integration is poised to advance traditional tasks substantially. Looking forward, as foundation models continue to enhance their performance, we believe they can be seamlessly integrated into our framework. Through continuous performance iteration and optimization, we anticipate further enhancing the effectiveness of our approach.

Response 1.2 Quantitative results of point completion module (Question 1)

We have supplemented the quantification results for the point cloud completion module. On our experimental set, there are around 4106 object samples with complete point clouds (obtained by merging all the point clouds of each object across the entire scene sequence). We then sampled the partial clouds and used them as input to generate 4096 points as a completion. The L1 Chamfer distance performance is summarized below. It shows the great effectiveness of our point completion module. As an additional reference of its effectiveness, please kindly refer to Table 8 of Appendix in our submission to see the performance differences in 3D box interpretation before and after point cloud completion.

AverageVehiclePedestrianCyclist
Samples410628911215102
CD7.195.356.399.84
评论

I have read the author's rebuttal. Their response adequately addresses the questions I raised, and given the promise I believe this paper holds, I keep my rating as is.

评论

We deeply value your meticulous review and are pleased that our responses have effectively addressed your questions and concerns. Your willingness to accept our paper is truly appreciated. We extend our profound appreciation for your insightful questions and invaluable suggestions, which undoubtedly contribute to elevating our manuscript's scholarly caliber.

作者回复

We sincerely thank all reviewers for their valuable comments and suggestions. We are sincerely grateful to the reviewers for dedicating their time and effort to review our work. We are delighted to see reviewers commenting on our paper with "significant novelty", "significant impact", "extensive and competitive experimental results", and "straightforward idea".

In this rebuttal, we try our best to solve the concerns of reviewers. We summarize the important and common concerns in the following:

Lack of novelty. (Reviewer aSzs, Reviewer Swqu)

Perception and understanding play a vital role in current data-driven autonomous driving (AD). Previous literatures [1,2] focus on alleviating the burdens of human labor and the cost of labeling. And we found several challenges in this field:

  • Only 3D object detection task is supported to generate auto labels in an offboard manner.
  • Still require huge amounts of data with high-quality human labels.
  • Lack the capabilities of open-set and zero-short settings to generalize to new scenarios and dataset.

Therefore, we tackle these challenges by proposing a unified framework for various perception tasks in an offboard manner without the requirements of human labels in AD scenes. Although our method incorporates several off-the-shelf foundation models, previous research has not explored these aspects to address the practical needs of auto labeling in AD. To the best of our knowledge, we are the first to establish such kind of work.

Performance of [0, 30]m on Waymo and more comparisons of different distances and occlusions. (Reviewer aSzs)

We follow the experiment setting in prior zero-shot 3D detection work [1] to ensure consistency with other methods, to report the performance of [0, 30]m on Waymo open dataset in Table 2 of the main paper. Additionally, results for full distances are presented in Table 1 of the main paper, Table 6 and Table 8 of Appendix (in our submission). Please kindly refer to them.

Furthermore, we evaluate the performance of several methods, PV-RCNN, Voxel-RCNN, DetZerom and SAM3D, across different distance ranges and occlusions. The distance ranges include the following segments: [0, 30]m, [30, 50]m, [50, +inf)m. The occlusion levels are defined based on whether the objects are obscured in the image perspective, which are provided by WOD.

Specifically, for different distances, VoxelRCNN decreases (L1 AP) with a ratio of 17.74% and 40.63% on [30, 50]m and [50, +inf)m compared [0, 30]m, PVRCNN decreases with 18.74% and 41.48%, our method decreases with 16.94% and 29.35%. For the occlusion part, VoxelRCNN and PVRCNN decrease with a ratio of 21.35% and 21.32%, our method decreases with a ratio of 11.02%. The reason is that our method could utilize the entire temporal information in the point cloud sequence by our mask tracking module to generate the unique object ID. Therefore, we can overcome the influence of distance compared to other onboard methods, especially at the farther ranges, and occlusion regions.

Data labeling issues and real-world applications. (Reviewer 96Yj)

Previous literatures focus on generating very high-quality 3D detection results as auto labels with offboard perception fashion. However, they still need high-quality human labels of the AD dataset as a prerequisite for training the whole pipeline. This creates a "chicken-or-egg" problem for auto labeling: when faced with a new dataset, how can it be automatically annotated with these methods? This is precisely the issue of data auto labeling we aim to address.

Naturally, we can leverage ZOPP as a cold-start paradigm for existing auto-labeling methods. For example, the detected 3D bounding boxes and segmentation results can serve as the labels to train other onboard perception models. We also provide the corresponding experiment in Sec. D.4 of Appendix (please kindly refer to it), demonstrates that our method could generate comparable auto labels and serve as an efficient cold-start approach for existing perception models.

Confusing writing structure and presentation. (Reviewer Swqu, Reviewer 96Yj)

We will improve the presentation and polish the structure in the revised version.

Failure pattern and visualization. (Reviewer aSzs)

We have briefly summarized some representative challenging scenarios in Sec. 6 (Limitations and Broader Impacts) of the main contents of our submitted paper. Let's discuss the details of failure patterns below. Firstly, our method would fail to effectively recognize similar object categories (e.g., construction vehicle, truck, trailer) and some uncommon object categories (e.g., tree trunk, lane marker) with the foundation models (Grounding-DINO). Since this is the first stage of our entire method, it will result in subsequent stages lacking the output of corresponding perception results, such as 3D segmentation and occupancy prediction. Looking forward, as foundation models continue to improve their performance, we believe they can flexibly integrate into our framework. Secondly, neural rendering methods may encounter numerous challenges in street-view scenes, constrained by practice factors (adverse weather conditions, sensor imaging issues), such as camera overexposure. In these scenarios, where it is impossible to generate geometrically plausible 3D reconstructions, our occupancy decoding will fail.

We also show the visualization of such failure cases in our global response PDF file. Please kindly refer to them.

Please refer to the following rebuttals for other specific concerns and more details. We are looking forward to your further reply and discussion.

最终决定

This paper got divergent reviews. The setting and capabilities of the proposed model are novel, though the components are not new. That said, given the new problem setting and the good zero-shot results, this AC believes the paper is worth accepting.