Point-PRC: A Prompt Learning Based Regulation Framework for Generalizable Point Cloud Analysis
摘要
评审与讨论
In this work, the authors propose a regularization method for the prompt learning of generalizable point cloud analysis, which can strengthen the performances of learned representations on downstream 3D task while keeping its generalizability. The regularization consists of three components: mutual agreement constraint, text diversity constraint, and model ensemble constraint, which is a plug-and-play method for existing 3D large multi-modal models. Moreover, this work also includes new benchmarks for the evaluation of 3D point cloud domain generalization. Results on the proposed benchmark confirm the effectiveness of the proposed regularization method.
优点
-
The whole framework is simple but effective;
-
The writting is good and easy to follow;
-
The construction of new benchmarks may be beneficial to the community.
缺点
My major concern about this work is its novelty. As the prompt tuning method has been well studied in other areas, e.g., text-to-image generation. The proposed regularization constraint in Eq.2 is somewhat similar as the preservation loss term proposed in [1], while the other terms improve the robustness by straightforward average operation. I am not sure if the novelty is enough.
[1] DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
问题
-
The relationships between the in Text Diversity Constraint is not so clear. Would it be used for the regularization in Eq.2? Or just the inference?
-
Some details are not intuitively presented, such as basic framework of existing multi-modal. It would be better to add some diagrams to present the existing works and your improvements;
-
How is the sensitivity of Model Ensemble Constrain to its hyperparameters? Besides, I am also not so sure about the necessity of such ensembling over all epochs. Why cannot we just select a best checkpoint through a validation set?
-
What's the advantages of doing prompt learning for domain generalization over fine-tuning methods such as Low Rank Adaption?
局限性
Yes.
Thanks for your insightful comments. Below we address your concerns one by one. Further questions are welcome and we are happy to respond.
Q1: Major concern on novelty
We understand your concern regarding the novelty. And we answer the question in the global response section, kindly referring to that part. We think the preservation loss in DreamBooth is not very relevant to the Eq. 2 in our work. And text-to-image generation is another different topic from the point cloud analysis.
Q2: The relationships between and the Text Diversity Constraint is not clear
Yes, would be used in Eq. (2). Since LLMs will generate multiple text descriptions for each point cloud category, we integrate these text features into .
Q3: The basic framework of multi-modal prompt learning and inclusion of new diagrams
We have added the diagrams to explain the multi-modal framework and the proposed method in our work, referring to Figure 1 and 2 in the uploaded PDF file.
Figure 1 highlights our research motivation, distinguishes our work with existing methods and demonstrates superior 3DDG ability on unseen new classes and better performances of base classes.
Figure 2 illustrates the overall pipeline of the proposed approach and showcases how we incorporate the regulation constraints in the pipeline.
Q4: The Model Ensemble Constraint sensitivity to its hyperparameters
We add the sensitivity analysis of the model ensemble constraint to the hyperparameters mean and variance and report the results in the following sub-table (a) and (b). As observed, increasing will give more weights to the models in later epochs and improve the base class accuracy while compromising the generalization on unseen new classes. In general, the changes are not very sharp, similarly for the variance .
Table r3-2. The sensitivity of Model Ensemble Constraint to the hyperparameters. Here mean and variance are the hyperparameters of a gaussian distribution. ULIP-2 is deployed as the 3D foundation model and the experiments are conducted on the base-to-new benchmark.
In the sub-table (a), the variable is , σ² = 1
| Metric | 7 | 9 | 11 | 13 | 15 |
|---|---|---|---|---|---|
| Base | 72.71 | 72.84 | 73.17 | 73.25 | 73.67 |
| New | 74.63 | 74.41 | 74.45 | 74.40 | 74.27 |
| HM | 73.66 | 73.62 | 73.80 | 73.82 | 73.97 |
In the sub-table (b), the variable is , = 15
| Metric | 25 | 16 | 9 | 4 | 1 |
|---|---|---|---|---|---|
| Base | 72.51 | 72.77 | 72.83 | 73.29 | 73.67 |
| New | 74.95 | 74.89 | 74.66 | 74.41 | 74.27 |
| HM | 73.71 | 73.81 | 73.73 | 73.85 | 73.97 |
Selecting the best checkpoint through a validation set is a common way. In theory, this greedy strategy favors highest performances on the downstream tasks, which means the small number of learnable prompts/parameters are well adapted to these tasks. It is equivalent to our framework without Model Ensemble Constraint (MEC).
However, purely optimizing the small number of learnable prompts toward target tasks will inevitably hinder the generalization ability of the large 3D models, as we analyzed in the paper.
We also provide the ablation study to this problem, as the following table indicates, the method without MEC has slightly lower accuracy on new classes (75.59% vs 76.10%) and harmonic mean. However, when removing the factors of MAC and TDC, the role of MEC becomes prominent. It raise the overall performance remarkably, especially for unseen new classes (5.28% absolute points).
Table r3-2. Ablation study for the framework without model ensembling constraint. The results are averaged on 5 datasets. MAC: mutual agreement constraint, TDC: text diversity constraint, MEC: model ensemble constraint. HM: harmonic mean of the Base and New class accuracies.
| MAC | TDC | MEC | Base | New | HM |
|---|---|---|---|---|---|
| x | x | x | 77.91 | 67.91 | 72.57 |
| x | x | √ | 82.42 | 73.19 | 77.53 |
| √ | √ | x | 83.30 | 75.59 | 79.26 |
| √ | √ | √ | 83.18 | 76.10 | 79.48 |
Q5: The advantages of prompt learning over low-rank adaptation
Prompt tuning and low-rank adaptation (LoRA) are orthogonal techniques for parameter-efficient fine tuning. We prefer prompt tuning over LoRA since it does not change the architecture and parameters of the 3D foundation models. In contrast, LoRA needs to change the architecture of the foundation models by introducing low-rank matrices, which might not be desirable in practice since foundation models are usually more precious and hard to obtain.
Thanks for the careful rebuttal of the authors. I apologize for the misunderstanding about the relations between Eq.2 and Dreambooth preservation loss after I check the paper. The response has well resolved my concerns. So I decide to raise my rating to weak accept.
Thank you very much for your careful consideration and positive feedback on our rebuttal. We greatly appreciate your willingness to re-evaluate our work and your understanding regarding the clarification of the relationship between Eq. 2 and the Dreambooth preservation loss.
We are also pleased to hear that your concerns have been resolved. Thank you again for your time and effort in reviewing our submission.
Kind regards,
Authors of Submission 1657
This paper investigates the 3D domain generalization (3DDG) ability of large 3D models using prompt learning. They utilize parameter-efficient prompt tuning to boost the performance of 3D point cloud recognition models. The paper observes that while prompt tuning improves downstream tasks, it often reduces the generalization ability of the models. Thus, they introduce a comprehensive framework to maintain good generalization by allowing learnable prompts to interact actively with the pre-trained general knowledge in large 3D models. This framework imposes explicit three regulation constraints on the prompt learning trajectory, maximizing mutual agreement between task-specific predictions and task-agnostic knowledge. They also develop three new benchmarks to evaluate 3D domain generalization: base-to-new class generalization, cross-dataset generalization, and few-shot generalization.
优点
- The newly created benchmarks provide a more holistic evaluation of 3D domain generalization, addressing real-world challenges such as transferring to unseen classes and handling corrupted data.
- This paper achieves consistent improvements in generalization ability across various large 3D models and benchmarks, demonstrating its effectiveness.
- The use of lightweight prompt tuning makes the framework computationally efficient, reducing the need for extensive retraining of large models.
缺点
- According to the paper's introduction, there are already substantial works on domain adaptation and domain generalization for 3D point clouds, including both object-level data and real scanned radar data. Many advanced methods also utilize beyond PointNet and ModelNet dataset. Consequently, the authors need to provide a more rigorous and detailed motivation for their study.
- The right part of eq.(1) needs a more detailed explanation to describe its components clearly.
- The method proposed in this paper seems overly simplistic and lacks novelty. Beyond the three general constraints mentioned, are there any specific designs for integrating LLMs with 3D point cloud multimodal learning?
- This paper does not introduce any additional designs for domain adaptation. Although it proposes a new benchmark, it essentially applies transfer learning.
问题
See the weaknesses section.
局限性
This paper lacks an analysis of its limitations. The effectiveness of the text diversity constraint relies on the quality and relevance of the text descriptions, which may vary depending on the source (LLMs or manual templates).
Thanks for your valuable feedback. Below we address your concerns one by one. Follow-up questions are welcome if something remains unclear.
Q1: More rigorous motivation of this study is needed
Thanks for your comments. We need to clarify the following points.
As we stated in the introduction (Line 36), there are only a few of methods discussing domain adaptation (PointDAN [45]) and domain generalization (MetaSets [19], PDG [58]) for 3D point clouds, instead of substantial works.
To make it more clear, we added Figure 1 in the uploaded one-page PDF to explain our motivation. In summary, our motivation is to enhance the performances of large 3D models on downstream 3D recognition tasks while simultaneously maintaining good 3DDG ability. Previous works focus on downstream tasks but fail to consider the generalization on unseen data and lack relevant evaluations. Their limitations include
-
Generalization among a fixed set of categories: Our approach based on large 3D models can conduct open-set recognition and generalize to any unseen class.
-
Limited Scale and Scope: The large 3D models we investigate usually have hundreds of millions of parameters, e.g., ULIP-2 with 100M+ params, and the 3D encoders in previous works cannot match this scale. Our method is evaluated on up to 215 classes while previous methods are only tested on up to 11 classes (Sim-to-Real).
-
Compromised generalization ability: Previous methods leveraging large 3D models focus on the downstream performances by lightweight adaptation (e.g., PPT [52], IDPT [68], DAPT [82], Point-PEFT [53]), while this strategy compromises generalization ability.
In short, our work offers a pioneering investigation of 3DDG ability of large 3D models and presents a simple yet effective solution. We believe that the new benchmarks will benefit researchers and drive future advancements in the field.
Q2: Further explanation of the right part of Eq. (1)
The general idea of the right part of Eq. (1) is that we try to find optimal point and text prompt parameters that minimize the expected cross-entropy loss over the ground truth data distribution . Let's dive into the components of this equation.
-
: These are the optimal parameters that we are trying to find.
-
: This notation indicates that we are looking for the arguments (in this case, the parameters ) that that minimize the following expression.
-
: This denotes the expected value over the ground truth data distribution , where represents the input point clouds and represents its real class labels.
-
: This is the cross-entropy loss function. represents the predicted class distribution, and is the true class label.
Q3: The method seems simple and lacks novelty
We answer this question in the global response section, kindly refer to that part.
In addition, when it comes to leveraging LLMs for prompt learning, we mainly regard them as powerful tools to produce diverse text descriptions to the point clouds. The specific designs are simple but effective. We customize three types of instructions to LLMs, and the details of the design are also visualized in Figure 3 of the uploaded one-page PDF. :
- Question Answering
- Caption Generation
- Making Sentences
After that, we encode these responses with the text encoder of the large 3D multi-modal models to obtain the representations of different 3D categories.
Q4: Lack of specific designs for domain adaptation
Thanks for your comments. Our work mainly investigates the 3D domain generalization (3DDG) ability of large 3D models, instead of domain adaptation (Line 1 of the main paper).
There are two key differences between domain adaptation (DA) and domain generalization (DG). First, DA methods can access target domain data during training while DG methods cannot. Second, DA methods aim to minimize performance drop when transferring to a specific known target domain while DG methods try to ensure generalization and robustness to unseen domains.
The proposed regulation constraint framework is designed for boosting task-specific performances while maintaining the generalization ability simultaneously. On the base-to-new benchmark, the models conduct lightweight prompt learning while directly test on unseen new classes. Similarly on the cross-dataset benchmark, the models learned from the source domain are directly tested on target domains.
I appreciate the author’s response and apologize for the delayed reply. However, I still have some concerns of this paper:
- I recognize that this paper is the first to introduce LLM to 3D DG. However, considering that in terms of prompt learning and DA/DG technical contributions, I believe it still lacks novelty.
- There are indeed many 3D DA/DG methods, especially for LiDAR data in autonomous driving. While they don’t focus on object-level point clouds, their ideas are still valuable. So, please consider improving the literature review.
- I'm very clear about the difference between DA and DG. My main concern is that this paper lacks a specific design to address domain gaps, whether at the category or dataset level. As the authors claim, ‘the models learned from the source domain are directly tested on target domains’, which seems more like transfer learning.
Sorry again for my late reply. I have also carefully read the comments from other reviewers and recognize the contribution of this paper to introducing LLM to 3D DG. I am willing to raise my rating if the authors can address my remaining concerns or if other reviewers lean toward acceptance.
Dear Reviewer UDVe,
We hope you’ve had a pleasant weekend. We wanted to thank you for the detailed feedback you provided on our submission. Your insights have been very valuable, and we have carefully considered your comments in our rebuttal.
If there are any additional concerns or points you would like to discuss further, we would be more than happy to clarify or provide further information. Your guidance is greatly appreciated, and we would welcome the opportunity to address any unresolved issues.
Thank you again for your time and consideration.
Best regards,
Authors of Submission 1657
Below we attach the related works in 3D DA/DG for lidar data in autonomous driving. We will summarize them and discuss the relations with them in revision. In general, our work focuses on object-level recognition and these methods are dealing with scene-level 3D tasks, such as semantic segmetation and object detection.
-
3DDA methods
- Saleh et al. Domain Adaptation for Vehicle Detection from Bird's Eye View LiDAR Point Cloud Data. ICCV 2019
- Xu et al. SPG: Unsupervised Domain Adaptation for 3D Object Detection via Semantic Point Generation. ICCV 2021
- Yi et al. Complete & Label: A Domain Adaptation Approach to Semantic Segmentation of LiDAR Point Clouds. CVPR 2021
- Zhao et al. ePointDA: An End-to-End Simulation-to-Real Domain Adaptation Framework for LiDAR Point Cloud Segmentation. AAAI 2021
- Achituve et al. Self-Supervised Learning for Domain Adaptation on Point Clouds. WACV 2021
- Jiang et al. LiDARNet: A Boundary-Aware Domain Adaptation Model for Point Cloud Semantic Segmentation. ICRA 2021
- Shen et al. Domain Adaptation on Point Clouds via Geometry-Aware Implicits. CVPR 2022
- Yang et al. No-Reference Point Cloud Quality Assessment via Domain Adaptation. CVPR 2022
- Liang et al. Point Cloud Domain Adaptation via Masked Local 3D Structure Prediction. ECCV 2022
- Wang et al. SSDA3D: Semi-supervised Domain Adaptation for 3D Object Detection from Point Cloud. AAAI 2023
- Saltori et al. Compositional Semantic Mix for Domain Adaptation in Point Cloud Segmentation. TPAMI 2023
- Katageri et al. Synergizing Contrastive Learning and Optimal Transport for 3D Point Cloud Domain Adaptation. WACV 2024
-
3DDG methods
- Wu et al. SqueezeSegV2: Improved Model Structure and Unsupervised Domain Adaptation for Road-Object Segmentation from a LiDAR Point Cloud. ICRA 2019
- Robey et al. Model-Based Domain Generalization. NeurIPS 2021
- Lehner et al. 3d-vfield: Adversarial augmentation of point clouds for domain generalization in 3d object detection. CVPR 2022
- Sanchez et al. Domain generalization of 3d semantic segmentation in autonomous driving. ICCV 2023
- Kim et al. Single Domain Generalization for LiDAR Semantic Segmentation. CVPR 2023
- Qu et al. Modality-Agnostic Debiasing for Single Domain Generalization. CVPR 2023
- Xiao et al. 3D Semantic Segmentation in the Wild: Learning Generalized Models for Adverse-Condition Point Clouds. CVPR 2023
- Li et al. BEV-DG: Cross-Modal Learning under Bird's-Eye View for Domain Generalization of 3D Semantic Segmentation. ICCV 2023
- Wang et al. Towards Domain Generalization for Multi-View 3D Object Detection in Bird-Eye-View. CVPR 2023
- Guo et al. An Accurate Outlier Rejection Network With Higher Generalization Ability for Point Cloud Registration. RAL 2023
- He et al. Domain Generalization-Aware Uncertainty Introspective Learning for 3D Point Clouds Segmentation. MM 2024
- Sanchez et al. ParisLuco3D: A High-Quality Target Dataset for Domain Generalization of LiDAR Perception. RAL 2024
- George Eskandar. An Empirical Study of the Generalization Ability of Lidar 3D Object Detectors to Unseen Domains. CVPR 2024
- Jiang et al. DG-PIC: Domain Generalized Point-In-Context Learning for Point Cloud Understanding. ECCV 2024
Thank you for your detailed responses. Most of my concerns are addressed and I will raise my rating to Borderline Accept. Please carefully revise the paper on motivation/contribution to explicitly state how to address domain shifts and improve the literature review section as well.
Dear Reviewer UDVe,
Thank you very much for your thoughtful reconsideration of our submission and for raising the rating. We are especially grateful for your detailed suggestions regarding the motivation and contribution sections, as well as your advice on improving the literature review.
We will carefully revise the paper to state how to address domain shifts explicitly and enhance the clarity of our contributions. Your guidance is invaluable, and we appreciate the time and effort you’ve dedicated to helping us improve our work.
Thank you again for your support.
Best regards,
Authors of Submission 1657
Dear Reviewer UDVe,
Thank you for taking the time to respond to our rebuttal, especially given this busy period. We appreciate your feedback and the opportunity to further clarify the points you find unclear. Below we will answer these questions point by point.
Q1: Concern on technical contributions
Thanks for your comments. We understand your concern regarding the novelty. In terms of technical contributions, we have the following points to explain.
-
Previous DA/DG methods based on relatively small models for 3D point cloud recognition design common feature space among source and target domain or adapt the meta-learning framework to handle the domain shifts. They provide very valuable insights and solutions to advance the development of 3D DA/DG.
-
Recent works, especially large 3D models (e.g., ULIP, ULIP-2, Uni3D), demonstrate even much better zero-shot recognition performances across a wide range of target tasks. The results imply that domain gaps can also be more effectively narrowed down by pre-training large 3D models on large-scale datasets (e.g., ULIP-2 is pre-trained on million-scale pointcloud-text-image triplets). You can treat them as different technical routes/stages to solve the 3D DA/DG problem.
-
This work is built on the latter route and the idea of improving the 3DDG ability is to exploit the power of large models. To this end,
- (1) we design an active interaction strategy to align with the pre-trained knowledge in large 3D models,
- (2) we deploy LLMs as powerful interfaces to produce high-quality descriptions to various point clouds,
- (3) we synthesize the opinions from different learning stages with a gaussian-weighted voting.
-
Then, these three components are effectively incorporated into a unified regulation framework to handle the category/dataset shifts between seen and unseen domains.
-
Finally, we verify the effectiveness of distinct components by ablation experiments and validate the proposed framework on multiple large 3D models and multiple benchmarks to reflect the boosted generalization ability and robustness.
-
We also want to note that our framework shows promising model-agnostic attribute, which implies that with the increasing abilities of large 3D models and LLMs, the 3DDG gains will increase.
-
For the prompt learning part, our design is distinguished from previous works in:
- we conduct multi-modal prompt learning (both on text and 3d branches) while previous works conduct prompt tuning on a single modality. For instance, PPT [52] only tunes text prompts, IDPT [68], DAPT [82], Point-PEFT [53] tune prompts only in 3D. In the beginning of this project, we compare these two solutions and find our strategy achieves better generalization over single-modal on our benchmarks.
Q2: Improving literature review
Thanks for your suggestions. We appreciate that the reviewer clarifies there are many 3D DA/DG methods for lidar data in autonomous driving.
After receiving the feedback, we maximize our efforts to find related papers (these works are attached in another reply part due to character limitation). Many of them focus on semantic segmentation, object detection, and registration. We will read them carefully later. There is no doubt that these 3D DA/DG methods are prestigious to the field and we will reflect them in our revised related work section by:
-
(1) summarizing these works according to the methodology they proposed;
-
(2) explaining the differences and relations to our work.
Q3: Lacks a specific design to adress domain gaps
Thanks for your comments. This question is similar to Q1 (concern on technical contributions) and we answer the question in corresponding section, kindly referring to that part. About the question related to transfer learning, we have the following points to provide.
First, we agree transfer learning and DG methods share many similarities. They are both techniques used to improve the performance of models on tasks where there is limited or no data.
Second, they approach this problem from different angles/ways. Let's perceive the difference through some specific examples.
-
In transfer learning, a model is typically pre-trained on a large dataset (source domain) and then fine-tuned on a smaller, task-specific dataset (target domain). For example, a model is pre-trained on the ImageNet dataset (which contains millions of images across 1,000 classes) and then fine-tuned on a smaller dataset of medical images to classify different types of skin diseases. By this way, the learned useful features from ImageNet can be transferred to the target domain with relatively little additional training.
-
In domain generalization, the model is only trained on source domain and directly tested on unseen target domains, without requiring fine-tuning on the target domain data. In our base-to-new class and cross-dataset settings, we do not fine-tune the prompts on new classes or target datasets.
This paper investigates the 3D domain generalization (3DDG) capability of large 3D models based on prompt learning. The authors propose a comprehensive regulation framework that employs lightweight prompt learning to improve both task-specific performance and domain generalization ability. The framework consists of three main components: mutual agreement constraint, text diversity constraint, and model ensemble constraint. Additionally, the authors introduce three new 3DDG evaluation benchmarks: base-to-new, cross-dataset, and few-shot generalization benchmarks. Experimental results demonstrate that the proposed method significantly enhances model generalization while improving specific task performance.
优点
1 The paper demonstrates significant originality by being the first to address the 3DDG problem for large multi-modal 3D models and proposing a novel regulation framework with innovative constraint mechanisms.
2 The quality of the research is evident in its comprehensive experimental design and the significant improvements shown across multiple benchmarks and models.
3 The work's significance lies in addressing the critical issue of domain generalization in 3D point cloud analysis, potentially impacting related fields broadly. Furthermore, the introduction of new benchmarks provides valuable tools for future research in 3DDG。
缺点
1 There's no detailed comparison of training time between the proposed method and baseline approaches or full fine-tuning of large 3D models.
2 Limited validation across diverse point cloud tasks and real-world scenarios.
问题
1 What specific advantages do the newly proposed benchmarks have compared to previous DG methods?
2 How do you view the scalability of this method on larger datasets or more complex 3D tasks?
局限性
By a more detailed discussion, the authors could provide a more balanced view of their work, demonstrating scientific rigor and offering valuable insights for researchers looking to build upon or apply their method. This would significantly strengthen the paper and its contribution to the field.
Thanks for your valuable feedback. We address your concerns point by point. Feel free to ask follow-up questions if something remains unclear.
Q1: Training time comparison between baselines and our method.
As requested, we have added a comparison of the training time between our method and the baselines. The results are shown in the following table. The proposed method consumes a similar amount of time per epoch compared to the baseline, with a slight increase due to the inclusion of our framework.
Table r1-1. Running time comparison of a strong baseline ULIP-2 and the proposed approach. We conduct prompt learning based on ULIP-2 for 20 epochs on the base-to-new class benchmark, and the experiments are run three times with different seeds. The settings are consistent with those in the main paper. Time is counted in seconds for all 20 epochs using a RTX 4090.
| seed | MN40 | S-PB_T50_RS | S-OBJ_BG | S-OBJ_ONLY | SNV2 | Avg. | |
|---|---|---|---|---|---|---|---|
| 1 | 132 | 106 | 48 | 53 | 307 | 129.2 | |
| ULIP-2 | 2 | 132 | 106 | 48 | 53 | 305 | 128.8 |
| 3 | 133 | 108 | 48 | 51 | 305 | 129.6 | |
| 1 | 159 | 112 | 60 | 60 | 344 | 147.0 | |
| +RC(Ours) | 2 | 159 | 114 | 60 | 59 | 345 | 147.4 |
| 3 | 159 | 113 | 59 | 60 | 345 | 147.2 |
The number of learnable parameters in our framework is 16,896 while full fine-tuning ULIP-2 has 82.3M learnable parameters (only in text and 3D encoder). According to the reported details of ULIP-2, pre-training on Objaverse [9] utilizes 8 A100 GPUs and takes 1.5 days. So full fine-tuning ULIP-2 is also expensive.
[9] Deitke et al. Objaverse: A Universe of Annotated 3D Objects. CVPR 2023
Q2: Limited validation across diverse point cloud tasks and real-world scenarios.
We would like to point out that we have already considered multiple real-world scenarios. In particular, there are four datasets in our new benchmarks collected from real-world scenarios, including the three variants of ScanObjectNN [55] and Omni3D [60]. ScanObjectNN is widely used in the community and Omni3D is a recently released dataset that contains a large vocabulary of 3D objects. Both of them pose great challenges to existing point cloud recognition methods, according to the results in Table 1 and Table 2 in the main paper.
Our work mainly focuses on the recognition task since the large 3D models (e.g., ULIP, ULIP-2) are pre-trained using the contrastive objective (similar to CLIP) and they are good at the global alignment of 3D objects and their text descriptions with classe names.
Other 3D tasks like object detection and segmentation are dealing with scene-level point cloud data. They follow different paradigms and methodologies compared to object-level recognition. We leave it to future works to explore the 3DDG ability of our framework on these tasks.
[55] Revisiting Point Cloud Classification: A New Benchmark Dataset and Classification Model on Real-World Data. ICCV 2019
[60] Wu et al. OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation. CVPR 2023
Q3: The advantages of the new benchmarks compared to existing ones.
First, our benchmarks provide new evaluation dimensions for the 3DDG methods. These new evaluation dimensions are key indicators to reflect the domain generalization ability but absent in existing benchmarks. Specifically,
-
the domain generalization evaluation in existing PointDA and Sim-to-Real focuses on the shared categories between source and target domain, without considering unseen new classes. We think it is a critical limitation, especially when evaluating the generalization ability of large 3D models that can conduct open-set recognition. In contrast, our created base-to-new and cross-dataset benchmarks provide the evaluations both on seen and unseen data.
-
previous benchmarks fail to evaluate the generalization ability when the target domain occurs data corruptions, which are common cases in 3D point cloud analysis. In contrast, our cross-dataset benchmark introduces this kind of evaluation to measure the model’s ability against common data corruptions.
-
the few-shot benchmark inspects the model generalization ability under extreme low-data regime (e.g., 1-shot learning). But this kind of evaluation has not been covered by previous benchmarks.
Second, our benchmarks are more diverse and challenging. There are only 10 classes in PointDA and 11 classes in Sim-to-Real. Our newly created benchmarks contain 7 different datasets and up to 216 point cloud categories, which will drive future research.
Q4: The scalability of this method on larger datasets
During rebuttal, we further test our framework on a larger dataset named Objaverse-Lvis and the results are promising. This dataset is a subset of the recently released Objaverse and only serves as a test set (target domain). Objaverse-Lvis contains 46,205 point clouds and 1,156 classes, and some classes only have a single object, posing great challenges to existing point cloud recognition methods. In the experiments, we select representative ULIP and ULIP-2 as baselines and compare them with the models with our regulation framework.
The results in the following table verify the proposed approach can also bring considerable gains (+3.27% absolute points for ULIP-2) on such a larger and challenging dataset.
| Method | Source (ShapeNetV2) | Target (Objaverse-Lvis) |
|---|---|---|
| ULIP | 87.33 (0.95) | 0.83 (0.05) |
| +RC(Ours) | 90.43 (0.86) | 1.10 (0.08) |
| ULIP-2 | 76.70 (1.37) | 14.80 (0.22) |
| +RC(Ours) | 76.70 (1.59) | 18.07 (0.49) |
The author's response addressed my concerns, so I changed my rating to weak accept.
Dear Reviewer kpdQ,
We wanted to express our sincere gratitude for your thoughtful re-evaluation of our submission. Your willingness to reconsider the rating and provide us with constructive feedback is truly appreciated.
We are pleased to hear that our response helped address your concerns, and we value the time and effort you’ve dedicated to reviewing our work. Your insights have been instrumental in improving the clarity and quality of our submission.
Thank you again for your careful consideration and for raising your rating.
Best regards,
Authors of Submission 1657
Dear Reviewer kpdQ,
I hope this message finds you well. I wanted to express my gratitude for the time and effort you’ve invested in reviewing our submission. I understand that this is a busy period, and I sincerely appreciate your attention and recognition to our work.
If there are any further clarifications or questions regarding our rebuttal, we are more than happy to provide additional information.
Thank you again for your valuable feedback and consideration.
Best regards,
Authors of Submission 1657
First of all, we sincerely thank all reviewers and ACs for reviewing our paper and providing valuable comments. There is no doubt that these suggestions and feedback are very valuable for refining the paper. We are encouraged by the appraising comments from the reviewers: ''significant originality by being the first ...'' (Reviewer kpdQ), ''The whole framework is simple but effective'' (Reviewer d557), ''consistent improvements in generalization ability across various large 3D models and benchmarks'' (Reviewer UDVe). Moreover, all reviewers recognize that our constructed benchmark for 3D domain generalization is valuable to the community.
In the following, we will give a global response to the common concern on novelty of this work.
Q1: The method proposed in this paper seems simple and lacks novelty.
A1: We agree that our method is simple, but effective. The novelty of our work lies in that we are the first to investigate the 3D domain generalization capability of large 3D models, and present a simple yet effective regulation framework to address this critical issue. The originality is The originality is highly appraised Reviewer kpdQ.
Moreover, we construct three new benchmarks that provide new evaluation dimensions for 3D domain generalization (3DDG), including generalization on unseen new classes, corrupted data, and few-shot generalization. These are vital indicators to measure the 3DDG ability in real-world scenarios but ignored by previous works. The merits of our constructed benchmarks are recognized by all three reviewers: ''... new benchmarks provide valuable tools for future research in 3DDG'' (Reviewer kpdQ). ''... newly created benchmarks provide a more holistic evaluation of 3D domain generalization'' (Reviewer UDVe). ''... new benchmarks may be beneficial to the community'' (Reviewer d557)
The paper initially received mixed ratings. After the rebuttal and discussion phases, all reviewers leaned toward accepting the paper, with two ratings of Weak Accept and one of Borderline Accept. After thoroughly considering the paper, the reviews, and the subsequent discussions, AC recommends acceptance to this paper.
The AC suggests that the authors should incorporate the reviewers' suggestions and the additional content from the rebuttal and discussion phases into the final version of the paper. This includes related works on 3D domain adaptation/generalization for LiDAR data in autonomous driving, a comparison of training times, scalability considerations, sensitivity analysis, and the inclusion of figures presented in the rebuttal PDF.