Mining and Transferring Feature-Geometry Coherence for Unsupervised Point Cloud Registration
A novel unsupervised point cloud registration method for large-scale outdoor scenes.
摘要
评审与讨论
The paper introduces a new unsupervised outdoor point cloud registration method called INTEGER, which dynamically integrates high-level contextual information and low-level geometric information to generate reliable pseudo-labels, addressing the issue of poor performance in complex outdoor environments seen in previous methods reliant on geometric data. By incorporating the Feature-Geometry Coherence Mining module, Anchor-Based Contrastive Learning, and the Mixed-Density Student model, this method not only performs well in scenarios with significant variations in data density but also demonstrates its efficiency and generalizability on standard datasets like KITTI and nuScenes. Overall, the INTEGER method significantly improves in accuracy and generalizability, especially in handling complex and distant outdoor scenes, showing advantages that traditional methods struggle to match.
优点
Originality:The INTEGER method demonstrates originality by innovatively combining high-level contextual and low-level geometric information for unsupervised point cloud registration. This method creatively addresses challenges in outdoor environments, where previous methods relying solely on geometric data often fail. Quality:The paper showcases high-quality research through detailed methodological execution and robust evaluations on standard benchmarks like KITTI and nuScenes. The proposed mixed-density student model, which learns density-invariant features, ensures robust performance across diverse scenarios. Clarity:The manuscript is clearly written and well-organized, effectively communicating its core ideas, methodologies, and results. It uses figures and tables effectively to aid understanding, making it accessible to readers familiar with point cloud processing. Significance:The work is highly significant, offering potential transformative impacts in autonomous driving and robotics by enhancing unsupervised point cloud registration in outdoor scenes. Its competitive performance against supervised methods highlights its practical relevance and potential to reduce reliance on labeled data.
缺点
Dependency on Initial Conditions The INTEGER method's performance heavily relies on the quality of the initial teacher model. A poorly initialized teacher model could propagate errors and inefficiencies through the learning process. The paper could explore and discuss alternative strategies for more robust initializations or provide a more detailed analysis of how the initial conditions affect the overall performance. This would give readers a clearer understanding of the method's robustness and potential limitations in real-world scenarios. Comparison with State-of-the-Art Methods While the paper presents comparisons with existing methods, the selection seems limited to those that are closely aligned with the proposed method's framework. Including a broader range of state-of-the-art methods, especially recent unsupervised learning approaches that use different paradigms, could help validate the strengths and uniqueness of INTEGER more convincingly. This comparison could also be extended to include a discussion on how INTEGER performs relative to these methods in terms of computational efficiency and scalability. Discussion on Failure Cases and Limitations The paper could benefit from a more thorough discussion of scenarios where INTEGER might underperform, such as extremely sparse point clouds or highly noisy environments. Understanding the method's limitations and potential failure cases would help in setting realistic expectations and could guide future research to address these specific challenges. Generalizability Across Different Datasets While the paper tests INTEGER on KITTI and nuScenes, additional experiments on datasets from different domains or with different characteristics could help demonstrate the method's adaptability and robustness. Exploring performance on datasets with varying densities and noise levels would provide a more comprehensive evaluation of the method's effectiveness across diverse real-world conditions.
问题
Robustness of Initial Teacher Model: Could you elaborate on the strategies used for the initialization of the teacher model? Given the dependency of the INTEGER method on the quality of the initial teacher, understanding its robustness in adverse conditions or with suboptimal initialization would be informative. Are there specific conditions under which the initial model tends to fail, and how does this impact the overall registration accuracy? Broader Comparisons: The paper currently compares INTEGER with a select group of methods. Could you include comparisons with additional state-of-the-art unsupervised methods that employ fundamentally different approaches? This could help in understanding the unique advantages or potential trade-offs of INTEGER. Additionally, it would be beneficial if you could provide insights into the computational efficiency and scalability of INTEGER compared to both supervised and unsupervised counterparts.
局限性
The manuscript does discuss some technical limitations related to the initialization of the teacher model and the conditions under which the method may underperform. However, these discussions could be expanded to include: Scalability: How does INTEGER scale with increasingly large datasets or more complex environments? Including a discussion on computational resources and runtime could provide a clearer scope of applicability. Robustness: More detailed exploration of robustness against different types of noise or outliers in point cloud data would be beneficial. Given the application domains of point cloud registration, such as autonomous driving and robotics, understanding these aspects is crucial.
We first would like to thank the reviewer for giving us valuable comments.
Q1: Robustness of Initial Teacher Model: the strategies used for teacher initialization and its robustness in adverse conditions or with suboptimal initialization.
A: The initialization strategy is detailed in Sec 3.2. We first split an input point cloud into two partially overlapping parts, apply a random rigid transformation to one part, and use it as the ground truth. To mimic real-world LiDAR scans, we employ a periodic sampling technique. We also apply commonly used data augmentations, such as random scaling and noise. Visualizations of the generated pairs are provided in Sec. A.5 (L594-600).
The performance of INTEGER indeed relies on the initial teacher. However, INTEGER remains robust even with suboptimal initialization. Our ablation study (Sec 4.3) shows that INTEGER maintains high-quality pseudo-labels, even with EYOC-style initialization (achieving 71.9% IR compared to EYOC's 53.2% IR). This superior robustness may be attributed to the proposed FCEM module, which dynamically adapts the teacher, enhancing it before mining pseudo-labels.
To further demonstrate the robustness, we have compared (1) a Fully-trained teacher, (2) a Under-trained teacher, and (3) a Randomly-initialized teacher. We measured the Inlier Ratio of the teacher in the first epoch to assess the quality of pseudo-labels:
| F | U | R | |
|---|---|---|---|
| IR@1st Epoch | 81.2% | 72.1% | 34.9% |
We observed only a minor performance drop when the teacher wasn't fully trained, and the randomly-initialized teacher performed only slightly worse than the well-trained EYOC. In this very extreme case, FCEM's adaptation process ensures that the teacher still performs surprisingly well, highlighting both the superiority of the FCEM design and the robustness of our method. We would also like to highlight that after training for several epochs, the teacher's IR improves significantly, with minimal negative impact on the student's convergence.
Indeed, more complicated registration networks, such as Predator, slightly struggle with current initialization strategies. During training, they show suboptimal performance at the beginning of teacher-student training but eventually catch up after more training epochs, highlighting the superior effectiveness of the proposed FCEM module. We speculate that the complicated networks may suffer from overfitting the synthetic data. In further work, we will devise a better strategy to address this issue.
Regarding adverse conditions, we plan to investigate common adverse conditions in autonomous driving scenarios, such as extreme weather, in our future work. This is a critical aspect of real-world applications that requires thorough investigation.
Q2: Broader Comparisons: Additional SOTA unsupervised methods that employ fundamentally different approaches.
A: Our INTEGER focuses on outdoor scenes; only a few unsupervised methods, namely EYOC and RIENet, are applicable in this context. We have included both in our comparison (see Sec 4.1, Table 1). Among these unsupervised methods, RIENet takes a fundamentally different approach. Our INTEGER not only outperforms EYOC, but also suppresses RIENet by a large margin.
Q3: Insights into computational efficiency and scalability of INTEGER compared to both supervised and unsupervised counterparts.
A: As mentioned in Limitations (Sec.A.1), our method is slightly slower to obtain pseudo-labels than EYOC because of the proposed iterative method used in the FGC-Branch of the FCEM module. However, with only 33% more time cost on KITTI, our INTEGER has much more accurate pseudo-labels with 28.1% higher IR (81.3% v.s. EYOC's 53.2%), and considerable improvements on RR% of very distant pairs. (54.2% v.s. EYOC's 52.3%). We show additional results for supervised baselines on the nuScenes dataset and generalizability tests in part 2 in the general response, which shows that our method has better scalability than SOTA supervised and unsupervised methods. Please refer to it for details.
Q4: Scalability with increasingly large datasets or more complex environments and the discussion on computational resources and runtime; Generalizability Across Different Datasets.
A: Our method targets autonomous driving scenarios, so we tested it on KITTI and nuScenes, standard datasets in this field. These datasets vary in density, occlusion, and noise levels, enabling us to evaluate INTEGER's scalability and robustness.
To further demonstrate the scalability of INTEGER, we have also evaluated the performance of INTEGER on ETH, an outdoor dataset with rural and forest scenes, which are primarily unstructured and complex, differing from the urban environments of KITTI and nuScenes datasets. Moreover, we also conducted additional experiments in indoor scenes as requested by other reviewers. Please refer to part 2 and 3 of the general response for results. These results also show the superior generalizability of INTEGER.
Q5: Robustness against different types of noise or outliers.
A: In outdoor scenes, registration methods are often challenged by extremely low overlap and density variation, which introduces noise in correspondences, especially for distant pairs. Our method demonstrates superior robustness against these challenges, as shown in Sec 4.1, Table 1.
Q6: Discussion on Failure Cases and Limitations.
A: The robustness of the proposed INTEGER against different types of noise and outliers, including the noise introduced by extreme weather, requires further investigation. These situations are particularly relevant for autonomous driving. We will discuss these limitations and future work in the conclusion.
Dear Reviewer 2VKD,
We sincerely thank you for your precious time and efforts in reviewing our paper. We greatly appreciate your insightful and detailed feedback, and we have carefully addressed the concerns in the rebuttal. Please let us know if any aspects remain unclear, and we are happy to provide further clarification.
We are looking forward to your reply.
Best regards,
The authors of submission 1909
The author's answer here is convincing and solves some of my previous questions. Although only a few unsupervised methods are compared, it can be seen that the proposed method has advantages over those methods. If possible, it would be better to compare more unsupervised methods. Although the performance of the proposed method is lacking in some aspects compared to other methods, it also shows that the method proposed by the author has a good effect overall, which is surprising. I keep my original score.
Thanks for continuing to recognize our work. Your suggestion on understanding the robustness of the teacher initialization and scalability across various datasets are very helpful—these experiments effectively highlight the superior robustness and generalizability of our work. Additionally, your suggestion on adding more unsupervised baselines can enable a more comprehensive comparison between our work and existing methods, and we are very grateful for that. Although we have found some existing indoor unsupervised methods fail to work in outdoor scenes, we will conduct these experiments and include in the experiment section. We sincerely appreciate your thorough review, and we will incorporate your suggestions into the final version.
This paper introduces an unsupervised framework for point cloud registration that generates reliable pseudo-correspondences using both low-level geometric and high-level contextual information. It employs a widely used teacher-student architecture and proposes Anchor-Based Contrastive Learning to facilitate robust feature space development through contrastive learning with anchors. Additionally, the paper introduces a Mixed-Density Student approach to learn density-invariant features, effectively addressing challenges associated with density variations and low overlap in outdoor scenarios.
优点
This paper focuses on registering point clouds without relying on pose prior datasets to supervise model training. The observation that in the feature space, points of latent new inlier correspondences tend to cluster around respective positive anchors summarizing features of existing inliers is particularly interesting.
The FCEM adaptation of the teacher model to a data-specific teacher for the current mini-batch is intriguing. However, the procedure described in the manuscript is complex and challenging to follow. It seems that the teacher model is first updated using the current mini-batch dataset, then correspondences are selected to train the student, followed by an update of the teacher weights using Exponential Moving Average (EMA). This sequence raises questions about the efficiency and effectiveness of such a dynamic updating scheme. The authors need to provide a clearer explanation of this process.
缺点
First limitation is that some parts of the framework appears remarkably similar to the established EYOC method, thereby calling into question the novelty of this contribution. Besides, the integration of different levels (low and high) of information for mining pseudo-labels is highlighted as a distinguishing contribution, yet the manuscript lacks a clear explanation of the specific mechanisms. If Spatial Compatibility Filtering is employed, it should be noted that this technique is already a component of EYOC.
问题
This paper introduces an unsupervised framework for point cloud registration that integrates both low-level geometric and high-level contextual information to generate reliable pseudo-labels. While the concept is promising, the manuscript could benefit from additional detail and clarification in several areas:
- Similarity to Existing Methods: Some parts of the framework appears remarkably similar to the established EYOC method, thereby calling into question the novelty of this contribution.
- Using low and high level information for mining pseudo-labels is highlighted, yet the manuscript lacks a clear explanation of the specific mechanisms. If Spatial Compatibility Filtering is employed, it should be noted that this technique is already a component of EYOC.
- The manuscript mentions pairs with distances [d1, d2], but it does not explain the impact of these distances for registration. A detailed explanation would enhance understanding of how these challenge the registration.
- Dataset Differences and Overlap Ratios: The experimental section discusses the use of KITTI and nuScenes datasets. It is important to delineate the differences between these datasets, especially concerning their overlap ratios. Providing specific overlap ratios for these and other datasets would facilitate a more nuanced discussion of the framework's applicability and robustness across different scenarios.
- Performance on Various Datasets: The framework's performance is evaluated on outdoor datasets like KITTI and nuScenes, but there is no discussion on its efficacy in indoor settings, such as with the 3DLoMatch dataset. Insights into performance in diverse environments would be particularly valuable, given the unique challenges posed by indoor scenarios.
- Training Time: The manuscript omits details regarding the training duration. Including this information is essential for assessing the practicality of deploying this framework in real-world applications.
- Comparative Analysis with EYOC: If EYOC also utilizes synthetic pretraining, a direct comparison with the proposed method under similar conditions is necessary. Such a comparative analysis would provide a fairer assessment of the proposed method's strengths and help highlight any genuine advancements.
- Architecture Details for Predator: Since Predator architecture employs cross-attention and a dual-branch structure but lacks clarity on how the network is divided into teacher and student components. This is non-trivial, and thus, more detailed explanations are required to understand its implementation and functionality fully.
- Generalization to Indoor Environments: There is a need to assess how networks trained on outdoor datasets perform in indoor point cloud registration tasks. Evaluating the generalization capabilities of the framework would give insights into its versatility and effectiveness across different application settings.
局限性
The authors adequately addressed the limitations
We would like to thank the reviewer for the valuable comments. For concerns about the efficiency and effectiveness of FCEM, please refer to the "6. Efficiency and effectiveness of FCEM" part in the general response.
Q1: Similarity to EYOC.
A: Our method significantly differs from EYOC in several key aspects: (1) Novel Insight: We innovatively introduce a new perspective by demonstrating that inlier matches are closer to positive anchors than negative ones, which summarizes existing inlier and outlier features, respectively. (2) Pseudo-Labeling with Multilevel Information: Our innovative FCEM module dynamically adapts the teacher and integrates low-level geometric and high-level contextual information, whereas EYOC only uses low-level geometric cues. (3) Robust Knowledge Transfer We propose ABCont for effective and robust teacher-student knowledge transfer, which is not present in EYOC. (4) Density Invariance: We introduce MDS to explicitly address density variation in outdoor scenarios, a concern overlooked by EYOC. Experiments show that our method outperforms EYOC in two datasets by a considerable margin.
Q2: Clarity on pseudo-labeling mechanisms and the use of Spatial Compatibility Filtering(SCF) in relation to EYOC.
A: We would like to clarify a potential misunderstanding. While SCF in GSA-Branch appears to be similar to EYOC regarding rejecting outliers, our intention completely differs from EYOC: We aim to improve the teacher for the current data batch, whereas EYOC uses it for pseudo-label extraction for student. Moreover, experiments(Sec4.2, L272-285) show that SCF can be implemented using other pose estimators rather than SC2-PCR estimators used by EYOC.
Moreover, we do not solely rely on SCF to incorporate both low- and high-level information; this is achieved through the entire FCEM, particularly the FGC-Branch(Sec 3.3, L186-216). We iteratively identify latent inliers using feature-space anchors and exclude outliers via SCF. The contributions are novel and not found in EYOC. In our manuscript(L174-175), we cited EYOC to credit their work. We will add detailed discussions on the differences to EYOC in the revised version.
Q3: Impact of distances [d1, d2] on registration.
A: Our evaluation follows EYOC's existing protocol, focusing on pairs within distances [d1,d2], consistent with our method's design for autonomous driving data. The differences between frames, indicated by the distance between them, are crucial for demonstrating robustness and generalizability in real-world applications like collaborative perception. Increased distances mainly present two challenges:(1) Overlap Ratio: Increasing distances reduce overlap ratios, complicating the search for reliable correspondences. (2) Density Variation: Increasing distances often lead to significant point density variations, challenging feature matching without density-invariant features.
Q4: Dataset Differences and Overlap Ratios.
A: Fig.2 in the PDF attachment of the general response illustrates the overlap ratios w.r.t distances between frames in KITTI and nuScenes datasets, indicating increasing distances reduce overlap ratios. Furthermore, KITTI and nuScenes differ in LiDAR resolution(64 LiDAR beams in KITTI and 32 LiDAR beams in nuScenes), resulting in denser point cloud data of the KITTI dataset. The sparsity of nuScenes point cloud data makes it more challenging for registration, especially for feature extraction.
Q5: Performance on Various Datasets.
Q9: Generalization to Indoor Environments.
A: We address Q5 and Q9 together here. We conduct additional experiments in indoor(3DMatch/3DLoMatch dataset) and non-urban environments(ETH dataset). Please refer to parts 2 and 3 of the general response for details.
Q6: Lack of Training Time Details.
A: For most experimental results with FCGF, the full training procedure takes approximately 76 hours. For the additional results with the Predator presented in the Appendix, the training procedure takes about 140 hours. The inference efficiency is the same as that of the registration network(0.16s for FCGF and 0.30s for Predator per pair) because our method does not introduce any test-time modules. Please refer to part 5 of the general response for details.
Q7: Direct Comparison with EYOC if it also uses synthetic pretraining.
A: EYOC doesn't use synthetic pretraining. Instead, it assumes minimal transformation between two consecutive LiDAR frames, using identity transformation as the ground truth for the teacher's pretraining. In contrast, our method employs synthetic pretraining to initialize the teacher, as detailed in Sec3.2(L149-156). Table 4 of the general response shows that EYOC's assumption does not hold in real-world data: it may introduce significant noise, leading to a suboptimal initial teacher of EYOC.
For direct comparison, we evaluate both methods using EYOC's Approximation approaches and Synthetic-Pretraining.
| App. | Syn. | |
|---|---|---|
| EYOC | 53.2 | 54.0 |
| Ours | 71.9 | 81.2 |
Our method consistently outperforms EYOC even when initialized the same as EYOC. While synthetic pretraining benefits EYOC, the improvement is limited.
Q8: Clarification on Predator architecture and its use in the framework.
A: Predator doesn't employ a dual-branch structure. It consists of a single-branch KPConv encoder-decoder structure with an overlap-attention module in between, and we do not divide the architecture into teacher and student *components. Regarding Predator's results in the Appendix, we utilize the entire Predator model described in Predator's paper, pre-training it fully as the teacher, and then supervise the student with the same architecture. We will clarify this in the revised version.
I appreciate the authors' clarifications.
After reading the rebuttal, I still have the following questions:
It seems we may have different understandings of the dual-branch structure. Let me restate the concept as described in the original paper: PREDATOR is a two-stream encoder-decoder network, producing point-level features for both the source and target point clouds. How the final loss is calculated remains unclear to me. Did you calculate the loss from the two stream?
Again, I thank the authors for their work and time.
Dear Reviewer SEpR,
We sincerely thank you for your precious time and efforts in reviewing our paper. We greatly appreciate your insightful and detailed feedback, and we have carefully addressed the concerns in the rebuttal. Please let us know if any aspects remain unclear, and we are happy to provide further clarification.
We are looking forward to your reply.
Best regards,
The authors of submission 1909
Dear Reviewer SEpR,
We sincerely thank you for your precious time. We are happy to see that our work related to PREDATOR has been understood.
We would like to clarify a potential misunderstanding of the the two parts of our loss, namely the anchor-based auxiliary loss and the registration loss . We proposed to increase the feature similarity to positive anchors from the teacher, and decrease the feature similarity to negative anchors from the teacher. It is not Circle Loss, although they may share some similarities in their mathmatical expressions. The detailed definition of these anchors are given in L106-117 of our manuscript. Overall, is designed to regularize the feature space of the student with teacher's extracted features. , in PREDATOR's case, involves the Circle Loss. However, it cannot be removed because it is the primary supervision signal to supervise the student with the pseudo-labels from the teacher. In Table 3 of our ablation study, we remove the and find a performance drop. However, with and our more accurate pseudo-labels comapred with existing methods, we are still able to outperfrom exsiting SOTAs (our 83.5% mRR v.s. EYOC's 83.2%), which demonstrates the importance of . For the student, indeed, we need to train both encoders and decoders, as well as the overlap attention module in between. As you said, the loss passes through both encoders and decoders, so it can adequatly supervise the learning of all components of the student. During teacher-student training, we prevent gradients from being propagated to the teacher, as is typically done by works involving distillation.
We sincerely appreciate your thorough review. Indeed, it is crucial to understand how gradients are back-propagated and their effects on different modules of the network. We will conduct additional experiments and incorporate your suggestions into the final version.
Best regards,
The authors of submission 1909
Thank you for your clarification. If the paper is accepted, I strongly suggest releasing the code, as it would help readers understand the method more clearly.
I understand that is used to increase feature similarity to positive anchors from the teacher and decrease feature similarity to negative anchors. However, it still functions as a circle loss, as stated in PREDATOR and other papers like GeoTransformer. Are the hyperparameters in this loss, such as , set the same as in PREDATOR? (Perhaps I missed this detail.)
Regarding Table 3, which network did you use? Was it not FCGF? I ask this because you mentioned in your response that the loss ablation study cannot remove the regularization loss. In your supplementary material, you provide results for both FCGF and PREDATOR, which has confused me further.
Dear Reviewer,
Thank you for your reply. Follow our tradition, we will release the code if the paper is published. We would also like to clarify a potential misunderstanding in the ablation study. As we stated in our previous reply, we indeed have removed the regularization loss in our ablation study (see row "w/o ABCont" in Table 3), and becuase the registration loss is the primary part for the student's supervision, we cannot remove it.
Best regards,
The authors of submission 1909
Dear Reviewer SEpR,
We sincerely appreciate the time and effort the reviewer has dedicated to evaluating the work. As the reviewer-author discussion phase is coming to the end, we hope that our responses and additional experiments have effectively addressed the concerns raised. We respectfully request that the reviewer re-evaluate our work and kindly reconsider the ratings. Thank you for the continued support and understanding.
Best regards,
The authors of submission 1909
Hi,
Thank you for your replies and the well-prepared rebuttal. I almost fully understand what you did in the method section regarding PREDATOR. I just have a small question about the PREDATOR part, though it does not affect my final score.
In this case, both and are implemented as circle loss. For the distillation part, the gradient backpropagation only passes through one decoder and two encoders for either or . However, when combined, they will cover both encoders and decoders. The loss passes through both encoders and decoders, so it seems that the part might be redundant and could potentially be removed.
Best,
We sincerely thank you for your precious time and efforts. We are happy that we have addressed most of your concerns. The reviewer is still concerned about the loss calculation for both the source and target point clouds. Indeed, as stated in the PREDATOR paper, PREDATOR is a two-stream encoder-decoder network, which produces point-level features for both the source and target point clouds. We provide more details about the loss for the source and target point clouds as follows:
As is elaborated in Sec.3.4(L236-238) in our manuscript, the overall training loss for student is a weighted combination of multiple introduced by ABCont (see Sec. 3.1), which is defined as .
-
For the auxiliary loss part proposed by us, as we have stated in L142-143 of the manuscript, we calculate it from the two stream (i.e.,for both source and target point clouds):
-
For the registration loss part , as we have stated in L137 of the manuscript, we directly adopt the loss of the registration networks, such as PREDATOR and FCGF used in our experiments. Therefore, for experiments involving PREDATOR, we calculate the registration loss in the same way as in their paper. As stated in their paper, they calculate the loss from the two stream (i.e.,for both source and target point clouds), and average them to form the total registration loss.
In summary, we calculate the loss from the two streams in the experiments involving PREDATOR during teacher-student training. As for synthetic pretratining, we also directly adopt the loss of the registration network, and thus calculate the loss from the two stream (i.e.,for both source and target point clouds) for PREDATOR.
We hope this further explanation would address your questions on the loss calculation. Thank you again for your constructive suggestions. We sincerely appreciate your thorough review, and we will incorporate your suggestions into the final version, including further clarification on loss calculation.
This submission proposes an unsupervised framework for point cloud registration. The key contribution is a two-stage training scheme, which first trains a teacher network on synthetic data that extract features in a density-invariant manner, and then trains a student network with pseudo label produced by the teacher net.
优点
-
The submission is in general well structured. Similar to this type of works, it contains quite a few notations and architecture design details, which are presented in a relatively clear manner.
-
The key insight, namely, "points of latent new inlier correspondences tend to cluster around respective positive anchors..." seems interesting and fresh to me (though I am not an expert in this area).
-
The experimental performance looks competitive.
缺点
-
The key insight is more presented as an empirical observation, is there any chance that such can be justified in a more concrete way, even on some toy examples?
-
I didn't find report on the training cost and inference efficiency in the submission.
问题
In KITTI and nuScenes, as the data are scanned with limited degree of viewpoint freedom, can such method be applied to cases with significant rigid transformations? For example, in robotics, one may have to handle objects in varying poses randomly sampled in SE(3). If it is doable, could you please provide some insight/estimation on cost of the pre-training stage.
Minor points: . L161, missing space.
局限性
The limitation discussion looks good to me.
We first would like to thank the reviewer for giving us valuable comments.
Q1: The key insight is more presented as an empirical observation, is there any chance that such can be justified in a more concrete way, even on some toy examples?
A: Thanks for your valuable comments. Fig.1 in our manuscript provides qualitative results concerning our key insight. We also conduct an additional simple quantitative experiment. For a sample pair from KITTI, we report the IR% of pseudo-labels at the first iteration in the FGC-Branch under different thresholds based on the feature-space distance to positive anchors in the following table. Features are normalized before calculating distances.
| 0.1 | 0.5 | 0.8 | |
|---|---|---|---|
| IR% | 98.2 | 88.3 | 85.1 |
The table shows that most inliers are close to positive anchors in feature space, and outliers are likely far away from positive anchors. This result conforms to our key insight.
Notably, despite the high IR% of correspondences very close to anchors at the first iteration, there are too few correspondences to supervise the student, and some outliers still exist. This fact calls for our iterative design of FCEM to include potential inliers and update the anchors iteratively.
We will refine the presentation of our key insight by (1) providing detailed quantitative results that cover more pairs from both the KITTI and the nuScenes datasets and (2) adding some toy examples to Fig.1 in the revised version.
Q2: Missing Report on the training cost and inference efficiency in the submission.
A: As an unsupervised registration network, our method's training cost and inference efficiency are highly dependent on the choice of registration network. For most experimental results with FCGF, the full training procedure takes approximately 76 hours. For the additional results with the Predator presented in the Appendix, the training procedure takes about 140 hours. The inference efficiency is the same as that of the registration network(0.16s for FCGF and 0.30s for Predator per pair) because our method does not introduce any test-time modules. We will add these details in the revised version.
Q3: INTEGER applies to cases with significant rigid transformations. For example, object-level cases in robotics.
A: Although our work focuses on outdoor scene-level data rather than object-level data, we are willing to demonstrate the effectiveness of our method in scenarios involving significant rigid transformations. Specifically, we augment the rotation around the specified axis by a randomly-sampled angle between [a1,a2]°.
- Results with Augmented Z-axis Rotation on KITTI: Transformation that has been presented in the training set.
In outdoor settings, the rotation around the Z-axis is the most common transformation. However, significant rotations as large as 180° are unseen in the model during training, which makes them a challenge.
Compared to the supervised FCGF, our method has shown superior performance in terms of robustness against significant rigid transformations. Note that we follow the common practice of previous works to compute RRE and RTE only for pairs that are successfully registered. Therefore, RRE and RTE are not necessarily large even when RR% is very low. The results are as follows:
Ours:
| [-15,15]° | [-45,45]° | [-90,90]° | [-180,180]° | |
|---|---|---|---|---|
| RRE(°) | 0.25 | 0.28 | 0.28 | 0.33 |
| RTE(m) | 0.13 | 0.14 | 0.14 | 0.16 |
| RR(%) | 99.1 | 99.1 | 98.6 | 79.6 |
FCGF (supervised) trained on KITTI:
| [-15,15]° | [-45,45]° | [-90,90]° | [-180,180]° | |
|---|---|---|---|---|
| RRE(°) | 0.36 | 0.38 | 0.39 | 0.44 |
| RTE(m) | 0.46 | 0.42 | 0.43 | 0.25 |
| RR(%) | 85.9 | 83.0 | 80.1 | 77.2 |
- Results with Augmented X-axis Rotation on KITTI: Transformation that hasn't been presented in the training set.
In outdoor settings, the rotation around the X-axis is almost absent from the dataset. Such transformations are thus considered a challenge for registration models, both in terms of generalization and robustness. In this case, we have observed that our method outperforms the supervised FCGF by a very large margin (+18.9 RR% at most), demonstrating the superior generalizability of our method. The results are as follows:
Ours:
| [-15,15]° | [-45,45]° | [-90,90]° | [-180,180]° | |
|---|---|---|---|---|
| RRE(°) | 0.49 | 0.51 | 0.84 | 1.28 |
| RTE(m) | 0.15 | 0.28 | 0.25 | 0.30 |
| RR(%) | 98.5 | 96.6 | 89.3 | 58.7 |
FCGF (supervised) trained on KITTI:
| [-15,15]° | [-45,45]° | [-90,90]° | [-180,180]° | |
|---|---|---|---|---|
| RRE(°) | 0.91 | 0.93 | 1.01 | 1.23 |
| RTE(m) | 0.49 | 0.52 | 0.51 | 0.48 |
| RR(%) | 87.3 | 80.1 | 74.8 | 39.8 |
Overall, these results indicate that our method can effectively handle cases with significant rigid transformations, even when such transformations are not adequately presented in the training dataset.
We have acknowledged the importance of handling object-level data, such as ModelNet40. However, we are not able to conduct experiments on object-level datasets due to the lack of point cloud sequences in these datasets. Our method relies on progressive training (see Sec.3) and is designed to train with point cloud sequences provided in outdoor datasets. We will consider this in our future work.
Q4: Insight/estimation on the cost of the pre-training stage.
A: For the KITTI dataset, we train the teacher for 30 epochs in the pretraining stage, which takes roughly the same time as teacher-student training. However, the time cost for a single epoch is longer because there are more point cloud pairs (we generate a synthetic pair for every frame in the training set). We will investigate the necessity and devise more efficient sampling strategies in our future work.
Q5: Minor points: L161, missing space.
A: Thanks for pointing out it. We will correct it in the revised version.
Thank you for the reply, I do not have further questions.
Thanks for your time and effort to read our response. We are more than happy to be able to address your questions!
Dear Reviewer FNCk,
We sincerely thank you for your precious time and efforts in reviewing our paper. We greatly appreciate your insightful and detailed feedback, and we have carefully addressed the concerns in the rebuttal. Please let us know if any aspects remain unclear, and we are happy to provide further clarification.
We are looking forward to your reply.
Best regards,
The authors of submission 1909
This paper focuses on unsupervised point cloud registration in 3D computer vision. To tackle the problem, it leverages the observation that in the feature space, points of latent new inlier correspondences tend to cluster around respective positive anchors. Based on that, this paper proposes a novel unsupervised registration method termed INTEGER to incorporate high-level contextual information for reliable pseudo-label mining. Extensive experiments were conducted on outdoor benchmarks including KITTI and nuScenes, which demonstrates the effectiveness of the proposed method.
优点
-
This paper is well-written and well-organized;
-
The observation w.r.t. the new inliers and positive anchors seems interesting and correct;
-
The use of teacher-student networks is rational, and the design of Feature-Geometry Coherence Mining is novel.
-
The experimental results are strong on outdoor scenes.
缺点
- There are some problems in terms of figures. For example, Fig. 1 is not very easy to understand. As a teaser, it does not demonstrate the core idea of this paper very clearly. Fig. 2 and Fig.3 include too much content and make it hard to focus.
问题
-
In Tab.1, why there are fewer baseline methods on nuScenes compared to those on KITTI? As well as the generalization test from KITTI to nuScenes?
-
Although the title demonstrates that this paper focuses on outdoor scenes, the proposed method could also work on Indoor scenes. Why Indoor scenes are excluded? Could some experiments be added to further demonstrate the superiority of the proposed method?
局限性
Limitations have been dicussed.
We first would like to thank the reviewer for giving us valuable comments.
W1: Problems with the figures. For example, Fig. 1 is not very easy to understand. Fig.2 and Fig.3 include too much content and make it hard to focus.
A: Thanks for pointing out the figure issues. A simplified version of Fig.2 is shown in the PDF attachment of the general response. In Fig.1, we visualize the feature space by reducing the feature dimension to 2 with t-SNE. Here, we provide an explanation:
- Contour lines and colors in between denote the feature-space distance from a point to its nearest anchor.
- The "o" and "x" markers denote points forming inlier and outlier correspondence, respectively.
- From Fig.1, we can observe that almost all points close to positive anchors in the feature space belong to inlier correspondences.. The observation conforms to our key insight.
We will further improve the clarity of these figures in the revised version. Thank you for bringing this to our attention.
Q2: Reasons for fewer baseline methods presented on nuScenes compared to those on KITTI, as well as the generalization test from KITTI to nuScenes.
A: We excluded some supervised baselines (e.g., GeoTransformer, CoFiNet, and Predator) in experimental results of nuScenes and generalization tests for two main reasons:
- We aimed to align our experimental setup with previous research, such as EYOC, which similarly omits these baselines.
- The omitted baselines are included in KITTI's results to demonstrate that state-of-the-art supervised methods fail when the overlap ratio is very low and density variation is huge. Since KITTI and nuScenes are both autonomous driving datasets and share similar challenges, we believe including these baselines may not significantly add to the completeness of our experiments.
We recognize the value of a comprehensive evaluation and have provided additional baselines in part 1 of the general response. Overall, these supervised SOTA methods still fail to perform on par with unsupervised counterparts with distant point cloud pairs. These experimental results will also be added to the revised version of our paper for clarity.
Q3: Reasons indoor scenes are excluded? Additional experiments to further demonstrate the superiority of the proposed method?
A: We chose to focus on outdoor scenes for two reasons: (1) A rich set of unsupervised indoor registration methods have been proposed, and their performance is already on par with supervised indoor methods. Still, they fail to tackle outdoor challenges due to their reliance on RGB-D data, high overlap assumptions, and LiDAR data's sparsity in outdoor scenes. We aimed to extend the success of unsupervised methods to outdoor scenes, explicitly addressing the unique challenges of these scenarios, such as the varying point density of the same object in different scans, as often encountered in autonomous driving. Given the very distinct applications and sensor characteristics (RGB-D cameras v.s. LiDARs), which may hinder a "unified method" from excelling in both indoor and outdoor environments, we only present the results on outdoor datasets in our paper. (2) Our method relies on progressive training (see Sec.3) and is designed to train with point cloud sequences that are commonly provided in outdoor autonomous driving datasets.
We acknowledge the significance of indoor scenes and have conducted additional experiments to explore our method's potential in these environments. Specifically, we tested generalization from the outdoor KITTI dataset to the indoor 3DMatch/3DLoMatch dataset, demonstrating that our framework, despite being unsupervised, outperforms the supervised FCGF by a significant margin (26.4% RR vs. FCGF's 19.7% RR). Please refer to part 2 of the general response for details.
Dear Reviewer Bwh6,
We sincerely thank you for your precious time and efforts in reviewing our paper. We greatly appreciate your insightful and detailed feedback, and we have carefully addressed the concerns in the rebuttal. Please let us know if any aspects remain unclear, and we are happy to provide further clarification.
We are looking forward to your reply.
Best regards,
The authors of submission 1909
Thanks for answering my questions! I am still leaning towards accepting this manuscript.
Thanks for your time and effort to read our response. We are happy to be able to address your concerns, and we are more than grateful for your accept recommendation.
We sincerely thank all reviewers for their careful, valuable, and insightful reviews. The reviewers acknowledge our work with good writing(Bwh6,FNCk,2VKD), novel design(Bwh6,SEpR), and competitive performance(Bwh6,FNCk,2VKD). We are particularly delighted to see that all reviewers recognize the novelty and correctness of the insight and the key observation of our method. We have conducted additional experiments to address the comments, and results are included in the PDF attachment. We describe new experiments in the PDF attachment and discuss the results most reviewers requested.
- More Supervised Baselines.
We excluded some supervised baselines for nuScenes dataset and the generalizability test because:
- We aimed to align our experimental setup with previous work, such as EYOC, which similarly omits these baselines.
- The omitted baselines are included in KITTI's results to show that SOTA supervised methods fail with very low overlap and huge density variation. Since KITTI and nuScenes are both autonomous driving datasets and share similar challenges, we believe including these baselines may not significantly add to the completeness of our experiments.
We have included more supervised baselines in Table 1 of the PDF attachment. Similar to KITTI's results, these supervised SOTA methods still fail to perform on par with unsupervised counterparts with distant point cloud pairs (within [40,50]m; see Table 1). This highlights the need for an unsupervised outdoor registration method with strong generalizability to handle distant point cloud pairs, which is crucial to real-world applications. We will include these results in the revised version for clarity.
- Reasons for Outdoor Scenes. Generalizability to Indoor Scenes.
In our paper, we chose to focus on outdoor scenes because (1) A rich set of unsupervised indoor registration methods have been proposed, and they already perform on par with supervised indoor methods. Still, they fail to tackle outdoor challenges due to their reliance on RGB-D data, high overlap assumptions, and LiDAR data's sparsity in outdoor scenes. In this paper, we aimed to extend the success of unsupervised methods to outdoor scenes, explicitly addressing the unique challenges of these scenarios, such as the varying point density of the same object in different scans, as often encountered in autonomous driving. Given the very distinct applications and sensor characteristics (RGB-D cameras v.s. LiDARs), which may hinder a "unified method" from excelling in both indoor and outdoor environments, we only present the results on outdoor datasets in our paper. (2) Our method relies on progressive training(see Sec.3) and is designed to train with point cloud sequences that are commonly provided in outdoor autonomous driving datasets.
However, we acknowledge the importance of indoor settings and provide results on 3DMatch/3DLoMatch in Table 3 in the PDF attachment. Following SpinNet(CVPR 2021) and BUFFER(CVPR 2023), we report RR% for generalizability tests using weights trained on KITTI.
- ETH Dataset: Generalizability to Non-Urban Outdoor Environments.
We further demonstrate the scalability of INTEGER by evaluating on ETH dataset, an outdoor dataset with rural and forest scenes, which are primarily unstructured and complex, differing from the urban environments of KITTI and nuScenes datasets. Because ETH is too small for training from scratch, we directly evaluate performance using weights trained on the KITTI dataset. The results are shown in Table 2 of the PDF attachment.
Overall, our method suppresses supervised FCGF in RR% by a large margin (58.06% v.s. 39.55%), demonstrating its superior generalizability. Moreover, our method also outperforms unsupervised EYOC by a considerable margin.
- Overlap w.r.t. Distance of KITTI and nuScenes.
Our evaluation follows EYOC's existing protocol, focusing on pairs within distances [d1,d2], consistent with our method's design for autonomous driving data. The differences between frames, indicated by the distance between them, are crucial for demonstrating robustness and generalizability in real-world applications like collaborative perception. Fig.2 in the PDF attachment illustrates the overlap ratios w.r.t distances between frames in KITTI and nuScenes datasets, indicating increasing distances reduce overlap ratios.
- Training Cost and Computational Efficiency.
As an unsupervised training framework, INTEGER's training cost is highly dependent on the choice of the registration network. For most experimental results with FCGF, the entire training takes ~76 hours. For results with Predator in the Appendix, the training takes ~140 hours.
The inference efficiency is the same as that of the registration network because our method does not introduce any test-time modules. We list the inference efficiency of FCGF and Predator on KITTI here for convenience:
| Method | Time(s) |
|---|---|
| FCGF | 0.16 |
| Predator | 0.30 |
We will add the results to the revised version.
- Efficiency and Effectiveness of FCEM.
We have demonstrated the effectiveness in Sec 4.3, Table 3. In the ablation study, we validate the effectiveness with experiments on the two branches of FCEM.
As mentioned in Limitations (Sec.A.1), our method is slightly slower to obtain pseudo-labels than EYOC because of the proposed iterative method used in the FGC-Branch of the FCEM module. On average, our method takes ~12 minutes to train (i.e., validation time is not included) every epoch on KITTI, while EYOC takes about ~9 minutes. However, with only 33% more time cost on KITTI, our INTEGER has much more accurate pseudo-labels with 28.1% higher IR (81.3% v.s. EYOC's 53.2%), and considerable improvements on RR of very distant pairs. (54.2% v.s. EYOC's 52.3%). Therefore, our FCEM is efficient concerning the trade-offs of time and performance.
The paper focuses on unsupervised point cloud registration. To tackle the problem, it leverages the observation that in the feature space, points of latent new inlier correspondences tend to cluster around respective positive anchors. Based on that, this paper proposes a new unsupervised registration method termed INTEGER to incorporate high-level contextual information for reliable pseudo-label mining. Extensive experiments were conducted on outdoor benchmarks including KITTI and nuScenes, which demonstrates the effectiveness of the proposed method. The final recommendations from the reviewers were 1XWeak Accept, 2XBorderline Accept and 1XBorderline Reject.