A Consistency-Aware Spot-Guided Transformer for Versatile and Hierarchical Point Cloud Registration
We propose a novel consistency-aware spot-guided Transformer for versatile and hierarchical point cloud registration, achieving state-of-the-art accuracy, efficiency, and robustness on both outdoor and indoor benchmarks.
摘要
评审与讨论
This work proposed a coarse-to-fine matching approach for point cloud registration, where a consistency-aware spot-guided transformer is proposed to ensure the coarse matches to be geometric consistent and around the overlap regions. This method is compared with SOTA PCR methods on three benchmarks and the experiment results validate its superior performance in terms of reregistration accuracy, success rate and time cost.
优点
-
Extensive experiments and SOTA results. The proposed method is evaluated on three benchmarks including two outdoor and one indoor cases, where challenging cases such as low overlapping data are also included. It achieves the SOTA performance in terms of registration accuracy, success rate, time cost.
-
The idea of incorporating geometric consistency to increase the distinctiveness of the learned feature is novel. For pure 3D descriptor methods, such as SPINNet, YOHO, only rotation-equivalence is considered. With the help of geometric consistency, the proposed method showed superior feature matching performance in Table 3.
-
The wiriting are clear and easy to follow.
缺点
I don’t see any significant weakness of this work. Some minor comments are here:
-
In line 123, “third level” and “fourth level” need to be explained more or illustrated in the Figure 1.
-
Please clear state the proposed contribution or difference with respect to the existing work. For instance, the “spot”, cross-attention, self-attention are also discussed in DiffusionPCR, GeoTransformer, and geometric consistency is widely applied by many PCR method, such as PointDSC, RANSAC, Spectral Matching [1].
-
Even though the additional experiment is not encouraged during this process, I would recommend the generalizability of the proposed method can be discussed. I can imagine the proposed method should have better generalizability than existing correspondence-based methods, due to the help of geometric consistency. But, how about it compared to description-based methods (SpinNet, YOHO) with RANSAC/PointDSC ? Trained on indoor dataset and test on outdoor cases?
[1] A spectral technique for correspondence problems using pairwise constraints. ICCV 2015.
问题
In table 3, the performance of CAST+RANSAC is better than CAST means the proposed coarse-to-fine matching approach is not strictly geometric consistent, which arises a question that instead of introducing the geometric consistency to the feature matching process, why not making it as a separate and following process, just like RANSAC or Spectral Matching ?
局限性
I didn’t see any limitation.
Thank you for your constructive review and valuable suggestions. Below, we offer detailed responses to each of your comments and questions. If there are any points where our answers don't fully address your concerns, please let us know, and we will respond as quickly as possible.
- Weakness 1: In line 123, “third level” and “fourth level” need to be explained more or illustrated in the Figure 1.
For feature extraction based on KPConv, we need to down-sample the raw point clouds with a series of voxel sizes before each encoder layer, resulting in the first level, the second level, the third level and the fourth level points. As the feature maps in the decoder of these points are with the size of of that of the input point cloud where , respectively, we will introduce the notation and to symbolize the points and their features to make our maniscript more easy to understand. We have carefully revised our writing and figures of Section.3 Method and released this version as a general response, and it would be appreciated if you can read our revised overview part.
- Weakness 2: Please clear state the proposed contribution or difference with respect to the existing work.
Existing work widely adopts self-attention and cross-attention for coarse matching, however, CoFiNet and GeoTransformer adopt global cross-attention for feature aggregation, which inevitably attends to similar yet irrelevant areas, resulting in misleading feature aggregation and inconsistent correspondences. DiffusionPCR follows an iterative matching scheme that explicitly selects overlapped regions to attend to, leading to much longer inference time. Additionally, these methods focus on matching among very coarse nodes without considering geometric consistency, which is not tight enough for fine matching. Instead, we focus on feature aggregation among semi-dense features leveraging both local and global geometric consistency to tackle the sparsity and looseness of coarse matching. To be specific, our consistency-aware self-attention only attends to salient nodes sampled from a global compatibility graph, while our spot-guided cross-attention only attends to nodes selected based on local consistency, i.e., the spot.
As for geometric consistency, existing work only uses it to search consistent correspondences (outlier rejection) from a given correspondence set for robust pose estimation, such as PointDSC, spectral matching, etc. However, our method leverages geometric consistency in feature aggregation during coarse matching, instead of outlier rejection. For fine matching, we adopt the geometric consistency for inlier prediction, which is similar to PointDSC but without time-consuming hypothesis-and-verification pipelines. Therefore, our attention-based modules and the way using geometric consistency are very different from existing work.
- Weakness 3: I would recommend the generalizability of the proposed method can be discussed.
We present our results of generalizability and discuss the potential in realistic applications in part 3 of author rebuttal. Please refer to it, thank you for your suggestion!
Although our method trained on 3DMatch fails to deploy on ETH due to the out of memory error, we evaluate the generalizability of different methods when trained on KITTI (outdoor) and tested on ETH (outdoor). Note that these datasets use Velodyne-64 3D LiDAR and Hokuyo 2D LiDAR, respectively, leading to very different appearance of point clouds, hence we believe it is solid to show the generalizability of different methods. Our method achieves satisfying accuracy and robustness, showcasing better generalizability than GeoTransformer, due to the help of consistency. But it falls behind SpinNet and BUFFER using RANSAC mainly owing to the point-wise FPN backbone and lightweight yet learnable fine matching. Furthermore, we conduct an unsupervised domain adaptation experiment, indicating that our model easily adapts to an unseen domain and achieves robust and accurate performance after a short period of unsupervised tuning (only 20min for a epoch!). For applications, we believe generalizing or quickly adapting from an outdoor LiDAR dataset to another is a more realistic setting than generalizing from an RGBD camera dataset 3DMatch to a LiDAR dataset ETH as many papers.
- Q: Instead of introducing the geometric consistency to the feature matching process, why not making it as a separate and following process?
A: Our overall design using a complex coarse matching module and a lightweight fine matching module is motivated by the efficiency, accuracy and scalability requirements from real-time applications such as LiDAR odometry and SLAM. In a LiDAR odometry system, it is a consensus that the real-time frame-to-map registration is essential for accuracy, instead of frame-to-frame registration as we do in this task. However, all of the existing coarse-to-fine methods fail to achieve real-time performance. We believe the key is only using a lightweight fine matching process, because coarse matching is not necessary in odometry with a small pose deviation. Existing hypothesis-and-verification pipelines such as RANSAC and PointDSC are robust but time-consuming. Hence, we propose a lightweight fine matching module allowing independent deployment without coarse matching to meet these requirements. Nevertheless, coarse matching is essential for place recognition and global re-localization in SLAM without real-time requirement, hence we introduce the consistency to it to make the whole system robust for large-scale pose deviation.
The author's detailed explanation and additional experiment results are appreciated and address most of my concerns. I will maintain my original rating for the manuscript.
Dear reviewer,
Thank you again for dedicating time to review our paper! Thank you so much for your appreciations of our explanation and additional experiment results!
Sincerely yours,
Authors
This paper introduces a consistency-aware, spot-guided Transformer, adapted from 2D image matching techniques, to minimize interference from irrelevant areas. It incorporates a consistency-aware self-attention module that enhances matching capabilities by using edge length constraints to filter correspondences. Additionally, a lightweight fine matching module is designed for both sparse keypoints and dense features, enabling accurate transformation estimation. The method performs well on outdoor datasets; however, its effectiveness diminishes in scenarios with low overlap.
优点
The motivation to make learning-based registration more efficient and scalable for real-time applications is both interesting and important. I appreciate the introduction of the edge equality constraint into the registration pipeline, which guides the correspondence search in a more efficient manner compared to COTReg [1].
The method performs well on outdoor datasets; however, its effectiveness diminishes in scenarios with low overlap.
缺点
The method section of this paper is challenging to follow. It would be beneficial to introduce each method sequentially, aligned with the flow depicted in the figure—for example, starting with self-attention, followed by cross-attention, and then linear cross-attention.
The captions of Figure 1 should clearly label the coarse and fine matching modules, as they are not marked in the current figures, making it difficult to understand. Furthermore, it's unclear which network layer corresponds to "the third level" mentioned in line 123—is it part of the encoder or decoder? It is also hard to distinguish between spot attention and other forms in Figure 1.
The methodology for obtaining overlap scores is not adequately explained and is not illustrated in Figure 1.
The performance on datasets with low overlap seems subpar, which is concerning since handling cases of low partial overlap is crucial.
The authors claim that existing methods are not efficient or scalable for real-time applications like odometry in robotics; however, only time comparisons are provided in Table 3, and these do not clearly demonstrate an advantage. Additionally, there is no assessment of performance on large-scale datasets. Therefore, the authors should provide more comprehensive evaluations, including comparisons of other metrics such as FLOPS.
This paper lacks visualizations for the spot-guided attention, which are essential for demonstrating the method's effectiveness.
The manuscript lacks critical references such as COTReg [1] by Mei et al., which discusses geometric consistency and should be cited. Comparisons with state-of-the-art methods cited in references [2] and [3] are also necessary to validate the proposed method's efficacy.
[1] Mei, Guofeng, et al. "COTReg: Coupled optimal transport based point cloud registration." arXiv preprint arXiv:2112.14381 (2021). [2] Zhang, Yifei, et al. "FastMAC: Stochastic Spectral Sampling of Correspondence Graph." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. [3] Huang, Tianyu, et al. "Scalable 3d registration via truncated entry-wise absolute residuals." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
问题
The paper does not specify if the loss supervises the overlap scores, it does not require it?
The term "partial tokens" mentioned in line 180 needs clarification. Details on computing the coarse matching for spot-guided attention are also needed.
Suggestion:
- Ensure consistency in font sizes across Tables 3 and 4.
- The terminology in line 123 could be improved by using "superpoints" or "nodes" instead of "patches," as the latter suggests a collection of points rather than individual points or nodes.
局限性
Yes, the authors have adequately addressed the limitations.
Thank you for your constructive review and valuable suggestions. Below, we offer detailed responses to each of your comments and questions. If there are any points where our answers don't fully address your concerns, please let us know, and we will respond as quickly as possible.
- Weakness 1-3 about writing.
We have carefully revised our writing and figures of Method Section in the general comments, and it would be appreciated if you can read it. Specifically, we adjust our Figure 1 to clearly label the coarse and fine matching modules and include more details such as the prediction of overlap scores. We explicitly describe our pipeline in both Figure 1 and an extra Architecture subsection in the same order. I believe our revised version can address most of your concerns.
- Weakness 4 about performance in low overlapping scenes.
We update our results on 3DLoMatch and discuss the performance of our method in 3DLoMatch in part 1 of author rebuttal. Please refer to it, thank you!
- Weakness 5 about our claim of potential in large-scale applications.
The FLOPs of our method is significantly lower than GeoTransformer and RoITr, indicating its efficiency. Although FLOPs of other methods are lower, their runtime is much more. We believe the runtime on RTX3080Ti is more direct than FLOPs only focusing on deep models.
| Method | FLOPs (G) |
|---|---|
| CoFiNet | 23.46 |
| Predator | 36.89 |
| GeoTransformer | 271.23 |
| RoITr | 422.14 |
| CAST | 102.16 |
By the way, "large-scale applications" in our manuscript may be misleading, thank you for pointing it out! Instead of large-scale datasets, it mainly stands for SLAM in our context, which requires frame-to-map registration for accuracy, instead of frame-to-frame registration as we do in this task. Compared with our task, the frame-to-map registration is large-scale. Here we would like to clarify why our design has greater potential in this application.
At present, existing coarse-to-fine methods fail to achieve real-time performance in SLAM. Moreover, their fine matching tightly couples with coarse matching, which requires node-based partitioning and really time-consuming optimal transport. Therefore, they are not scalable to SLAM. We believe the key is only using a lightweight fine matching process, because coarse matching is not necessary in odometry with a small pose deviation. Hence, we design a sparse-to-dense fine matching pipeline allowing independent deployment without coarse matching to meet these requirements, which can efficiently establish virtual correspondences for any point and their neighbors in the other point cloud. Moreover, we validate its feasibility using a learnable inlier classifier instead of existing time-consuming pose estimators (RANSAC, MAC, etc). Nevertheless, coarse matching is essential for place recognition and global re-localization in SLAM without real-time requirement, hence we introduce the consistency to it to make the whole system robust for large-scale pose deviation.
- Weakness 6: lacks visualizations for the spot-guided attention.
We add visualizations of vanilla global cross-attention and our spot-guided cross-attention of nodes from both salient areas and non-salient areas, as is shown in Figure 2 in the supplementary pdf, which demonstrates its effects to select instructive areas to attend to according to local consistency, while vanilla cross-attention may be attracted by many irrelavant areas.
- Weakness 7: The manuscript lacks critical references. Comparisons with state-of-the-art methods cited in references [2] and [3] are also necessary.
Thank you for pointing it out. We add the mentioned reference in our revised version in general comments. Comparisons with these SOTA registration methods are shown in Table 1 in author response, which further validate the efficacy of our method. For fairness, all of these methods use the FCGF descriptor rather than coarse-to-fine methods. Rather than focusing on feature extraction or matching as the baselines included in our manuscript, these methods focus on how to search a consensus set for robust pose estimation, using a pipeline more complex than our sparse-to-dense fine matching and LGR in GeoTransformer. We believe this is why existing papers only compare these registration methods with each other using the same descriptors or correspondences.
- Q1: The paper does not specify if the loss supervises the overlap scores, it does not require it?
A: Our coarse matching loss in Equation (12), Line 233 simultaneously supervises the coarse matching and the overlap scores, since the final matching score is the product of feature similarities and overlap scores, and we also include the negative log-likelihood terms of overlap scores whoose labels are "0". Besides, we add an ablation study by ablating this overlap head as is shown in Table 2 in the supplementary pdf, which demonstrates its importance.
- Q2: The term "partial tokens" needs clarification. Details on computing the coarse matching for spot-guided attention are also needed.
A: "Partial tokens" means instead of attending to all nodes, our attention only select some of them to attend to. To be specific, the consistency-aware self-attention only attends to salient nodes sampled from a global compatibility graph, while the spot-guided cross-attention only attends to nodes selected based on local consistency, i.e., the spot. We have carefully revised our writing and figures of Method Section in the general comments and supplementary pdf, which clearly provides the details of how to match the semi-dense nodes and evaluate their consistency in Subsection Architecture.
- Suggestion
Thank you for pointing it out! We will ensure consistency in font sizes across Tables 3 and 4. And we also replace the terminology "patches" in line 123 or any similar cases with "nodes", which is more appropriate.
I appreciate the authors' clarifications. My concerns have been addressed.
Dear reviewer,
Thank you again for dedicating time to review our paper and rebuttal! Thank you so much for your appreciations of our clarifications and revisions, and we sincerely look forward to your reevaluation of the rating for our paper at your convenience during the discussion process!
Sincerely yours,
Authors
This manuscript mainly focuses on the learning based feature matching of point cloud registration. The authors propose a consistency-aware spot-guided transformer and a lightweight fine matching module. Experiments on both indoor and outdoor benchmarks prove the effectiveness of the designs. However, this article requires significant revisions and improvements in the narrative of the method introduction. It needs to provide a clearer sequence of the entire process, the inputs and outputs of each module, and the specific implementations of key designs.
优点
-
It is reasonable to improve the attention architecture by utilizing geometric consistency constraints.
-
Effective registration result improvements have been achieved on both indoor and outdoor datasets.
缺点
- Figure 1 should be presented with greater clarity and precision. Currently, the coarse matching module and fine matching module are not labeled in the figure, making it difficult to correspond the depicted process with the title description.
- In section 3.2 of the Methods part, the definition of “spot” by the authors is not clear. “Spot” seems to be a key concept in this section, yet it is only briefly described in line 145. Is “spot” referring to a patch? How does it guide the cross-attention?
- The logic in the Methods section is not sufficiently clear, requiring repeated and careful reading. The inputs and outputs of each module are not explicitly stated. For instance, what are the inputs and outputs of the Sparse-to-Dense Fine Matching module in section 3.3? This module is also not represented in the flowchart.
- What is the sequence of the Consistency-Aware Self-Attention and Spot-Guided Cross-Attention in the process? Figure 1 shows that self-attention is performed first, so it is recommended to follow the same sequence in the detailed introduction.
- In Formula 4, is the pre-defined threshold σc set differently for datasets of varying sizes?
- How generalizable is the method proposed in this paper? If trained on the 3DMatch dataset, how would the results be when tested on outdoor datasets such as ETH?
- The method designed in this paper seems reasonable and effective. It is suggested that the authors make the code open-source for others to learn from.
问题
Please see Weaknesses.
局限性
Please see Weaknesses.
Thank you for your constructive review and valuable suggestions. Below, we offer detailed responses to each of your comments and questions. If there are any points where our answers don't fully address your concerns, please let us know, and we will respond as quickly as possible.
- Weakness 1.3.4 about writing.
Thank you for your valuable suggestions. We have carefully revised our writing and figures of Method Section and released them in the supplementary pdf of author rebuttal and general comments, respectively, and it would be appreciated if you can read this version.
-
- Specifically, we adjust our Figure 1 to clearly label the coarse and fine matching modules with as well as submodules, and include more details to achieve greater clarity and precision.
-
- The inputs and outputs of each module are explicitly stated in our revised version and depicted in the revised Figure 1. For instance, the inputs of the sparse matching are coarse correspondences and the semi-dense nodes with features from FPN, while the outputs of the single-head attention and the compatibility graph embedding (submodules) are virtual correspondences and their confidence weights, respectivel, which are also the outputs of sparse matching. Then these virtual correspondences with weights along with dense features are the inputs of dense matching, which estimates the pose for coarse alignment and further local matching, then it outputs the final pose estimate.
-
- Finally, we have added an extra Architecture subsection to introduce the pipeline and kept the writing in the same same order as the figure, for instance, the consistency-aware self-attention is before the spot-guided cross-attention. The architecture of our coarse matching module is designed as a sequence of blocks for attention-based multi-scale feature aggregation. For each block with both semi-dense features and coarse features as inputs, we first feed into a self-attention module and a cross-attention module. Then and are fused into each other. Finally, semi-dense features are fed into a consistency-aware self-attention module and later a spot-guided cross-attention module at the end of each block. I believe our revised version can address most of your concerns about writing and presentation.
I believe our revised version can address most of your concerns about writing and presentation.
- Weakness 2 about the description of spot.
We have carefully revised this part and you can check it in the supplementary pdf (Figure 2) of author rebuttal and general comments. Here I would like to briefly describe the definition and effect of spot. Before spot-guided cross-attention, we compute a coarse correspondence set by matching, denoted as . For each node such that , we select a subset as seeds from its neighborhood and construct a region of interest for it as , namely its spot. selects and only its neighbors with reliable correspondences, and we propose a consistency-aware matching confidence criterion to rank the neighbors. Effect: As shown in Figure 2 of the supplementary pdf, global cross-attention tends to aggregate features from wrong regions under the disturbance of many similar yet irrelevant regions, leading to false correspondences. Our formulation of spot is inspired by local consistency that the correspondences of adjacent 3D points remain close to each other. Hence, the spots defined above are likely to cover the true correspondences of query nodes, providing guidance for feature aggregation without interfering with irrelevant areas.
- Weakness 5: In Formula 4, is the pre-defined threshold set differently for datasets of varying sizes?
Yes, the threshold for evaluating the consistency should be set differently for datasets of varying sizes. For simplification, we set it as the 6 times of the initial voxel size for down-sampling.
- Weakness 6: How generalizable is the method proposed in this paper? If trained on the 3DMatch dataset, how would the results be when tested on outdoor datasets such as ETH?
We present our results of generalizability and discuss the potential in realistic applications in part 3 of author rebuttal. Although our method trained on 3DMatch fails to deploy on ETH due to the out of memory error, we evaluate the generalizability of different methods when trained on KITTI and tested on ETH. Note that these datasets use Velodyne-64 3D LiDAR and Hokuyo 2D LiDAR, respectively, leading to very different appearance of point clouds, hence we believe it is solid to show the generalizability of different methods. Our method showcases better generalizability than GeoTransformer, due to the help of consistency. But it falls behind SpinNet and BUFFER using RANSAC mainly owing to the point-wise FPN backbone and lightweight yet learnable fine matching. Moreover, we conduct an UDA experiment, indicating that our model easily adapts to an unseen domain and achieves robust and accurate performance after a short period of unsupervised tuning (only 20min for a epoch!). For applications, we believe generalizing or quickly adapting from an outdoor LiDAR dataset to another is a more realistic setting than generalizing from an RGBD camera dataset 3DMatch to a LiDAR dataset ETH as many papers.
- Weakness 7: The method designed in this paper seems reasonable and effective. It is suggested that the authors make the code open-source for others to learn from.
Thank you for your suggestion! Our source codes have been submitted as the supplementary material before rebuttal, and we will make it open source once the paper is accepted.
After reviewing the authors' revisions and considering the rebuttal and feedback from other reviewers, I find that while the authors have generally addressed my initial concerns, there remains no compelling reason to adjust my initial evaluation either upward or downward. Therefore, I will maintain my original rating for the manuscript.
Dear reviewer,
Thank you again for dedicating time to review our paper! Thank you so much for your appreciations of our rebuttal and revisions!
Sincerely yours,
Authors
This paper focuses on feature matching for point cloud registration. To this end, it aims to improve the effectiveness of the coarse-to-fine matching mechanism by designing a consistency-aware spot-guided Transformer (CAST). More specifically, the proposed method incorporates a spot-guided cross-attention module to avoid interfering with irrelevant areas, and a consistency-aware self-attention module to enhance matching capabilities. Extensive experiments on both indoor and outdoor datasets are conducted to validate the proposed model.
优点
-
The overall writing is satisfied;
-
The experiments on outdoor benchmarks look good
缺点
-
One major issue of this paper is the novelty of this paper. It looks like the proposed method roughly follows the pipeline of GeoTransformer, but introduces outlier rejection techniques and more attention layers. Moreover, it is also necessary to show the size of the proposed model (in terms of the number of parameters)
-
The second major concern lies in the performance on indoor benchmarks including 3DMatch and 3DLoMatch. As aforementioned, with many modules added to GeoTransformer, the proposed method even fails to outperform it. The advantages of the proposed method on indoor scenarios should be better demonstrated.
问题
See weaknesses.
局限性
Limitations have been included as a part of the Appendix.
Thank you for your constructive review and valuable suggestions. Below, we offer detailed responses to each of your comments and questions. If there are any points where our answers don't fully address your concerns, please let us know, and we will respond as quickly as possible.
- Weakness 1: Novelty
Our method roughly follows the pioneering coarse-to-fine matching framework CoFiNet (NeurIPS21'), but there are significant improvements besides performance. We believe this is a novel way incorporating both local and global consistency in feature aggregation rather than outlier removal as existing registration methods such as SC2-PCR and PointDSC. And it is also important that our method makes learning-based registration more efficient and scalable for real-time applications such as odometry and SLAM.
Ever since the publishment of CoFiNet (NeurIPS21'), many works roughly follow the pipeline and revise some of the modules, including GeoTransformer(CVPR22'), OIF-Net (NeurIPS22'), PEAL (CVPR23'), etc. Although these works widely adopt self-attention and cross-attention for coarse matching, however, CoFiNet and GeoTransformer adopt global cross-attention for feature aggregation, which inevitably attends to similar yet irrelevant areas, resulting in misleading feature aggregation and inconsistent correspondences. Additionally, these methods focus on matching among very coarse nodes without considering geometric consistency, which is not tight enough for fine matching. Instead, we focus on feature aggregation among semi-dense features leveraging both local and global geometric consistency to tackle the sparsity and looseness of coarse matching. To be specific, our consistency-aware self-attention only attends to salient nodes sampled from a global compatibility graph, while our spot-guided cross-attention only attends to nodes selected based on local consistency. We believe this is a novel way to introduce geometric consistency to feature aggregation rather than outlier rejection such as PointDSC and SC2-PCR.
More importantly, our overall design using a complex coarse matching module and a very lightweight fine matching module is motivated by the efficiency, accuracy and scalability requirements from real-time applications such as LiDAR odometry and SLAM. In a LiDAR odometry system, it is a consensus that the real-time frame-to-map registration is essential for accuracy, instead of frame-to-frame registration as we do in this task. However, all of the existing coarse-to-fine methods fail to achieve real-time performance. Moreover, their fine matching tightly couples with coarse matching, which requires node-based partitioning and really time-consuming optimal transport for patch-to-patch correspondences. Therefore, existing methods are not scalable to SLAM. We believe the key is only using a lightweight fine matching process, because coarse matching is not necessary in odometry with a small pose deviation. Hence, we design a sparse-to-dense fine matching pipeline allowing independent deployment without coarse matching to meet these requirements, which can efficiently establish virtual correspondences for any point and their neighbors in the other point cloud. Moreover, we validate the feasibility of this scheme using a learnable inlier classifier instead of existing time-consuming pose estimators (RANSAC, PointDSC, MAC, etc). Nevertheless, coarse matching is essential for place recognition and global re-localization in SLAM without real-time requirement, hence we introduce the consistency to it to make the whole system robust for large-scale pose deviation.
Finally, extensive experiments validates the accuracy, robustness, and efficiency on both indoor and outdoor scenes. It is noteworthy that our model can easily adapts to an unseen domain and achieve robust and accurate performance after a short period of unsupervised tuning (only 20min for a epoch!), which is detailed in part 3 in author rebuttal. The model sizes of popular methods are also reported as follow:
| Method | Model Size (M) |
|---|---|
| REGTR | 11.85 |
| CoFiNet | 5.48 |
| Predator | 7.42 |
| GeoTransformer | 9.83 |
| RoITr | 10.10 |
| CAST | 8.55 |
- Weakness 2: Performance on indoor datasets
We update our results on 3DLoMatch and discuss the performance of our method in low overlapping scenes in part 1 of author rebuttal. Please refer to it, thank you for your comment!
It is noted that on indoor benchmark 3DMatch, our method achieves the highest RR of 95.2%. On 3DLoMatch, our method achieves 75.1% RR, surpassing all descriptors and all non-iterative correspondence-based methods except OIF-Net. As for DiffusionPCR and PEAL showing higher RR than others, they iteratively use a variant of GeoTransformer with overlap priors for multi-step matching, which is extremely time-consuming (10x runtime of ours), and PEAL even uses extra information from 2D images. Notably, our method achieves such superior performance only with the lowest time consumption.
The authors' response addresses my concerns well and I will increase my score.
Dear reviewer,
Thank you again for dedicating time to review our paper! Thank you so much for your appreciations of our rebuttal and reevaluation of our scores!
Sincerely yours,
Authors
Overview
Given two partially overlapped point clouds and , the point cloud registration problem can be formulated as solving the optimal rigid transformation between by minimizing the weighted sum of point-to-point errors of a predicted correspondence set with a confidence weight for each correspondence :
where and are the rotation and the translation between and , respectively.
As depicted in Figure 1, CAST follows a coarse-to-fine feature matching and registration architecture, including a feature pyramid network, a consistency-aware spot-guided attention-based coarse matching module, and a sparse-to-dense fine matching module. We first utilize a KPConv-based fully convolutional network [1] to extract multi-scale features. We denote feature maps of the decoder with the size of 1/k as , which correspond to nodes and down-sampled from , respectively. For coarse matching, we first adopt an efficient linear cross-attention [2] module to enhance . Then both semi-dense features and coarse features are fed into a consistency-aware spot-guided attention-based coarse matching module to improve the feature distinctiveness. The similarity matrix between these enhanced semi-dense features is computed based on inner product: . Furthermore, we fed and into a point-wise MLP to predict the overlap scores, which encode the likelihood of a node having a correspondence. We perform dual-softmax on to obtain the final matching scores:
where and are predicted overlap scores of the i-th node of and the j-th node of , respectively. We use the mutual nearest neighbor scheme to select confident coarse correspondences. For highly efficient fine matching, we extract a keypoint from the neighborhood of each semi-dense node in , and predict its virtual correspondence based on the neighborhood of the corresponding node in . Then we utilize compatibility graph embedding to predict the confidence of these keypoint correspondences as weights in Eq.1 for initial pose estimation. Finally, a lightweight local matching module for dense points and predicts dense correspondences to refine the pose.
Consistency-Aware Spot-Guided Attention
To tackle the sparsity and looseness of coarse matching, we focus on feature aggregation among semi-dense features leveraging both local and global geometric consistency. To be specific, the self-attention only attends to salient nodes sampled from a global compatibility graph, while the cross-attention only attends to nodes sampled based on local consistency, which are referred to as consistency-aware self-attention and spot-guided cross-attention, respectively.
Preliminaries. Transformers stacked by alternate self-attention and cross-attention have showcased advanced performance in coarse feature matching. When -dimensional features attends to , the output of vanilla attention is formulated as:
\hat{F}_A = \text{softmax}\left( \frac{1}{\sqrt{D}}F_AW_Q(F_BW_K)^\mathsf{T}\right)F_BW_V, (3)$$ where $W_Q,W_K,W_V$ are learnable linear transformations to generate queries, keys, and values. When $F_A,F_B$ related to coordinates $P_A,P_B$ are from the same point cloud, it becomes self-attention that requires positional encoding to embed spatial information. To encode the 3D relative positions, we equip the rotary positional embedding [3] $\tilde{R}(\cdot)$ with learnable weights $b_1,\cdots,b_{D/2}\in \mathbb{R}^{1\times 3}$:\tilde{R}(p) = \begin{bmatrix} R(b_1p) & & \mathbf{0}\\ & \ddots &\\ \mathbf{0} & & R(b_{D/2}p) \end{bmatrix}, R(\theta) = \begin{bmatrix} \cos{\theta} & -\sin{\theta} \\ \sin{\theta} & \cos{\theta} \end{bmatrix},\forall p\in\mathbb{R}^3. (4)$$
When applying to vanilla self-attention, the output is formulated as:
\hat{F}_A = \text{softmax}\left( \frac{1}{\sqrt{D}}F_AW_Q\tilde{R}(P_A)(F_BW_K\tilde{R}(P_B))^\mathsf{T}\right)F_BW_V. (5)$$Spot-Guided Cross-Attention. As shown in Figure 2, global cross-attention tends to aggregate features from wrong regions under the disturbance of many similar yet irrelevant regions, leading to false correspondences. Inspired by local consistency that the correspondences of adjacent 3D points remain close to each other, we design the spot-guided cross-attention as depicted in Figure 2. For each node such that , we select a subset as seeds from its neighborhood and construct a region of interest for it as , namely its spot. selects and only its neighbors with reliable correspondences. We propose a consistency-aware matching confidence criterion to rank the neighbors, which is formulated as the sum of the matching score and the normalized generalized degree in the compatibility graph. This criterion incorporates feature similarity and geometric consistency to properly measure the reliability of correspondences for seed selection. Finally, semi-dense features attend to their spots for feature aggregation according to Eq.3. Under the guarantee of local consistency, the spots are likely to cover the true correspondences, providing guidance for feature aggregation without interfering with irrelevant areas.
Sparse-to-Dense Fine Matching
Given a coarse correspondence set selected as mutual nearest neighbors from the final coarse matching scores (Eq.1), we propose a lightweight sparse-to-dense fine matching module for hierarchical pose estimation without optimal transport, maintaining scalability and efficiency. For sparse matching, we first adopt k-nearest neighbor (kNN) search centered on semi-dense nodes to group patches containing dense points from , then we use an attentive keypoint detector [7] to predict a repeatable keypoint with a descriptor from each patch. Each keypoint of point cloud is assigned to its nearest node, and each node with a correspondence in groups a patch of keypoints via kNN. Then, a keypoint assigned to will correspond to the patch , forming a keypoint-to-patch correspondence. Finally, we utilize a single-head attention layer for each keypoint-to-patch correspondence to predict virtual correspondences for keypoints. Denote the descriptor of as , the virtual correspondence with feature is predicted from keypoints in :
\hat{y}_i = \sum _{j=1}^k \text{softmax}(d^X_i\overline{W}_Q(d^Y _{i_j}\overline{W}_K)^\mathsf{T}) y _{i_k}, \hat{d}_i^Y = \sum _{j=1}^k \text{softmax}(d^X_i\overline{W}_Q(d^Y _{i_j}\overline{W}_K)^\mathsf{T}) d^Y _{i_k}, (9)$$ where $d^Y_{i_1},\cdots,d^Y_{i_k}$ are descriptors of $y_{i_1},\cdots,y_{i_k}$ with top-k descriptor similarity in $\mathcal{P}(y_j^S)$, and $\overline{W}_Q$ and $\overline{W}_K$ are learnable weights. Inspired by PointDSC [5], we construct a compatibility graph $B$ (Eq.8) of sparse keypoint correspondences $\{(x_i,\hat{y}_i)\}$ for spatial consistency filtering via compatibility graph embedding:E^{(l+1)}=\text{softmax}\left(\frac{1}{\sqrt{D_e}} E^{(l)}W_Q^{(l)}
(E^{(l)}W_K^{(l)})^\mathsf{T} \odot B \right)E^{(l)}W_V^{(l)}, E^{(0)}_i = \text{MLP}([ x_i,d^X_i, \hat{y}_i, \hat{d}^Y_i ]), (10)$$
where is the correspondence-wise embedding of the -th layer with learnable weights . Finally, the embedding is fed into an MLP to classify if a correspondence is an inlier. The predicted inlier confidences serve as the weights of keypoint correspondence for pose estimation formulated as Eq.1, which can be analytically solved by weighted Kabsch algorithm [8]. It is noteworthy this process is really lightweight and scalable to large-scale registration tasks.
After aligning two point clouds based on sparse matching, we propose to refine the transformation based on dense matching. We still utilize local attention (Eq.9) to predict the correspondences of dense points from its neighbors in within a radius , and we simply set the confidence weight of a correspondence with a distance as . By solving Eq.1 again with both sparse and dense correspondences, we can achieve more accurate pose estimation efficiently.
- [7] Fan Lu et al. "RSKDD-Net: Random sample-based keypoint detector and descriptor." Advances in Neural Information Processing Systems (NeurIPS), 2020.
- [8] Wolfgang Kabsch et al. "A solution for the best rotation to relate two sets of vectors." Acta Crystallographica447 Section A: Crystal Physics, Diffraction, Theoretical and General Crystallography, 1976.
Architecture. As both spot-guided cross-attention and consistency-aware self-attention are sparse attention lacking of abundant global context, we propose to enhance the semi-dense features via multi-scale feature fusion with coarse features. Hence, the architecture of our coarse matching module is designed as a sequence of blocks for attention-based multi-scale feature aggregation. For each block with both semi-dense features and coarse features as inputs, we first feed into a self-attention module (Eq.5) and a cross-attention module (Eq.3). Then and are fused into each other based on nearest up-sampling and distance-based interpolated down-sampling [4]:
\hat{F}^{1/4} = F^{1/4} + \text{MLP}(\text{Nearest Up-sampling}(F^{1/8})), \hat{F}^{1/8} = F^{1/8} + \text{MLP}(\text{Interpolated Down-sampling}(F^{1/4})). (6)$$ Finally, $\hat{F}^{1/4}$ is fed into a consistency-aware self-attention module and a spot-guided cross-attention module at the end of each block. Before these sparse attention modules, we need to match the semi-dense features and evaluate the geometric consistency as a clue to select sparse yet instructive tokens. Given semi-dense features $\hat{F}_X^{(l)},\hat{F}_Y^{(l)}$ in the $l$-th block, the matching score is formulated as:P_{ij}^{(l)} = \underset{k\in \{1,\cdots,M'\}}{\text{softmax}}(S _{kj}^{(l)})_i \underset{k\in \{1,\cdots,N'\}}{\text{softmax}}(S _{ik}^{(l)})_j, S^{(l)}=\hat{F}_X^{(l)} (\hat{F}_Y^{(l)})^{\mathsf{T}}. (7)$$
Then the correspondence of each node can be obtained as the node from another point cloud with the highest matching score, forming a correspondence set . An insight about the consistency among correspondences is that the distance between two points is invariant after transformation. Hence, geometric compatibility is adopted as a simple yet effective measure of consistency [5,6], which is based on the length difference between pairwise line segments. Given a pre-defined threshold , the pair-wise geometric compatibility of is formulated as:
\beta_{ij}=\left[1 - d_{ij}^2 / \sigma_c^2 \right]^+,d_{ij}=\left| \Vert x_i^S - x_j^S\Vert_2 - \Vert y_i^S - y_j^S\Vert_2\right|. (8)$$ The compatibility matrix $B_c=[\beta_{ij}]_{M'\times N'}$ is also considered as the adjacency matrix of a weighted undirected graph known as the compatibility graph, where each vertex is a pair of correspondence and the edge connectivity corresponds to the compatibility between two correspondences. Intuitively, we adopt the generalized degree of a pair of correspondence in the graph as a measure of global consistency, which quantifies the connectivity of a vertex as the sum of edge weights connected to it. **Consistency-Aware Self-Attention.** Intuitively, the correspondences of less salient nodes can be effectively located based on the geometric relationships between them and the salient ones. Hence, compared with global self-attention that attends to all nodes, attending to only salient nodes is more efficient and effective to encode the geometric context for matching. We propose the consistency-aware self-attention that samples sparse salient nodes to be attended to based on both geometric consistency and feature similarity. Given the correspondence set $C^{(l)}$ with a compatibility graph, we perform two-stage sampling by ranking the generalized degrees and matching scores, respectively. The first-stage graph sampling using generalized degrees can obtain sufficient consistent correspondences as proposals. The second-stage sampling based on matching scores can further obtain sparse salient nodes from these proposals. Finally, semi-dense features $\hat{F}_X^{(l)},\hat{F}_Y^{(l)}$ only attend to features of salient nodes from the same point cloud for feature aggregation according to Eq.5. - [1] Hugues Thomas *et al.* "KPConv: Flexible and deformable convolution for point clouds." Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019. - [2] Angelos Katharopoulos *et al.* "Transformers are rnns: Fast autoregressive transformers with linear attention." International Conference on Machine Learning (ICML), 2020. - [3] Jianlin Su *et al.* "RoFormer: Enhanced transformer with rotary position embedding." Neurocomputing, 2024. - [4] Charles R. Qi *et al.* "PointNet++: Deep hierarchical feature learning on point sets in a metric space." Advances in Neural Information Processing Systems (NeurIPS), 2017. - [5] Xuyang Bai *et al.* "PointDSC: Robust point cloud registration using deep spatial consistency." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (ICCV), 2021. - [6] Mei Guofeng, *et al.* "COTReg: Coupled optimal transport based point cloud registration." arXiv preprint arXiv:2112.14381 (2021).Dear reviewers,
Thank you for dedicating time to review our paper. We thank the reviewers L6BY,LKCi,UeYt for your appreciations of our idea using consistency in feature aggregation, and thank the reviewer UeYt for pointing out that we make learning-based registration more efficient and scalable for real-time applications. We thank all reviewers for highlighting our superiority on outdoor scenes, especially reviewers LKCi,L6BY also highlighting our superiority on indoor scenes, and we thank reviewers L6BY,D9ok for compliments on writing. Meanwhile, reviewers LKCi,UeYt concerns about some details of writing, while reviewers UeYt,D9ok concerns about the performance on indoor low overlapping scenes. Generalizability tests are suggested by reviewers LKCi,L6BY. Here we address these common issues:
1. Performance in low overlapping cases
We update the evaluation results in Table 1 in the supplementary pdf, as we notice the calculation of RR in our original version is not the same as other methods. In the orginal manuscript we report the RR over the whole dataset but other methods use the average RR of eight sequences in their codes, hence we use the same evaluation protocol for fairness.
On 3DMatch, our method achieves the highest RR of 95.2%. On 3DLoMatch, our method achieves 75.1% RR, surpassing all descriptors and all non-iterative correspondence-based methods except OIF-Net. As CAST typically detects about 1000 keypoints and establishes <250 keypoint correspondences on 3DLoMatch after consistency filtering, it is fair to compare with other methods using only 250 points. Our method outperforms OIF-Net using 1000,500,250 points, indicating its efficacy in low overlapping scenes. Though PEAL and DiffusionPCR show higher RR on 3DLoMatch, they iteratively use a variant of GeoTransformer with overlap priors, which is extremely time-consuming (10x runtime of ours), and PEAL even uses priors from 2D images. Notably, our method achieves such superior performance only with the lowest time consumption.
As suggested by reviewer UeYt, we compare CAST with SOTA registration methods using FCGF descriptor cited in FastMAC and TEAR(CVPR24') to further validate its efficacy. For fairness, we use their RR, i.e., the fraction of point clouds with RTE<30cm and RRE<15°. As reported in Table 1, our method achieves the highest RR and the lowest registration errors, suggesting its robustness and accuracy.
Table 1: Comparison on 3DM (3DMatch) and 3DLM (3DLoMatch).
| RR (%) | RTE (cm) | RRE (°) | RR (%) | RTE (cm) | RRE (°) | |
|---|---|---|---|---|---|---|
| 3DM | 3DM | 3DM | 3DLM | 3DLM | 3DLM | |
| RANSAC-4M | 91.44 | 8.38 | 2.69 | 10.44 | 15.14 | 6.91 |
| TEASER++ | 85.77 | 8.66 | 2.73 | 46.76 | 12.89 | 4.12 |
| SC-PCR | 93.16 | 6.51 | 2.09 | 58.73 | 10.44 | 3.80 |
| DGR | 88.85 | 7.02 | 2.28 | 43.80 | 10.82 | 4.17 |
| PointDSC | 91.87 | 6.54 | 2.10 | 56.20 | 10.48 | 3.87 |
| MAC | 93.72 | 6.54 | 2.02 | 59.85 | 9.75 | 3.50 |
| FastMAC | 92.67 | 6.47 | 2.00 | 58.23 | 10.81 | 3.80 |
| CAST | 96.48 | 5.64 | 1.71 | 76.13 | 8.47 | 2.75 |
2. Revision of the manuscript and Figure 1
We have carefully revised our writing and figures of Method Section and released them in the general comments and the supplementary pdf below, and it would be appreciated if you can read this version. I believe it can address most reviewers' concerns about writing.
3. Generalizability
For fair comparison with existing methods trained on 3DMatch (indoor) and evaluated on ETH (outdoor), we need to set the voxel size very small as they do during down-sampling, but our model with multi-scale feature aggregation runs out of memory on RTX3090. This can not be solved even when we scale the point clouds to 1/8 of the original size. Therefore, we only evaluate the generalizability when trained on KITTI (outdoor) and tested on ETH in Table 2, where RR on ETH represents the fraction of point cloud pairs with RTE<0.3m and RRE<2°. Our method achieves satisfying accuracy and robustness, showcasing better generalizability than GeoTransformer, due to the help of consistency. But it falls behind SpinNet and BUFFER using RANSAC.
Table 2
| RTE (cm) | RRE (°) | RR (%) | |
|---|---|---|---|
| FCGF | 6.13 | 0.80 | 39.55 |
| Predator | 7.88 | 0.87 | 71.95 |
| SpinNet | 3.63 | 0.62 | 99.44 |
| GeoTransformer | 8.01 | 0.89 | 93.55 |
| BUFFER | 3.85 | 0.57 | 99.86 |
| CAST | 6.86 | 0.66 | 97.05 |
| CAST + UDA 1epoch | 5.29 | 0.58 | 99.44 |
| CAST + UDA 2epoch | 4.96 | 0.57 | 99.86 |
Patch-wise features like SpinNet and BUFFER have good generalizability due to local characteristics that are inherently more robust than point-wise features extracted from FPN, such as FCGF, Predator, and coarse-to-fine methods. Moreover, we uses a lightweight learnable pipeline for fine matching without robust pose estimators, which is not likely to achieve high robustness in lower inlier ratio cases. Hence, we believe that the subpar generalizability mainly owes to the backbone and fine matching. Nevertheless, coarse-to-fine methods significantly outperform descriptors when trained and evaluated on the same dataset.
Furthermore, we conduct an unsupervised domain adaptation experiment, which mixes the KITTI point clouds with ground-truth poses and the ETH point clouds without ground-truth poses for fine-tuning. For ETH, we augment a point cloud via random rotation and cropping and learn to align it with itself before augmentation. The results indicate that our model easily adapts to an unseen domain and achieves robust and accurate performance after a short period of unsupervised tuning (only 20min for a epoch!). For applications, we believe generalizing or quickly adapting from an outdoor LiDAR dataset to another is a more realistic setting than generalizing from an RGBD camera dataset 3DMatch to a LiDAR dataset ETH as many papers.
This paper received mixed ratings of two Borderline Accepts, Reject, and Accept from four reviewers. During the rebuttal phase, Reviewer UeYt (who suggested Reject) believed that its concerns have been addressed, but did not change the rating. After reading the reviewers' comments, the authors' responses and discussions, the AC believes the work can be accepted. Meanwhile, the authors are encouraged to improve the writing of this paper and include these important information provided during rebuttal into the camera ready version.