3D Focusing-and-Matching Network for Multi-Instance Point Cloud Registration
摘要
评审与讨论
This paper proposes FMNet, an end-to-end deep learning approach for multi-instance point cloud registration. The key novelty is an attention-weighted feature matching module that can adaptively focus on reliable point correspondences during matching. Unlike traditional two-step methods, FMNet jointly learns feature extraction and matching within a unified network using an attention mechanism. The attention module computes matching probabilities between point features through scaled dot-product, allowing the network to dynamically highlight meaningful matches. Integrated with geometric and cycle consistency losses, FMNet achieves state-of-the-art registration performance on multiple datasets, demonstrating robust handling of noise, occlusions and point density variations. This work pioneers applying attention for solving the point cloud registration problem in a unified deep learning framework.
优点
- The proposed an attention-weighted feature matching module that can adaptively focus on important point pairs, enhancing robust matching. This innovative design surpasses many previous hand-crafted or geometry-constrained feature matching strategies.
- The proposed FMNet end-to-end framework is well-designed, with clear mathematical explanations for the attention module. Extensive evaluations on public datasets and robustness analysis against noise/missing data provide strong empirical evidence.
缺点
- The author's contribution is not summarized in points, which seems rather unclear. 2.Even though your network is clearly expressed in the article, I think it is overly structured, which is why I find it less than innovative.
- In contrast to previous work PointCLM utilized density information for clustering relationships, it seems to me that this article does not effectively extract and utilize density information or other bases used for clustering. I think this information is more important for multi-instance alignment.
- Compared to previous work PointCLM and ECC, the article seems to be less theoretical, with a large number of formulas used to describe the structure of the network.
- I think this method of finding the center point first and then doing multi-instance point cloud registration is time consuming.
问题
What is the approximate percentage of time spent finding the center point and the registration, respectively?
局限性
Yes, the author fully addresses the limitations of his proposed methodology.
We appreciate your diligent review and valuable feedback to help improve our paper.
Q1: The contribution and innovation of the method.
A1: Our contributions lie in three aspects:
(1) Our primary contribution does not lie in the network architecture but rather in proposing a new pipeline to address the multi-instance point cloud registration problem. Existing methods (such as PointCLM and MIRETR) mainly learn correspondence between the one CAD model and multiple objects (one-to-many paradigm), while our method decompose the one-to-many paradigm into multiple pair-wise point cloud registration (multiple one-to-one paradigm) by first detecting the object centers and then learning the matching between the CAD model and each object proposal.
(2) Our new pipeline is simple yet powerful, achieving the new state-of-the-art on both Scan2CAD and ROBI datasets. Especially on the challenging ROBI dataset, our method significantly outperforms the previous SOTA MIRETR by about 7% in terms of MR, MP, and MF.
(3) The progressive decomposition approach of transforming multi-instance point cloud registration into multiple pair-wise registrations, as proposed in our paper, also holds significant insights for other tasks, such as multi-target tracking and map construction.
Q2: Reasons for not using density information (such as PointCLM) for clustering.
A2: PointCLM uses the feature density of contrastive learning to filter out matching pairs that do not belong to the instance before clustering. However, our method decomposes the one-to-many paradigm into multiple one-to-one paradigm by first detecting the object centers and then learning the matching between the CAD model and each object proposal. For the first detection stage, we can filter lots of background points by detecting the object centers and generating object proposals. For the second matching stage, the instance mask learning and overlap mask learning further filter the noisy points. To this end, we count the inlier ratio of points falling on objects and the total correspondence on the ROBI dataset. The following table show the results:
| Methods | Inlier Ratio (%) |
|---|---|
| PointCLM | 36.27% |
| MIRETR | 56.59% |
| 3DFMNet (ours) | 59.84% |
It can be observed that our method achieves the highest inlier ratio, which further demonstrate that our multiple one-to-one paradigm can effectively improve inlier ratio.
Q3: The article seems to be less theoretical, with a large number of formulas used to describe the structure of the network.
A3: As mentioned in A1, our contribution primarily lies in proposing a new paradigm for solving the multi-instance point cloud registration problem. Our method is simple yet powerful, achieving new state-of-the-art performance on both Scan2CAD and ROBI dataset.
Q4: I think this method of finding the center point first and then doing multi-instance point cloud registration is time consuming.
A4: Generally, two-stage methods are time-consuming than one-stage methods. In Table 2 of the main paper, we compare our method (3DFMNet) with existing methods (such as PointCLM, MIRETR):
| Methods | Total Time (s) | |
|---|---|---|
| PointCLM | Two-Stage | 0.63s |
| MIRETR | One-Stage | 0.40s |
| 3DFMNet (ours) | Two-Stage | 0.54s |
Our inference speed of our two-stage 3DFMNet (0.54s) is lower than one-stage MIRETR (0.40s), while our inference speed is higher than two-stage PointCLM (0.63s). Although each method has different inference speeds, the inference speeds of these methods are all at the same level. In order to reduce the overall time consumption, one possible strategy for reducing the inference time of our two-stage method is to use parallel optimization to simultaneously match multiple pair-wise registrations. In the future, we will further consider reducing the inference time of our method.
Q5: The time spent of the center point detection and the registration.
A5: We supplement the detailed inference time of each stage in the following table:
| Focusing Model Time (first stage) | Matching Model Time (second stage) | Total Time |
|---|---|---|
| 0.145s | 0.405s | 0.540s |
It can be observed that the first stage (Focusing Model) tasks up a few time while the second stage (Matching Model) takes up a larger proportion of the time. Compared with the inference speed 0.400s of MIRETR, our total inference speed 0.540s is slightly lower than MIRETR.
Dear Reviewer,
This is a gentle reminder to please review the rebuttal provided by the authors. Your feedback is crucial to the decision-making process. Please consider updating your score after reading the rebuttal.
Thank you for your help with the NeurIPS!
Best, Your AC
The discussion period will end in approximately 5 hours. We sincerely hope that you could review our rebuttal and respond accordingly. Our rebuttal offers comprehensive explanations to the questions you raised, and we believe that our work shows a new yet powerful pipeline and high performance in multi-instance point cloud registration. We have also provided the discussion of density information and inference time analysis. We are confident that our responses will alter your perception of this paper. We look forward to your reply to our rebuttal. Once again, thank you for your dedication and efforts in the review process.
This paper introduces a novel focusing-and-matching technique for addressing the multi-instance point cloud registration challenge. Instead of fitting multiple models from a set of incorrect correspondences, this method initially detects potential instance regions and subsequently performs standard pairwise point cloud registration. The approach was tested on two benchmarks, where it outperformed existing state-of-the-art methods in terms of recall and precision metrics.
优点
-
The primary contribution of this paper is the establishment of a new approach to solving the multi-instance registration task. Unlike previous methods that rely on multi-model fitting (using RANSAC-like methods) from a set of spurious correspondences, which are computationally expensive and unreliable due to the large matching space, this method uses learned scene priors to narrow down the matching space within individual regions of interest before performing standard pairwise point cloud registration.
-
The writing is clear and easy to follow. In particular, the methodology section is well-structured. The concept of each variable is connected and explained fluently.
缺点
-
Certain details in Section 3.2 need clarification. Specifically, when the first module predicts K object centers but there are only (K-2) ground truth instances, how does the pair-wise registration model manage these two falsely detected objects? From my understanding, it appears that the model only predicts the transformation parameters without providing additional confidence scores.
-
The metrics require explanation or references. For MR, MP, and MF, how is a “registered instance” defined? What criteria are used to consider a predicted pose successful, such as RMSE, chamfer distance, relative translation error, or relative rotation error? Additionally, what are the thresholds?
-
More experimental results related to the 3D focusing module are needed, as this module determines the number of pairwise registrations performed, which sets the upper bound for the number of “successfully registered instances.” Results such as the number of detected objects, correctly detected objects, and wrongly detected objects are necessary, as these significantly impact the metrics, including MR, MP, and MF.
问题
I don't have any questions.
局限性
I can see the limitation is the performance of the 3D focusing module, espeecially its generalizability to unseen scene. Once it wrongly detects object proposals, the following pairwise registration module will fail to register the instance as well.
We thank the reviewer for the diligent comments to improve the paper.
Q1: How does the pair-wise registration model manage these falsely detected objects?
A1: Our method is a two-stage approach, so we analysis the falsely detected objects in both two stage.
(1) For the 3D multi-object focusing model (first stage), we compute the mean recall and mean precision of detect object centers. Specifically, we usex the distance between the predicted center and the ground truth center (, is the radius of instance) as a metric to determine the success of the 3D multi-object focusing module. The results are as following table:
| Datasets | Mean Recall (%) | Mean Precision (%) |
|---|---|---|
| Scan2CAD | 98.14% | 98.85% |
| ROBI | 80.30% | 99.99% |
On Scan2CAD and ROBI, it can be observed that our method achieve very high precision (98.85% and 99.99%), which means that our method can successfully detect the object centers.
(2) For the 3D dual-masking instance matching module (second stage), the instance mask learning and overlap mask learning further filter the falsely detected objects. Specifically, we study the influence of the false detected objects on the Scan2CAD dataset. According to the above table, it can be observed that there are 1.05% objects are falsely detected, i.e., there are 25 cases of failed detected objects on the Scan2CAD dataset. After the instance mask learning and overlap mask learning, the 22 cases are filtered. Since instance mask learning and overlap mask learning cannot effectively achieve the matching points between the falsely detected objects and the CAD model, there are few matching points obtained for falsely detected objects, resulting in SVD being unable to solve for the relative pose. Therefore, falsely detected objects slightly affect our final results.
Q2: Details of the metrics.
A2: We refer to the settings used in MIRETR and previous work to determine whether an instance is recognized as correctly registered based on RTE and RRE [1][2]. Specifically, we consider a match successful when and . Following existing methods, such as MIRETR, the voxel sizes of Scan2CAD and ROBI dataset are set to 0.025m and 0.0015m, respectively.
[1] Yuan M, Li Z, Jin Q, et al. PointCLM: A Contrastive Learning-based Framework for Multi-instance Point Cloud Registration. In ECCV, 2022.
[2] Yu Z, Qin Z, Zheng L, et al. Learning Instance-Aware Correspondences for Robust Multi-Instance Point Cloud Registration in Cluttered Scenes. In CVPR, 2024.
Q3: Experimental results related to the 3D multi-object focusing module are needed.
A3: To evaluate 3D multi-object focusing module, we compute the mean recall (MR) and mean precision (MR) and root mean square error (RMSE) of detect object centers. Specifically, we usex the distance between the predicted center and the ground truth center (, is the radius of instance) as a metric to determine the success of the 3D multi-object focusing module. The results are as following table:
| Datasets | MR (%) | MP (%) | RMSE (m) |
|---|---|---|---|
| Scan2CAD | 98.14% | 98.85% | 0.0814m |
| ROBI | 80.30% | 99.99% | 0.0065m |
It can be observed that our 3D multi-object focusing module has good results in terms of MR, MP, and RMSE, which ensures that our method can successfully detect the object centers in the first stage.
Q4: The generalizability of the 3D multi-object focusing module to the unseen scenes.
A4: For the unseen scenes, we follow MIRETR and use ShapeNet dataset [1] (total 55 categories) to conduct experiments on invisible semantics. Specifically, we use CAD models of ShapeNet dataset from the first 30 categories for training and the remaining 25 categories for testing to assess generalization to new categories. To prevent class imbalance, we follow MIRETR to randomly sample up to 500 models per category. Each point cloud pair consists of a randomly selected CAD model and a scene model created by applying 4–16 random poses. This results in 8,634 pairs for training, 900 for validation, and 7,683 for testing. The results of testing set are as following table:
| Methods | MR (%) | MP (%) | MF (%) |
|---|---|---|---|
| MIRETR | 94.95% | 93.94% | 94.44% |
| 3DFMNet (ours) | 95.12% | 94.23% | 94.67% |
It can be observed that our method show good generalizability on unseen scenes compared to previous SOTA MIRETR. We will add the results in the revised manuscript.
[1] Chang A X, Funkhouser T, Guibas L, et al. ShapeNet: An Information-Rich 3D Model Repository. arXiv preprint arXiv:1512.03012, 2015.
Dear Reviewer,
This is a gentle reminder to please review the rebuttal provided by the authors. Your feedback is crucial to the decision-making process. Please consider updating your score after reading the rebuttal.
Thank you for your help with the NeurIPS!
Best, Your AC
The rebuttal from the authors addressed most of my questions. Additional experiment results show good support for the proposed method. The potential improvement for this work is predicting additional pair-wise confidence scores or using pair-wise registration accuracy as the indicator to handle the falsely detected objects. Overall, I would like to update the score to "Weak Accept".
For multi-instance point cloud registration, the authors proposed a 3D focusing-and-matching network by learning multiple pair-wise point cloud registration. Specifically, a 3D multi-object focusing module is proposed to locate the center of each object and generate object proposals. In addition, a 3D dual-masking instance matching is introduce to estimate the pose between the model point cloud and each object proposal. Extensive experiments on two popular datasets, Scan2CAD and ROBI, show that the proposed method achieves new state-of-the-art performance on the multi-instance point cloud registration task.
优点
- From the experimental results, it can be seen that decomposing multi-instance registration into multiple pair-wise registrations is very simple and effective.
- The method achieves new state-of-the-art on Scan2CAD and ROBI. Especially in the challenging ROBI dataset, the proposed method is significantly better than the previous SOTA (+9%).
- It is very convincing that the authors analyzed the upper bound of the method.
缺点
- Since the overall architecture is a two-stage structure for multi-instance point cloud registration, its inference time is lower than one-stage MIRETR (0.54s vs. 0.40s per scene in Table 2). Why the inference time is slightly lower than MIRETR? Considering the two-stage strategy, the author should conduct a detailed analysis.
- The authors did not provide detailed training and testing strategies for its two-stage approach.
- In Table 1, it can be observed that the proposed method did not achieve SOTA in terms of MR on the Scan2CAD dataset. What are the possible reasons behind this phenomenon?
- In Figure 4, it can be observed that both MIRETR and the proposed method 3DFMNet cannot successfully match all parts on the ROBI dataset, especially on dense scenes with lots of parts.
- As mentioned in the limitation, the proposed method is a two-stage method, and its inference time is slightly lower than MIRETR. It is better to discuss a potential plan to solve this issue.
问题
Please refer to weaknesses.
局限性
The authors did not discuss the computational complexity of method
We thank the reviewer for the detailed comments to improve the paper.
Q1: Detailed analysis of why the inference speed is slightly lower than MIRETR.
A1: In the main paper, we have measured the total inference time of our method (0.540s) and MIRETR (0.400s) per scene on the ROBI dataset. Since our method is a two-stage method, we supplement the detailed inference time of each stage in the following table:
| Focusing Model Time (first stage) | Matching Model Time (second stage) | Total Time |
|---|---|---|
| 0.145s | 0.405s | 0.540s |
It can be observed that the first stage (Focusing Model) tasks up a few time while the second stage (Matching Model) takes up a larger proportion of the time. Compared with the inference speed 0.400s of MIRETR, our inference speed 0.540s is slightly lower than MIRETR.
Q2: Details of training and testing strategies.
A2: During training, we use the ground truth center as supervision to train the 3D multi-object focusing module and use the point cloud around the ground truth center as the training data to train the matching network. During testing, we use the center predicted by the 3D multi-object focusing module and its surrounding point cloud as the input to the 3D dual-masking instance matching module to regress the final pose.
Q3: Possible reasons behind the phenomenon of MR not SOTA on the Scan2CAD.
A3: In Table 1 of the main paper, it can be seen that in the Scan2CAD dataset, our method improves the mean precision (MP) by about 3% while reducing the mean recall (MR) by about 1%. For Scan2CAD dataset, it may be due to some objects in the scene being too close to each other and relatively small in scale compared to the scene, which cause the multiple objects to be detected as a single object, leading to a decrease in MR. In order to better evaluate MP and MR, we also adopt the mean F1 score (MF), which is the harmonic mean of both MP and MR. In Table 1 of the main paper, it can be observed that our method improve MF by about 1.3%, which further demonstrate the effectiveness of our method.
Q4: Analyze the reasons why current methods cannot achieve good results on the ROBI dataset.
A4: Due to the fact that the ROBI dataset is generated from monocular images, occlusions and other factors can cause the instance point clouds to be quite incomplete, making it more challenging. This is the main reason that the performance of all methods is not ideal. Nonetheless, our method significantly outperforms the previous SOTA MIRETR by about 7% in terms of MR, MP, and MF.
Q5: Potential plan to solve the problem of inference time.
A5: In the future, we plan to use multiple GPUs for parallel computation to address this issue. We believe that the time taken to process a single matching proposal will be the upper limit of our optimization.
Q6: Discuss the computational complexity.
A6: We provide the inference time and the number of parameters in the ROBI dataset in the following table:
| Methods | Inference Time | Parameters (MB) | MR (%) | MP (%) | MF (%) |
|---|---|---|---|---|---|
| MIRETR | 0.400s | 11.31M | 38.51% | 41.19% | 39.80% |
| 3DFMNet (ours) | 0.540s | 21.15M | 46.81% (+8.3%) | 50.61% (+9.42%) | 48.63% (+8.83%) |
It can be our running time is slightly higher than MIRETR. Although the number of parameters is about 2 of MIRETR, the performance has been greatly improved by about 8% in terms of MR, MP, and MF on the ROBI dataset.
Dear Reviewer,
This is a gentle reminder to please review the rebuttal provided by the authors. Your feedback is crucial to the decision-making process. Please consider updating your score after reading the rebuttal.
Thank you for your help with the NeurIPS!
Best, Your AC
The rebuttal from the authors addressed most of my questions. Since the findings in this paper are interesting, the interpretations are rationale, and the experimental results are convincing to me, I would like to keep the score to "Weak Accept".
This paper introduces an interesting approach for multiple pair-wise point cloud registration. The reviewers appreciated the proposed idea and its impressive registration accuracy. Minor concerns regarding computation overhead, missing details, and additional analysis were raised during the review phase. The authors provided solid feedback to relieve the concerns, and all reviewers appreciate the idea of predicting matching confidence scores or accuracy to handle wrongly detected objects, which considers time consumption. As a result, reviewer 8vwC gave 'weak acceptance', reviewer uLBB gave 'weak acceptance', and reviewer RCTj upgraded the score to 'borderline acceptance'. AC notes that all reviewers reached a clear consensus, and the idea and the proposed approach have several merits in the field.