6.0

/10

Spotlight4 位审稿人

最低5最高7标准差0.7

3.8

置信度

正确性2.8

贡献度2.3

表达2.8

NeurIPS 2024

SA3DIP: Segment Any 3D Instance with Potential 3D Priors

Xi Yang,Xu Gu,Xingyilang Yin,Xinbo Gao

OpenReview PDF

提交: 2024-05-15更新: 2024-11-06

摘要

关键词

Open Vocalbulary3D Instance SegmentationFoundation Model3D Detection

评审与讨论

审稿意见

评分: 5置信度: 52024-06-28

The paper "SA3DIP: Segment Any 3D Instance with Potential 3D Priors" presents a novel method for 3D instance segmentation by incorporating geometric and textural priors to generate 3D primitives. It addresses the limitations of existing methods that rely heavily on 2D foundation models, leading to under-segmentation and over-segmentation issues. The proposed approach includes a 3D detector for refining the segmentation process and introduces a revised version of the ScanNetV2 dataset, ScanNetV2-INS, with enhanced ground truth annotations. Experimental results on various datasets demonstrate the effectiveness and robustness of the SA3DIP method.

优点

The strength of the paper lies in its innovative approach to integrating both geometric and textural priors for 3D instance segmentation, which significantly reduces initial errors and enhances the overall segmentation quality. The introduction of a 3D detector for refining segmentation further strengthens the method by addressing over-segmentation issues. Additionally, the revised ScanNetV2-INS dataset provides a more accurate benchmark for evaluating 3D segmentation models, contributing valuable data to the research community. The experimental results across multiple challenging datasets convincingly demonstrate the robustness and effectiveness of the proposed method.

缺点

Despite its strengths, the paper has certain weaknesses. Firstly, it lacks sufficient innovation, as the framework closely resembles SAI3D [1], with the primary difference being the addition of a 3D detector. The entire Scene Graph Construction part is exactly the same as SAI3D. Furthermore, as shown in the ablation study, the overall performance improvement of the model heavily depends on the pre-trained 3D detector, which diminishes the originality and contribution of this paper.

[1] Yin, Yingda, et al. "Sai3d: Segment any instance in 3d scenes." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

问题

In Section 3.1, the author mentions 'histogram vectors,' but the definition of these vectors is unclear in the article. It is not specified how these vectors are derived or what their dimensions are. Additionally, it is unclear why they can effectively represent the features of each superpoint for calculating the affinity score. Clarification on these points is necessary to understand their role and significance in the methodology.
In Table 1, SAMPro3D is the method that is numerically closest to our work. Why is there no visual comparison in Figure 4?

局限性

The authors have addressed the limitations of their work. They highlight that using only 3D priors results in short execution times but can lead to an excessive number of superpoints, complicating the merging process. Furthermore, they acknowledge that the affinity matrix based on 2D masking relies heavily on the accuracy of 2D foundational segmenters and suggest that a more robust merging algorithm or better utilization of various 2D foundational models could be promising future directions.

作者回复

2024-08-06

We thank the reviewer for the constructive and detailed feedback.

W: Innovation of the proposed approach
In Fig. 1 of our paper, we demonstrate the existing problem of previous methods, which use under-segmented 3D primitives and inherit the over-segmented 2D masks to the final 3D instance segmentation. We believe our method is problem-oriented, trying to alleviate these two major defects.
We therefore design two modules accordingly:

Complementary primitives generation module to generate more accurate and finer-grained 3D primitives to avoid any error accumulation.
Introducing the 3D space prior for providing instance-aware constraint, which was implemented by a 3D detector.

The combination of both modules works great at solving the aforementioned problems.

W: Overall performance gain analysis
It is true that from ablation study the performance gain seems to heavily depend on 3D detector. However, we believe that the performance gain of the complementary primitives module is not as minor as it looks. We think it is related to the definition of the metric AP (ratio of correctly identified instances to the total number of identified instance), which is more in favor of under-segmentation rather than over-segmentation, since the former introduces relatively fewer false instance. We randomly choose two scenes (scene0011_00 & scene0644_00) and conduct random 10%, 20%, 30% over/under segmentation tests based on their GT instances. Results are averaged from three experiments and shown in the table. APs with subscript O stand for over-segmentation results, and U for under-segmentation.

	mAP_O	AP25_O	AP50_O	mAP_U	AP25_U	AP50_U
Scene0011_00
10%	87.2	87.2	98.1	92.4	96.2	96.2
20%	71.8	71.8	98.1	84.5	92.3	92.3
30%	62.3	62.3	98.1	85.4	92.3	92.3
Scene0644_00
10%	88.1	88.1	96.7	97.6	98.8	98.8
20%	73.6	73.6	98.3	86.9	92.9	92.9
30%	60.3	60.3	98.3	69.8	82.1	82.1

It can be observed that under whatever percentage the under-segmentation seems to give higher APs, due to its high precision and fewer false positive cases. While for over-segmentation, it gives fewer precision and higher recall. This is consistent with our results, which give finer-grained (20% more in average, shown in the table below) primitives but slightly lower APs. We show the number of instances in comparison with SAI3D in the table.

Count	Min	Max	Avg		Min	Max	Avg
Primitive				Final result
SAI3D	159	3905	1068	SAI3D	6	258	59
ours	243	3989	1272	ours	5	159	45

Upon closer look in the ablation study in our paper, adding only our complementary 3D superpoint primitives module even slight deters the performance, due to the AP metric. However, our final results after experiencing whole pipeline (after adding 3D space prior) reverse the minor drop and produce an extra gain at around 1% compared to adding only 3D detection. The counts for instances decrease as well. The superior performance and decreased instance number indicate that our results achieve both high recall and high precision. It proves that he combining of finer-grained (and slightly over-segmented) primitives and instance-aware refinement (merge those over-segmented primitives) provides a thorough solution.

Q1: Histogram vectors clarification
Histogram vector represents a distribution of 2D mask ids which are covered by the projection of a 3D primitive. It reflects which 2D mask has the correspondence with the current 3D primitive. The histogram vectors are then used to calculate the affinity (similarity) matrix for each pairs of 3D primitives, indicating the likelihood that two primitives belong to the same object (the same 2D mask).
In the implement of codes, assume that for a scene there are M 2D images and N 3D points. For the m-th image, we project the N points to the image using the corresponding pose and camera intrinsic. We record the 2D mask label for every 3D point according to which pixel it is projected onto (0 for invisible points in the image). After this we get a (N, M) matrix, indicating the 2D mask label for every point in every 2D view.
Next, we load all K 3D primitives in the scene. Again for the m-th image, we record the 2D mask labels each primitive covers according to the (N, M) matrix we get earlier, since multiple labels may be covered by one single primitive due to the ambiguity or inaccuracy in 2D masks. We maintain a normalized matrix of size (K, V) for each view, where V refers to the 2D mask counts in the view. This matrix of size (K, V) is actually the histogram vector for all primitives in the m-th view.

	2D mask #1	2D mask #2	......	2D mask #V
Primitive #1	0	0.5	......	0
Primitive #2	0.3	0.6	......	0
......	......	......	......	......
Primitive #K	0.1	0	......	0.6

In the end, we use the (K, V) matrix for cosine similarity calculation to obtain the affinity matrix of size (K, K) which represents the likelihood of each pair of primitives.

Q2: Visual comparison of SAMPro3D
We showcase the comparison in Fig. 3 of global rebuttal PDF. Our method clearly alleviates the over-segmented instances appear in SAMPro3D, which is consistent with our initial purpose of introducing the 3D instance-aware space prior.

2024-08-11

Dear reviewer cYA9,

Thank you very much for reviewing our paper and giving us some good questions. We have tried our best to answer all the questions according to the comments. Especially, we conduct more ablation experiments on 3D priors and visual comparison with SAMPro3D.

As half of the discussion time has already passed, we would be very grateful if you could respond to our rebuttal and offer us an opportunity to further engage with your concerns and address any additional questions you might have!

Thank you for your time and feedback!

Best,

Authors

2024-08-13

Thank you for the clear explanation of histogram vectors and the visual comparison of SAMPro3D. Your response effectively answered my questions.

审稿意见

评分: 6置信度: 32024-07-08

The paper proposes a pipeline to perform open-vocabulary 3D instance segmentation of scenes, incorporating geometric and RGB information. The method is based on constructing a super-points graph, which is then refined by SAM and a 3D detector (V-DETR). Also, the paper provides an enhanced version of ScanNetV2, correcting and extending the annotations. The method outperforms other SAM based baselines.

优点

Given the popularity of ScanNetV2, releasing a more curated and fine-grained annotation seems a useful contribution. The proposed method is a careful combination of components, in a cascade of steps that seems effective.

缺点

As a non-expert in 3D scene segmentations, I find it difficult to understand the novelty of the proposed approach, whereas other works also rely on SAM. From my understanding, the key difference seems to be the exploitation of 3D prior to incorporating the 3D classifier, producing the main impact in instance awareness by constraint. If this is the case, I am unsure about the significance of the technical contribution, and I would ask for further clarification on this.
The paper does not provide enough documentation or analysis of the new annotations for ScanNetV2, which is critical since it is one of the core contributions of the paper. Figure 3 is insufficient to understand the quantitative statistics of the performed effort. On the new dataset, the methods perform worst, which could suggest that the new labels are more difficult/detailed, but more evidence is required to confirm this. To prove the dataset's usefulness, I would suggest comparing methods trained on ScanNetV2 and ScanNetV2-INS and incorporating them in the paper more statistics (e.g., the difference in the number of categories, how many instances for each of these, ...).

问题

Adding on the observations reported in the previous section:

The method requires posed images. Is this a requirement also for the competitors?
Images are often difficult to parse: e.g., Figure 2 contains many small images with several colors, and its flow is not linear. I would suggest providing a more schematic overview with larger figures and including in the appendix this detailed version.

局限性

The method briefly discusses the proposed method's main limitation but not the dataset.

作者回复

2024-08-06

We thank the reviewer for the constructive and detailed feedback.

W1: Novelty of the proposed approach
It is true that other methods, including ours, rely on SAM, and this heavy reliance of 2D segmentation results is exactly what we were trying to avoid by introducing the 3D prior. In Fig. 1 of our paper, we demonstrate the existing problem of the previous methods (like SAI3D and SAMPro3D), which inherit the over-segmented 2D masks to the final 3D instance segmentation.
Thus, our method is problem-oriented, trying to alleviate two major defects existing in the pipelines of the current works through 3D priors.

The under-segmented 3D primitives and subsequent error accumulation
The part-level over-segmented tendency of the 2D foundation segmentator.

We therefore design two modules accordingly. First, we use the complementary primitives generation module to generate more accurate and finer-grained 3D primitives to avoid any error accumulation. Second, we introduce the 3D space prior for providing instance-aware constraint, which we choose to implement using a 3D detection. The combination of both modules works great at solving the aforementioned problems.

W2/Limitation: Documentation or analysis of the new annotations for ScanNetV2
We need to claim that all methods of this type (such as SAMPro3D, SAI3D, ours and so on) are zero-shot ones, which means they utilize the generalization capability of 2D foundation model and do not depend on training. Thus, the ScanNetV2-INS dataset we propose is simply of a evaluation use, since their output instance segmentation results do not change when switching the ground truth file. We will elaborate the details about ScanNetV2-INS dataset as follows:

Meaning: There are two major deficiencies in the original ScanNetV2 dataset: missing instances and incomplete instance masks. To solve these, we assigned new class-agnostic labels to the missing ones and re-labelled the incomplete instance masks.
Tasks: Following the recent methods of utilizing 2D foundation model to perform zero-shot 3D class-agnostic instance segmentation, our ScanNetV2-INS dataset also focus on this specified task and for evaluation only.
Format: We generate a txt file (e.g., sceneid.txt) for each scene in the original ScanNetV2 validation set. The txt file contains N separate lines corresponding to N points in the scene, where the number of each line is calculated as: Number = instance id (refers to n-th object in the scene) + 1000 * semantic label id (refers to scannetv2-labels.combined.tsv). The newly labelled instances were assigned a semantic label id of 41 since we focus on class-agnostic segmentation.
Limitation: Our dataset focus on the evaluation use of 3D class-agnostic instance segmentation, due to the recent trend of utilizing 2D foundation segmentator to perform zero-shot 3D class-agnostic instance segmentation and the high expense of annotation on 3D data.
More statistics: In Fig. 3 of our paper, we demonstrate how many scenes with more than (10, 20, ..., 100) instances in ScanNetV2 and ScanNetV2-INS. Our proposed dataset clearly provides larger number of scenes with more instances. We will give more statistics below. The first table below illustrates the instance count of the original ScanNetV2 and ScanNetV2-INS. It can be seen that our ScanNetV2-INS dataset incorporates more instances. In the second table, we show the number of instances with varying point counts within specified ranges for two datasets, ScanNetV2 and ScanNetV2-INS. ScanNetV2-INS dataset features more smaller objects, which requires the model to have finer-grained instance perception capabilities. The statics from these two tables could, to some extent, explain why the ScanNetV2-INS dataset is more challenging than ScanNetV2.

Instance Count Min Max Avg Total
ScanNetV2 2 47 14 4364
ScanNetV2-INS 2 54 17 5596

Point # Per Instance <500 500-1000 1000-2000 2000-5000 5000-10000 >10000
ScanNetV2 252 452 1119 1690 567 284
ScanNetV2-INS 692 748 1366 1873 626 291

Instance Count	Min	Max	Avg	Total
ScanNetV2	2	47	14	4364
ScanNetV2-INS	2	54	17	5596

Point # Per Instance	<500	500-1000	1000-2000	2000-5000	5000-10000	>10000
ScanNetV2	252	452	1119	1690	567	284
ScanNetV2-INS	692	748	1366	1873	626	291

Q1: Requirement of the posed images
Yes, posed images are also required by other competitors. As we clarified in W2/Limitation above, this type of methods (such as SAMPro3D, SAI3D, ours and so on) are zero-shot ones, which means they utilize the generalization capability of 2D foundation model, since to date there is no 3D foundation model due to the limited 3D labelled data. The correspondence between 2D and 3D space is the focus of these methods. Thus, posed images are essential for acting as a bridge between 2D foundation models and 3D space. We will include this discussion in the final version of our paper.

Q2: Further figure clarification
In general, the whole pipeline shown in Fig. 2 of our paper can be separated in three parts. Step A: 3D primitives generation exploiting both geometric and textural priors. Step B: scene graph construction where the primitives serve as nodes and affinity matrix of 3D primitives guided by 2D masks generated using 2D segmentators as edge weights. Step C: Region growing and instance-aware refinement on the constructed scene graph. We provide a more straightforward version with larger images in Fig. 1 of global rebuttal PDF document.

评论- Post-Rebuttal

2024-08-09

I thank the author for clarifying their contributions and method. I think showing a significant increment of instances with few points is a reasonable measure to suggest that the new dataset has, at least to some extent, a further complexity.

I find this sentence a bit confusing: Thus, the ScanNetV2-INS dataset we propose is simply of evaluation use since their output instance segmentation results do not change when switching the ground truth file. From my understanding, ScanNetV2-INS provides more fine-grained annotations to evaluate the methods. These can also be used to train (or fine-tune) the models. Hence, I am unsure why the authors suggest that this dataset is designed for evaluation purposes only.

Thanks again to the authors for their time and clarifications.

2024-08-09

Dear Reviewer 3tui,

Thanks so much for your quick response and feedback, and we are really happy to hear that we were able to address some of your concerns!

For the sentence you mentioned, the original ScanNetV2 dataset actually holds a big number of scenes (1202 in train split, 312 in validation split and 100 in test split). Recent methods, such as SAM3D, SAMPro3D, SAI3D and ours, follow a zero-shot pipeline that lifts the 2D segmentation results from SAM to the 3D scenes. Therefore, these methods do not require the training process or the 3D scenes in training split, and could be (and actually are) performed directly on the validation set of 3D dataset, and to compare the metrics of 3D instance segmentation.

Thus, the ScanNetV2-INS dataset we propose consists of only the revised version of 312 validation scenes. Sure it is also possible to use our dataset for a normal model which requires training by replacing the original val set to ours, since the format of gt labels of our dataset is consistent with the original one. But the initial motivation of proposing the dataset is to provide a fairer comparison between the aforementioned no-training methods (thus on the validation set only). This is why we suggest that the new dataset is of a evaluation use, in the context of these no-training methods.

Once again, thank you for your quick response and additional feedback! In the remaining discussion period, we would be very glad to address any additional questions that may arise, or any clarifications needed.

Authors

2024-08-13

Dear reviewer 3tui,

As the discussion phase ends today we will not be able to further clarify potential additional concerns. We would be very grateful if you could respond to our further comment and offer us an opportunity to address any additional questions you might have!

Thank you for your time and feedback!

Best,

Authors

2024-08-13

Dear Authors,

Thank you for your clarifications. I do not have further questions.

Best.

2024-08-13

Dear reviewer 3tui,

Thanks for your quick response and all the feedback! We are glad to hear that your questions have been addressed, and we will further explore 3D vision related problems in the future.

Best,

Authors

审稿意见

评分: 6置信度: 42024-07-10

This study introduces SA3DIP, a novel 3D instance segmentation model based on SAM. SA3DIP leverages texture priors from point cloud color channels to generate complementary primitives and incorporates 3D spatial priors when merging 2D masks by integrating a 3D detector. These enhancements enable SA3DIP to generate superior superpoints and mitigate the over-segmentation problem found in previous SAM-based 3D instance segmentation methods.

优点

(1) Using SAM to extract 2D masks from RGB-D frames and merge them into a final 3D segmentation result is common in 3D OV segmentation methods. However, few previous methods are geometry-aware during the merging process, highlighting the significance of SA3DIP's incorporation of 3D priors.

(2) Constraints from 3D spatial priors substantially improve performance on both the ScanNetV2 and ScanNetV2-INS datasets.

(3) The low quality of ScanNetV2 ground-truth segmentation results has been a persistent problem. A 3D segmentation dataset with more accurate ground truth, like ScanNetV2-INS, is demanding.

缺点

(1) Using RGB values only as texture prior are not robust enough due to their susceptibility to variations caused by lighting conditions, shadows, reflections, and object materials.

(2) The ablation study shows that the performance gain from Complementary 3D superpoint primitives is not significant compared to other modules in SA3DIP.

(3) This paper reports only the class-agnostic instance segmentation results of SA3DIP. However, the previous benchmark method (SAI3D [1]) also uses semantic instance segmentation as a typical evaluation metric.

[1] Yin, Y., Liu, Y., Xiao, Y., Cohen-Or, D., Huang, J. and Chen, B., 2024. Sai3d: Segment any instance in 3d scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3292-3302).

问题

More comprehensive experiments are needed to validate the effectiveness of the Complementary Primitives Generation module. Additionally, the authors should further discuss the motivations for using color value similarities as texture priors.

局限性

The authors adequately addressed the limitations in section 4.4.

作者回复

2024-08-06

We thank the reviewer for the constructive and detailed feedback.

W1/Q: RGB values as texture prior are not robust enough / Motivations for using color values / More experiments
In the first place, our motivation is that we found distinct instances with similar normal often exhibit different color. Meanwhile, the commonly used 3D indoor datasets have only xyz and rgb attributes for their 3D points. Thus, we opt to explore RGB values as texture prior.
It is true, as concerned also by other reviewers, that texture prior such as RGB values are not robust enough when being used solely. We have conducted further experiments about the priors and their weights, shown in Tab. 1 of the global rebuttal PDF document. In our approach we assign less weight to the textural prior, thus to exploit it while minimizing its negative impact. For ScanNetV2, we set the geometric weights ( $W_n$ ) as 0.96 and texture weights ( $W_c$ ) as 0.04 (Line 214). Under this setting complementary primitives generation module yields a good initial state for the subsequent merging and refinement. We have conducted more experiments on Matterport3D (table below) and Replica dataset (Tab. 2 of the global rebuttal PDF document).

Matterport3D	$W_n$	$W_c$	3D Space Prior	mAP	AP25	AP50
OpenMask3D	/	/	/	15.3	28.3	43.3
OVIR-3D	/	/	/	6.6	15.6	28.3
SAM3D	1	0	/	10.1	19.4	36.1
SAI3D	1	0	/	18.9	35.6	56.5
Ablation of ours
Ours #1	1	0	yes	19.8	36.6	56.2
Ours #2	0.9	0.1	/	18.1	35.7	62.3
Ours #3	0.9	0.1	yes	20.6	38.3	61.0

It can be seen in the last row of three tables that, our performances on all three datasets exceed that using only geometry prior. We will include this discussion in the final version of our paper.

W2: Non-significant gain of complementary 3D superpoint primitives module
In our approach, we designed the complementary 3D superpoint primitives module for alleviating the existing problem about under-segmented 3D primitives based only on geometry and subsequent error accumulation.
For the non-significant gain of this module, it is related to the definition of the metric AP (ratio of correctly identified instances to the total number of identified instance). This AP metric is more in favor of under-segmentation rather than over-segmentation, since the former introduces relatively fewer false instance. To further illustrate this, we randomly choose two scenes (scene0011_00 & scene0644_00) and conduct random 10%, 20%, 30% over/under segmentation tests based on their GT instances. Results are averaged from three experiments and shown below. APs with subscript O stand for over-segmentation results, and U for under-segmentation.

	mAP_O	AP25_O	AP50_O	mAP_U	AP25_U	AP50_U
Scene0011_00
10%	87.2	87.2	98.1	92.4	96.2	96.2
20%	71.8	71.8	98.1	84.5	92.3	92.3
30%	62.3	62.3	98.1	85.4	92.3	92.3
Scene0644_00
10%	88.1	88.1	96.7	97.6	98.8	98.8
20%	73.6	73.6	98.3	86.9	92.9	92.9
30%	60.3	60.3	98.3	69.8	82.1	82.1

It can be observed that under whatever percentage the under-segmentation seems to give higher APs, due to its high precision and fewer false positive cases. While for over-segmentation, it gives fewer precision and higher recall. This is consistent with our results, which give finer-grained (20% more in average, shown in the table below) primitives when exploiting both geometry and texture but slightly lower APs. We show the number of instances in comparison with SAI3D in the table.

Count	Min	Max	Avg		Min	Max	Avg
Primitive				Final result
SAI3D	159	3905	1068	SAI3D	6	258	59
ours	243	3989	1272	ours	5	159	45

However, our final results when experiencing whole pipeline (after adding 3D space prior) reverse the minor drop and produce an extra gain at around 1% compared to adding only 3D detection. The counts for instances decrease as well. The superior performance and decreased instance number indicate that our results achieve both high recall and high precision. It proves that he combining of finer-grained (and slightly over-segmented) primitives and instance-aware refinement during merging process (merge those over-segmented primitives) provides a thorough solution.

W3: Semantic instance segmentation
We have conducted the semantic instance segmentation on ScanNet200 dataset, better aligning our capability with previous benchmark method (such as SAI3D). The results are provided in the following table. Our method clearly results in better performance than other methods on AP, head and tail.

Method	AP	AP50	AP25	Head(AP)	Common(AP)	Tail(AP)
Closed mask
OpenMask3D	15.4	19.9	23.1	17.1	14.1	14.9
Open-vocab. mask
OVIR-3D	9.3	18.7	25.0	9.8	9.4	8.5
SAM3D	9.8	15.2	20.7	9.2	8.3	12.3
SAI3D	12.7	18.8	24.1	12.1	10.4	16.2
Ours	13.5	20.4	24.8	14.9	11.6	16.9

2024-08-11

Dear reviewer sUBY,

Thank you very much for reviewing our paper and giving us some good questions. We have tried our best to answer all the questions according to the comments.

Thank you for your time and feedback!

Best,

Authors

2024-08-12

Thank you for the comprehensive feedback and the additional experiments provided by the authors; they have addressed most of my concerns effectively. However, I am still unclear about why $W_n$ must be set much higher than $W_c$ across all benchmark settings. According to Table 1 in the attached PDF, it appears that increasing $W_c$ to 0.6 results in a consistent performance decline across all metrics. I would be inclined to improve my rating if the author could explain this properly. Thank you in advance for taking the time to clarify this issue.

2024-08-12

Dear reviewer sUBY,

Thanks so much for your feedback, and we are really happy to hear that we were able to address some of your concerns!

As for your question concerning the weights of normal and color, we have conducted some more experiments on the counts of 3D primitives under different weights for analysis.

The following tables show the 3D primitive count of one single scene of scene0011_00 and all 312 scenes of ScanNetV2 validation set under different weights. It can be observed that in the process of reducing $W_n$ and increasing of $W_c$ , the number of 3D primitives increases first and then decreases. The initial increase is due to the introduction of color, where the previously under-segmented primitives were separated, just as we wanted. The decrease is due to the fact that texture prior is greatly affected by lighting conditions, reflections, shadows, and noise collected by sensors. Thus, it blurs the boundaries of objects that can be distinguished by the original normal information, resulting in another trend of under-segmentation of primitives.

Scene0011_00
3D primitives count	$W_n$	$W_c$	3D primitives count	$W_n$	$W_c$
1620	1	0	1772	0.85	0.15
1725	0.99	0.01	1751	0.8	0.2
1779	0.98	0.02	1719	0.75	0.25
1781	0.97	0.03	1725	0.7	0.3
1786	0.96	0.04	1698	0.6	0.4
1785	0.95	0.05	1637	0.5	0.5
1781	0.94	0.06	1623	0.4	0.6
1773	0.93	0.07	1554	0.3	0.7
1764	0.92	0.08	1432	0.2	0.8
1770	0.91	0.09	1242	0.1	0.9
1767	0.90	0.10	386	0	1

For	all	ScanNetV2	val	set
Min	Max	Mean	$W_n$	$W_c$	Min	Max	Mean	$W_n$	$W_c$
159	3905	1068	1	0	241	3968	199	0.85	0.15
217	3941	1208	0.99	0.01	228	3943	1172	0.8	0.2
231	3989	1257	0.98	0.02	220	3907	1150	0.75	0.25
240	3997	1271	0.97	0.03	210	3877	1134	0.7	0.3
243	3989	1272	0.96	0.04	191	3849	1103	0.6	0.4
241	3998	1270	0.95	0.05	185	3903	1079	0.5	0.5
245	3994	1264	0.94	0.06	177	3893	1051	0.4	0.6
244	4000	1258	0.93	0.07	163	3809	1012	0.3	0.7
247	3989	1251	0.92	0.08	157	3596	947	0.2	0.8
247	3979	1243	0.91	0.09	125	2951	809	0.1	0.9
246	3986	1234	0.90	0.10	12	1566	328	0	1

In Fig. 2.A of our paper (Complementary Primitives Generation), we also showcased the visualization of primitives under the config of ( $W_c$ =1, $W_n$ =0) and ( $W_c$ =0, $W_n$ =1). It is clear that primitives in both scenes feature under-segmentation (the green segment in the middle image, and the purple segment in the right image). While the addition of them with greater weight for $W_n$ and less weight for $W_c$ , shown in the left image, gives a better state of primitives.

Therefore, it is necessary to assign a much higher weight of $W_n$ , and introduce a limited awareness of color information to avoid the under-segmentation trend at both ends.

Hope our analysis could ease your concern, and once again, thank you for your response and additional feedback! In the remaining discussion period, we would be very glad to address any additional questions that may arise, or any clarifications needed.

Best,

Authors

2024-08-12

Thanks for the author's quick response. This answer makes sense to me. I have raised my score to weak accept.

审稿意见

评分: 7置信度: 32024-07-15

The paper introduces SA3DIP, a novel method for 3D instance segmentation that leverages both geometric and textural priors to enhance the accuracy of segmentation tasks. The goal is to improve open-world 3D instance segmentation by addressing the limitations of current methods, which often result in under-segmentation and over-segmentation due to the limited use of 3D priors.

SA3DIP integrates both geometric and textural priors to generate finer-grained 3D primitives, reducing initial errors that accumulate in subsequent processes. ItIncorporates constraints from a 3D detector during the merging process to rectify over-segmented instances, maintaining the integrity of objects in 3D space. It also introduces a revised version of the ScanNetV2 dataset, termed ScanNetV2-INS, with enhanced ground truth labels for more accurate and fair evaluations of 3D class-agnostic instance segmentation methods. Finally, extensive experiments on ScanNetV2, ScanNetV2-INS, and ScanNet++ datasets demonstrate the effectiveness and robustness of SA3DIP, achieving significant improvements in segmentation accuracy over existing methods.

优点

Enhanced 3D Instance Segmentation Pipeline:

The SA3DIP pipeline incorporates both geometric and color priors to generate complementary 3D primitives.
Introduces a 3D detector to provide additional constraints during the merging process, addressing over-segmentation issues.

ScanNetV2-INS Dataset:

A revised version of the ScanNetV2 dataset with improved annotations, providing a more accurate benchmark for evaluating 3D instance segmentation methods.
Rectifies incomplete annotations and incorporates additional instances to better reflect real-world scenarios.

Robust Performance:

Demonstrated superior performance in 3D instance segmentation through extensive experiments on multiple datasets.
Achieved competitive results, significantly outperforming existing methods in terms of mAP (mean Average Precision), AP50, and AP25 scores.

The SA3DIP method addresses the limitations of previous approaches by fully exploiting the potential of 3D priors, leading to more accurate and reliable 3D instance segmentation results. The improvement of the prior dataset is cleverly done and the overall architecture is sound. The dataset is going to be valuable to the community for future work.

缺点

Obfuscation of Feature Contributions:
- The enhancement metrics contributions by individual features are not clearly delineated. Ablation studies can be performed more meticulously to attribute the contribution of each feature individually.
Super Primitives Definition:
- A more precise and comprehensive definition of super primitives could be provided to enhance understanding and reproducibility.
Progressive Region Refinement Examples:
- Including examples of progressive region refinement could illustrate the process and its effectiveness more clearly.
2D to 3D Space Integration:
- While the method backprojects 3D space metrics into 2D space, the potential of lifting 2D space into the 3D object space was not fully explored. This approach could be considered and discussed to justify the design choices made.

These points highlight areas where the methodology can be further refined and expanded to provide clearer insights and potentially improve performance.

问题

What are the main assumptions made in workflow and reasons as well as rationale for key design choices ?

局限性

It is not clear what level of demarkation of 3d segmentation is considered. Can it work for subparts within a scene such as within a chair. A level of detail seems to be one of the main deficiencies. Additionally it is not clear what the complexity of the pipeline is in terms of computation. Only one scene is shown here. What would be metrics of how much improvement across the dataset can be - an estimate would be good across a randomly chosen dataset.

作者回复

2024-08-06

We thank the reviewer for the constructive and detailed feedback.

W1: Ablation studies on each feature individually
We have conducted ablation studies on each feature we intend to exploit (e.g., geometry, texture and 3D space prior). The results are provided in the following tables. APs with subscript 1 stand for ScanNetV2 dataset, and 2 for ScanNetV2-INS. We assigned several weights for geometry and texture to test their contribution. Specifically, we conduct the config with $W_n$ =0.4 and $W_c$ =0.6 which yields the similar number of 3D primitives as the primitives used in SAM3D, SAI3D and others, for a fair comparison. The experiments show that the config with $W_n$ =0.96 and $W_c$ =0.04 suits best for our approach. More experiment on Matterport3D is shown below as well, and experiments on Replica is in Tab.2 of the global rebuttal PDF.

Geometry	$W_n$	Texture	$W_c$	3D Space Prior	mAP_1	AP25_1	AP50_1	mAP_2	AP25_2	AP50_2
√	1	×	0	×	30.8	50.5	70.6	28.9	49.2	69.7
×	0	√	1	×	10.4	18.1	32.5	9.5	17.0	31.1
√	0.4	√	0.6	×	27.3	47.4	69.8	25.6	46.3	69.4
√	0.96	√	0.04	×	29.3	49.2	70.5	27.4	48.3	70.4
√	1	×	0	√	40.8	63.6	80.7	35.9	57.8	75.4
×	0	√	1	√	12.7	22.1	37.2	11.0	19.7	34.1
√	0.4	√	0.6	√	39.1	62.7	80.2	33.5	56.3	75.0
√	0.96	√	0.04	√	41.6	64.6	81.3	36.1	58.6	76.3

Matterport3D	$W_n$	$W_c$	3D Space Prior	mAP	AP25	AP50
OpenMask3D	/	/	/	15.3	28.3	43.3
OVIR-3D	/	/	/	6.6	15.6	28.3
SAM3D	1	0	/	10.1	19.4	36.1
SAI3D	1	0	/	18.9	35.6	56.5
Ablation of ours
Ours #1	1	0	yes	19.8	36.6	56.2
Ours #2	0.9	0.1	/	18.1	35.7	62.3
Ours #3	0.9	0.1	yes	20.6	38.3	61.0

It can be observed that texture prior are not robust enough when being used solely due to influence on shadows, reflection and so on. Thus, we adopt the complementary primitives module, exploiting both texture and geometry. A greater weight for geometry and less for texture yield a good initial state for the subsequent merging and refinement. It can be seen in the last row that with 3D space prior, the performance exceeds that using only geometry prior.

W2: Super Primitives Definition
In the context of our paper, 3D superpoints/primitives refer to clusters of 3D points. The points within the same group exhibit homogeneity on certain attributes. The purpose of generating primitives rather than using raw points is to introduce prior knowledge (such as geometry and texture attributes we use) and reduce computational complexity to the subsequent process.

W3: Progressive Region Refinement Examples
In Fig. 2.C of the paper we have provided one visual example that proves the effectiveness of our refinement process. We showcase the detailed visualization at each stages in the region growing and refinement process in Fig. 1 bottom of the global rebuttal PDF document.

W4: 2D to 3D Space Integration
Due to inaccuracies in camera poses, the point clouds generated from posed RGB-D images are not perfectly aligned in 3D space. This misalignment can introduce noise when lifting 2D results into 3D space, as seen in methods like SAM3D. In contrast, projecting 3D mesh-sampled point clouds into 2D space experiences less influence from such inaccuracies. We showcase a example of SAM3D in Fig. 2 of global rebuttal PDF document., which uses only 2D to 3D projection. It is clear that the 2D to 3D projection introduces a lot of noise, and this explains why recent methods, including ours, tend to explore 3D to 2D projection.

Q: Main assumptions & reasons for key design choices
In general, we have made two major assumptions in our pipeline:

At the Complementary primitives generation stage, it is assumed that points belong to the same semantic or instance labels share similar inherent attributes (geometry, texture and so on).
At the instance-aware refinement stage, it is assumed that the points inside the 3D bounding boxes have high confidence to belong to the same instance.

The reasons for the key design choices are elaborated as follows:

Attributes and weights for primitive generation: In the first place it is based on the observation that distinct instances with similar normal often exhibit different color. We believe it is a promising solution to the under-segmented primitives. While afterward we chose to assign greater weight to the geometric prior and less weight to the texture prior, as explained also in W1, due to the variations introduced by lighting conditions, shadows and so on.
SAM variant for 2D segmentation: We use instance-level segmentation function of Semantic-SAM for ScanNetV2 dataset since not so many scenes contain a large amount of small objects, and we opt to SAM-HQ for ScanNet++ dataset due to its high-resolution scenes along with detailed objects.

2024-08-11

Dear reviewer oUe8,

Thank you very much for reviewing our paper and giving us some good questions. We have tried our best to answer all the questions according to the comments.

Thank you for your time and feedback!

Best,

Authors

作者回复

2024-08-06

We thank all the reviewers for their valuable feedback, we appreciate their detailed suggestions. We reply to each reviewer’s questions and concerns in the individual responses, and we have added tables and figures in the attached rebuttal PDF, which we reference and explain in the responses.

We would like to emphasize that our work is problem-oriented, motivated by the commonly existed defects in SAM3D, SAMPro3D, SAI3D and so on. We therefore design two modules accordingly:

Complementary primitives generation module to generate more accurate and finer-grained 3D primitives to avoid any error accumulation.
Introducing the 3D space prior for providing instance-aware constraint, which was implemented by a 3D detector. The visualization of Fig. 4 in our paper demonstrates that the two purposes have been achieved and indeed the problems could be alleviated through our approach.

Here, we also would like to provide an overview of the material in the attached document:

Figure 1 – Re-drew schematic overview of the proposed pipeline, in response to Reviewer 3tui who found it difficult to follow the original Fig. 2 in our paper. The top blocks give a simplified flow of our method, and under which is the detailed version with images. We also show a full progressive merging and refinement process in the yellow dashed box at the bottom, in response to Reviewer oUe8 who required progressive region refinement examples.

Figure 2 – Visual examples from SAM3D, in response to the discussion with Reviewer oUe8 about the potential of 2D to 3D space integration.

Figure 3 – Visual comparison with SAMPro3D, in response to Reviewer cYA9. This visualization, along with Fig. 4 in our paper, fully demonstrates the effectiveness of our approach at alleviating the error accumulation caused by under-segmented 3D primitives and over-segmented 3D instance due to the transferred knowledge from the part-level masks of 2D foundation segmentation model.

Table 1 - Detailed ablation study on prior we intend to explore. We hope this could ease the concern by most reviewers about the effectiveness of texture prior or the complementary primitive generation module.

Table 2 – More experiment and ablation on Replica dataset. This further proves the robustness and generalization capability of our approach.

最终决定Accept (spotlight)

2024-09-25

This paper received all positive scores (Accept, Weak Accept, Weak Accept, and Borderline Aceept) from 4 reviewers. These reviewers are mainly satisfied with the responses from the authors during the rebuttal phrase. The authors are encouraged to include these important clarifications into the final paper when preparing the camera ready version.