A Unified Framework for 3D Scene Understanding
摘要
评审与讨论
The paper proposes a unified 3D segmentation frame work for six 3D segmentation tasks. It enhance the performance through building the inter-task connections. The model could achieve the state-of-the-art performance in individual tasks even comparing to the models specialized for individual tasks.
优点
The proposed model unifies 6 different 3D segmentation task in one framework, with no modules specialized for specific task. The authors build explicit inter-task relations to further improve the performance of individual tasks. Detailed experiments and comparison are reported.
缺点
- The writing needs improvement. Some notation and writing flow are confusing. For example, K_v in line 155 is not introduced. The multiplication in equation (3) is not clear (is it a matrix multiplication?). Positive sample and negative samples are not introduced in section 3.2 (line 206). mask_pos in equation (6) is not introduced.
- No explanation on how to compare models/design choice when the evaluation metrics for different tasks are not consistent. For example, if model A is better in interactive segmentation while model B is better in referring segmentation, how to compare?
- Without the extra fine-tunning trick, the proposed model has a worse performance over previous SOTA OneFormer3D [16] on SS and IS tasks (table 2). According to Table I and IV in appendix, the performance of the proposed method is questionable.
- The authors empirically find the interactive segmentation is the best task for mask predictions (line 57-58) but no analysis on the reason. Only one value is reported in table 1, which could not validate interactive segmentation has superior performance.
问题
- As is mentioned in the weaknesses, how do you compare models/design choice when the evaluation metrics for different tasks are not consistent? For example in table 4(a), how would you compare line 2 and line 4? Which one is better?
- The design choice of Mask Decoder Layer is weird. Why do you choose to do a cross-attention first then followed by a self-attention? Also, is there duplicate information when doing the cross-attention between q and f_i, since q_u in q is obtained from f_i? Why not do the self-attention for q_u and cross-attention with q_p and q_t?
- Table 6 shows that unifying the tasks are hurting the performance, then what is the point of unifying all the tasks together? Adding new tasks (OVS, Referring, Interactive) is hurting PS, SS, IS. Would these new tasks help PS, SS, IS?
- As is mentioned in the weaknesses, table 1 could not support the claim that interactive segmentation is superior in mask prediction. Also, is there any intuition why this specific segmentation task is superior?
局限性
The paper addresses the limitation and claims there is no societal impact.
- To Question 2: “The design choice of Mask Decoder Layer is weird…”
Reply: Good question! Here's a detailed discussion: 1) For the order of cross/self-attention, please note that the from represent features from the entire scene. Cross-attention can effectively evaluate the relationship between the sampled and the scene , while self-attention evaluates the pairwise importance among each . By applying cross-attention first, a global perspective is incorporated into , helping self-attention make further adjustments (see below experiments). We also would like to point out that such order is a common operation and is widely used in other representative works [1,2]. 2) For the mentioned duplicate information, we argue that since is randomly selected from , incorporating global perspectives in cross-attention is advantageous. The presence of 'duplicate information' is not an issue, as such overlap is typical for query, key, and value in 2D or 3D transformer methods [3,4]. 3) As mentioned in 1) and 2), we reiterate that a global perspective is essential, as demonstrated by the results in the second table below, which show that simultaneously incorporating a global perspective into , , and significantly enhances performance, further supporting our argument.
| PS | SS | IS | Inter. Seg. | Ref. Seg. | OVS | |
|---|---|---|---|---|---|---|
| Self-attention first | 70.2 | 75.7 | 58.0 | 52.9 | 29.0 | 19.2 |
| Cross-attention first | 71.3(+1.1) | 76.9(+1.2) | 59.3(+1.3) | 54.5(+1.6) | 29.6(+0.6) | 19.7(+0.5) |
| PS | SS | IS | Inter. Seg. | Ref. Seg. | OVS | |
|---|---|---|---|---|---|---|
| w/o cross-attention for | 66.8 | 73.3 | 55.1 | 48.6 | 25.9 | 11.6 |
| w/ cross-attention for | 71.3(+4.5) | 76.9(+3.6) | 59.3(+4.2) | 54.5(+5.9) | 29.6(+3.7) | 19.7(+8.1) |
[1] UniVS: Unified and Universal Video Segmentation with Prompts as Queries. CVPR 24.
[2] Flamingo: a Visual Language Model for Few-Shot Learning. NeurIPS 22.
[3] DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. ICLR 23.
[4] CenterFormer: Center-based Transformer for 3D Object Detection. ECCV 22.
- To Question 3: “Table 6 shows that unifying the tasks are hurting…then what is the point of unifying all the tasks together…”
Reply: We first propose a simple baseline to unify six 3D segmentation tasks. However, we observe challenges in multi-task joint optimization. We would like to clarify that instead of presenting our final results, the goal of Table 6 is to prove multi-task unification as a challenging topic where introducing new tasks might hurt the performance of existing tasks. To relieve these impacts, we propose UniSeg3D, which builds inter-task associations to jointly optimize the associated tasks. Experiments demonstrate that our UniSeg3D not only supports six tasks but also surpasses currently specialized SOTA approaches on all tasks, verifying the motivation of our method. Considering that this is the first work successfully unifying six 3D segmentation tasks, we believe it would be valuable and interesting for people in this area.
Thanks for providing feedback and taking the time to review our work!
- To Weakness 1: “The writing needs improvement...”
Reply: Thank you for carefully reading this paper. We apologize for any confusion regarding notations. Specifically, refers to the length of the class name vocabulary; the Multiplication in equation (3) is indeed a matrix multiplication; Positive and negative samples denote samples successfully matched or miss-matched with labels, respectively; and in equation (6) represents the mask predictions from the positive samples. We will carefully revise these notations in the revised version.
- To Weakness 2 & Question 1: “…models/design choice…different tasks are not consistent…model A is better…model B…” / “…compare models/design choice…table 4(a)…compare line 2 and line 4…”
Reply: UniSeg3D is the first model unifying six 3D segmentation tasks. OneFormer3D is the only existing unified model for 3D segmentation, covering just three tasks. To compare comprehensively and demonstrate our model's superiority, we first evaluate UniSeg3D on OneFormer3D's three tasks, selecting the best checkpoint. This checkpoint is used to test interactive, referring, and open-vocabulary tasks. Remarkably, without specific checkpoint selection, UniSeg3D outperforms current specialized SOTA methods. Additionally, different complete convergence checkpoints show minimal differences in performance, making both models viable. In practical applications, checkpoints can be chosen based on preferred tasks while maintaining good performance across other tasks.
For table 4(a), lines 2 and 4 both represent our methods. We empirically find that, with the fine-tuning trick, the setting of line 4 performs better compared to line 2 (see below table). Thus, we employ the setting of line 4 for our method. Note that the fine-tuning trick improves performance without extra inference costs.
| Panoptic Segmentation (PS) | Semantic Segmentation (SS) | Instance Segmentation (IS) | Interactive Segmentation (Inter. Seg.) | Referring Segmentation (Ref. Seg.) | Open-vocabulary Segmentation (OVS) | |
|---|---|---|---|---|---|---|
| Tab.4(a) line 2 w/ fine-tuning | 71.3 | 76.4 | 59.1 | 54.2 | 29.6 | 19.7 |
| Tab.4(a) line 4 w/ fine-tuning | 71.3(+0.0) | 76.9(+0.5) | 59.3(+0.2) | 54.5(+0.3) | 29.6(+0.0) | 19.7(+0.0) |
- To Weakness 3: “Without the extra fine-tunning trick… Table I and IV in appendix…questionable.”
Reply: Thank you! Without the trick, UniSeg3D already achieves highly competitive performance against OneFormer3D, supporting three additional tasks. It consistently outperforms specialized SOTA methods in interactive, referring, and open-vocabulary segmentation before fine-tuning. We argue that achieving full convergence while optimizing six tasks simultaneously is challenging. Hence, we introduce a simple trick requiring only a few epochs of fine-tuning, allowing UniSeg3D to surpass all current specialized SOTA methods across six tasks. Importantly, this fine-tuning trick does not add any additional inference cost, making it a practical solution.
| PS | SS | IS | Inter. Seg. | Ref. Seg. | OVS | |
|---|---|---|---|---|---|---|
| AGILE3D (ICLR 24) | - | - | - | 53.5 | - | - |
| X-RefSeg3D (AAAI 24) | - | - | - | - | 25.5 | - |
| Open3DIS (CVPR 24) | - | - | - | - | - | 19.0 |
| OneFormer3D (CVPR 24) | 71.2 | 76.6 | 59.3 | - | - | - |
| UniSeg3D w/o fine tuning trick | 71.3(+0.1) | 76.3(-0.3) | 59.1(-0.2) | 54.1(+0.6) | 29.5(+4.0) | 19.6(+0.6) |
Besides, considering Table I and IV in the appendix, we would like to clarify two points: 1) For the instance segmentation task, mAP is a more comprehensive metric than mAP and mAP. UniSeg3D achieves the same mAP as OneFormer3D but shows notable improvement under stricter metrics (e.g., mAP, mAP), indicating more precise mask generation; 2) For the referring segmentation task, mIoU is more comprehensive, and UniSeg3D outperforms the SOTA method X-RefSeg3D by 4.1 mIoU, proving UniSeg3D's superior performance in this task.
| Method | mAP | mAP |
|---|---|---|
| OneFormer3D (CVPR 24) | 41.0 | 29.3 |
| UniSeg3D (Ours) | 42.9(+1.9) | 30.9(+1.6) |
- To Weakness 4 & Question 4: “…interactive segmentation is the best task…no analysis….” / “…table 1 could not support the claim that interactive segmentation is superior in mask prediction…”
Reply: Good suggestion! Visual prompts are crucial for interactive segmentation, providing precise object location priors that guide focus on foreground targets and reduce background noise. The interactive segmentation can be considered as an extension of instance segmentation with the addition of strong object location priors. Consequently, it excels in mask prediction, achieving 76.0 mIoU, 7.9 higher than the 68.1 mIoU of instance segmentation. We will add discussions and this comparison to Table 1 in the next version for clarity.
| mIoU | |
|---|---|
| IS (w/o visual prompt) | 68.1 |
| Inter. Seg. (w/ visual prompt) | 76.0(+7.9) |
For questions 2 & 3, please see the reply presented in the “comment” part.
Thanks for the detailed rebuttal. The authors have addressed most of my concerns.
Considering the response from the authors and the reviews from other reviewers, I would change my rating to boardline accept. Please clarify some notations and provide some explanations in the rebuttal for the revised version.
We appreciate your thought-provoking reviews and are pleased to see your positive decision. We will carefully revise the notations and add detailed explanations in the revised version. Thank you once again for your positive rating.
This work proposes UniSeg3D, a framework to unify 3D point cloud segmentation tasks. Compared to previous work that unifies 3 tasks, UniSeg3D additionally incorporates interactive segmentation, text-referring segmentation, and open-vocabulary segmentation. In total, six tasks are unified in a single Transformer decoder architecture, and techniques such as knowledge distillation, contrastive learning, and two-stage fine-tuning are applied to boost performance. Consequently, UniSeg3D achieves comparable or superior performance against existing state-of-the-art baselines.
优点
- Novel architecture for incorporating six segmentation tasks under one single model, which appears to be effective and flexible for extending to additional tasks.
- Solid experiments. The authors conducted experiments on 3 datasets and various ablation studies to demonstrate the effectiveness of the proposed method.
缺点
- Marginal performance gain. While referring segmentation and interactive segmentation tasks enjoy noticeable improvements in UniSeg3D, the performance of generic segmentation tasks seems to be on par or worse than training alone (see Table 1). This led to questions about the motivation for the proposed unification. Further justifications, such as the standard deviation of the performances, could make the results more convincing.
- Additionally, in Table 6, unifying the additional 3 tasks turns out to hurt the performance of generic segmentation tasks. How do the authors justify their motivation?
- Some minor issues in writing and notations. See the Questions below.
问题
- Equation 3: The current notation is for cross products while it looks like dot products are intended.
- Table 1: where does the Table 1 come from?
- Notations: and in Equation 4 are not the same as the 's in Figure 2 and there are no corresponding notations in Figure 3. Consider using a different notation or adding corresponding legends in Figures 2 and 3 for better clarity.
- Section 3.3: the motivation of inter-task association is not clearly stated, how exactly can the tasks benefit each other? For example, how can referring segmentation benefit interactive segmentation?
局限性
The authors have discussed their limitations in Section 5. Given the paper is titled A Unified Framework for 3D Scene Understanding, the reviewer would like to point out that point cloud segmentation is one aspect of 3D scene understanding, so it would also be worth discussing how such a unified segmentation framework can potentially help other 3D understanding tasks.
Thanks for providing feedback and taking the time to review our work!
- To Weakness 1 & 2: “Marginal performance gain...” / “Additionally, in Table 6, unifying…”
Reply: Good question! Unifying the tasks into a single model could save computation consumption and benefit real-world applications, which has become a trend in the community [1][2]. In this paper, we aim to achieve six 3D segmentation tasks within a single model in one inference. We first propose a simple baseline unifying such six tasks and find that it can not achieve ideal performance, and we argue that the reasons are from two aspects: challenges of creating unified representations and difficulties of multi-task joint optimization for these tasks. We would like to clarify that the goal of Table 6 is to prove multi-task unification a challenging topic instead of our final results, e.g., introducing new tasks might hurt the performance of existing tasks. To relieve these impacts, we propose UniSeg3D, which uses queries to unify the representations and employs knowledge distillation with ranking-based contrastive learning to jointly optimize associated tasks. Extensive experiments demonstrate that our method not only supports six tasks but also surpasses currently specialized SOTA approaches on all tasks. Besides, following your suggestion, we provide the standard deviation of UniSeg3D by training 3 times. Referring to the Table below, we report slight standard deviations on six tasks.
| Times | Panoptic Segmentation (PS) | Semantic Segmentation (SS) | Instance Segmentation (IS) | Interactive Segmentation (Inter. Seg.) | Referring Segmentation (Ref. Seg.) | Open-vocabulary Segmentation (OVS) |
|---|---|---|---|---|---|---|
| 1 | 71.3 | 76.9 | 59.3 | 54.5 | 29.6 | 19.7 |
| 2 | 71.4 | 76.6 | 59.2 | 54.3 | 29.4 | 19.8 |
| 3 | 71.2 | 76.8 | 59.2 | 54.6 | 29.5 | 19.7 |
| Overall | 71.30±0.10 | 76.77±0.15 | 59.23±0.06 | 54.47±0.15 | 29.50±0.10 | 19.73±0.06 |
[1] UniGS: Unified Representation for Image Generation and Segmentation. CVPR 24.
[2] UniVS: Unified and Universal Video Segmentation with Prompts as Queries. CVPR 24.
- To Weakness 3: “Some minor issues…”
Reply: Thank you for pointing out these minor issues. We will carefully revise them in the next version.
- To Question 1: “Equation 3: The current notation…”
Reply: Sorry for this unclear statement. Actually, they are indeed cross products. We will add descriptions in the revised version.
- To Question 2: “Table 1: where does…”
Reply: Thanks! To make full use of knowledge contained in superior 3D scene understanding tasks, we propose knowledge distillation to transfer knowledge from superior tasks to other tasks, which improves multi-task performance without extra inference costs. The key to knowledge distillation is to utilize the task-predicting segmentation masks of the best quality to guide the other tasks, i.e., using a teacher to guide the students. Considering that visual prompts can provide location priors indicting precise object positions, naturally making the interactive task perform best in mask prediction. To prove this, we provide quantitative results (mIoU) in Table 1 of the manuscript. Essentially, the main difference between instance and interactive segmentation is w/o or w/ visual prompt. Here, we further compare the instance segmentation with interactive segmentation. Referring to the Table below, the interactive segmentation significantly outperforms instance segmentation by 7.9% mIoU, indicating the superior quality of predicted masks provided by the interactive segmentation task. We will add these discussions to make Table 1 more clear in the revised version.
| Task | mIoU |
|---|---|
| Instance segmentation (w/o visual prompt) | 68.1 |
| Interactive segmentation (w/ visual prompt) | 76.0(+7.9) |
- To Question 3: “Notations: and in Equation 4…”
Reply: Good suggestion! Following your suggestion, we now add the corresponding legends in Figure 3 for better clarity, which can be found in Figure-R 3 of the attached PDF.
- To Question 4: “Section 3.3: the motivation…”
Reply: Sorry for this unclear statement. In this paper, we propose the knowledge distillation loss to implement inter-task association, i.e., transfer the 3D scene understanding knowledge from the superior task (a good teacher) to the rest of the tasks. Therefore, considering that the visual prompts provide strong object location priors (see the reply to question 2), we employ interactive segmentation as the teacher. We use this teacher to provide auxiliary guidance, which improves multi-task performance without extra inference costs. At the same time, we would like to clarify that we mainly employ the interactive segmentation as the teacher, instead of the referring segmentation, to benefit other tasks (including the referring segmentation). The experimental results prove that under such inter-task association, we achieve considerable improvements compared with the baseline.
| Method | PS | SS | IS | Inter. Seg. | Ref. Seg. | OVS |
|---|---|---|---|---|---|---|
| Baseline | 70.4 | 76.2 | 58.0 | 54.5 | 29.1 | 19.7 |
| Baseline w/ knowledge distillation loss | 71.2 (+0.8) | 76.2(+0.0) | 59.3 (+1.3) | 56.6 (+2.1) | 29.2 (+0.1) | 19.6 (-0.1) |
- To Limitation:
Reply: Titling 3D segmentation research as 3D scene understanding has been a common practice in this community[1][2]. We will explore the relationships between 3D segmentation and other 3D scene understanding tasks in the future. Besides, UniSeg3D can provide pixel-level semantic information, which might serve as potential reliable input for other 3D understanding tasks.
[1] OpenScene: 3D Scene Understanding With Open Vocabularies. CVPR 23.
[2] Exploring Data-Efficient 3D Scene Understanding With Contrastive Scene Contexts. CVPR 21.
Thank you for addressing my questions and concerns. I appreciate the added experiments and the detailed explanation. I am happy to keep my original rating.
We sincerely thank the reviewer for the constructive feedback and support. Your comments are valuable for us in improving the quality of this work. We will incorporate your suggestions in the revision.
This paper proposes UniSeg3D, a unified framework for six 3D segmentation tasks that achieves SOTA results on the six tasks. The authors propose to use knowledge distillation and ranking-based contrastive learning to enhance inter-task knowledge sharing and the overall performance. Extensive experiments are done to prove that UniSeg3D is powerful. Comprehensive ablation studies are performed to prove the effectiveness of the design.
优点
- UniSeg3D is the first framework that unifies six tasks in 3D segmentation, and the comprehensive experimental results prove the effectiveness of the design.
- Interactive segmentation-guided training is insightful. Analysis of its impact on inter-task formulation is comprehensive, enlightening future directions.
- Well-written manuscripts with illustrative figures.
缺点
- Since the feature for visual prompt is sampled from the superpoints, the quality of the visual prompt significantly influence the overall performance of the model.
- The experiments are only conducted on ScanNet-based datasets, as a unified model, the authors should provide more experiments on different datasets to validate the method as done in previous works[1][2].
- Unlike previous works[2][3], UniSeg3D directly learn the relation between text and 3D without any 2D supervision or guidance. It is a concern that how "open" is this framework in OVS task. Some visualized experiments on open-set text queries as in [2][3][4] could answer the question.
[1] Zhu, Ziyu, et al. "3d-vista: Pre-trained transformer for 3d vision and text alignment." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. [2] Nguyen, Phuc, et al. "Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. [3] Takmaz, Ayça, et al. "Openmask3d: Open-vocabulary 3d instance segmentation." arXiv preprint arXiv:2306.13631 (2023). [4] Peng, Songyou, et al. "Openscene: 3d scene understanding with open vocabularies." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023.
问题
Please refer to the weaknesses.
局限性
The authors have adequately addressed the limitations and potential negative societal impact of their work.
Thank you for your valuable feedback!
- To Weakness 1: “Since the feature for visual prompt is sampled from the superpoints, the quality of the visual prompt significantly influence the overall performance of the model.”
Reply: Thanks. For the interactive segmentation task, it is intuitive that using different visual prompts will report different results. To investigate its influence, we study the three variants, including random, center, and prompts provided by the SOTA method (AGILE3D [1]), and the performance is shown in the table below. We can find that under the fair setting, i.e., using the prompts from AGILE3D, the proposed UniSeg3D achieves notable performance improvement compared with the SOTA. For the other tasks, due to attention masks in cross attention layer, the visual prompt will not influence their performance. We will add these discussions in the revised version.
| Method | Used sampling criteria | AP | AP | AP |
|---|---|---|---|---|
| AGILE3D (ICLR 24) [1] | AGILE3D (ICLR 24) [1] | 53.5 | 75.6 | 91.3 |
| UniSeg3D | AGILE3D (ICLR 24) [1] | 54.5 | 79.4 | 93.2 |
| UniSeg3D | Center | 56.6 | 82.1 | 94.9 |
| UniSeg3D | Random | 51.3 | 75.2 | 89.6 |
[1] AGILE3D: Attention Guided Interactive Multi-object 3D Segmentation. ICLR 2024.
- To Weakness 2: “The experiments are only conducted on ScanNet-based datasets, as a unified model, the authors should provide more experiments on different datasets to validate the method as done in previous works[1][2].”
Reply: Good suggestion! To the best of our knowledge, ScanNet is the only publicly available dataset supporting six tasks. Thus, to explore a unified framework supporting six tasks in a single model, we mainly conduct experiments on the ScanNet-based datasets in the manuscript.
Unfortunately, we note that the work[1] mainly focuses on visual grounding tasks, i.e., it is a 3D object detection task, while this paper focuses on the various segmentation tasks. The second work[2] explores the open-vocabulary segmentation task on the Replica dataset[3], one of our supported tasks. Hence, we follow work[2] to conduct zero-shot open-vocabulary segmentation on the Replica[3] dataset. The Table below demonstrates that the UniSeg3D surpasses the current SOTA approach Open3DIS.
| Method | AP | AP |
|---|---|---|
| OpenScene (CVPR 23) | 10.9 | 17.3 |
| OpenMask3D (NeurIPS 23) | 13.1 | 24.2 |
| Open3DIS (CVPR 24) [2] | 18.5 | 28.2 |
| UniSeg3D (Ours) | 19.1 | 29.2 |
As a supplement, we conduct experiments on the S3DIS dataset. Note that the S3DIS dataset does not provide text expressions, i.e., it can not support the referring segmentation task. Here, we evaluate the panoptic, semantic, instance, and interactive segmentation tasks on the S3DIS dataset. We can see that UniSeg3D outperforms current specialized SOTA approaches, especially for the challenging panoptic segmentation task, verifying the effectiveness of our method. We will include these new comparisons in the revised version.
| Panoptic Segmentation | Semantic Segmentation | Instance Segmentation | Interactive Segmentation | |
|---|---|---|---|---|
| Method | PQ | mIoU | mAP | AP |
| PointNeXt-XL (NeurIPS 22) | - | 70.5 | - | - |
| PointTransformerV2 (NeurIPS 22) | - | 71.6 | - | - |
| PBNet (ICCV 23) | - | - | 53.5 | - |
| Mask3D (ICRA 23) | - | - | 57.8 | - |
| OneFormer3D (CVPR 24) | 62.2 | 72.4 | 58.7 | - |
| UniSeg3D (Ours) | 65.7 | 72.5 | 58.7 | 29.9 |
[1] 3d-vista: Pre-trained transformer for 3d vision and text alignment. ICCV 23.
[2] Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance. CVPR 24.
[3] The Replica dataset: A digital replica of indoor spaces. arXiv 19.
- To Weakness 3: “Unlike previous works[2][3], UniSeg3D directly learn the relation between text and 3D without any 2D supervision or guidance. It is a concern that how "open" is this framework in OVS task. Some visualized experiments on open-set text queries as in [2][3][4] could answer the question.”
Reply: Thanks for admiring the distinctiveness of our work! The proposed UniSeg3D directly learns relations between text and 3D, mitigating the dependencies on 2D supervision. Previous works[2][3][4] mainly employ two experiment settings to evaluate how “open” they are, i.e., using random open-set text prompts on the ScanNet200 dataset and the cross-dataset evaluation. Following your suggestion, we visualize the open-vocabulary segmentation results under these two experiment settings. Please see Figure-R 1 and Figure-R 2 in the attached PDF for details. As we can see, the UniSeg3D presents the applicability from the following aspects:
(1) support open-set classes which are not included in the training data (illustrated in Figure-R 1);
(2) support attribute descriptions such as affordance and color (illustrated in Figure-R 1);
(3) support cross-dataset point cloud data (illustrated in Figure-R 2).
Since the above three applications, we think the UniSeg3D presents a desirable performance in how “open.”
Thank you for your detailed responses, which have made the work more comprehensive and addressed my concerns regarding the open-vocabulary attribute. I hope the author can add the additional visualization results to the revised version of the paper. Considering the impact of this work, I'll maintain my initial score.
Thank you for your valuable suggestions. We appreciate your acknowledging that our rebuttal addressed the concerns. We will add these visualizations to the revised version to present the effectiveness of our work more comprehensively.
For the first time, this work proposes a unified model for several point cloud segmentation tasks, including panoptic, semantic, instance, OV, interactive, and referring segmentation. This work uses the typical query-based transformer perception architecture with the proposed knowledge distillation losses for superior performance, outperforming previous methods, which shows the effectiveness of the designed model.
优点
- This work addresses an important problem of 3D scene perception with a unified model and achieves great results.
- The proposed model is simple and effective.
- The overall writing is fluent and clear.
缺点
-
The main weakness of this work is that the proposed architecture is widely used for multi-modal perception tasks, limiting the model’s novelty. However, this is acceptable as long as the model’s performance is indeed great, as simple and effective models are usually similar.
-
The paper claims knowledge distillation losses as one of its contributions. However, according to the ablations, adding these losses only makes marginal changes.
Still, I think this work has a solid contribution to the field (if open-sourcing the codes with reproducible results) and should be accepted.
问题
Some implementations of the method are unclear.
- The detailed process of producing super points.
- The criteria for sampling the corresponding points from the clicks.
- For the referring segmentation, how to decide the target segmentation mask from many text tokens?
- For L182-186, I am confused that isn't formula-5 already satisfied by using formula-4?
局限性
See above.
Thanks for providing feedback and taking the time to review our work! We promise that the training/inference codes, logs, and checkpoints will be released.
Weaknesses:
- To Weakness 1: “The main weakness of this work is that the proposed architecture is widely used for multi-modal perception tasks, limiting the model’s novelty. However, this is acceptable as long as the model’s performance is indeed great, as simple and effective models are usually similar.”
Reply: Thanks. This paper aims to establish a simple method to unify various 3D understanding tasks. Thus, we make the architecture as simple as possible, which can serve as a solid baseline for future unified research. We also appreciate that the reviewer agrees the simplicity of the proposed architecture is acceptable.
- To Weakness 2: “The paper claims knowledge distillation losses as one of its contributions. However, according to the ablations, adding these losses only makes marginal changes.”
Reply: Thanks. We first construct a unified baseline that achieves highly competitive performance on six different tasks. Based on this baseline, we further propose knowledge distillation losses, which not only maintain the consistent SOTA performance on referring segmentation and open-vocabulary segmentation tasks but also improve the panoptic segmentation, instance segmentation, and interactive segmentation tasks by notable 0.8 PQ, 1.3 mAP, and 2.1 AP, respectively. To fully explore its effectiveness, we choose the representative instance segmentation task for discussing the results on strict metrics (i.e., mAP), as shown below. We can see that the proposed knowledge distillation losses will bring more noticeable performance gains (+2.3%) under the strict metric compared with the baseline.
| Method | mAP | mAP | mAP |
|---|---|---|---|
| Baseline | 58.0 | 76.2 | 54.5 |
| Baseline w/ knowledge distillation losses | 59.3(+1.3) | 76.9(+0.7) | 56.8(+2.3) |
Questions:
Sorry for the unclear details. We give detailed explanations for each question in the following content.
- To Question 1: “The detailed process of producing super points.”
Reply: The superpoints are produced in an unsupervised manner, and they are formed by grouping points with locally similar geometric structures. Specifically, our procedures for producing superpoints can be divided into three steps:
Step 1: generate handcrafted local geometric features of the 3D points; Step 2: employ the farthest point sampling (FPS) algorithm in the coordinate space of point clouds to obtain the initial superpoint centers; Step 3: construct an association map between the points and superpoint centers.
We will add introductions of the detailed processes, producing superpoints in the revised version.
- To Question 2: “The criteria for sampling the corresponding points from the clicks.”
Reply: We sample the 3D points nearest to the clicks as the corresponding points. Specifically, in practice, clicks are clicked in 2D projections of 3D scenes. Therefore, the clicks are initially represented as 2D coordinates of the corresponding 2D projections. While reverse mapping the 2D projections into original 3D scenes, the clicks are concurrently reverse mapping to concrete 3D coordinates of the 3D scenes. Afterward, we sample points nearest to these concrete 3D coordinates as the corresponding points.
- To Question 3: “For the referring segmentation, how to decide the target segmentation mask from many text tokens?”
Reply: Each textual expression refers to a specific object segmentation mask, and the text tokens are individually encoded from respective textual expressions. We feed the text tokens into the mask decoder and generate respective mask predictions. Therefore, the target segmentation mask, i.e., the mask predictions, are naturally corresponded to the input text tokens in a one-to-one manner. With many text tokens, we decide the target segmentation masks to the individually respective input text tokens.
- To Question 4: “For L182-186, I am confused that isn't formula-5 already satisfied by using formula-4?”
Reply: Sorry for the confused statement. Actually, while , the formula-5 is not satisfied by using formula-4. Specifically, the formula-5 is:
,
where if , and otherwise. and are input parameters of formula-4.
My concerns have all been addressed, and I would like to thank the authors for their time. Considering the impacts of this work, I will keep my original score.
Thank you for your recognition of our responses and for identifying the impacts of our work. We value your comments and will carefully organize the codes and checkpoints for release.
Dear Reviewers,
We sincerely thank the reviewers for their thoughtful comments and feedback. We are encouraged that the reviewers appreciate the simple, novel architecture and insightful module of UniSeg3D, solid experiments of the proposed method, well-written of the paper. We provide detailed responses to each reviewer, respectively, and promise we will incorporate all feedback in the revised version.
Best regard,
Paper 1782 Authors
Dear Reviewers,
We sincerely appreciate your time and effort in reviewing our paper. We hope our explanations have addressed your concerns. As we are in the discussion phase, we welcome any additional comments or questions regarding our response or the main paper. If further clarification is needed, please do not hesitate to mention it, and we will promptly address your inquiries. We look forward to receiving your feedback.
Best regard,
Paper 1782 Authors
This paper presents a multi-purpose framework for 3D scene segmentation, allowing panoptic, semantic, instance, and other types of segmentation tasks by a single model. The reviewers acknowledge that unifying all these tasks is novel, and note the simplicity of the model, the clarity of the writing, and the strong results. The paper underwent improvements during the rebuttal phase, adding experiments on S3DIS and adding more baselines. Reviewer 3AFf provided constructive criticism on the writing, which the authors are encouraged to make use of. Given these positive reviews, the AC’s decision is to accept. Congratulations to the authors on the strong submission! One minor suggestion (by the AC) is to make the title more unique, considering (as mentioned in the paper) this is not the only “unified framework for 3D scene understanding".