6.8

/10

Poster4 位审稿人

最低4最高5标准差0.4

3.5

置信度

创新性3.3

质量3.0

清晰度3.0

重要性2.8

NeurIPS 2025

COS3D: Collaborative Open-Vocabulary 3D Segmentation

Runsong Zhu,Ka-Hei Hui,Zhengzhe Liu,Qianyi Wu,Weiliang Tang,Shi Qiu,Pheng-Ann Heng,Chi-Wing Fu

OpenReview PDF

提交: 2025-04-14更新: 2025-10-29

TL;DR

We present COS3D, a new collaborative prompt-segmentation framework that effectively contributes to integrating complementary language and segmentation cues throughout the entire pipeline.

摘要

关键词

Open-vocabulary 3D Segmentationsegmentationlanguage

评审与讨论

审稿意见

评分: 4置信度: 32025-06-15

This paper proposed COS3D, a new open-vocabulary 3D segmentation technique that collaborates on language and segmentation cues. COS3D utilizes a "collaborative field" consisting of an instance field (encoding object boundaries) and a language field (oriented with textual cues). A two-step framework is introduced: training a discriminative instance field first and then projecting these instance feature maps to language feature maps. A text prompt is utilized for initial segmentation at inference time, which is then improved with an adaptive prompt strategy. Various extensive experiments exhibit large improvements on standard benchmarks, evidencing efficacy and efficiency.

优缺点分析

Strength:

Quality: The methods are technically solid and well-justified. Experiments make many interesting comparisons and clearly exhibit better performance relative to strong baselines. Ablations are extensive and insightful.

Clarity: In general, writing and structure are clear. Key ideas—collaborative field, two-stage training, adaptive inference—are explicated sensibly and supported by useful visualizations.

Significance: The approach is highly relevant to open-vocabulary 3D segmentation and demonstrates significant improvement from existing efforts. Useful applications toward robotics and hierarchical segmentation enhance its utility.

Novelty: The collaborative strategy, specifically the instance-language mapping and adaptive prompt tuning, is an original contribution.

Weaknesses:

Although the authors claim open-vocabulary capability, evaluation primarily utilizes fixed object classes. To make stronger assertions, performance would need to be demonstrated on broader queries such as attributes, materials, or abstract concepts.

The algorithm is tailored to scenes for 3D Gaussian Splatting; generalization to other 3D representations is neither obvious nor would it enhance applicability.

Adaptive refinement utilizes hyperparameters (thresholds) whose sensitivity is unexplored and questions practical robustness.

问题

How sensitive is COS3D’s adaptive refinement to the thresholds employed? Can the authors give an analysis or advice on how to choose these parameters?
Could the authors present quantitative results on non-category queries such as materials like "metal," and abstract concepts like "small" to better substantiate the open-vocabulary claim?
Kernel regression is superior to its MLP counterpart at all times. Should the authors elaborate on practical trade-offs such as scalability and complexity of computation for these implementations?

局限性

yes

最终评判理由

Thanks for addressing my questions, keep the scores.

格式问题

N/A

作者回复

2025-07-31

We thank the reviewer for appreciating our work (e.g., original contribution, proposing a technically solid and well-justified method, conducting extensive experiments, and maintaining a clear structure) and for providing insightful comments (e.g., non-category queries and the trade-off between kernel regression and the MLP counterpart). We are pleased to address the issues raised in the review.

Q1: Non-category queries.
A1: Thank you for the insightful suggestion. Note that the standard benchmarks (e.g., LeRF, ScanNet) commonly used in existing open-vocabulary baselines do not include evaluations for non-category queries. But we have provided some qualitative examples with non-category queries (e.g., materials: fleece, affordance: drinkable, abstract: music) in Supp. Fig. 1 and Supp. Fig. 4. These results demonstrate the capability of our approach to understand non-category queries to some extent. Both the quantitative comparisons on standard evaluations and the qualitative examples with non-category queries consistently highlight the effectiveness of our proposed COS3D. We agree that building a benchmark for non-category queries and designing comprehensive quantitative evaluations is an interesting and valuable direction, which we leave for future work.

Q2: 3D Gaussians Splatting (3DGS) representations.
A2: Thanks. Our COS3D adopts 3DGS as the 3D representation with the following considerations. First, 3DGS has become a widely used and popular representation for real 3D scenes to support efficient rendering. Moreover, it has also been adopted by recent baselines [1, 2, 3, 4, 5] for the open-vocabulary 3D segmentation task.

Generalization to other 3D representations: We believe our methods have the potential to generalize to other 3D representations, such as point clouds. For instance, recent advancements (e.g., VGGT [6]) can connect the original 2D image with 3D representations (e.g., point (cloud) maps), enabling the use of 2D foundation models to provide open-vocabulary segmentation for 3D point clouds. We leave further explorations in future work.

We will include these discussions regarding further generalization in the revised version.

Q3: Hyperparameter analysis in adaptive refinement.
A3: Thank you for your questions. There are two hyperparameters in the adaptive refinement algorithm: the region expansion threshold $\mathcal{T}$ and the widely used relevance threshold $\tau$ .

The region expansion threshold $\mathcal{T}$ : Instead of using a fixed value for this hyperparameter, we implement a hyperparameter generation strategy as described in Algorithm 1 (see Supp Page 2). In this approach, the final expansion threshold is determined based on the calculated mean segmentation feature distances. Additionally, we conducted experiments with different expansion thresholds by applying permutations (ranging from -20% to 20%) to the mean feature distances. The results, shown in the table below, demonstrate that COS3D achieves stable performance within a reasonable range of region threshold values and consistently outperforms the latest SOTA method (InstanceGaussian[5]).

Table 1. Quantitative comparisons of different region expansion threshold $\mathcal{T}$ values. We conduct the experiments under different expansion thresholds by adding a moderate permutation (from -20% to 20%) to the mean feature distances.

Perturbation	mIoU $\uparrow$	mAcc $\uparrow$
-20 %	49.96	72.44
-10 %	49.90	72.44
0.0 (default)	50.76	72.08
+10 %	49.68	71.28
+20 %	49.87	72.18
InstanceGaussian[5]	45.30	58.44

The relevance threshold $\tau$ : This hyperparameter has been widely used and proven to be relatively robust in existing methods [1, 2, 3, 4, 5]. Therefore, we simply set it to a fixed value of 0.5 throughout our experiments.

Q4: Practical analysis between kernel regression and its MLP counterpart.
A4: This is a great question. First, both kernel regression and MLP are approaches used to implement our instance-to-language mapping function. The improved performance achieved by these two strategies demonstrates the effectiveness of the mapping design.

We selected these two implementations primarily because of their distinct theoretical properties. Specifically, kernel regression is a traditional fitting technique that provides strong interpretability (e.g., transparent kernel functions) and intuitive locality. In contrast, the MLP counterpart can handle flexible data patterns (both local and global) and requires only the storage of model parameters after training, making it more compact.

In practice, we observed a trade-off between accuracy and memory efficiency. On the one hand, kernel regression achieves better accuracy compared to the MLP counterpart. As discussed in the ablation section of the main paper, we attribute this to the discriminative instance features chosen as the source domain, which make the mapping process inherently an easy regression task. In such cases, the traditional kernel regression method is well-suited.

On the other hand, the MLP counterpart, consisting of small layers, offers better GPU memory efficiency. Specifically, the computational complexity of kernel regression depends on the size of the training data, resulting in higher memory requirements. To illustrate this, we report the following table comparing the accuracy and GPU memory usage of kernel regression and MLP, using an identical input batch size of 100K.

Table 2. Trade-off between accuracy and memory efficiency on LeRF.

Type	mIoU $\uparrow$	mAcc $\uparrow$	GPU Memory $\downarrow$
MLP	49.75	70.60	819M
Kernel regression	50.76	72.08	6863M

2025-08-06

Thanks for addressing my questions, keep the scores.

评论- Thank you!

2025-08-06

Dear Reviewer EzJB,

Thank you for your feedback on our rebuttal. We are glad that our response has addressed your questions. We greatly appreciate your kind support for our work!

Sincerely,
The Authors

审稿意见

评分: 5置信度: 42025-06-27

The paper COS3D: Collaborative Open-Vocabulary 3D Segmentation proposes novel take on the problem of 3D segmentation by using both a instance field and a language field, both of which are connected though a learned mapping (MLP or kernel). In addition, the authors propose an inference strategy to improve segmentation quality by proposing an adaptive language-to-instance prompt refinement. Finally, the method is evaluated across a set of 3D segmentation benchmarks.

优缺点分析

Strengths:

The proposed technical novelties are interesting and effective
The concepts of creating instance and language fields, taken together as the collaborative field, is sound and seems to work well
I appreciate the technical discussion parts after each of the methods subsections
The mechanism for adaptive prompt refinement makes sense and further improves the overall method
The method is somewhat ablated

Weaknesses:

I’m completely missing ablations on how robust the approach is when using VLMs alternative to CLIP or segmentation models other than SAM. Since the method relies heavily on these off-the-shelf models, I think ablating 2 other choices per model would be appropriate. For VLMs, I would suggest SigLIP [1] or BLIP [2]. For the segmentation model, SAM2 [3] and Semantic SAM [4] at different granularities would be interesting.
L. 161 states „First, we learn a segmentation-aware instance field supervised by the 2D SAM segmentation“, but in the following section Stage 1 paragraph, there is no mention of how SAM is used/prompted to generate the 2D instance masks. The authors should be more clear in this section.
Several typos in Figure 2 where the authors write filed instead of field
Same in Figure 4, the authors write „Ous“ instead of „Ours“

[1] Zhai, Xiaohua, et al. "Sigmoid loss for language image pre-training." Proceedings of the IEEE/CVF international conference on computer vision. 2023. [2] Li, Junnan, et al. "Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation." International conference on machine learning. PMLR, 2022. [3] Ravi, Nikhila, et al. "Sam 2: Segment anything in images and videos." arXiv preprint arXiv:2408.00714 (2024). [4] Li, Feng, et al. "Segment and Recognize Anything at Any Granularity." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024.

问题

My main remaining issue is the lack of ablations for the 2D foundation models. How does the performance change if you use other VLMs / Segmentation Models, for example the ones cited above? Additionally, how is SAM used/prompted to generate the 2D instance masks?

局限性

yes

最终评判理由

The authors have addressed my concerns sufficiently. Therefore, I believe the paper is ready for acceptance and raise my score.

格式问题

None

作者回复

2025-07-31

We thank the reviewer for their helpful comments and suggestions (e.g., ablation on different VLMs), which have helped enhance the quality of our paper. We are pleased to address the issues raised in the review.

Q1: Ablation on different VLMs.
A1: Thank you for your suggestions. We utilize CLIP and SAM as the default 2D language and segmentation models in our experiments, following common practice in recent baselines [1, 2, 3, 4, 5], to ensure that the improvements are attributed to the proposed collaborative fields design.

However, our methods can adopt other foundation models. Based on your suggested alternatives, we conducted an ablation study comparing different 2D foundation models (e.g., CLIP [6] vs. SigLIP [7], SAM [8] vs. SAM2 [9] & Semantic SAM [10]). The results, shown in the following table, demonstrate that our framework is compatible with different 2D foundation models. Furthermore, we empirically observed that using more advanced models (e.g., SAM2 for segmentation, SigLIP for language) can lead to performance improvements. We sincerely appreciate your insightful suggestions, which contributed to these valuable observations!

We will include ablation results on different 2D foundation models in the revised version of our paper.

Table 1. Ablation on 2D foundation models on LeRF.

Model	Segmentation	Language	MIoU $\uparrow$	mAcc $\uparrow$
COS3D (default)	SAM [8]	CLIP [6]	50.76	72.08
COS3D	SAM [8]	SigLIP [7]	51.08	73.79
COS3D	SAM2 [9]	CLIP [6]	51.94	75.05
COS3D	Semantic SAM [10]	CLIP [6]	49.93	70.94

Q2: Procedure for SAM instance mask generation.
A2: Thanks. To generate the 2D instance masks from SAM, we follow the automatic mask generation strategies commonly used in prior works [1, 2, 3, 4, 5]. Specifically, we create a grid of point prompts across the image and obtain the final segmentation results using the same hyperparameters as in OpenGaussian[3]. Further details can be found in the supplementary materials (Supp. L26–L28). We will clarify this procedure in Sec. 3.2 paragraph (“Stage 1: instance field learning.”) for the revised version.

Q3: Typos.
A3: Thank you for pointing out the typos in Fig. 2 and Fig. 4. We will update them in the revised version.

[1] LangSplat: 3D Language Gaussian Splatting. CVPR 2024.
[2] LEGaussians: Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding. CVPR 2024.
[3] OpenGaussian: Towards Point-Level 3D Gaussian-based Open Vocabulary Understanding. NeurIPS 2024.
[4] Dr. Splat: Directly Referring 3D Gaussian Splatting via Direct Language Embedding Registration. CVPR 2025.
[5] InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception. CVPR2025.
[6] Learning transferable visual models from natural language supervision. ICML 2021.
[7] Sigmoid Loss for Language Image Pre-Training. ICCV 2023.
[8] Segment Anything. ICCV 2023.
[9] SAM 2: Segment Anything in Images and Videos. ICLR 2025.
[10] Semantic-SAM: Segment and Recognize Anything at Any Granularity. ECCV 2024.

评论- Response to the rebuttal

2025-08-05

I would to thank the authors for addressing my concerns. In my opinion, the paper is in a good state and ready for acceptance.

评论- Thank you!

2025-08-05

Dear Reviewer yYMN,

Thank you for your feedback on our rebuttal. We are pleased that our response has addressed your questions. We greatly appreciate your kind support for our work!

Sincerely,
The Authors

审稿意见

评分: 4置信度: 52025-06-30

This paper proposes COS3D, a collaborative prompt-based framework for open-vocabulary 3D segmentation using 3D Gaussian Splatting. It introduces a collaborative field comprising an instance field and a language field. During training, an instance-to-language mapping is used to construct the language field efficiently. During inference, an adaptive language-to-instance prompt refinement is applied to improve segmentation quality. The method outperforms previous approaches on LeRF and ScanNetv2, demonstrating strong performance and practical applicability.

优缺点分析

Strengths:

The idea of constructing a collaborative field comprising both language and instance features is quite novel and addresses the limitations of prior approaches that rely solely on either.
The proposed method achieves strong performance improvements over state-of-the-art on LeRF and ScanNetv2 benchmarks in both mIoU and mAcc.
The paper is well-organized and includes informative visualizations and ablation studies.

Weaknesses:

The authors mention that the proposed method lacks reasoning ability. While Qwen is used in the supplementary materials, the main method still relies on CLIP for language features. It would be valuable to explore whether replacing CLIP with more advanced language models like GPT or Qwen could enhance the reasoning capacity of the language field. I am particularly curious whether such substitutions could lead to performance gains.
While both MLP and kernel regression versions are implemented, the kernel regression appears hand-crafted with hyperparameters fixed. Moreover, the authors do not sufficiently analyze when or why one version might fail or perform better in the experiment section.
The training time comparison in Table 5 is insufficient. For example, one of the baseline methods, Instance Gaussian, lacks reported training time, and the training time of other competing methods in Table 1 and 2 is not reported either. This provides little support to COS3D’s claimed superior training efficiency.

问题

The authors state that the proposed method lacks reasoning ability. Have the authors considered replacing CLIP with more advanced language models such as Qwen or GPT in the main pipeline, beyond the experiments in the supplementary material? Would such replacement lead to improved reasoning capability and segmentation performance?
The paper provides two implementation versions but lacks deeper analysis in the experiment section. Could the authors provide deeper insights into when each version is expected to perform better?
The kernel regression approach fixes the hyperparameter σ. Could the authors provide an ablation study on σ to validate its influence on performance?
The training time comparison in Table 5 is not sufficient, as only one method reports its training time. Could the authors include training time for most baselines to better support the claim of COS3D’s superior training efficiency?
There is a minor typo in line 318: Table 5 is incorrectly referred to as “Table 4.3” in the main text.

局限性

Yes

最终评判理由

The rebuttal clear my doubts and I would keep my rating.

格式问题

No major formatting issues.

作者回复

2025-07-31

We are encouraged to see that the reviewer finds the proposed method novel, effective in addressing the limitations, achieving strong performance improvements, and well-organized. We sincerely thank the reviewer for their constructive comments (e.g., discussions on extensions and additional analysis), which help enhance the quality of our paper. We are pleased to address the issues raised in the review.

Q1: Potential extension to reasoning ability.
A1: Thanks for your question. This work currently focuses on the open-vocabulary 3D segmentation task, and we plan to explore reasoning ability in our future work, as noted in our limitations. Moreover, incorporating reasoning abilities using advanced LLM models is a promising direction. Specifically, it is technically feasible to replace the CLIP[1] feature with the vision encoder feature from a V-LLM model, such as Qwen-vl2.5 [2]. However, how to effectively integrate the learned 3D Qwen feature fields with the LLM backbone to handle the reasoning queries and enhance segmentation quality remains an open question, as the relevance between the 3D Qwen feature fields and the text queries cannot be directly modeled using feature distance (e.g., cosine similarity) as in CLIP-style models (e.g., CLIP[1], SigLIP[3] ). One possible solution is to render the 3D Qwen feature field into a 2D feature map based on a reference view, allowing it to be fed into the LLM backbone along with reasoning queries. It is an interesting research problem to design comprehension solutions that enable interaction between 3D LLM-style feature fields and reasoning queries. We will provide more discussion on the future work in the revised version.

In our supplementary robotic demo, we incorporate reasoning ability to some extent using a two-stage strategy: first, the Qwen-VL model parses complex commands into short queries; then, our COS3D model predicts the 3D segmentation results based on the queries.

Q2: Analysis between kernel regression and the MLP counterpart.
A2: Thanks for your interesting question. Both kernel regression and the MLP counterpart are means to achieve our instance-to-language feature mapping function, and the improved performance achieved by these two strategies consistently demonstrates the effectiveness of our feature mapping design.

Moreover, we provide a discussion from the perspectives of accuracy and memory efficiency:

Accuracy: We observe that the kernel regression version tends to achieve better performance than the MLP counterpart in our experiments. We attribute this to the selection of discriminative instance features as the source domain, which makes the mapping learning process inherently an easy regression task, where the traditional kernel regression method is extremely suitable. Moreover, kernel regression benefits from its training-free property, bypassing the unstable optimization process required by the MLP counterpart.
Memory efficiency: On the other hand, the MLP counterpart, consisting of small layers, offers better GPU memory efficiency. Specifically, the computational complexity of kernel regression depends on the size of the training data, resulting in higher memory requirements.

To illustrate this, we report the following table comparing the accuracy and GPU memory usage of kernel regression and MLP, using an identical input batch size of 100K.

Table 1. Trade-off between accuracy and memory efficiency on LeRF.

Type	mIoU $\uparrow$	mAcc $\uparrow$	GPU Memory $\downarrow$
MLP	49.75	70.60	819M
Kernel regression	50.76	72.08	6863M

We will incorporate more discussion in the revised version.

Q3: Hyperparameter analysis in kernel regression.
A3: Thanks. For kernel regression, we set this hyperparameter to 0.1 across all experiments, and it performs well without requiring additional tuning. Additionally, we conducted an ablation study on $\sigma$ , with the results presented in the following table. Overall, the results remain stable under moderate changes to $\sigma$ , and our method consistently outperforms the latest state-of-the-art approach, InstanceGaussian [4].

Table 2. Quantitative comparisons for different $\sigma$ values in kernel regression.

$\sigma$	mIoU $\uparrow$	mAcc $\uparrow$
0.05	51.00	71.38
0.08	50.55	70.59
0.10 (default)	50.76	72.08
0.15	50.55	71.23
0.20	50.58	71.25
InstanceGaussian[4]	45.30	58.44

Q4: Training times.
A4: Thank you. We originally plotted the training time of most baselines in the main paper Fig. 1(d) and included partial information in the main paper Tab. 5 due to space constraints. For clarity, we have organized the complete information into the table below. We will update the full table in the supplementary materials.

Table 3. Training efficiency analysis on LeRF. We report mIoU.

Method	Training time $\downarrow$	mIoU $\uparrow$
Langsplat[5]	240 min	9.66
LEGaussians[6]	240 min	16.21
Dr. Splat [7]	10 min	43.58
OpenGaussian[8]	60 min	38.36
InstanceGaussian[4]	-	45.30
Ours (3K for instance)	8 min	50.16
Ours (6K for instance)	15 min	50.24
Ours (default)	50 min	50.76

Q5: Typo.
A5: Thank you for pointing out the typo in L318. We will correct this in the revised version.

[1] Learning transferable visual models from natural language supervision. ICML 2021.
[2] Qwen2.5-VL Technical Report. ArXiv.
[3] Sigmoid Loss for Language Image Pre-Training. ICCV 2023.
[4] InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception. CVPR 2025.
[5] LangSplat: 3D Language Gaussian Splatting. CVPR 2024.
[6] LEGaussians: Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding. CVPR 2024.
[7] Dr. Splat: Directly Referring 3D Gaussian Splatting via Direct Language Embedding Registration. CVPR 2025.
[8] OpenGaussian: Towards Point-Level 3D Gaussian-based Open Vocabulary Understanding. NeurIPS 2024.

2025-08-05

I appreciate the authors’ detailed response to my comments. The rebuttal has clarified my concerns, and I would keep my positive rating.

评论- Thank you!

2025-08-05

Dear Reviewer etuw,

Thank you for your feedback on our rebuttal. We are pleased that our response has addressed your questions. We sincerely appreciate your kind support for our work!

Sincerely,
The Authors

审稿意见

评分: 4置信度: 22025-06-30

This paper proposes COS3D, a framework for open-vocabulary 3D segmentation over reconstructed Gaussian scenes. The method builds a collaborative representation consisting of two aligned fields: an instance field that captures segmentation features trained from 2D instance masks (e.g., via SAM), and a language field that encodes CLIP-aligned features via a two-stage instance-to-language mapping. During inference, a text or image prompt produces a relevance map in the language field, which is adaptively refined using the instance field to generate accurate object-level 3D segmentations. COS3D outperforms prior methods on LeRF and ScanNet V2 benchmarks and supports flexible queries and applications such as hierarchical segmentation and robotic grasping.

优缺点分析

Strength:

COS3D introduces a dual-field architecture with an explicit mapping from instance to language space, enabling effective fusion of geometry-aware segmentation cues and CLIP-aligned semantics. This separation allows targeted supervision and mutual refinement between fields. 2.The two-stage training procedure decouples segmentation learning and language alignment, resulting in better mIoU and faster convergence than joint or parallel alternatives (e.g., 50.76 vs. 43.84 mIoU; 50 min vs. 165 min training). This makes COS3D practical for scalable deployment.
COS3D achieves state-of-the-art results on both synthetic (LeRF) and real-world (ScanNet V2) datasets, and generalizes to multiple query modalities (text, image) and tasks (e.g., robotics). Inference is fast (~0.22s/query), enabling interactive or embodied use.

Weaknesses:

COS3D relies on precomputed 3D Gaussian scenes and high-quality 2D masks from SAM. It is not directly applicable to raw point clouds, online reconstruction, or streaming sensor input, which may restrict use in dynamic or large-scale outdoor environments.
The system handles only one object query at a time and cannot support relational, spatial, or multi-object reasoning (e.g., “the chair next to the table”). The authors acknowledge this but do not provide architectural insights into potential extensions.
Due to unavailability of code, methods like InstanceGaussian [1] and Dr.Splat [2] are not included in ScanNet V2 evaluations. While understandable, this omission limits clarity on how COS3D ranks among the latest Gaussian-based 3D segmentation methods.

Reference:

Haijie Li et al. InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception. CVPR 2025.
Jun-Seong Kim et al. Dr. Splat: Directly Referring 3D Gaussian Splatting via Direct Language Embedding Registration. CVPR2025.

问题

Could COS3D be extended to handle gradually reconstructed 3D scenes, where the collaborative fields are updated incrementally as new observations arrive?
How would the current architecture support relational or multi-object queries, such as “all mugs on the shelf” or “the red chair next to the table”? Are such extensions feasible without major redesign?
Have the authors conducted any indirect comparisons or qualitative assessments against recent methods like InstanceGaussian or Dr.Splat, given their similar use of Gaussian scene representations?

局限性

yes

最终评判理由

The rebuttal has addressed my concerns regarding the choice of 3DGS over raw point clouds, the potential for online reconstruction, and the lack of comparisons with InstanceGaussian and Dr. Splat. The additional results on ScanNetV2 clarify the performance advantage over these baselines. I therefore maintain my positive recommendation.

格式问题

作者回复

2025-07-31

We thank the reviewer for appreciating our work (e.g., achieving state-of-the-art results, being fast, making COS3D practical for scalable deployment, supporting flexible queries and applications) and for their inspiring and thoughtful comments, which have helped us improve our paper. We are pleased to address the issues raised in the review.

Q1: …Gaussian splatting…not directly applicable to raw point clouds, online reconstruction ...
A1:

Raw point clouds: Thanks. Our COS3D adopts the 3D Gaussian splatting (3DGS) representation reconstructed from multi-view image inputs, with the following considerations: First, Our 3DGS model can be derived from multi-view images, which are easy to acquire, widely available, and offer higher resolution along with rich textural information. In contrast, raw point clouds typically require specialized equipment for collection and are more expensive. Second, compared to raw point clouds, the 3D Gaussian representation enables the utilization of 2D foundation models like CLIP and SAM which contain rich world-level knowledge; thus, the 3DGS representation improves our model’s capability on the open-vocabulary segmentation. Besides, the 3D Gaussian splatting is a commonly-used 3D representation of most recent open-vocabulary 3D segmentation (OV3DS) approaches [1, 2, 3, 4, 5].
Extension to online reconstruction: Thank you. Following recent approaches [1, 2, 3, 4, 5], we evaluate our model on standard offline benchmarks. It is an interesting direction to incorporate an online setting. To do so, one possible solution is to integrate our COS3D with existing online Gaussian reconstruction methods [6, 7,8, 9] to update our collaborative fields incrementally. We acknowledge that efficient online OV3DS requires additional research efforts, and we leave this for future work.

Q2: Discussion on potential extension to relational or multi-object queries.
A2: Thanks. This work currently focuses on the OV3DS task, and the proposed collaborative framework (COS3D) significantly surpasses previous state-of-the-art methods. We plan to explore relational reasoning in future work, as noted in our limitations. While extending our framework to support relational reasoning remains an open question, COS3D provides a solid foundation. For instance, the recent study [10] demonstrates that reasoning 3D segmentation (e.g., querying "the red chair next to the table") can be achieved to some extent by employing an LLM as an agent in combination with an OV3DS approach as a tool. We will provide more discussion on the potential extension in the revised version.

Q3: More comparisons with InstanceGaussian and Dr. Splat on ScanNetv2.
A3: Thank you for your reminder. Since InstanceGaussian and Dr. Splat were not open-sourced at the submission deadline, we did not include their results on ScanNetv2. Recently, we observed that InstanceGaussian and Dr. Splat were open-sourced thus we include more comparisons here. Considering the inconsistent evaluation protocols (as mentioned in the main paper, L282–L284) and input (i.e., segmentation mask) in these baselines, we ran the official codes of InstanceGaussian and Dr. Splat on ScanNetv2 using the same input and evaluation protocol as ours, following OpenGaussian, to ensure fair comparisons. The results presented in the following table demonstrate that our methods outperforms InstanceGaussian and Dr.Splat in all metrics.

Table 1. Comparison with the latest baselines on ScanNetv2. Our * are sourced from Table 2 in our main paper.

Method	19 classes		15 classes		10 classes
	mIoU $\uparrow$	mAcc $\uparrow$	mIoU $\uparrow$	mAcc $\uparrow$	mIoU $\uparrow$	mAcc $\uparrow$
Instance Gaussian [5]	30.21	47.88	33.77	51.63	42.32	59.78
Dr. Splat [4]	23.35	40.17	30.06	49.13	38.44	55.33
Ours *	32.47	49.05	35.95	54.35	44.32	63.66

[1] LangSplat: 3D Language Gaussian Splatting. CVPR2024.
[2] LEGaussians: Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding. CVPR2024.
[3] OpenGaussian: Towards Point-Level 3D Gaussian-based Open Vocabulary Understanding. NeurIPS2024.
[4] Dr. Splat: Directly Referring 3D Gaussian Splatting via Direct Language Embedding Registration. CVPR2025.
[5] InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception. CVPR2025.
[6] Gaussian Splatting SLAM. CVPR2024.
[7] Gaussian-SLAM: Photo-realistic Dense SLAM with Gaussian Splatting. ArXiv.
[8] RGB-Only Gaussian Splatting SLAM for Unbounded Outdoor Scenes. ICRA 2025.
[9] Outdoor Monocular SLAM with Global Scale-Consistent 3D Gaussian Pointmaps. ICCV 2025.
[10] LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent. ICRA2024.

2025-08-05

I appreciate the authors’ detailed response to my comments. The rebuttal has clarified my concerns, and I have no further questions.

评论- Thank you!

2025-08-05

Dear Reviewer S9Jm,

Thank you for your feedback on our rebuttal. We are glad that our response has addressed your questions. We sincerely appreciate your kind support for our work!

Sincerely,
The Authors

最终决定Accept (poster)

2025-09-17

The method proposes a novel open vocabulary method for radiance fields that relies on two aligned fields, one centered on identifying instances and a language field. The paper received a positive assessment from all reviewers, which highlighted the novelty and effectiveness of the method. Therefore, I recommend accepting the paper. However, several concerns and improvements were suggested during the reviewer-author discussion that should be included in the final version of the paper, such as additional comparisons, extended analysis and discussion of MLP vs kernel method, hyperparameter analysis, training times, and ablation studies on different VLMs.