PaperHub
7.3
/10
Spotlight4 位审稿人
最低4最高6标准差0.9
4
4
4
6
3.8
置信度
创新性2.5
质量3.0
清晰度3.3
重要性3.0
NeurIPS 2025

OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29

摘要

关键词
Image segmentationSegment anything modelVision-language model

评审与讨论

审稿意见
4

This paper introduces OpenWorldSAM for Open-Vocabulary Segmentation tasks. OpenWorldSAM extends the SAM2 to open-vocabulary scenarios by integrating multi-modal embeddings extracted from a lightweight vision-language model (VLM). The experimental results show that the proposed method achieves consistent gains on several benchmarks.

优缺点分析

Strengths:

  1. OpenWorldSAM supports a diverse range of prompts.
  2. This paper designs tie-breaker embeddings to address the multi-object issue.
  3. The training is efficiency.

Weaknesses:

  1. OpenWorldSAM inherits the shortcomings of SAM2 in multi-scale ambiguous segmentation, i.e., predicting several masks for a single prompt.
  2. The inference speed may slow.
  3. In Table 1, using Oracle-Prompts for OpenWorldSAM and Global-Matching prompts for other methods is misleading.

问题

Question:

  1. What about the Global-Matching prompts for OpenWorldSAM?
  2. How much does the performance decrease when replacing SAM2 with SAM or a light-weight SAM2 backbone?
  3. How does OpenWorldSAM work when both text and point/box prompts are given?
  4. Many open-vocabulary object detection methods with SAM can do open-vocabulary segmentation, including GroundedSAM[1][2] and VL-SAM[3], and they typically conduct experiments on more challenge datasets, LVIS, in a zero-shot manner. So, what about the performance of OpenWorldSAM on LVIS?

[1] Liu, Shilong, et al. "Grounding dino: Marrying dino with grounded pre-training for open-set object detection." ECCV 2024. [2] Ren, Tianhe, et al. "Grounded sam: Assembling open-world models for diverse visual tasks." arXiv preprint arXiv:2401.14159 (2024). [3] Lin, Zhiwei, Yongtao Wang, and Zhi Tang. "Training-free open-ended object detection and segmentation via attention as prompts." NeurIPS 2024.

局限性

This paper inherits the shortcomings of SAM2, and the inference speed may be slow for real-world applications.

最终评判理由

The rebuttal addresses most of my concerns. I want to keep my original rating.

格式问题

no

作者回复

We sincerely thank Reviewer 4 for the positive feedback and valuable insights, especially appreciating your recognition of OpenWorldSAM’s diverse prompting capabilities, our novel tie-breaker design for handling multi-object scenarios, and the efficiency of our training method.

Below, we address your constructive concerns and questions:

1. Multi-scale ambiguous segmentation inherited from SAM2

Concern: OpenWorldSAM inherits SAM2’s issue of predicting multiple masks for single prompts in multi-scale ambiguous scenarios.

Response: Indeed, our approach builds upon SAM2, and thus inherits its architectural characteristics. However, our method specifically addresses instance-level ambiguity, not scale-level ambiguity. Unlike visual prompts (points/boxes), text prompts typically introduce instance ambiguity rather than scale ambiguity (see Figure 3).

The reviewer correctly pointed out that by default, SAM outputs 3 valid masks given one query prompt (multimask_output=True). Importantly, we disable SAM's default multimask_output setting during training, ensuring the model predicts only one mask per query (1 text prompt → K=20K=20 queries → 20 mask predictions, instead of 20×3=6020 \times 3= 60 masks). We are encouraging the model to predict what exactly the text prompt asks for. Thus, OpenWorldSAM does not inherit the multi-scale ambiguity directly.

Additionally, future improvements to SAM2 will naturally enhance OpenWorldSAM, given our modular, plug-and-play architecture.

2. Inference speed concern

Concern: Reviewer points out the potential slowdown in inference speed.

Response: We conducted detailed profiling to quantify the impact of adding the VLM and our adapter modules to SAM. Below, we present inference timing breakdowns for processing a single 1024×1024 image on an NVIDIA A5000 GPU, averaged over five independent runs:

Single Text Prompt (20 queries):

ModuleTime (ms)PercentageCategory
sam_backbone_feature_prep329.8371.6%SAM
prompt_tokenization0.430.1%Non-SAM
beit3_forward70.8415.4%Non-SAM
mlp_projection_layer6.681.4%Non-SAM
prepare_batched_tie_breaker_tokens0.130.0%Non-SAM
cross_attention8.451.8%Non-SAM
sam_prompt_encoder0.110.0%SAM
sam_mask_decoder43.419.4%SAM
postprocessing0.680.1%Non-SAM
TOTAL TIME460.69100.0%

Summary:

  • SAM modules total time: 373.35 ms (81.0%)
  • Non-SAM modules total time: 87.21 ms (18.9%)
  • Non-SAM overhead: 87.21 ms

Six Text Prompts (20×6=12020\times6=120 queries):

ModuleTime (ms)PercentageCategory
sam_backbone_feature_prep334.4248.6%SAM
prompt_tokenization1.020.1%Non-SAM
beit3_forward123.7318.0%Non-SAM
mlp_projection_layer4.480.6%Non-SAM
prepare_batched_tie_breaker_tokens0.200.0%Non-SAM
cross_attention_layers18.172.6%Non-SAM
sam_prompt_encoder0.120.0%SAM
sam_mask_decoder205.1829.8%SAM
postprocessing1.060.2%Non-SAM
TOTAL TIME688.50100.0%

Summary:

  • SAM modules total time: 539.72 ms (78.4%)
  • Non-SAM modules total time: 148.65 ms (21.6%)
  • Non-SAM overhead: 148.65 ms

Takeaway:

  1. The profiling results show that adding the VLM and adapter modules results in only a moderate increase in inference time (approximately 19–22% overhead). Most computational cost remains within SAM’s backbone and mask decoder.

  2. Mask Decoder scaling: sam_mask_decoder cost grows almost linearly with (K × P).

    • Going from 1→20 queries (same prompt) adds ~41 ms.
    • Going from 1 prompt→6 prompts (120 queries) adds a further ~162 ms.

    Note that one text prompt mimics user clicks 20 times on an image. If automatic mask generation is desired without user intervention, SAM’s built-in auto-mask generator uses a dense 32×32 grid of point prompts, incurring significantly higher costs compared to our text-based prompting approach.

  3. Overall overhead: Relative to one vanilla SAM2 call, our pipeline is approximately 39% slower for a single prompt (332 → 461 ms). However, it becomes approximately 3×\times more efficient when handling three or more prompts, as the backbone and VLM overhead are amortized.

Thus, our enhancements introduce manageable overhead, maintaining practical usability in real-world applications.

3. Evaluation protocol clarity in Table 1

Concern: Using Oracle-Prompts for OpenWorldSAM but Global-Matching prompts for other methods in Table 1 may be misleading.

Response: We appreciate this clarification. Table 2 already provides both protocols explicitly for X-Decoder, but we agree that clearer labeling and alignment are necessary. We will explicitly state the evaluation modes ("Oracle-Prompt" for OpenWorldSAM, "Global-Matching" for other methods) in Table 1 in the revised manuscript to ensure an accurate, fair comparison. Importantly, OpenWorldSAM remains superior even under aligned evaluation conditions in Table 2.

Additional Questions:

1. Global-Matching prompts for OpenWorldSAM

OpenWorldSAM, like SAM, is specifically trained for interactive, prompt-conditioned segmentation (Oracle-Prompt mode). It is inherently unsuitable for Global-Matching evaluation, which involves prompts that may not correspond to objects in the image, causing issues as negative sampling was not incorporated in SAM’s original training recipe. Thus, evaluating OpenWorldSAM with global prompts would not accurately reflect its intended usage or strengths.

2. Performance with alternative SAM backbones

Due to time and computational constraints, we could not perform additional training with SAM or lighter SAM2 variants. Nonetheless, since our approach is modular, retraining our lightweight adapter to align with other backbones (e.g., SAM2-small/base) is straightforward. Given the minimal performance gaps reported among SAM2 variants in the table bellow (taken from SAM2's Github repo), we expect our method's performance to remain robust across different backbone scales:

ModelSize (M)SA-V test (J&F)MOSE val (J&F)LVOS v2 (J&F)
sam2_hiera_tiny38.975.070.975.3
sam2_hiera_small4674.971.576.4
sam2_hiera_base_plus80.874.772.875.8
sam2_hiera_large224.476.074.679.8

3. Combining text and point/box prompts

Currently, OpenWorldSAM does not simultaneously support text and point/box prompts. Our goal was precisely to reduce or eliminate user intervention by using positional tie-breakers that emulate spatial clicks. However, incorporating dual-modality prompts is an interesting direction we intend to explore in future work.

4. Performance on challenging datasets like LVIS

We agree that evaluating on challenging large-scale datasets like LVIS-1203 would further validate our method’s scalability and robustness. Given our strong performance on similarly challenging datasets (e.g., ADE-857), we anticipate competitive generalization to LVIS. Furthermore, incorporating strategies like Grounding DINO's large-scale pretraining (datasets: Objects365, OpenImages, GoldG, Cap4M, COCO, RefCOCO) could significantly enhance our model's capabilities. And VL-SAM’s preprocessing stage where a 17B parameter multimodal reasoning model (CogVLM-17B) is asked to list all the objects in the image could help transition OpenWorldSAM from oracle-prompt conditions toward fully automated open-vocabulary segmentation. We plan to explore these promising avenues in future research.

We deeply appreciate your valuable suggestions and insights, which will significantly enhance the final manuscript.

评论

Thanks for the rebuttal, which addressed my concerns. I will keep my original rating.

审稿意见
4

This paper extends the SAM2 framework for open-vocabulary semantic segmentation by integrating multi-modal embeddings from a Vision-Language Model (VLM). The authors propose four key design principles,i.e., Unified Prompting, Efficiency, Instance Awareness, and Generalization, to enhance the method from multiple perspectives. Experimental results demonstrate the superior performance of the proposed approach.

优缺点分析

Strengths: +The paper is well-prepared and easy to follow. +The proposed framework appears effective, achieving improved accuracy compared to baseline methods. +Inspired by DETR, the authors propose a novel and effective tie-breaker module for querying multiple instances from target images.

Weaknesses: -Tables 3 and 4 suggest that replacing the foundational model from CLIP/Vicuna with BEiT significantly improves segmentation. It is unclear whether the state-of-the-art results are primarily attributable to the BEiT model or the proposed methodology. A justification from the authors on this point is needed. -A two-stage inference method is proposed. However, according to Table 1, the results with two-stage inference do not achieve better accuracy compared to the single-stage OpenWorldSAM. Furthermore, Figure 1 in the Appendix provides only two comparative images, which is insufficient to convincingly explain the contradiction between lower evaluation metrics and seemingly better visualization effects. -Given that one of the paper's claimed merits is its multi-instance segmentation capability, it is strongly recommended to conduct a dedicated multi-instance segmentation evaluation to more robustly demonstrate its superiority. -In Table 5, E5 appears only slightly better than E3. It would be beneficial to see a comparison of these two settings on other evaluation datasets. If this outcome remains consistent, confirming the effectiveness of the cross-attention module, the authors should discuss the possibility of further simplifying the cross-attention module to reduce training parameters.

问题

Please refer to the weaknesses for the specific questions. Minor: In Table 1, the value "16.9" for OpenWorldSAM should be in bold.

局限性

Yes

最终评判理由

I have carefully read the authors' responses and other reviewers' comments. I believe that most of my concerns have been addressed in the rebuttal. I will therefore maintain my initial rating.

格式问题

None

作者回复

We sincerely thank Reviewer 3 for the positive and insightful comments, especially appreciating your recognition of our clear exposition, the effectiveness of our proposed tie-breaker module inspired by DETR, and the improved accuracy achieved by OpenWorldSAM over baseline methods.

Below, we address your constructive questions and concerns:

1. Attribution of state-of-the-art results to BEiT-3 vs. proposed methods.

Concern:

Reviewer points out that replacing CLIP/Vicuna with BEiT-3 significantly improves performance, raising a question about the relative contributions of BEiT-3 versus our proposed modules.

Response:

Short answer — both are required.

We fully concur with the reviewer that BEiT-3 primarily contributes semantic reasoning. Our ablation study (Table 4) indicates this, where replacing BEiT-3 with late-fusion CLIP drastically decreases performance. The early fusion strategy of BEiT-3 indeed provides richer semantics closer to SAM's features. Early fusion lets every Transformer layer mix visual and textual evidence, yielding finer localization that late‑fusion CLIP cannot retrieve from a global text embedding.

However, Table 5 already shows that BEiT‑3 alone is not sufficient: if we remove tie‑breakers the model collapses. Tie‑breakers compensate for SAM’s weak semantics by spatially diversifying otherwise identical semantic queries. With language semantics present, removing the positional perturbations causes near‑complete failure: ADE‑150 AP plunges from 17.1 → 1.0 and mIoU from 56.8 → 1.2 (Table 5, E3 vs E1). Tie-breakers are thus essential in enabling multi-instance segmentation. Therefore, both the powerful BEiT-3 encoder and our novel tie-breaker strategy jointly enable our state-of-the-art performance.

2. Clarifying two-stage inference results and visualization discrepancy.

Concern:

The reviewer correctly notes that despite improved visual mask boundaries in our two-stage inference, Table 1 shows no substantial quantitative improvement, and the provided visualization in the appendix is limited.

Response:

Indeed, quantitative metrics (mIoU, PQ, AP) show negligible improvements primarily because:

  1. Two‑stage inference only refines mask contours; IoU‑style metrics saturate once coarse localization is accurate, so small contour tweaks seldom raise mIoU/PQ/AP.
  2. Amplified errors on hard examples. On incorrectly localized masks from stage 1, refinement anchored to incorrect regions can further degrade metrics.

As the rebuttal submission does not allow visuals, we will add some failure examples to the revised Appendix.

To substantiate the visual improvement despite stable metrics, we conducted a user-study. We ran a blind preference test on 50 random image pairs from ADE‑857 val:

  • Participants: 10 graduate volunteers familiar with vision tasks.
  • Task & options: For each of the 50 pairs, respondents answered a single forced‑choice question: “Which mask looks more accurate?” Choices: A, B, No discernible difference. Average completion time ≈ 15 min.
  • Result: Across 10 × 50 = 500 judgements, Two‑stage preferred: 43.6 % (218/500); One‑stage preferred: 37.2 % (186/500), No difference: 19.2 % (96/500). Excluding “No‑difference”, Two‑stage wins 218/ (218+186) = 54.0 %.
  • Takaway: Although the gain is modest, humans still favor the refined masks more often than the originals.

Given that the two-stage inference serves as an optional, low-cost post-processing step, users can conveniently enable or disable it based on their preference.

3. Dedicated multi-instance segmentation evaluation.

Concern:

Given multi-instance segmentation is a core claim, Reviewer 3 recommends a dedicated evaluation.

Response:

We agree on the importance of explicitly evaluating multi-instance segmentation. Notably, our primary results already include extensive multi-instance evaluation through panoptic and instance segmentation benchmarks (ADE20K-150, ScanNet-20). Specifically, OpenWorldSAM achieves new state-of-the-art panoptic (combining instance-level and semantic-level) segmentation performance on ADE20K-150 (35.2 PQ) and strongly outperforms baselines on instance-level metrics (16.9 AP on ADE20K-150). These results directly validate our multi-instance capabilities. Nonetheless, we will further emphasize these findings explicitly in the final manuscript to highlight this strength.

4. Ablation comparison between E3 and E5 and simplifying cross-attention.

Concern:

The reviewer observes that E5 (full three-layer cross-attention, 4.5M parameters) performs only slightly better than E3 (no cross-attention, 1.7M parameters) on ADE-150, recommending additional datasets comparison and exploring simpler cross-attention architectures.

Response:

Thanks for this valuable suggestion. We performed additional comparative experiments across multiple datasets, summarized below:

Params (M)ADE-150 (PQ/AP/mIoU)ADE-857 (mIoU)PC-59 (mIoU)PC-459 (mIoU)VOC-20 (mIoU)SUN-37 (mIoU)SCAN-40 (mIoU)
E5 (3-layer)4.535.2 / 16.9 / 60.433.173.747.598.067.755.6
E3 (no cross-attn)1.735.1 / 17.1 / 56.832.270.444.297.363.653.8
E5.1 (1-layer cross-attn)2.435.1 / 16.8 / 59.032.872.646.397.566.454.0

We observe consistently higher accuracy with E5 (three-layer cross-attention) across datasets, confirming the importance of multi-layer cross-attention. However, a single-layer variant (E5.1) significantly narrows the gap with fewer parameters (2.4M vs. 4.5M), suggesting a practical compromise between parameter count and accuracy. We will include this detailed analysis in the revised Appendix and explicitly discuss potential simplified variants as a recommended direction for practitioners seeking more parameter-efficient adaptations.

Minor formatting suggestion:

Thank you for noting the formatting oversight in Table 1 ("16.9" should be bold). We will correct this in the final manuscript.

We greatly appreciate your thorough and constructive feedback and will incorporate all suggestions to improve our manuscript clarity and robustness.

评论

Thank you for your detailed responses. The additional experimental results and analyses you provided have largely addressed my concerns.

审稿意见
4

This paper presents OpenWorldSAM, a unified interface designed to support a range of open-vocabulary segmentation tasks. It introduces an efficient language adapter that enables parameter-efficient fine-tuning while maintaining the model's generalization and interactivity. Additionally, a tie-breaking mechanism is proposed to generate instance-level segmentation masks. To further enhance multi-object localization, the paper incorporates a cross-attention soft prompting strategy.

优缺点分析

Strengths:

The core contribution of this paper is an efficient, learnable positional tie-breaker that enables SAM to segment multiple object instances rather than a single one, significantly improving recall in complex, multi-object scenes.

The introduction of a soft prompting Transformer block facilitates alignment between language and image features via cross-attention.

Both components—tie-breaker and soft prompting—are shown to positively impact performance. In particular, the tie-breaker demonstrates strong results on instance segmentation tasks, where images typically contain multiple instances per class.

Weaknesses:

As acknowledged by the authors (lines 293–295), the SAM encoder’s feature space offers limited semantic richness. This raises concerns about the effectiveness of the proposed cross-attention mechanism and tie-breaker in accurately aligning language and image features. Based on my prior experience, SAM’s image features often struggle to distinguish fine-grained semantics. Clarification is needed on how the tie-breaker achieves robust performance under this limitation.

While the visual-text embedding may provide coarse spatial cues, much of the semantic reasoning likely stems from BEiT-3, not direct alignment with language inputs. In this context, the operational role and limitations of the tie-breaker should be examined in more detail, especially since it is a central contribution.

One of the key justifications for using BEiT over CLIP is to enable early multimodal fusion. However, the paper would benefit from a deeper discussion of this design choice, especially regarding the EVF-SAM setup. A comparison with SAM4MLLM would also be valuable, as it presents a relevant and potentially competing approach.

问题

See weakness

局限性

Yes

最终评判理由

Thanks for the detailed response; the rebuttal addresses most of my concerns. I keep my inital rating.

格式问题

No

作者回复

We sincerely thank Reviewer 2 for the positive feedback, particularly appreciating your recognition of our contributions such as the efficient learnable positional tie-breaker mechanism and the soft prompting Transformer block, significantly improving SAM2’s open-vocabulary segmentation performance in complex, multi-object scenes.

Below, we address the reviewer’s constructive questions and concerns:

1. Semantic richness of SAM’s encoder and the role of cross-attention and tie-breakers

Concern: SAM features have limited semantic richness, raising concerns about the effectiveness of the proposed cross-attention and tie-breaker mechanisms.

Response:

We fully agree with the reviewer that SAM’s encoder alone possesses limited semantic richness. The BEiT-3 encoder indeed plays a critical role by providing robust semantic alignment. The early fusion strategy of BEiT-3 provides richer semantics closer to SAM's features.

However, we argue that our OpenWorldSAM is a general plug-in architecture that aligns SAM’s embedding spaces with VLM’s embedding space while keeping all heavy backbones frozen. Although we instantiate the framework with SAM2 and BEiT-3 in this paper, neither component is required by design; alternative interactive decoders or vision-language encoders can be swapped in with only minor re-training of the adapter (Please refer to Appendix E.1 & Table 5). Appendix Table 5 surveys representative VLM encoders that could replace BEiT-3 in OpenWorldSAM with \approx 5(M) adaptor parameters. For example, early-fusion encoders like BEiT-3 (e.g., VLMo, OFA, Florence-2) require zero architectural change beyond projector resizing and are therefore the most promising immediate swaps. Dual-encoders (CLIP family) need a shallow cross-attention adaptor to overcome missing image context.

Of course, with better VLMs whose embedding spaces are closer to SAM’s encoder space, our model will perform better.

On the role of cross-attention. Cross-attention injects BEiT-3 semantics into SAM’s high-resolution grid. Our proposed 3-layer soft-prompting Transformer (Fig. 4c) performs cross-attention over the high-resolution 64×64 feature map used by SAM2 for mask decoding. This strategy provides every query direct access to BEiT-3’s fused semantics at pixel-level granularity (~16K tokens), rather than relying solely on the coarse global [CLS] embedding discussed in Line 133. Empirically, removing the cross-attention block (while keeping other components intact) degrades performance: ADE-150 mIoU drops by 1.5 points, and ADE-857 mIoU by 0.9 points (Table 5, E3 vs. E5).

On the role of Tie-breakers. Tie-breakers address SAM’s limited semantics by spatially diversifying identical semantic queries. When BEiT-3’s semantics are present, disabling positional tie-breakers leads to a severe performance collapse. Specifically, ADE-150 AP drops from 17.1% to 1.0%, and mIoU plummets from 56.8% to just 1.2% (Table 5, E3 vs. E1). This highlights the critical role of tie-breakers in differentiating multiple instances spatially, especially in the presence of weaker intrinsic semantics from SAM.

2. Operational role and limitations of tie-breakers relative to BEiT-3

Concern: Semantic reasoning primarily comes from BEiT-3 rather than direct language-image alignment, raising questions about the operational role and limitations of tie-breakers.

Response:

Operational role of tie-breakers: BEiT-3 provides robust class-level semantic cues ("what"), whereas tie-breakers supply instance-level spatial cues ("where/how many"). During training, Hungarian matching compels each tie-breaker vector to specialize in spatially distinct regions, thereby enforcing diversity and preventing redundancy.

Limitations: Tie-breakers are fundamentally positional and thus cannot rectify semantic misclassifications arising from BEiT-3. If BEiT-3 misinterprets a semantic concept, tie-breakers cannot correct such semantic errors. Their effectiveness is contingent on BEiT-3’s correct semantic interpretation.

3. Justification for BEiT vs. CLIP and comparisons with EVF-SAM and SAM4MLLM

Concern: The paper would benefit from a deeper discussion of the choice of BEiT-3 over CLIP (especially regarding EVF-SAM) and a comparison to SAM4MLLM.

Response:

We appreciate this suggestion and further clarify below:

Relation to EVF-SAM: In the submitted manuscript, we (1) gave a shout out to EVF-SAM in Related Work L108 (“Inspired by the success of EVF-SAM, we enhance SAM further into the domain of open-vocabulary segmentation…”), (2) identified the problem of EVF-SAM in Figure 3c (“Naive solution [EVF-SAM]: A single language query for ‘zebra’ causes SAM2 to segment only the most salient instance.”), and (3) gave credits to EVF-SAM’s preliminary experiments on replacing CLIP with BEiT-3 in our ablations L292 (“…deep semantic integration is critical for aligning language concepts with visual regions, echoing findings by EVF-SAM”).

Given that EVF-SAM’s primary innovation lies in utilizing BEiT-3, we believe that our multiple citations throughout the submitted manuscript sufficiently acknowledge EVF-SAM's contributions, avoiding redundant discussion on the choice of BEiT-3 over CLIP.

While EVF-SAM also leverages BEiT-3, it is strictly limited to single-instance queries. As illustrated in Figure 3, EVF-SAM segments only the most salient instance when given category-level prompts, ignoring other instances. Our design (tie-breakers plus cross-attention) significantly advances EVF-SAM’s single-instance capability into a robust multi-instance, multi-task model, which is clearly demonstrated by our detailed ablations (Table 5).

Comparison with SAM4MLLM [Chen et al., 2024]: Despite employing a powerful 7B-parameter Qwen-VL backbone with ~29M LoRA-tuned parameters, SAM4MLLM achieves only a minor improvement (+0.5 cIoU) over OpenWorldSAM (74.5% vs. 74.0%) on RefCOCOg. Moreover, SAM4MLLM relies on a two-stage pipeline ("LLM-dialogue → point prompts → SAM"), incurring higher inference latency and failing to demonstrate multi-instance segmentation capabilities or results on challenging multi-instance benchmarks (ADE20K, PASCAL, ScanNet). In contrast, our lightweight 4.5M-parameter adapter with early vision-language fusion via BEiT-3 and positional tie-breakers consistently achieves state-of-the-art results across multiple open-vocabulary segmentation tasks and datasets.

We deeply appreciate your valuable suggestions and will incorporate all feedback into the manuscript.

评论

Thanks for the detailed response; the rebuttal addresses most of my concerns. I keep my inital rating.

审稿意见
6

The paper presents OpenWorldSAM, extending SAM2 for open-vocabulary multi-instance segmentation in a resource-efficient manner. The methodology includes: (1) jointly encoding image and text prompts, together with multilevel aggregation of input for richer information, and aiding with reasoning over ambiguity and unseen classes. (2) a novel tie breaking mechanism to equip the model with awareness of different object instances. Experiments show that OpenWorldSAM outperforms SOTAs across multiple benchmarks.

优缺点分析

Strengths

  • OpenWorldSAM consistently performs strongly relative to prior arts, across a broad range of evaluation settings. Apart from comprehensive quantitative comparison, the authors also qualitatively analyze the resulting segmentation masks.
  • Unified architecture while being very lightweight and parameter efficient.
  • The tie-breaker solution is elegant and effectively addresses the multi-instance problem.
  • The introduced Oracle-Prompts, an alternative evaluation protocol, addresses the ambiguity issue for better performance assessment. Through this, they also identify deeper characteristics into existing models.
  • Design choice of the model is well validated through detailed ablations.
  • The upfront honesty of the model’s limited performance on outdoor datasets is appreciated.
  • The discussion on two-stage inference for segmentation map refinement is also appreciated.

Weaknesses:

  • Line 154: “ In practice K = 20 covers >99% images in COCO; for larger scenes K can be increased trivially.”, though the paper doesn’t showcase the effect of varying the number of tie-breaker tokens (K), either during training or post-training.
  • Line 178: “During training, the Hungarian matching loss naturally encourages each t_i to specialize in different spatial regions.” While graceful; without explicit encouragement, this could lead to problems (e.g. query collapse, forgetting, or overfitting) when finetuning to small downstream datasets with low instance counts, which isn’t studied in this work.

问题

  1. Figure 4 illustrates inputting two object queries concurrently (giraffe and rock) into the model; while the textual explanation only refers to a single query (e.g. line 130: “a category name“), maybe for simplicity in writing. How is it actually done in the main experiments? What is the performance difference between feeding each query separately vs. multiple queries at once? If multiple queries, how is positional information (non tie-breaker) incorporated into the soft-prompting transformer module?
  2. What does soft-prompting in soft-prompting transformer module mean? Is there any prefix used during the training? We acknowledge the difficulty in keeping up with the rapid pace of the field. Do the authors plan to discuss concurrent works (e.g. [1,2])?
  3. What’s the relationship between tie-breaker tokens and learned positional encodings (e.g. [3])?

[1] https://arxiv.org/abs/2501.04001

[2] https://arxiv.org/pdf/2411.14723

[3] https://arxiv.org/abs/1810.04805

局限性

yes

最终评判理由

The paper provides strong performance against past and concurrent baselines. The methods are elegant and properly justified.

格式问题

None.

作者回复

We thank Reviewer 1 for the strong endorsement of our paper and insightful feedback! We are encouraged that you found our method effective, efficient, and well validated. We address reviewer comments below and will incorporate all feedback.

1. Effect of varying tie-breaker tokens (KK)

Thanks for highlighting this point. To clarify, setting K=20K=20 means for each prompt (e.g., a category name), our model can identify up to 20 distinct objects. For crowded scenes containing more than 20 objects per category, increasing KK is straightforward and advisable.

In practice, COCO images typically contain a moderate number of distinct categories and instances (the original COCO paper reports "on average, our dataset contains 3.5 categories and 7.7 instances per image." [1]). The chosen value should match or exceed the maximum expected number of objects per category. For reference, DETR [2] used 100 total queries, aligning roughly with the maximum number of objects per image. Our choice (K=20K=20) results, on average, in approximately 70 queries per image (20 queries × 3.5 categories), providing ample coverage for typical scenes.

Further, Zhang et al. (2022) [3] observed that increasing queries initially improved Average Precision (AP), but then plateaued or even slightly declined when queries became excessive, indicating redundancy in higher query counts. However, recall does improve with more queries, since more detection slots increase the chance to find each object.

We conducted additional ablation experiments by varying KK, pretrained on COCO and evaluated on ADE20K instance segmentation:

MetricK=10K=10K=20K=20K=30K=30
Average Precision (AP)14.216.916.5
Average Recall@100 (AR)21.628.829.4

Observations. (1) Increasing KK from 10 → 20 improves recall and AP; beyond 20 gains saturate, mirroring the behavior reported for DETR‑style object queries; (2) Average Recall with max 100 detections per image (AR@100) improve when increasing KK from 10 → 20 → 30; (3) K=20K=20 is optimal for balancing precision and recall in standard datasets.

We'll include these findings clearly in the supplementary materials.

2. Query collapse, forgetting, or overfitting on small datasets

We appreciate this concern and provide a detailed analysis below:

A. Query collapse:

During fine-tuning on RefCOCOg, each text prompt corresponds to exactly one target object, making it effectively a small-instance-count dataset scenario. We found our model generalizes robustly: cIoU scores demonstrate strong performance. We explicitly measured query collapse—the fraction of images where multiple tie-breakers predicted identical masks (IoU > 0.9). Results show minimal query collapse (<1%), even when fine-tuning on the smaller RefCOCOg-Mini (2k images):

SettingFrozen encoders?Train imagesAvg. inst./imgCollapse?*
RefCOCOg (full)26 k2.10.3 %
RefCOCOg (mini)2 k2.20.7 %

B. Catastrophic forgetting:

To mitigate forgetting and small-dataset overfitting, we follow established adapter-style fine-tuning strategies. PSF‑SAM [4] observes that fully fine‑tuning SAM on a few hundred images “frequently results in catastrophic forgetting”, and proposes freezing most weights plus a lightweight adapter to preserve upstream skills. For OpenWorldSAM, all heavy encoders remain frozen, and only our lightweight 4.5M-parameter adapter is trained. This approach empirically preserves COCO-level zero-shot capabilities while adapting effectively to RefCOCOg.

Furthermore, we assessed the zero-shot performance on ADE20K following fine-tuning. After RefCOCOg‑Mini fine‑tuning, ADE‑150 mIoU drops by only 4.2 pts (60.4 → 56.2), indicating minimal loss of zero‑shot capability.

SettingADE-150 (PQ, AP, mIoU)ADE-857 (mIoU)
COCO pretraining (original setting in paper)35.2, 16.9, 60.433.1
COCO pretraining + RefCOCOg (full) fine-tune35.7, 16.5, 56.632.2
COCO pretraining + RefCOCOg (mini) fine-tune35.2, 16.3, 56.232.0

C. Overfitting:

We explicitly examined potential overfitting on RefCOCOg-Mini (2k images), finding no evidence of such:

SettingTrain ImagescIoU (Validation Set)cIoU (Training Set)
RefCOCOg (full)26k74.075.5
RefCOCOg (mini)2k73.477.5

This comprehensive analysis shows that our Hungarian matching objective implicitly enforces query diversity through one-to-one assignment, while the cross-attention layers (lines 155–166) further guide queries towards distinct visual regions. We will clarify this implicit enforcement of diversity in L178 and provide detailed experimental results in supplementary materials.

Thank you again for your insightful questions; we will incorporate these clarifications into the revised manuscript.

References

[1] Lin et al. "Microsoft coco: Common objects in context." ECCV 2014.

[2] Carion et al. "End-to-end object detection with transformers." ECCV, 2020.

[3]Zhang et al. "What are expected queries in end-to-end object detection?." arXiv preprint arXiv:2206.01232 (2022).

[4] Wang et al.. "An efficient fine tuning strategy of segment anything model for polyp segmentation." Scientific Reports 15.1 (2025): 14088.

Additional Questions

1. Clarification on processing multiple text prompts per forward pass

Our implementation supports an arbitrary number of text prompts per image, processed simultaneously in a single forward pass. Specifically, during the forward pass:

  • We first tokenize all text prompts for each image simultaneously and repeat the visual tokens from BEiT-3 for each prompt, generating visual-text embeddings in one BEiT-3 forward pass (e.g., embeddings for "giraffe" and "rock" concurrently).
  • Tie-breaker embeddings are replicated for each prompt independently, and the resulting queries are concatenated (e.g., concat[queries for "giraffe", queries for "rock"]). This combined stack is then processed through SAM’s prompt and mask decoders simultaneously. Thus, Figure 4(b) accurately represents this concurrent processing.

Evaluation setup in Tables 1–5: For comparability with previous studies (each image is seen in one forward pass), all text prompts per image were processed simultaneously in one forward pass in the reported evaluations.

Impact of batching prompts on accuracy: We empirically tested batching prompts versus separate processing on ADE20K-150:

Prompt modeForward-passes / imagemIoU
Single-prompt per pass= #labels60.0 (-0.4)
All prompts together (default)160.4

The minor difference (0.4 mIoU) indicates that batching prompts neither significantly harms nor benefits segmentation quality. The reason is that BEiT-3 already provides a distinct CLS token for each prompt, and tie-breakers are independently replicated per prompt, minimizing cross-interference. The slight improvement from batching arises from the self-attention layers in our soft-prompting Transformer module, promoting query diversity.

How positional information is handled when multiple prompts are batched. Positional tie‑breakers are added independently to every prompt before the prompts are stacked. No extra “prompt‑id” encoding is introduced or needed; Hungarian matching during training ensures that the K queries attached to each prompt specialize on distinct spatial modes for that prompt only.

2. “What exactly is ‘soft‑prompting’ in the soft‑prompting transformer? Is there a trained prefix?”

“Soft‑prompting” here means that each language‑aware query qq_ᵢ is refined on‑the‑fly by cross‑attention with the frozen SAM2 level‑3 feature map (Fig. 4c). No static prefix or extra tokens are prepended as in LLM prompt‑tuning; instead the three Transformer layers are entirely learnable and act as an adapter that lets text semantics “look at” high‑resolution vision tokens before the mask decoder sees them.

3. Relationship between tie‑breaker tokens and positional encodings?

Tie-breaker tokens serve as instance-diversity embeddings, fundamentally different from standard positional encodings (learned or sinusoidal) that encode absolute pixel coordinates. While traditional positional encodings map pixels to unique spatial coordinates, our tie-breaker embeddings globally nudge identical semantic queries toward distinct latent spatial subspaces. During training, the Hungarian loss encourages each tie-breaker token to represent distinct, data-driven spatial modes (e.g., differentiating multiple "zebras" in Figure 3). The SAM2 image encoder still provides its standard 2‑D positional encodings, so spatial locality is preserved; tie‑breakers complement these by resolving instance ambiguity rather than spatial positioning itself.

4. Related work discussion

We appreciate the suggestions and will include references to the mentioned concurrent works in the revised Related Work section (§2). Additionally, we will explicitly compare these works quantitatively in Tables 1 and 3:

ModelADE‑150 mIoURefCOCOg
OpenWorldSAM (902M)60.474.0
ESC-Net (451.3M)41.8-
Sa2VA-1B (1B)-72.3
Sa2VA-8B (8B)-74.1

We sincerely thank the reviewer again for their detailed and constructive feedback, which will greatly enhance the clarity and comprehensiveness of our manuscript.

评论

Thank you for the detailed rebuttal. I think the "soft-prompt" here might need more clarifications in the updated manuscript since it's a bit non-standard. The results against concurrent work (ESC-Net and Sa2VA) are impressive.

评论

We thank the reviewer for acknowledging our detailed rebuttal and for highlighting the importance of clarifying the "soft-prompt" terminology in our manuscript. We will explicitly address this in our updated manuscript. We also appreciate the positive feedback regarding our results compared to concurrent works.

评论

Dear Reviewers,

As the author-reviewer discussion period will end soon (Aug 6, 11:59 PM AoE), please take a moment to read the authors’ responses and post a reply - either to acknowledge their clarifications or to raise any remaining concerns.

Thank you for your time and contributions to the review process.

Best regards,

AC

最终决定

All four reviewers gave positive ratings (3 borderline accepts and 1 strong accept), and the rebuttal successfully addressed their concerns. In summary, this paper proposes OpenWorldSAM, a method extending SAM2 for open-vocabulary segmentation via language prompts. Strengths acknowledged across reviews include its unified and lightweight architecture, tie-breaker mechanism for multi-instance segmentation, and strong zero-shot generalization across diverse benchmarks. Reviewers acknowledged the author’s thorough responses and kept their ratings after rebuttal. Given the technical contributions and solid experimental results, the AC recommends acceptance.