6.0

/10

Poster4 位审稿人

最低4最高7标准差1.2

3.5

置信度

正确性3.3

贡献度3.3

表达2.8

NeurIPS 2024

Uncertainty-aware Fine-tuning of Segmentation Foundation Models

Kangning Liu,Brian L. Price,Jason Kuen,Yifei Fan,Zijun Wei,Luis Figueroa,Krzysztof J. Geras,Carlos Fernandez-Granda

OpenReview PDF

提交: 2024-05-13更新: 2024-11-06

TL;DR

We introduce the Segmentation with Uncertainty Model (SUM), which enhances the accuracy of segmentation foundation models by incorporating an uncertainty-aware training loss and prompt sampling based on the estimated uncertainty of pseudo-labels.

摘要

The Segment Anything Model (SAM) is a large-scale foundation model that has revolutionized segmentation methodology. Despite its impressive generalization ability, the segmentation accuracy of SAM on images with intricate structures is often unsatisfactory. Recent works have proposed lightweight fine-tuning using high-quality annotated data to improve accuracy on such images. However, here we provide extensive empirical evidence that this strategy leads to forgetting how to "segment anything": these models lose the original generalization abilities of SAM, in the sense that they perform worse for segmentation tasks not represented in the annotated fine-tuning set. To improve performance without forgetting, we introduce a novel framework that combines high-quality annotated data with a large unlabeled dataset. The framework relies on two methodological innovations. First, we quantify the uncertainty in the SAM pseudo labels associated with the unlabeled data and leverage it to perform uncertainty-aware fine-tuning. Second, we encode the type of segmentation task associated with each training example using a $task prompt$ to reduce ambiguity. We evaluated the proposed Segmentation with Uncertainty Model (SUM) on a diverse test set consisting of 14 public benchmarks, where it achieves state-of-the-art results. Notably, our method consistently surpasses SAM by 3-6 points in mean IoU and 4-7 in mean boundary IoU across point-prompt interactive segmentation rounds. Code is available at https://github.com/Kangningthu/SUM

关键词

Segmentation foundation model

评审与讨论

审稿意见

评分: 4置信度: 32024-06-24

This paper introduces the Segmentation with Uncertainty Model (SUM) that combines high-quality annotated data with a large unlabeled dataset to improve performance without forgetting. First, the authors quantify the uncertainty in the SAM pseudo labels associated with the unlabeled data and leverage it to perform uncertainty-aware fine-tuning. Second, the task prompt encodes the type of segmentation task associated with each training example to reduce ambiguity. The proposed method is evaluated on different test sets consisting of 14 public benchmarks.

优点

The motivation of the paper, which aims to improve the accuracy of SAM without affecting generalization, is interesting.
The paper verifies the effectiveness of the proposed method through a large number of experiments.

缺点

1.The proposed uncertainty-aware fine-tuning method has limited innovation. The uncertainty-aware fine-tuning strategy has been widely used and has been proven to effectively purify pseudo-labels [1,2]. The proposed method aims to improve the initial prediction of SAM. Although the framework has many complex modules, it does not reveal any insightful views or lacks key theoretical analysis.

2.The method section lacks theoretical introduction. Only the text introduction reads like a technical document. There should be more forms of introduction to help understand the principle of the method, such as formulas or figures.

3.The presentation of the method is unclear and not very readable. For example, in the first sentence of section 3.2, it is introduced that "SUM applies the same prompt-sampling strategy as SAM for the human-annotated data during interactive training". In this case, should this module be placed in the Preliminary instead of a separate section in the method section? This makes it confusing to understand the proposed method.

[1]Yuxi Wang, Junran Peng, and ZhaoXiang Zhang. Uncertainty-aware pseudo label refinery for domain adaptive semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9092–9101, 2021. [2]Zhedong Zheng and Yi Yang. Rectifying pseudo label learning via uncertainty estimation for domain adaptive semantic segmentation. International Journal of Computer Vision, 129(4):1106– 1120, 2021.

问题

1.The Related Work Section should be divided into subheadings. Summarizing the related work separately is conducive to improving the readability of the paper. In addition, emphasizing the insightful advantages of the proposed method in each section of related work is conducive to a better understanding of the proposed method.

The introduction of the uncertainty-aware segmentation method is relatively broad, and a more detailed analysis of the related work should be provided.

局限性

As above

作者回复

2024-08-06

We thank the reviewer for the thoughtful feedback. We respond in detail below.

Q1 Difference with previous uncertainty-aware segmentation methods

Our approach differs fundamentally from previous approaches in both 1) the generation of uncertainty maps and 2) the utilization of these uncertainty maps, as detailed below. We will clarify this in the revised manuscript.

The uncertainty-aware fine-tuning strategies in [1][3] generate uncertainty via model prediction logit values. [2][4] use prediction differences from different heads to generate uncertainty. They are designed for domain-adaptive semantic segmentation or semi-supervised semantic segmentation. In contrast, our proposed strategy is designed for interactive binary segmentation, which has not been tackled by the previous method.

Uncertainty map generation: Our main insight is to utilize external supervision to detect and correct systematic biases that accumulate during training. This design logic is fundamentally different from relying on the model self-training itself, as most previous models including [1] and [2]. SAM undergoes several rounds of self-training, leading to the accumulation and overfitting of errors in pseudo-labels. In our evaluation, the model frequently predicts erroneous regions with high confidence, and different heads concur on these incorrect areas. Therefore, traditional methods to generate uncertainty maps, including [1][2][3][4], do not capture this uncertainty effectively.

Utilization of the uncertainty map: Our method is the first to introduce uncertainty-aware prompt sampling for training interactive segmentation models, representing a novel contribution to the field. In contrast, the semantic segmentation tasks addressed in [1] and [2] do not involve prompt sampling. Additionally, we have tailored the uncertainty-aware focal and Dice loss to train the binary segmentation foundation model. They are not proposed in [1] and [2]. We will report this in Section 2.

Quantitative comparison with previous methods: As reported in Table 1 of the paper, the proposed SUM method outperforms existing methods based on uncertainty quantification, including [1] and [2], "SUM Confidence-Uncer" corresponds to [1], and "SUM Discrepancy-Uncer" corresponds to [2]. Table 1 also provides a comparison with UP2L [3], an uncertainty-aware semi-supervised segmentation method that uses uncertainty maps in the feature space via contrastive learning, which also performs worse than SUM (see also Figure 5).

Q2 Key insights for module design

We will modify the introduction and method sections to clearly explain the three main novel components of our framework:

A module for uncertainty map generation, which is trained by external supervision to correct the systematic bias in the foundation model training. The module accurately quantifies the uncertainty in pseudo-labels, generalizing effectively across different tasks. We will also mention novel design considerations, including data-pair filtering, training mask generation, and model tuning design (now in Appendix Section D) in the main paper.
An uncertainty-aware cost function, which leverages the uncertainty map.
A strategy for uncertainty-aware prompt sampling during training, also leveraging the uncertainty map.

We will highlight these contributions more clearly in the introduction, comparing them in more detail to the existing uncertainty-aware segmentation methods, as explained below, modifying the paper structure as explained below. We will also provide a more mathematical formulation of the uncertainty map, as suggested by the reviewer, and highlight our contributions more clearly in the figures (see the provided modified figures).

Q3 Discussion about other uncertainty-aware methods in related work

We will expand the related work section by providing more detailed explanations of relevant techniques. Previous uncertainty quantification approaches such as [1][2][3][4] are not designed for training interactive foundation models. Our experiments validate that they are not effective in generating effective uncertainty maps in our setting. Additionally, the way these methods utilize uncertainty maps differs from our setting.

[1][2][4] are designed for domain-adaptive semantic segmentation, while [3] is for semi-supervised semantic segmentation. The concept of uncertain classes explored in [1] is not applicable to binary segmentation scenarios. Our strategy, however, is tailored for interactive binary segmentation. Please also see Q1.

We have already conducted comprehensive experiments and provided a detailed discussion of related work in Appendix E.1 and E.2, respectively. We will further expand the related work section to include these additional insights.

Q4 Paper organization.

Related Work Section: We will reorganize the Related Work section into subheadings, summarizing related work in distinct categories to improve readability and highlight the novel aspects of our method.
Method Section - Theoretical Introduction: We will add formulas and figures to the Method section to clarify the principles behind our approach.
Method Section - Presentation: We will move the description of the prompt-sampling strategy to the Preliminary section and add figures to enhance clarity and readability.

[1]Wang et al. Uncertainty-aware pseudo label refinery for domain adaptive semantic segmentation. CVPR 2021.

[2]Zheng et al. Rectifying pseudo label learning via uncertainty estimation for domain adaptive semantic segmentation. IJCV 2021.

[3] Wang et al. Semi-supervised semantic segmentation using unreliable pseudo-labels. CVPR 2022.

[4] Wu et al. Upl-sfda: Uncertainty-aware pseudo label guided source-free domain adaptation for medical image segmentation. IEEE transactions on medical imaging 2023.

2024-08-12

I've reviewed the authors' responses. The additional explanation makes the novelty of the proposed method clear. The use of external supervision to detect and correct systematic biases accumulated during training makes sense. The discussion of related uncertainty methods highlights the insightful advantages of the proposed method. Therefore, I will increase the score in the final rating. In addition, will the code be made public?

2024-08-12

We appreciate your constructive feedback. We are glad that our explanation makes the novelty of the proposed method clear!

Regarding the public release of the code, we are committed to making our research accessible to the community. Upon acceptance of our paper, we will make a project website and make the code available, with the hope that it will benefit other researchers and the broader community.

审稿意见

评分: 7置信度: 52024-07-04

The paper introduces a novel framework for enhancing the accuracy of the Segment Anything Model (SAM) while maintaining its generalization capabilities. SAM is a foundational model for interactive binary segmentation, but it struggles with segmenting intricate structures accurately. Fine-tuning SAM with high-quality annotated data often leads to overfitting, degrading its generalization abilities.

The proposed framework, called Segmentation with Uncertainty Model (SUM), addresses these challenges by combining high-quality annotated data with a large unlabeled dataset. Key innovations include: Uncertainty-aware Fine-tuning: The framework quantifies uncertainty in SAM’s pseudo labels and incorporates this into the fine-tuning process to improve segmentation accuracy without losing generalization capabilities. Task Prompts: SUM uses task prompts to specify the segmentation task for each training example, reducing ambiguity and improving the model’s performance on diverse tasks. Uncertainty-aware Prompt Sampling: This technique avoids misleading prompt locations by focusing on regions with high confidence. Experiments demonstrate that SUM consistently outperforms SAM and other fine-tuning strategies across various datasets, achieving significant improvements in mean Intersection over Union (mIoU) and mean boundary IoU (mBIoU).

优点

Improved Accuracy:

SUM significantly enhances the segmentation accuracy of SAM, particularly for complex structures. The uncertainty-aware fine-tuning process focuses on regions with high confidence, leading to more precise segmentation results.

Maintained Generalization:

Unlike traditional fine-tuning methods that can lead to overfitting, SUM maintains SAM’s generalization abilities. This is achieved by effectively combining high-quality annotated data with a large, diverse set of unlabeled data.

Versatility and Flexibility:

The use of task prompts allows SUM to handle various segmentation tasks, including salient-object, entity, and part segmentation. This flexibility makes SUM suitable for a wide range of applications.

Efficiency in Handling Uncertainty:

By incorporating uncertainty maps and an uncertainty-aware loss function, SUM effectively manages the noise and inaccuracies in pseudo labels. This leads to more reliable training and improved overall performance.

Robust Performance Across Datasets:

SUM demonstrates robust performance on multiple public benchmarks and internal datasets, consistently outperforming existing methods in mIoU and mBIoU. This robustness underscores the model’s effectiveness in diverse settings.

Innovative Use of Uncertainty Maps:

The generation and utilization of uncertainty maps to guide the training process is an innovative approach. It not only enhances segmentation accuracy but also improves the quality of the pseudo labels used in training.

缺点

Training Complexity:

The SUM framework introduces significant complexity into the training process. Incorporating uncertainty maps, task prompts, and uncertainty-aware prompt sampling requires additional computation and fine-tuning, which may be challenging and time-consuming for implementation. Dependency on Initial Model Performance:

The effectiveness of the SUM framework heavily depends on the initial performance of the SAM model. If SAM’s initial pseudo labels are highly inaccurate, the overall performance of SUM could be compromised, as the refinement process might not fully correct these inaccuracies. Computational Overhead:

The additional steps involved in generating and utilizing uncertainty maps, as well as the iterative prompt sampling, add computational overhead. This could be a barrier for practical deployment in resource-constrained environments where computational resources are limited. Evaluation on Specific Datasets:

The paper primarily evaluates SUM on a limited set of datasets focused on specific segmentation tasks. While results are promising, the generalizability of the framework to other datasets and segmentation tasks remains uncertain. Further validation on a broader range of datasets is necessary to confirm its robustness.

问题

see the weakness

局限性

see the weakness

作者回复

2024-08-06

We thank the reviewer for the thoughtful feedback. It will help us improve the manuscript. We respond in detail below.

Q1 Training complexity and Q3 computational overhead

We acknowledge the reviewer's concern about the additional training phase for obtaining uncertainty maps in the SUM framework. While the framework adds steps, they remain manageable within practical constraints, as detailed below. We will mention this in the paper and add a subsection to the appendix with a more detailed explanation.

Training the Uncertainty Map Generation Module: Training the uncertainty map generation module is relatively efficient on the human-labeled samples. It involves tuning a small number of parameters and can be completed within 4 hours using 8 A100 GPUs.

Generating uncertainty map for the unlabeled image The computational overhead introduced by the SUM framework is minimal. The process of generating pseudo labels and uncertainty maps for unlabeled data utilizes the existing SAM encoder, thereby sharing the primary computational workload. Beyond pseudo-label generation, the additional time required to produce refined masks using a lightweight decoder is negligible compared to the time taken by the image encoder. ViT-H image encoder in SAM has 632M parameters while the prompt-based decoder only has 3.87M parameters. As per the reference [1], on an RTX4090, the decoder adds only approximately 2% more computational time compared to the encoder.

Fine-Tuning the SAM Model: Once the uncertainty map is generated, the fine-tuning SAM model using the uncertainty map is similar to the standard training used in SAM.

Iterative Point Prompt Sampling: This method is used in standard SAM training as well, and replacing the original uniform sampling with weighted (uncertainty-aware) sampling results in a negligible training burden. Sampling a point from a 1024x1024 candidate pool using both methods can be completed within 0.006-0.009 seconds (average of 1000 runs) on the CPU of a MacBook.
Task Prompt: The task prompt, a learnable single vector, is combined via element-wise addition with the embeddings from the SAM image encoder and is used only in the first round of interactive segmentation. The element-wise addition of two tensors is relatively fast.
Uncertainty-aware Loss Computation: This involves thresholding and a weighted loss computation, which requires a similar running time to the original loss.

Inference Phase: Uncertainty maps are utilized only during training. Once our model is trained, inference operates similarly to the SAM model, without additional computational burden. This results in negligible computational overhead.

Q2 Dependence on the initial model performance

As shown in Figure 9, although the quality of refined results depends on the initial model performance, the gains are positive for most examples. We provide Figure 5 in the rebuttal PDF to demonstrate that even when the initial SAM input quality is suboptimal, our mask refinement module still enhances the input and improves performance.

Q4 Evaluation set

We organize our evaluation tasks according to the hierarchical level of granularity, covering various levels (part, entity, multiple instances). Each task is evaluated using several diverse datasets designed to encompass a wide range of images and subtasks.

For instance, our part segmentation evaluation utilizes five diverse datasets: Fashionpedia, Fashionpedia Subpart, Paco, Multi-Human Parsing, and Portrait. The first two include a comprehensive ontology of fashion elements and different levels of part granularity. Paco covers 75 object classes and the last two focus on different human-specific part segmentation subtasks.

That said, we acknowledge that evaluating a broader range of segmentation tasks would further strengthen the robustness of our proposed methods. To address this, we have extended our evaluation by testing SUM and SAM on additional image types. For reproducibility, SUM is fine-tuned on the Public dataset FT-Medium.

We selected 7 datasets from the evaluation sets used in SAM to complement our existing 14 public evaluation sets. Additionally, we have included part one of a synthetic dataset, GTAV [2]. These additional evaluation sets encompass various image types e.g. driving, synthetic, egocentric, irregular shapes, paintings, underwater animals, drones, and underwater trash.

The mIoU comparison results, reported in the following tables, confirm that SUM consistently outperforms SAM. We appreciate the reviewer’s suggestion and will include these additional results in the Appendix of the final version.

Dataset	Image type	Method	Round 1	Round 3	Round 5	Round 9
Cityscapes	Driving	SAM	44.1	50.9	58.6	64.6
		SUM	46.4	57.3	62.5	67.1
EgoHOS	Egocentric	SAM	77.9	85.4	90.2	91.9
		SUM	79.0	90.0	92.3	93.5
DRAM	Paintings	SAM	68.4	81.3	87.9	91.1
		SUM	73.6	85.8	89.2	91.4
ishape	Irregular shapes	SAM	41.4	60.4	75.7	84.6
		SUM	74.9	87.3	92.6	94.4
GTAV	Synthetic	SAM	44.8	47.0	52.2	57.0
		SUM	45.7	52.8	56.7	59.7
NDD20	Underwater Animal	SAM	86.2	88.3	90.2	91.3
		SUM	87.9	91.3	92.2	93.1
TrashCan	Underwater	SAM	63.3	69.1	76.7	82.6
		SUM	64.9	76.8	82.0	86.1
IBD	Drones	SAM	78.8	84.5	91.2	93.6
		SUM	78.4	87.6	91.4	93.9

[1] Songa et al. SAM-Lightening: A Lightweight Segment Anything Model with Dilated Flash Attention to Achieve 30 times Acceleration." arXiv preprint arXiv:2403.09195 2024.

[2] Richter et al. Playing for data: Ground truth from computer games. ECCV 2016.

审稿意见

评分: 6置信度: 32024-07-11

In this paper, the authors proposed the Segmentation with Uncertainty Model (SUM) which combines high-quality annotated data with a large unlabeled dataset. This novel framework improves the performance of the large-scale foundation model without forgetting.

优点

Paper clarity. The paper is overall well-written and structured. The appendix is

Good results. The method improves over the original SAM method and achieves SoTA results.

Adequate appendix. The quantitative and qualitative results in supplementary material support the paper's idea and concept.

缺点

Although the paper is overall well-written and structured, the figures are confusing.

The legend in Figure 1 looks confusing to me.
Figure 2 misses the emphasized part to me.
Figure 3 missed the uncertainty map.

Besides, I would suggest the authors use bold to highlight the context they want to emphasize in the Intro Sec.

问题

Please see the weaknesses parts and the following questions.

Can the authors provide the uncertainty map mathematically?
Is the uncertainty map similar to the attention in the transformer?

局限性

The authors adequately addressed their work's limitations in the appendix.

作者回复

2024-08-06

We thank the reviewer for the thoughtful feedback. We respond in detail below.

Q1 Clarification on the figures

We appreciate the reviewer’s suggestion to make the figures clearer. We have included updated figures in the rebuttal PDF (see Figure 1, 2, 3 in the rebuttal PDF) and will incorporate them into the final version of our paper.

In Figure 1, we have allocated individual legends to each corresponding sub-figure to clarify the relationship between the figures and their respective legends.

In Figure 2, we highlight the novel components of our proposed framework, making them easier to identify.

In Figure 3, we have emphasized the uncertainty map and toned down the colors to avoid distraction.

Q2 Highlight the context in intro

We appreciate the reviewer's suggestion. We will highlight key concepts in the introduction and method sections in our revised draft.

Q3 Definition of the uncertainty map

We will add the following mathematical definition of the uncertainty map to Section 3.3. The uncertainty map is obtained from the absolute difference between the original SAM input and the refined output by the Mask-refinement Module. Both the sigmoid-transformed probabilities of the SAM logits and the refined prediction have values in the range $[0,1]$ and share the same spatial dimensions. The uncertainty map is calculated as the absolute difference between them for each pixel. Let $u_i$ represent the uncertainty value for the $i$ -th pixel in the uncertainty map, it is equal to:

$u_i = \left| \sigma(\mathbf{S}_i) - \sigma(\mathbf{R}_i )\right|$

where $\sigma$ is the sigmoid function, $\mathbf{S}$ denotes the SAM logits, and $\mathbf{R}$ denotes the refined prediction. This yields values between 0 (no difference, indicating low uncertainty) and 1 (large difference, indicating high uncertainty).

Q4 The similarity of uncertainty map with respect to the attention in the transformer

The attention mechanism in transformers serves multiple crucial functions, such as capturing long-range dependencies and assigning importance weights to tokens. Similarly, the uncertainty map in our method assigns importance to regions of the pseudo labels, which is somewhat analogous in spirit to attention. However, there are important differences.

The uncertainty map is specifically used to guide the training process of the segmentation model, modifying the training cost function and the prompt sampling process. In contrast, attention is a component of the function implemented by transformers, which enables the model to weigh the relevance of different tokens within the sequence, facilitating better context understanding and representation. It does not modify the cost function (or the prompt sampling process).

2024-08-13

I appreciate the author's detailed response about the figures and definition of the uncertainty map. I will keep my original score. It would be better if the authors could make the code available.

2024-08-13

We appreciate the reviewer’s valuable feedback. To enhance the accessibility of our research to the community, we will create a project website. Upon acceptance, we will make the inference code and the novel components of the SUM training code publicly available. We are currently in the process of obtaining approval from our organization for the release of the full training code.

审稿意见

评分: 7置信度: 32024-07-13

This paper proposes the Segment with Uncertainty Model (SUM), a fine-tuning framework for existing foundational segmentation models like SAM. Specifically, SUM consists of two main components: an uncertainty-aware training pipeline and the task prompt concept to reduce ambiguity.

Technically, the uncertainty-aware training pipeline comprises three distinct techniques. First, the uncertainty maps generation module is fine-tuned from SAM using filtered high-quality images from human-annotated datasets. With uncertainty maps generated from this module, the uncertainty-aware prompt sampling strategy is then proposed to increase the probability of selecting prompts from high-confidence regions. Finally, the SUM is trained in a semi-supervised manner, and the uncertainty-aware focal and dice loss is applied to the unlabeled branch. Additionally, the task prompt concept is introduced to differentiate data from various sources, reducing ambiguity related to different segmentation types of interest across datasets.

The experimental evaluation in this paper assessed the performance of the proposed strategy on various datasets. The results indicate that the proposed SUM achieved superior performance across multiple benchmarks compared to the other methods.

优点

The paper provides a clear explanation and detailed analysis of the proposed method. Also, it is good that even the supplementary material was faithfully written, allowing me to check most of the things I was curious about while reading the main text.
The ideas behind uncertainty-based training strategy are considered intuitive and novel.
The experimental analysis is comprehensive. In addition to the evaluation of multiple benchmarks across various datasets, the ablation studies and quantitative results analyses presented in the supplementary materials further enhance credibility.
The discussion on semi-supervised baselines makes the approach by which SUM utilizes unlabeled data more convincing.

缺点

There are a few concerns about this paper:

Compared to SAM fine-tuning methods, SUM requires an additional training phase to obtain uncertainty maps. This introduces extra training burden.
SUM demonstrates outstanding performance on numerous benchmarks. However, as shown in Table 6, the performance improvement is not substantial with the increase in the number of parameters of the backbone. Does this indicate that SUM may face challenges in terms of scaling up the model?

问题

I thank the author for their detailed experiments and analysis. Nevertheless, there are a few questions I wonder about.

The uncertainty-based strategy is mainly applied to unlabeled data. What would happen if a similar strategy were applied to the labeled data as well?
It appears that the task prompt is used to distinguish the type of input images. In Table 2, the SUM Continuous TP setting employs the task prompt during inference as well. I am curious why the performance is better without using the task prompt during inference and how the performance would be affected by using different task prompts.

局限性

yes

作者回复

2024-08-06

We thank the reviewer for their encouraging feedback. We believe it will improve the manuscript. We respond in detail below.

Q1 Extra training burden

The reviewer is correct that the additional training phase for obtaining uncertainty maps in the SUM framework adds some computational overhead, but it is manageable (relative to the large size of the backbone model). It involves tuning a small number of parameters and can be completed within 4 hours using 8 A100 GPUs. We will mention this in the revised version of the paper.

Q2 Scaling up with parameter numbers

The reviewer is correct that the performance gains eventually saturate when scaling the backbone, but this already occurs in the original SAM framework, as was noted in the SAM paper [1] (arXiv version, page 12): “ViT-H improves substantially over ViT-B, but has only marginal gains over ViT-L. Further image encoder scaling does not appear fruitful at this time.” Our results in Table 6 align with this observation, showing similar performance trends.

We reorganized Table 6 to highlight the boundary IoU performance of the original SAM model, confirming that our results are consistent with the conclusions of the SAM paper:

Backbone	Metrics	Points	Salient Object SAM	Entity SAM	Part SAM
ViT-B	mBIoU	1	57.7	64.4	48.0
ViT-L	mBIoU	1	67.4	69.5	48.8
ViT-H	mBIoU	1	68.3	70.1	48.3

Our SUM framework consistently improves SAM performance across different tasks and backbones, we display the Gains of SUM with respect to SAM in the following table:

Backbone	Metrics	Points	Gain on Salient Object	Gain on Entity	Gain on Part
ViT-B	mBIoU	1	8.9	2.3	2.3
ViT-L	mBIoU	1	7.5	1.6	0.2
ViT-H	mBIoU	1	7.3	3.2	2.9

Backbone	Metrics	Points	Gain on Salient Object	Gain on Entity	Gain on Part
ViT-B	mBIoU	6	2.7	2.4	1.7
ViT-L	mBIoU	6	3.9	2.9	1.3
ViT-H	mBIoU	6	4.4	3.8	2.6

In summary, while the SAM framework provides limited gains from scaling the backbone, our SUM framework consistently enhances performance across various tasks and backbones.

Q3 Applying uncertainty-aware training for the labeled data

We designed the uncertainty-aware training strategy to mitigate the influence of noisy annotations in pseudo-labels during model training, as our labeled data are of high quality. However, the reviewer's suggestion is very intriguing. In scenarios where the quality of the available annotations is not assured, our uncertainty quantification module could be applied to these labels to enhance training. We will mention this in Section 5. In the rebuttal pdf, we provide a proof of concept that the uncertainty map can be applied to human-labeled data. The uncertainty map corresponding to an image from Cityscapes [2] training set with coarse mask is provided in Figure 4 of the rebuttal PDF. This map accurately highlights the boundary regions where the human annotations tend to be less precise.

Q4 continuous TP in SUM

This is a good point. SUM only adds the task prompt in the first round (i.e. single prompt) during training and inference. Our results indeed indicate that adding the task prompt in all rounds (i.e. both single prompt and multi-prompt scenario) during the interactive training and inference may be counterproductive. A possible explanation is that the task prompt provides an implicit prior to the model regarding the desired output mask. When only one prompt is provided, this prior is useful and improves performance. However, when multiple prompts are provided, this already sufficiently constrains the desired mask, in a way that may slightly contradict the prior associated with the task prompt for some images. We will mention this in the revised manuscript, and point out that correctly balancing different user prompts is an important topic for future research.

In order to illustrate the effect of providing different task prompts, we report the results of applying SUM (FT-Medium) to the salient object segmentation task (on the same evaluation sets as the main paper) for three different round-1 task prompts. The results show that task prompt 1 enables the model to achieve the best performance in round 1, which makes sense since task prompt 1 is associated with the salient object task. However, for later rounds, the difference in performance is very small.

Task	Task prompt	Round 1	Round 3	Round 5
Salient Object	0	77.7	90.4	93.3
	1	85.2	91.6	93.5
	2	81.9	91.1	93.5

[1] Kirillov et al. Segment Anything. arXiv preprint arXiv:2304.02643 (2023).

[2] Cordts et al. The cityscapes dataset for semantic urban scene understanding. CVPR. 2016.

2024-08-12

I appreciate the author's detailed response. Most of my concerns have been addressed. I will maintain my original score, but I would consider raising it if the authors can ensure the code is made public, particularly the training phase of the SUM.

2024-08-13

Thank you for your response. We're pleased that we addressed your concern. With respect to the code, we will make the inference code and the novel components of the SUM training code publicly available upon acceptance. We are currently in the process of obtaining approval from our organization for the release of the full training code.

作者回复

2024-08-06

Response for all the reviewers:

We thank the reviewers for their thoughtful comments and are encouraged by their positive feedback. We appreciate the recognition of our paper's soundness, contributions, and presentation.

Positive Feedback:

Soundness: Excellent (Reviewer NsqR), Good (Reviewers cnn3, HNZW, MMBZ)
Contribution: Excellent (Reviewers NsqR, cnn3), Good (Reviewer HNZW)
Presentation: Excellent (Reviewer NsqR), Good (Reviewer HNZW)

Novelty and Methodology:

We are pleased that the reviewers found our uncertainty-based training strategy intuitive and novel (Reviewer NsqR), and appreciated the innovative use of uncertainty maps (Reviewer HNZW), efficiency in handling uncertainty (Reviewer HNZW), and the versatility and flexibility of task prompts (Reviewer HNZW).
The use of unlabeled data was found convincing (Reviewer NsqR), and the overall framework was seen as novel (Reviewers cnn3, HNZW) and well-motivated (Reviewer MMBZ).

Experimental Validation:

Reviewers noted the comprehensiveness of our experiments (Reviewer NsqR), the outstanding performance validated on numerous benchmarks (Reviewer NsqR), credible ablation studies (Reviewer NsqR), and significant improvements and robust performance across datasets (Reviewer HNZW).
The effectiveness in a large number of experiments (Reviewer MMBZ) and achieving state-of-the-art results (Reviewer cnn3) were also highlighted.

Writing and Presentation:

We appreciate the positive feedback on the clarity and detailed analysis of our paper with faithfully written appendix (Reviewer NsqR), as well as its well-written and structured nature with adequate appendices (Reviewer cnn3).

We address the comments of the reviewers in detail in our responses below. Specifically, we elaborate on the additional computational cost of the proposed approach, provide improved figures, and report results on 8 additional evaluation sets to further validate the robustness and generalization of the approach. The additional experiments and clarifications will be added to the revised version of the paper.

We provide all figures mentioned in the rebuttal in the rebuttal PDF.

评论- Author-reviewer discussion

2024-08-11

Dear reviewers,

Since the authors provided their responses, please read the responses, respond to them on in the discussion, and discuss points of disagreement if necessary by Aug 13.

Best regards,

评论- Thanks for the feedback

2024-08-14

We sincerely appreciate the reviewers' time and their constructive feedback.

We would like to highlight that, based on the discussion:

We have successfully addressed most of the concerns raised by Reviewer NsqR and Reviewer cnn3.
Reviewer MMBZ has acknowledged that our explanations clarify the novelty of SUM and highlight the insightful advantages of our method. Reviewer MMBZ has also agreed to increase their initial score of 4.

We also want to emphasize that we agree that making the code accessible to the community is important and are working towards this.

If there are any remaining ambiguities or if you have further questions, please do not hesitate to reach out. We would be more than happy to provide any additional clarifications.

Thanks again!

2024-08-14

As an update regarding code release, we are confident in securing approval for the full release of the code, given that it is based on public implementations and does not contain proprietary information. In the meantime, upon acceptance, we will publish a website for the paper, where we will release the inference code and the novel components of the SUM training code.

最终决定Accept (poster)

2024-09-25

Most of the reviewers land on the positive side. Only one reviewer (Reviewer MMBZ) gave “Borderline reject”, but the reviewer claimed to increase the score in the final rating and probably forgot to do so. Therefore, the AC suggests acceptance. The authors should include the discussion and new results in the rebuttal into the final version. Especially, they should release the code as they promised in the rebuttal.