PaperHub
6.4
/10
Poster4 位审稿人
最低4最高4标准差0.0
4
4
4
4
3.5
置信度
创新性2.3
质量2.5
清晰度3.0
重要性2.3
NeurIPS 2025

Gate to the Vessel: Residual Experts Restore What SAM Overlooks

OpenReviewPDF
提交: 2025-05-06更新: 2025-10-29

摘要

关键词
Medical Image Segmentation ; Foundation Models; Residual Learning; Sparse Expert Modules

评审与讨论

审稿意见
4

This paper introduces FineSAM++, a structure-aware sparse expert framework designed to enhance the performance of foundation segmentation models like SAM on medical images, particularly for fine-grained structures such as vessels. FineSAM++ employs a confidence-driven soft Routing Module to dynamically identify uncertain regions and selectively activate a lightweight Residual Expert for local refinement, enabling efficient correction without full retraining.

优缺点分析

Strengths:

  1. The idea of proposed method is simple and straightforward.
  2. The proposed method achieves state-of-the-art performance on five publicly available vascular segmentation datasets.

Weaknesses:

  1. While SAM demonstrates strong generalization across various diseases and imaging modalities, it would be helpful to clarify why the proposed MoE framework is applied only to vessel segmentation. Extending the approach to other structures or modalities could further validate its generalizability.
  2. It is recommended to provide a comparison of the number of trainable parameters and the inference FLOPs between FineSAM++ and baseline methods, to better illustrate the computational efficiency of the proposed framework.
  3. The paper would benefit from additional ablation studies or analysis on hyperparameters, such as the impact of δ\delta in Equation 3, to help understand their influence on performance.
  4. A more detailed description of the method, including the design of gθg_\theta in Equation 2 and the architecture of the expert modules, would improve the clarity and reproducibility of the work.
  5. It is suggested to evaluate the effectiveness of FineSAM++ on multi-class segmentation and different disease segmentation tasks to further demonstrate its versatility and robustness.

问题

See above

局限性

It is recommended to provide a discussion of the societal impact.

最终评判理由

As the authors have addressed some of the previous concerns, I have increased my score for acceptance.

格式问题

No

作者回复

We thank reviewer for his/her valuable and insightful reviews. Here we address his/her main concerns:

W1, 5 “Extending the approach to other structures or modalities to validate its generalizability.”:

We agree with the concern regarding the importance of validating the generalizability of our proposed framework beyond vessel segmentation. In response, we have extended our evaluation to multi-class segmentation on the Synapse Multi-Organ CT dataset, which includes eight anatomically diverse abdominal organs. Following [1, 8, 9], we split the dataset into 18 training samples and 12 test samples, and perform corresponding preprocessing and data augmentation methods.

We present the quantitative results in Tab.XII. As shown, our method achieves the highest average Dice score (87.97%) and the lowest Hausdorff Distance (HD) of 7.89 among all compared methods, outperforming strong baselines such as H-SAM and nnU-Net. Notably, our method maintains high segmentation accuracy on challenging small organs like the pancreas, demonstrating that FineSAM++ preserves its segmentation strength and robustness even when extended to more delicate anatomical structures. We will include this supplementary analysis in the final submission to better showcase the generalization capability of our approach and further support the robustness of the method.

Table XI. Comparison to state-of-the-art models on Synapse multi-organ CT dataset.

MethodSpleenRight KidneyLeft KidneyGallbladderLiverStomachAortaPancreasMean Dice (%)HD
TransUnet [1]87.2363.1381.8777.0294.0855.8685.0875.6277.4831.69
SwinUnet [2]85.4766.5383.2879.6194.2956.5890.6676.679.1321.55
TransDeepLab [3]86.0469.1684.0879.8893.5361.1989.0078.480.1621.25
DAE-Former [4]88.9672.386.0880.8894.9865.1291.9479.1982.4317.46
MERIT [5]92.0184.8587.7974.495.2685.3887.7171.8184.913.22
nnUnet [10]91.6888.4683.6870.8297.1383.3493.0481.587.3310.78
AutoSAM [6]80.5480.0279.641.3789.2461.1482.5644.2262.0827.56
SAM Adapter [7]83.6879.0079.0257.4992.6769.4877.9343.0772.833.08
SAMed [8]87.7769.1180.4579.9594.872.1788.7282.0681.8820.64
H-SAM [9]93.3489.9391.8873.4995.7287.189.3871.1186.498.18
Ours94.2591.5393.2171.2396.8990.8392.5282.2387.977.89

W2 “Parameter efficiency”:

Thanks for your valuable comment. To address your concern regarding the efficiency of our proposed method FineSAM++ in terms of learnable parameters among SAM-based segmentation methods, we have conducted a detailed comparison. Specifically, we computed the number of parameters, FLOPs (GAMCs), and inference latency under a standard input resolution of (1, 3, 1024, 1024). The results are summarized in the Tab.X below:

Table X. Comparison of model size, computational cost, latency, and segmentation accuracy (Dice score) across segmentation methods using an input of size (1, 3, 1024, 1024).

ModelParams (M)GAMCs (G)Latency (ms)Dice
Unets34.534.081.100.7787
nnUnet126.21864.937.40.8220
SAM Adapter104.3400.1127.80.4498
H-SAM111.3370.6124.80.6622
AutoSAM135.29774.16166.220.6603
SAMed92.2370.5117.10.6170
SAM93.7372.0116.33/
Ours (FineSAM++)94.4376.8117.60.8231

As shown in the Tab.X, FineSAM++ introduces only 1.4M additional learnable parameters on top of the SAM backbone. Compared to other SAM-based segmentation methods, FineSAM++ demonstrates significantly higher parameter efficiency. Although the total parameter count is higher than lightweight models like Unet, FineSAM++ achieves the highest Dice score (0.8231). These results clearly indicate that FineSAM++ strikes an excellent balance between parameter efficiency and segmentation performance. We will include this comparison in the revised version.

W3 “Ablation study on threshold δ\delta in the Gating module”:

Thanks for pointing this out. To assess the sensitivity of the Gating module to the pre-defined error threshold δ\delta, we conducted an ablation study on the DRIVE dataset by varying δ{0.3,0.4,0.5,0.6,0.7}\delta \in \{0.3, 0.4, 0.5, 0.6, 0.7\}. The results are summarized in the Tab.XI below:

Table XI. The Ablation Result of threshold δ in the Gating module

δDiceACCAUCSESP
0.30.81240.97120.98120.84320.9601
0.40.81870.97550.98460.84100.9732
0.50.82310.97900.98700.83660.9834
0.60.81800.97670.98540.82040.9807
0.70.81290.97350.98220.80830.9784

As shown, while some metrics are slightly higher, δ=0.5\delta=0.5 achieves the best overall performance across all metrics. Overall, δ=0.5\delta=0.5 provides a meaningful routing threshold:

  • If δ\delta is too low, almost all pixels are considered uncertain, resulting in unnecessary refinement and loss of gating sparsity.
  • If δ\delta is too high, only a few pixels are routed, making the refinement module underutilized.

Thus, δ=0.5\delta=0.5 strikes a balance, activating residual experts in genuinely ambiguous regions while maintaining efficiency. We will include this explanation and the ablation results in the final version.

W4 “A more detailed description of the method, including the design of gθg_\theta in Equation 2 and the architecture of the expert modules”:

We thank the reviewer for emphasizing the importance of methodological clarity. We will revise the manuscript to provide a more detailed description.

1. Gating Module gθg_\theta (Eq. 2):

gθg_\theta is implemented as a lightweight CNN without downsampling. It contains three convolutional layers with kernel size 3×33 \times 3, followed by BatchNorm and ReLU activation. It produces two outputs through separate heads:

• Uncertainty map: A 1×11 \times 1 convolution followed by a sigmoid activation outputs a spatial confidence map.

• Routing weights: The feature map is globally pooled and passed through an MLP to produce per-image expert routing weights.

2. Expert Modules:

Each expert is a compact 3-level U-Net, where each level contains 2 down-sampling and 2 up-sampling blocks. All experts share architecture but are independently trained to specialize in different error patterns. The final prediction is a weighted sum of expert outputs, gated by the spatial uncertainty and routing weights. We will update the final paper and appendix to support reproducibility.

L1 “Societal impact discussion”:

We appreciate the reviewer’s suggestion to address societal impact. FineSAM++ targets localized failures in medical images, especially in fine-grained regions like vessels with blurred boundaries, by introducing a structure-aware, sparsely activated expert mechanism that improves segmentation fidelity with minimal overhead. This design offers two key benefits: (1) improved reliability of automated tools to support clinical decision-making, and (2) a scalable, resource-efficient adaptation strategy that facilitates deployment in low-resource healthcare settings where retraining large models is impractical.

References

[1] Chen J., et al (2021). Transunet: Transformers make strong encoders for medical image segmentation.

[2] Cao H., et al (2022). Swin-unet: Unet-like pure transformer for medical image segmentation.

[3] Azad R., et al (2022). Transdeeplab: Convolution-free transformer-based deeplab v3+ for medical image segmentation.

[4] Azad R., et al (2023). Dae-former: Dual attention-guided efficient transformer for medical image segmentation.

[5] Rahman M M., et al (2024). Multi-scale hierarchical vision transformer with cascaded attention decoding for medical image segmentation.

[6] Hu X., et al (2023). How to efficiently adapt large segmentation model (sam) to medical images.

[7] Chen T., et al (2023). Sam fails to segment anything?–sam-adapter: Adapting sam in underperformed scenes: Camouflage, shadow, and more.

[8] Zhang K., et al (2023). Customized segment anything model for medical image segmentation.

[9] Cheng Z., et al (2024). Unleashing the potential of sam for medical adaptation via hierarchical decoding.

[10] Isensee F., et al (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation.

评论

Thank you for your response. I am pleased to see that my concerns have been addressed. I am willing to increase my score accordingly.

评论

We are grateful for your thoughtful reconsideration and for acknowledging the clarifications we provided. Your constructive comments have been valuable in improving the paper.

审稿意见
4

Segmentation foundation models like SAM demonstrate strong generalization on natural images but tend to underperform on medical images, which often involve fine-grained structures such as blood vessels. This paper introduces FineSAM++, a sparse expert framework designed to refine SAM's output. Specifically, it employs a confidence-driven soft routing module to identify uncertain regions and selectively activate mixed-of-residual-experts to correct structural errors.

优缺点分析

Strengths:

The proposed framework is well-motivated. While SAM is powerful, it struggles to capture fine-grained, high-resolution details—especially in medical images. Instead of retraining the entire model, this work proposes a more efficient approach: obtain an initial coarse mask, identify uncertain regions, and apply residual corrections through specialized experts. These diverse Residual Experts are trained to address different types of errors, such as disconnections, missing thin vessels, or noisy boundaries.

The paper includes thorough ablation studies that demonstrate:

  • the effectiveness of using multiple specialized Residual Experts over a single expert, and

  • the benefit of the proposed gating module for selective refinement.

A comprehensive evaluation is conducted across multiple datasets with a wide range of baselines, showing that the proposed method delivers competitive performance.

Weaknesses:

While it is intuitive to perform refinement on top of coarse masks from a foundation model, it would be important to acknowledge and analyze potential failure cases. In my own experience, refinement modules can sometimes overfit to thin structures, leading to false positives and degrading the initial mask quality. Did the authors observe any such failure modes where refinement worsened the output?

The paper lacks an ablation on the progressive optimization strategy with dynamic weighting. It would be helpful to compare this approach with a simpler stage-wise training setup, where each correction module is trained independently. A discussion on the specific advantages of the proposed strategy is needed.

The Residual Experts appear to be relatively independent from the base model. Have the authors explored lightweight, orthogonal integration methods such as adapters? This could offer insights into whether the refinement process is generalizable.

问题

See weakness.

局限性

Yes

最终评判理由

The response from the author solves most of my concerns. I will maintain my initial recommendation on accepting this paper

格式问题

No

作者回复

We thank reviewer for his/her valuable and insightful reviews. Here we address his/her main concerns:

W1 “While it is intuitive to perform refinement on top of coarse masks from a foundation model, it would be important to acknowledge and analyze potential failure cases. In my own experience, refinement modules can sometimes overfit to thin structures, leading to false positives and degrading the initial mask quality. Did the authors observe any such failure modes where refinement worsened the output?”:

We thank the reviewer for the insightful comment. We agree that refinement modules can sometimes overfit to thin structures, potentially leading to false positives or degraded mask quality—especially in regions with weak boundaries or ambiguous textures. This is an important and realistic concern that we have also observed in our experiments.

To address this, we adopt a Degrade strategy in FineSAM++. Instead of feeding all experts the same coarse mask, we introduce randomized degradations:

y^SAM(j)=Degrade(y^_SAM,η_j)\hat{y}_{\text{SAM}}^{(j)}=\text{Degrade} ( {\hat y}\_{SAM}, \eta\_j )

where Degrade()\text{Degrade}(\cdot) applies random masking, noise injection, or occlusion perturbations to promote coarse mask input diversity. This encourages specialization across experts and improves robustness to structural uncertainty. We conducted an ablation experiment to evaluate the effectiveness of the degradation mechanism. The results are summarized in Table XV.

Table XV. Ablation study of the Degrade strategy for multi-expert training.

StrategyDiceACCAUCSESP
No Degrade0.81540.97350.98130.82810.9720
Degrade0.82310.97900.98700.83660.9834

The Degrade strategy consistently improved robustness, reduced false positives, and encouraged specialization across experts. We will include these results and visual examples in the revised submission and supplemental material.

W2: The ablation of progressive optimization strategy

Thank you for the insightful suggestion. We agree that analyzing the effect of the progressive optimization strategy with dynamic weighting is important. To this end, we compared three strategies:

1)Naïve Joint Training: end-to-end optimization with uniform loss weights.

2)Stage-wise (Independent): freeze coarse module, train refinement separately.

3)Ours (Progressive): progressive training with dynamic uncertainty-based loss weighting.

The results are summarized in Table XVI. Our method consistently outperforms both baselines. The progressive setup encourages interaction between modules and focuses learning on uncertain regions, which improves refinement without overfitting. Our method consistently outperforms both baselines. The progressive setup encourages interaction between modules and focuses learning on uncertain regions, which improves refinement without overfitting. We will include these results and discussion in the revised paper. Thank you again for the helpful suggestion. We will include these results and discussion in the revised paper. Thank you again for the helpful suggestion.

Table. XVI: Ablation study on training strategies.

StrategyDiceACCAUCSESP
Naïve Joint Training0.79840.96410.97450.81230.9632
Stage-wise (Independent)0.81170.97220.98100.82890.9766
Ours (Progressive)0.82310.97900.98700.83660.9834

W3 “The Residual Experts appear to be relatively independent from the base model. Have the authors explored lightweight, orthogonal integration methods such as adapters? This could offer insights into whether the refinement process is generalizable”:

We thank the reviewer for the insightful comment. We have indeed explored adapter-based refinement as a baseline, inspired by recent methods such as SAM-Adapter [1]. While adapters provide a parameter-efficient integration mechanism, we found them suboptimal for our task: they struggle to correct fine-grained structures such as thin vessels, bifurcations, and topological continuity.

As shown in Tab. L and Fig. 4 (in manuscript), adapter-based methods often fail to recover fragmented branches or subtle boundaries, especially in high-precision vessel segmentation. These limitations motivated us to explore a mechanism that activates refinement paths only when needed—allowing the model to focus capacity on challenging regions while maintaining overall efficiency.

Table. X: Comparison with SAM-Adapter across vessel datasets.

DatasetMethodsDiceACCAUCSESPCALClDice
DriveSAM-Adapter0.44980.93110.92040.75770.93770.9930.4120.3210.488
Ours0.82310.97900.98700.83660.98340.9980.8480.8650.832
DCA1SAM-Adapter0.75830.97270.94080.78820.98360.9980.7850.8120.800
Ours0.81270.97750.99310.84790.98720.9970.9030.8770.865
CHUACSAM-Adapter0.76360.97840.93590.75830.99020.9980.7590.7680.750
Ours0.77680.98070.99510.75670.99320.9980.7870.7950.770
ROSESAM-Adapter0.63160.85780.84510.65030.98010.9920.6830.6570.660
Ours0.82200.94830.98270.94850.98230.9970.8190.8390.805

Mixture-of-Experts (MoE) architectures are well-suited for modeling spatially sparse and heterogeneous error patterns [2–4], which are characteristic of SAM’s failure cases in medical imaging. Our confidence-guided routing module enables spatially adaptive specialization by activating lightweight Residual Experts only in uncertain regions. Unlike bottlenecked adapter layers, these CNN-based experts retain local structural priors critical for topology-aware refinement.

Moreover, while SAM-Adapter [1] focuses on global adaptation in natural scenes (e.g., camouflage, shadows), our method is designed for localized, structure-aware correction—essential in clinical settings where boundary precision and topological integrity are crucial. FineSAM++ consistently outperforms both adapter-based and task-specific baselines across five vessel benchmarks, demonstrating the effectiveness and generalizability of this sparse, modular correction paradigm.

References

[1] Chen T., et al (2023). Sam-adapter: Adapting segment anything in underperformed scenes.

[2] Jacobs R A., et al (1991). Adaptive mixtures of local experts[J]. Neural computation.

[3] Fedus W., et al (2022). Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.

[4] Hwang C., et al (2023). Tutel: Adaptive mixture-of-experts at scale.

评论

Thanks for the detailed response, which solves most of my concerns. I will maintain my initial recommendation on accepting this paper

评论

Thank you for your follow-up and for taking the time to carefully consider our responses. We truly appreciate your thoughtful engagement with our work. We're glad to hear that we were able to address most of your concerns, and we sincerely thank you for maintaining your initial recommendation to accept the paper.

Please don’t hesitate to let us know if there are any remaining points you'd like us to further clarify. We are committed to making the final version of the paper as clear and rigorous as possible.

审稿意见
4

The authors build on SAM to produce more coherent vessel segmentations. They introduce two modules: 1. A Routing module to identify uncertain regions. 2. A Residual Expert model (lightweight) that fixes the structures identified by the routing model. They evaluate their method on 5 vessel datasets on many vessel/tube specific metrics including CLDice and connectivity.

优缺点分析

Strengths

  • the authors evaluate on various datasets and modalities
  • a comprehensive set of metrics is used to assess both accuracy and topology of the produced segmentations.
  • the authors compare against a comprehensive set of baselines, including specialized models and UNets
  • the authors ran an ablation study to understand the significance of each of their module.

Weaknesses.

Sparse related work section.

  • I would have expected more work on uncertainty evaluation, specialised models and foundation models.
  • No mention of the vessel specific literature.

Confusion in the experiments.

  • Different baselines are used for different datasets. Why is this the case? Can the author elaborate? I would guess that this limits the strength of the paper.
  • Why are the UNet results so variable depending on the dataset?
  • Did you also run data augmentations for the baselines? It seems to me that some important baselines are missing:
  • SAM models specifically designed for fine structures such as Segment anything in high quality [1]
  • nnUNet, which are a standard network for medical image segmentation [2]
  • Foundation model for tubular structures

[1]: Ke, Lei, et al. "Segment anything in high quality." Advances in Neural Information Processing Systems 36 (2023): 29914-29934.
[2]: Isensee, Fabian, et al. "nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation." Nature methods 18.2 (2021): 203-211.

问题

  • Why did the authors choose different baselines per dataset ? It seems to weaken their evaluation and makes performance less comparable
  • What is the number of parameters with their suggested changes and how does it compare to other baselines ? (Especially SAM and UNets)
  • Why not mention nnUNets, and models specialized for tubular structures ?
  • Were the augmentations applied to all baselines as well?
  • Would it be possible to see standard deviations to assess the significance of the results?

局限性

The authors mentioned the added complexity and computation time of their method in the supplemental section. I think it would be great to quantify this numerically both in terms of parameters and in terms of training time (for both phases).

最终评判理由

I appreciate the authors effort to address all my concerns. The marginal improvements on nnUNet seem a bit concerning (sometimes are insignificance). If the main claim is latency, this should be more emphasized in the paper. I think the improvements in latency are interesting though, but should be clarified in the paper.

格式问题

N/A

作者回复

We thank reviewer for his/her valuable and insightful reviews. Here we address his/her main concerns:

W1 “Sparse related work section”:

We appreciate the reviewer’s observation regarding the breadth of the related work. We would like to clarify that our Related Work section (Sec. 2) will include dedicated subsections that cover foundation models, uncertainty modeling, specialized approaches, and vessel-specific segmentation literature:

1)Foundation Models & Uncertainty modeling: We will expand Section 2.1 to cover additional efforts on adapting foundation models for medical image segmentation, including LoRA-based fine-tuning [1], prompt-driven adaptation [2], and SAM-Adapter methods [3]. To address the reviewer’s interest in uncertainty and boundary quality, we will add HQ-SAM [4], which improves boundary precision via high-quality priors and edge-aware refinement. We will also discuss topology-aware uncertainty modeling [5,6], which enhances consistency by quantifying structural uncertainty. These additions better contextualize the potential and limitations of foundation models in segmenting fine-grained medical structures like vessels.

2) Specialized Models for Vessel Segmentation: We will expand Section 2.3 to include recent progress in both traditional and foundation model-based vessel segmentation. Beyond CNN and Transformer methods [7,8], we will add vesselFM [9], a foundation model for universal 3D vessel segmentation, and a retinal disease foundation model [10], which supports generalizable vessel-pathology detection. These works underscore the increasing relevance of foundation models in vascular imaging for both segmentation and clinical insight.

Q1, 3, 4, 5 & W2.1, 2.3, 2.4, 2.5, 2.6 “Baseline selection, data augmentations, missing import baselines and standard deviations”:

We thank the reviewer for raising this important point. For each dataset, we selected baselines that are both highly cited on Google Scholar and representative of state-of-the-art performance, aiming to reflect the most influential methods specific to each task. Importantly, we included several SAM-based methods (SAM-Adapter, H-SAM, AutoSAM, SAMed) across all datasets, along with a vessel-specific baseline (Gupta et al. [5]). We agree that a unified suite of baselines would improve consistency, and we will revise our experiments to add some important baselines:

  1. HQ-SAM [4]: A SAM variant specifically designed for fine structural segmentation.
  2. nnUNet [11]: A standard network for medical image segmentation.
  3. RETFound (modified) [10]: A foundation model for vessel disease detection. We adapt it for segmentation by replacing the classification head with a lightweight decoder.

Furthermore, to ensure fairness, all methods were trained with the same data augmentation strategies, following [5], which includes random rotation, flipping, elastic deformation, and brightness/contrast jittering.

As shown in Table XII and Table 1, HQ-SAM improves over other SAM-based models (e.g., H-SAM, SAM-Adapter, SAMed) via lightweight token-level refinement. However, its frozen encoder and global adaptation lack domain-specific priors and struggle with fine-grained structures such as thin vessels and bifurcations. RETFound performs better structurally due to pretraining, while nnUNet remains stable due to its self-configuring pipeline and strong inductive bias in medical segmentation. Our method achieves balanced and robust performance across datasets, especially on Dice and AUC, critical for fine vessel structures. A more detailed analysis will be given in the revised version.

To better evaluate the robustness of our method, we report the mean ± standard deviation in Table XII. Minor revisions will be made in the final version to include these statistics for completeness.

Table XII. Quantitative comparison on DRIVE, DCAI, CHUAC and ROSE datasets.

DatasetMethodDiceACCAUCSESP
DRIVEHQ-SAM0.79780.96970.88240.80330.9824
nnUnet0.8220.96980.8940.80190.9862
RETFound0.8020.96490.8830.77960.9821
Ours0.8231±0.0140.979±0.0030.987±0.0230.8366±0.0480.9834±0.004
DCAIHQ-SAM0.7880.9770.8890.7890.988
nnUnet0.80450.95840.99030.82640.9879
RETFound0.79380.96810.99110.88530.9861
Ours0.81270.97750.99310.84790.9872
CHUACHQ-SAM0.7050.8940.8120.6760.948
nnUnet0.78140.97760.88420.77880.9896
RETFound0.76360.96040.99040.73250.9906
Ours0.7768±0.0450.9807±0.0050.9951±0.0320.7567±0.0650.9932±0.003
ROSEHQ-SAM0.7520.96090.99040.7940.9887
nnUnet0.8270.9470.9310.8650.994
RETFound0.71260.91970.93370.85630.9193
Ours0.822±0.0470.9483±0.0270.9827±0.0460.9485±0.0470.9823±0.094

W2.2 “Why are the UNet results so variable depending on the dataset?”:

Thank you for the question. The performance variability of UNet reflects both dataset characteristics and inherent model limitations. UNet relies heavily on low-level intensity cues and lacks global context modeling, making it sensitive to modality shifts and low-contrast scenarios. For instance, it performs well on DRIVE (high-contrast RGB) but struggles on CHUAC (grayscale, noisy angiography). This highlights UNet’s limited robustness across diverse medical domains.

Q2 & L1 “Parameter Count Comparison”:

We thank the reviewers for their comments. To better illustrate the computational properties of our method, we compare its performance against SAM-based methods, UNets, and nnUNet. While training time is a relevant consideration, it is not the primary bottleneck or design objective of our approach. FineSAM++ is built upon a frozen backbone and introduces two lightweight adaptation stages, both of which converge efficiently in practice. Accordingly, we focus on parameter efficiency, inference cost, and segmentation accuracy. The results are summarized in Table X below:

Table X. Comparison of model size, computational cost, latency, and segmentation accuracy (Dice score) across segmentation methods using an input of size (1, 3, 1024, 1024).

ModelParams (M)GAMCs (G)Latency (ms)Dice
Unets34.534.081.100.7787
nnUnet126.21864.937.40.8220
SAM Adapter104.3400.1127.80.4498
H-SAM111.3370.6124.80.6622
AutoSAM135.29774.16166.220.6603
SAMed92.2370.5117.10.6170
SAM93.7372.0116.33/
Ours (FineSAM++)94.4376.8117.60.8231

As shown above, FineSAM++ achieves a Dice score of 0.8231, the highest among all SAM-based methods, while introducing only 1.4M additional learnable parameters. Compared to UNet (1.1 ms latency) and SAM (116.33 ms latency), our method maintains a comparable inference time (117.6 ms) and significantly improves performance over Unet. In contrast to nnUNet, which requires heavy computation (1864.9G FLOPs), FineSAM++ achieves similar accuracy (0.8231 vs. 0.8220) with over 4× fewer parameters and over 30× faster inference.

References:

[1] Cheng Z., et al (2024). Unleashing the potential of sam for medical adaptation via hierarchical decoding.

[2] Chen Z., et al (2025). UN-SAM: Domain-adaptive self-prompt segmentation for universal nuclei images.

[3] Wu J., et al (2025). Medical sam adapter: Adapting segment anything model for medical image segmentation.

[4] Ke L., et al (2023). Segment anything in high quality.

[5] Gupta S., et al (2023). Topology-aware uncertainty for image segmentation.

[6] Huang J., et al (2024). Representing topological self-similarity using fractal feature maps for accurate segmentation of tubular structures.

[7] Qi Y., et al (2023). Dynamic snake convolution based on topological geometric constraints for tubular structure segmentation.

[8] Mou L., et al (2024). CS-Net: Channel and spatial attention network for curvilinear structure segmentation.

[9] Wittmann B., et al (2025). vesselFM: A Foundation Model for Universal 3D Blood Vessel Segmentation.

[10] Zhou Y., et al (2023). A foundation model for generalizable disease detection from retinal images.

[11] Isensee F., et al (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation.

评论

I appreciate the authors’ in depth response and their effort to address my concerns and add in-depth related work. Do the authors have access to the standard deviation for the baselines ? Also, it looks like the nnUNet and FineSAM++ are very close. Do they have any way to validate the statistical significance of their results ? I am also not sure I understand the claims about faster inference times and number of parameters (4x and 30x). They seem inaccurate? Also how are the Dice score computed and aggregated in Table X?

评论

We appreciate the reviewer’s feedback and addresss the concerns as follows:

“1.Standard deviation”:

We have calculated the standard deviation for baseline methods. This addition allows for more reliable comparisons in terms of both performance and stability (See Tab.XII).

Table XII. Quantitative comparison.

DatasetMethodDiceACCAUCSESP
DRIVEHQ-SAM0.7978 ± 0.0270.9697 ± 0.0090.8824 ± 0.0210.8033 ± 0.0300.9824 ± 0.011
nnUNet0.8220 ± 0.0170.9698 ± 0.0060.8940 ± 0.0140.8019 ± 0.0250.9862 ± 0.008
RETFound0.8020 ± 0.0220.9649 ± 0.0080.8830 ± 0.0180.7796 ± 0.0280.9821 ± 0.010
Ours0.8231 ± 0.0140.9790 ± 0.0030.9870 ± 0.0230.8366 ± 0.0480.9834 ± 0.004
DCAIHQ-SAM0.7880 ± 0.0260.9770 ± 0.0060.8890 ± 0.0180.7890 ± 0.0250.9880 ± 0.006
nnUNet0.8045 ± 0.0150.9584 ± 0.0050.9903 ± 0.0060.8264 ± 0.0210.9879 ± 0.004
RETFound0.7938 ± 0.0210.9681 ± 0.0070.9911 ± 0.0050.8853 ± 0.0190.9861 ± 0.007
Ours0.8127 ± 0.0120.9775 ± 0.0030.9931 ± 0.0040.8479 ± 0.0280.9872 ± 0.005
CHUACHQ-SAM0.7050 ± 0.0300.8940 ± 0.0150.8120 ± 0.0250.6760 ± 0.0350.9480 ± 0.010
nnUNet0.7814 ± 0.0200.9776 ± 0.0040.8842 ± 0.0170.7788 ± 0.0320.9896 ± 0.003
RETFound0.7636 ± 0.0280.9604 ± 0.0080.9904 ± 0.0060.7325 ± 0.0450.9906 ± 0.004
Ours0.7768 ± 0.0450.9807 ± 0.0050.9951 ± 0.0320.7567 ± 0.0650.9932 ± 0.003
ROSEHQ-SAM0.7520 ± 0.0280.9609 ± 0.0060.9904 ± 0.0050.7940 ± 0.0210.9887 ± 0.006
nnUNet0.8270 ± 0.0180.9470 ± 0.0060.9310 ± 0.0120.8650 ± 0.0200.9940 ± 0.005
RETFound0.7126 ± 0.0320.9197 ± 0.0090.9337 ± 0.0090.8563 ± 0.0330.9193 ± 0.007
Ours0.8220 ± 0.0470.9483 ± 0.0270.9827 ± 0.0460.9485 ± 0.0470.9823 ± 0.094

“2.Statistical significance”:

We performed paired t-tests to assess statistical significance (Tab.XIII), where most improvements are significant (p < 0.05 are bold). Our method demonstrates superior overall performance in segmentation accuracy, consistency (lower standard deviation), and statistical significance.

Table XIII. Statistical significance.

DatasetMethodDiceACCAUCSESP
DRIVEHQ-SAM9.86E-033.99E-042.47E-121.55E-011.32E-01
nnUNet5.50E-029.79E-062.44E-123.62E-032.08E-02
RETFound1.38E-029.61E-071.38E-135.60E-035.96E-01
DCAIHQ-SAM4.53E-079.97E-011.07E-213.28E-062.57E-01
nnUNet8.79E-041.35E-142.10E-026.22E-031.14E-02
RETFound1.24E-051.80E-054.79E-011.93E-061.70E-02
CHUACHQ-SAM1.04E-018.28E-081.71E-049.80E-036.41E-05
nnUNet6.96E-012.50E-012.50E-032.77E-014.46E-01
RETFound3.23E-017.69E-042.23E-013.12E-014.02E-01
ROSEHQ-SAM1.97E-033.20E-011.25E-011.43E-067.49E-01
nnUNet7.03E-013.13E-011.65E-013.02E-048.90E-01
RETFound1.64E-041.51E-033.11E-016.43E-064.93E-02

“3.Examination of Tab.X”:

We address the reviewer’s concerns on Tab.X below.

a.Parameter and Latency:

Compared to nnUNet (1864.9G GAMCs), FineSAM++ is ~4.9× more computationally efficient (376.8G GAMCs). The original “30×” was a typographical error based on preliminary theoretical FLOP ratios across scales. We will revise the manuscript.

b.Dice:

The Dice in Tab.X are averaged over all test images per dataset. While our method has higher latency than nnUNet, it outperforms SAM-based baselines with a similar parameter budget.

审稿意见
4

This paper presents a structure-aware sparse expert framework, named FineSAM++, which enhances foundation segmentation models. The proposed method has a gating module that identifies structurally uncertain regions and activates a Residual Expert to correct local structural inconsistencies of vessel segmentation. The proposed method has been demonstrated on five vessel segmentation datasets, and the results show superior performance compared to the existing state-of-the-art segmentation methods.

优缺点分析

Strengths

  • The proposed framework with a gating module and residual expert refines the vessel segmentation mask from SAM model effectively.
  • The proposed method trains a part of the neural networks, which enables efficient training of the large model.
  • Experimental results using five public datasets demonstrate the effectiveness of the proposed method in local correction of the segmentation mask.

Weaknesses

  • There is no comparison for the size of learnable parameters between the proposed method and the existing SAM-based methods.
  • The proposed gating module requires a pre-defined error threshold (\delta), which may make the model sensitive to the threshold value.
  • There is no study on the hyperparameters used in the gating module and residual experts, such as normalized routing weights (I, w_j) and the threshold (\delta).

问题

Please refer to the above weaknesses.

  • Is the proposed method efficient in terms of the number of learnable parameters among the SAM-based segmentation methods?
  • How is the error threshold (\delta) in the gating module defined? Is the threshold the same for all five public datasets?
  • The hyperparameter study would strengthen the contribution of the proposed method.

局限性

yes (in supplementary material)

最终评判理由

After reviewing the rebuttal by the authors and other reviewers' comments, I keep my rating. The authors have addressed most of my concerns, and the SAM-based proposed method would be helpful in vessel segmentation masks of various medical images.

格式问题

There is no concern on the paper formatting.

作者回复

We thank reviewer for his/her valuable and insightful reviews. Here we address his/her main concerns:

Q1&W1 “Parameter efficiency vs. SAM-based method”:

Thanks for your valuable comment. To address your concern regarding the efficiency of our proposed method FineSAM++ in terms of learnable parameters among SAM-based segmentation methods, we have conducted a detailed comparison. Specifically, we computed the number of parameters, FLOPs (GAMCs), and inference latency under a standard input resolution of (1, 3, 1024, 1024). The results are summarized in the Tab.X below:

Table X. Comparison of model size, computational cost, latency, and segmentation accuracy (Dice score) across segmentation methods using an input of size (1, 3, 1024, 1024).

ModelParams (M)GAMCs (G)Latency (ms)Dice
Unets34.534.081.100.7787
nnUnet126.21864.937.40.8220
SAM Adapter104.3400.1127.80.4498
H-SAM111.3370.6124.80.6622
AutoSAM135.29774.16166.220.6603
SAMed92.2370.5117.10.6170
SAM93.7372.0116.33/
Ours (FineSAM++)94.4376.8117.60.8231

As shown in the Tab.X, FineSAM++ introduces only 1.4M additional learnable parameters on top of the SAM backbone. Compared to other SAM-based segmentation methods—such as SAM Adapter (104.3M), H-SAM (111.3M), and AutoSAM (135.29M)—FineSAM++ demonstrates significantly higher parameter efficiency. Although the total parameter count is higher than lightweight models like Unet, FineSAM++ achieves the highest Dice score (0.8231). These results clearly indicate that FineSAM++ strikes an excellent balance between parameter efficiency and segmentation performance. We will include this comparison in the revised version.

Q2&W2 “Ablation study on threshold δ\delta in the Gating module”:

Thank you for pointing this out. To assess the sensitivity of the Gating module to the pre-defined error threshold δ\delta, we conducted an ablation study on the DRIVE dataset by varying δ{0.3,0.4,0.5,0.6,0.7}\delta \in \{0.3, 0.4, 0.5, 0.6, 0.7\}. The results are summarized in the Tab.XI below:

Table XI. The Ablation Result of threshold δ in the Gating module

δDiceACCAUCSESP
0.30.81240.97120.98120.84320.9601
0.40.81870.97550.98460.84100.9732
0.50.82310.97900.98700.83660.9834
0.60.81800.97670.98540.82040.9807
0.70.81290.97350.98220.80830.9784

As shown, while some metrics (e.g., SE at δ=0.3\delta=0.3) are slightly higher, δ=0.5\delta=0.5 achieves the best overall performance across all metrics. Overall, δ=0.5\delta=0.5 provides a meaningful routing threshold:

  • If δ\delta is too low, almost all pixels are considered uncertain, resulting in unnecessary refinement and loss of gating sparsity.
  • If δ\delta is too high, only a few pixels are routed, making the refinement module underutilized.

Thus, δ=0.5\delta=0.5 strikes a balance, activating residual experts in genuinely ambiguous regions while maintaining efficiency. Therefore, we adopt a fixed threshold δ=0.5\delta=0.5 for all five public datasets for consistency. We will include this explanation and the ablation results in the final version.

Q3&W3 “There is no study on the hyperparameters used in the gating module and residual experts, such as normalized routing weights {w_j}. A hyperparameter study would strengthen the contribution”:

Thank you for the helpful suggestion. We address the reviewer's concerns as follows:

1. Routing weights {wj}\{w_j\} are dynamically learned.

The weights {wj}j=1J\lbrace w_j \rbrace^J_{j=1} are not manually defined hyperparameters. They are predicted dynamically by the Gating module gθg_\theta, which outputs both the uncertainty mask m[0,1]H×Wm \in [0,1]^{H \times W} and expert routing weights {wj}\{w_j\}, based on input xx and coarse prediction y^SAM\hat{y}_{\mathrm{SAM}}.

2. Softmax normalization ensures stability.

To enable robust and balanced routing, we apply softmax normalization to the predicted weights across experts at each spatial location j=1Jwj=1\sum_{j=1}^{J} w_j = 1. This prevents expert collapse and allows smooth specialization without hard routing decisions.

3. Explanation of the indicator function I()\mathbb{I}(\cdot).

The Gating module is trained using a binary supervision mask gtg_t defined as:

gt(i)=I(y^SAM(i)y(i)>δ)g_t(i) = \mathbb{I}\left(|\hat{y}_{\text{SAM}}(i) - y(i)| > \delta\right)

where I()\mathbb{I}(\cdot) denotes the standard indicator function, returning 1 if the condition holds and 0 otherwise. This provides supervision for pixels with high prediction error, helping the module identify uncertain regions.

4. On the number of residual experts JJ.

As shown in Tab.3 of our manuscript, we performed an ablation study on J{1,2,4,6}J \in \{1, 2, 4, 6\}. While J=6J=6 achieved the highest accuracy, we chose J=4J=4 in the final model to strike a better trade-off between segmentation performance and model complexity. This reduces the number of parameters and computational cost, while still maintaining strong performance.

评论

After reviewing the rebuttal by the authors and other reviewers' comments, I keep my rating. The authors have addressed most of my concerns, and the SAM-based proposed method would be helpful in vessel segmentation masks of various medical images.

评论

We thank the reviewer for their time and constructive feedback. We are pleased to hear that most of the reviewer’s concerns have been addressed and that they recognize the potential of our SAM-based method for vessel segmentation across various medical imaging modalities. We appreciate the reviewer’s consideration.

最终决定

The paper introduces FineSAM++, a structure-aware, sparse Mixture-of-Experts add-on for SAM that routes only uncertain, topology-critical regions (e.g., fine vessels) to lightweight Residual Experts for refinement, leaving confident regions untouched. The paper demonstrates a clear, substantive improvement on an important application with a simple, interpretable mechanism that is broadly useful for adapting foundation models in clinical settings. All the reviewers gave positive ratings. Therefore, meta reviewer agrees with the reviewers for the final recommendation.