PaperHub
7.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
5
4
5
5
4.3
置信度
创新性2.8
质量3.5
清晰度3.3
重要性3.5
NeurIPS 2025

GaRA-SAM: Robustifying Segment Anything Model with Gated-Rank Adaptation

OpenReviewPDF
提交: 2025-04-09更新: 2025-10-29

摘要

关键词
RobustnessSegment Anything ModelLoRA

评审与讨论

审稿意见
5

The paper GaRA-SAM introduces a novel approach to enhance the robustness of the Segment Anything Model (SAM) under diverse input degradations, such as noise, blur, and adverse weather conditions. The core innovation is the Gated-Rank Adaptation (GaRA) module, which integrates lightweight, input-adaptive adapters into SAM’s frozen architecture. Unlike prior methods like RobustSAM, which require paired clean-corrupted images for training, GaRA dynamically adjusts the effective rank of its adapters via a hierarchical gating mechanism. This enables fine-grained adaptation to specific corruptions while preserving SAM’s zero-shot generalization capabilities.

优缺点分析

Strengths:

Comprehensive benchmarking: GaRA-SAM is evaluated across six datasets, including synthetic (LVIS, MSRA-10K) and real-world (BDD+LIS, ACDC) corruptions. It consistently outperforms baselines like RobustSAM, HQ-SAM, and restoration-based methods (AirNet, URIE) by 3.4–21.3% IoU.

Mechanistic validation: Ablation studies dissect the impact of hierarchical gating (removing it reduces performance by 3.6% IoU) and rank-space selection (unified rank spaces underperform by 1.4–4.3% IoU)

Real-world applicability: Training on unpaired real data (BDD+LIS) closes the sim-to-real gap, achieving 89.7% IoU on BDD+LIS and 88.8% IoU on ACDC

Weaknesses:

Statistical reporting: Results are reported from single runs without error bars or significance tests, limiting insights into variance

问题

  1. While GaRA-SAM demonstrates clear improvements over RobustSAM across synthetic and real-world benchmarks, the paper does not address whether the performance gaps stem solely from input-adaptive rank modulation or other architectural differences (e.g., hierarchical gating vs. RobustSAM’s anti-degradation modules). A controlled ablation isolating the impact of dynamic rank selection versus gating mechanisms would strengthen causal claims.

  2. [minor] GaRA-SAM’s training on BDD+LIS (autonomous driving data) raises questions about its applicability to non-driving scenarios (e.g., medical or aerial imagery). The paper does not validate cross-domain robustness beyond adverse-weather use cases. The reviewer understands the difficulty of the experiments.

局限性

See Questions.

格式问题

N/A

作者回复

We appreciate your insightful feedback and constructive suggestions, which help improve our paper substantially. We will address all the comments and include additional experiments in the revision. Please find our responses to the comments below. All ablation studies were conducted using the ViT-B backbone and evaluated on LVIS with point prompts.


Weakness 1. Statistical reporting

Thank you for the comment. We agree that reporting statistical variation is important for ensuring reliability. Accordingly, we repeated the main experiments three times with different random seeds using the ViT-B backbone and evaluated the models using point prompts. As shown in the table below, our method consistently achieves strong performance with low standard deviation across four datasets.

DatasetDegraded IoUDegraded PADegraded DiceClear IoUClear PAClear Dice
LVIS76.7 ± 0.394.2 ± 0.285.5 ± 0.282.0 ± 0.895.4 ± 0.189.1 ± 0.6
MSRA89.3 ± 0.297.4 ± 0.193.9 ± 0.291.3 ± 0.197.9 ± 0.195.1 ± 0.1
COCO76.9 ± 0.594.4 ± 0.285.6 ± 0.380.5 ± 1.095.2 ± 0.388.0 ± 0.7
SN77.3 ± 0.599.5 ± 0.086.2 ± 0.482.6 ± 0.499.7 ± 0.089.9 ± 0.3

We will revise all tables to reflect the statistical results.


Question 1. Impact of dynamic (rank-1) selection and rank space gating

Thank you for the insightful suggestion. To isolate the contributions of dynamic (rank-1) component selection and rank space gating, we conducted the following ablation studies.

1. Isolating the effect of dynamic (rank-1) component selection

We tested variants using only dynamic (rank-1) component selection within a unified rank space. Enabling the dynamic selection improves performance when using a unified rank space with the small maximum rank (i.e., 16), highlighting the benefit of adaptively choosing components per input. However, using a unified rank space with the large maximum rank (i.e., 256) did not lead to better results. We attribute this to the limited diversity in active components when using a unified space, which can restrict the ability to adjust representational capacities according to each degraded input, thus hindering input-adaptive modulation. It motivated our design to split the rank space, which led to improved performance.

Variant(Rank-1) Component SelectionRank space gatingIoU
w/o (rank-1) component selection (max rank 16)75.0
w/ (rank-1) component selection (max rank 16)76.2
w/o (rank-1) component selection (max rank 256)73.4
w/ (rank-1) component selection (max rank 256)73.3
GaRA-SAM (max rank 256)77.0

2. Isolating the effect of rank space gating

We also experimented with rank space gating alone with fixed rank settings (e.g., rank 16 or rank 128), disabling dynamic (rank-1) component selection. The results demonstrate that rank space gating alone significantly improves performance. We believe this is because it enables each rank space to specialize in handling different degraded inputs, allowing the gating to select the most suitable representational capacity for each. Furthermore, enabling (rank-1) component selection on top further boosts the performance, highlighting its complementary benefit.

Variant(Rank-1) Component SelectionRank space gatingIoU
w/o rank space gating73.4
w/o rank space gating76.0
GaRA-SAM77.0

These findings suggest that both dynamic (rank-1) component selection and rank space gating mechanisms are crucial and complement each other. We will include these analyses in our revised manuscript.


Question 2. Generalization beyond driving scenarios

We appreciate your important question regarding cross-domain applicability. While BDD consists of autonomous driving scenes, LIS contains diverse low-light scenes across both indoor and outdoor environments. We will clarify this in the revision.

To further evaluate the generalization capability of GaRA-SAM beyond driving scenarios, we additionally tested it on the ISIC-2016 medical segmentation dataset [1]. Using box prompts, GaRA-SAM showed notable improvements over SAM:

ModelIoUDiceAP
SAM70.680.776.8
GaRA-SAM78.786.281.7

These results indicate that GaRA-SAM maintains strong performance even in domains with very different visual characteristics. We will include this result in the revision.

[1] Skin lesion analysis toward melanoma detection: A challenge at the 2017 International symposium on biomedical imaging (ISBI), hosted by the international skin imaging collaboration (ISIC), arXiv 2016

审稿意见
4

The paper addresses the robustness problem of the Segment Anything Model (SAM) under various image degradations. The authors propose Gated-Rank Adaptation (GaRA), an adaptive and lightweight modification based on Low-Rank Adaptation (LoRA). GaRA dynamically adjusts the effective rank of the adapters integrated into the frozen layers of SAM according to the specific degradation type present in the input image. This gating mechanism adaptively selects between lower-rank and higher-rank components, enabling fine-grained, input-aware robustification without requiring paired clean-degraded training images. Experimental evaluations demonstrate state-of-the-art results on multiple robust segmentation benchmarks.

优缺点分析

Strengths:

  1. The writing is clear and the presentation is well-structured.

  2. The proposed method is intuitive and straightforward.

  3. Experimental evaluation is comprehensive, and the method surpasses baseline approaches across multiple datasets.

Weaknesses:

  1. My primary concern relates to methodological innovation. The proposed GaRA method employs a gating mechanism to dynamically activate low-rank or high-rank LoRA adapters. However, similar adaptive rank-selection concepts have already been extensively explored in related work such as NoRA, AdaLoRA, and ElaLoRA, raising questions regarding the novelty of this approach.

  2. The sequence of tables is somewhat confusing: Table 2 is presented first in Section 4.2, while Table 1 is referenced later in Section 4.3. This inconsistency affects readability and clarity.

  3. There seems to be an issue with the third row of Figure 4, as the input image appears missing or unclear. Clarification or correction is needed.

问题

  1. Would fine-tuning the query, key, and value parameters directly with a low learning rate or replacing LoRA with traditional adapters [a] also enhance the model’s robustness similarly?

[a] Houlsby et al. Parameter-efficient transfer learning for NLP. ICML 2019.

  1. If LoRA were applied to fine-tune the decoder instead of the encoder, would this also result in improved robustness?

  2. Can GaRA be merged into the SAM weights to reduce model size further? The paper mentions (line 100) that GaRA selects and combines appropriate (rank-1) components dynamically during inference, suggesting it may not be mergeable into the original SAM model weights like other static LoRA methods.

  3. Will the code and trained models be fully open-sourced?

局限性

See Weaknesses and Questions

最终评判理由

All of my concerns have been addressed during the rebuttal period. Considering the paper’s comprehensive experimental evaluation, I am assigning a final score of 4 Borderline accept.

格式问题

No Paper Formatting Concerns

作者回复

We appreciate your insightful feedback and constructive suggestions, which help improve our paper substantially. We will address all the comments and include additional experiments in the revision. Please find our responses to the comments below. All ablation studies were conducted using the ViT-B backbone and evaluated on LVIS with point prompts.


Weakness 1. Methodological innovation

Thank you for the comment. We would like to clarify that GaRA introduces a key difference from existing methods such as NoRA [1], AdaLoRA [2], and ElaLoRA [3] by enabling input-adaptive rank modulation at inference time.

Specifically, NoRA employs a nested LoRA structure to improve parameter efficiency but uses a fixed rank throughout both training and inference. AdaLoRA and ElaLoRA adjust ranks during training to optimize parameter-budget allocation, but their rank configurations remain fixed at inference time and do not adapt according to the input.

In contrast, GaRA-SAM is motivated by the observation in Figure 2(b) of the paper that optimal LoRA ranks vary significantly across inputs. GaRA is specifically designed to leverage this variability by dynamically selecting and combining (rank-1) components per input. This enables effective input-adaptive rank modulation even at inference time.

We will revise the paper to more clearly highlight this distinction.

[1] Nora: Nested low-rank adaptation for efficient fine-tuning large models, arXiv 2024.

[2] Adaptive budget allocation for parameter-efficient fine-tuning, ICLR 2023.

[3] Elalora: Elastic & learnable low-rank adaptation for efficient model fine-tuning, arXiv 2025.


Weakness 2. Table order confusion

We apologize for the confusion. We will reorganize the tables in the revision to match the flow of sections, improving overall readability.


Weakness 3. Visibility of Figure 4

The third row in Figure 4 shows a real nighttime image under extreme low-light conditions. The RGB pixel intensities range from 0 to 34 with a mean of 4.49, indicating severely limited illumination in the scene. To enhance visibility, we will include a brightness-enhanced version and clarify this explanation in the revision.


Question 1-1. Fine-tuning query, key, and value projections

We conducted additional experiments directly fine-tuning query, key, and value projection layers. Although this fine-tuning strategy improved robustness compared to the pretrained SAM, it significantly underperformed compared to GaRA-SAM, as shown in the table below.

MethodIoU
QKV Fine-tuning73.1
GaRA-SAM77.0

Question 1-2. Fine-tuning the traditional adapter

We fine-tuned the traditional adapter [1] as an alternative to LoRA. The results below showed limited robustness improvement compared to LoRA. We suspect this is due to the nonlinear bottleneck of the traditional adapter, which can distort features and lead to unstable adaptation under corruption. In contrast, LoRA applies linear updates directly to the weights, enabling more stable adaptation.

MethodIoU
Traditional adapter [1]63.0
LoRA73.4

[1] Parameter-efficient transfer learning for NLP, PMLR 2019.


Question 2. Applying LoRA to the decoder

We applied LoRA to the decoder instead of the encoder, but this resulted in a significant performance drop. We attribute this to the encoder continuing to generate degraded representations under corrupted inputs, which causes a mismatch with the adapted decoder parameters. In contrast, applying LoRA to the encoder directly mitigates feature degradation, leading to substantially better performance:

MethodIoU
LoRA on decoder29.6
LoRA on encoder73.4

Question 3. Mergeability of GaRA into the weights of SAM

GaRA cannot be merged statically into SAM since its gating decisions are dynamic and input-dependent, i.e., different (rank-1) components are activated for different inputs during inference. Nevertheless, GaRA remains efficient, as it introduces only minimal overhead while keeping the backbone frozen. We will clarify this point in the revision.


Question 4. Code availability

We fully agree that open-sourcing is essential for reproducibility and future research. We will release the full implementation, including model weights and training/evaluation scripts.

评论

Thank you for the rebuttal, which has resolved all of my concerns.

One additional suggestion: When you open-source GaRA, wrapping its core functionality into a simple, one-line function call would make it convenient for others to use and build upon.

Given the quality of the work and the response, I will raise my score to 4.

评论

We are glad our response has addressed all of your concerns! In the official release, we will include a one-line function call and clear documentation to support easy adoption. Thank you again for your constructive feedback and support of our work.

审稿意见
5

The paper proposes to use gated-rank adaptation to robustify SAM (segment anything model) against image corruptions, while maintaining the inherent generalization capabilities of such models. The core technique proposed is to freeze the parameters of a pre-trained SAM model and to introduce gated-rank adaptation modules that learn to adjust the effective rank of its weight matrix based on the current input. The authors show that this technique maintains the generalization capabilities of the model while significantly improving robustness to several synthetic and real-world image corruptions — especially when trained directly on real-world corrupted data.

优缺点分析

Strengths

  • Robustifying SAM models is crucial for real-world applications, as unexpected shifts in the data distribution due to image corruptions from manifold sources can significantly degrade segmentation performance. For example, for self-driving vehicles, adverse weather conditions, low light, sensor degradation, dirt, or lighting changes can significantly impact segmentation performance. The paper addresses this crucial problem.
  • The proposed solution is lightweight, allowing to reuse existing pre-trained SAM models. This saves significant cost and allows reusing existing models without retraining from scratch.
  • The proposed method can be trained on real-world corrupted data without paired clean examples. Compared to prior work, this allows avoiding the synthetic-to-real domain shift, leading to significantly improved robustness.
  • The paper is well presented and follows a logical structure. It includes ablations and experimental evidence for assumptions and intuitions presented.

Weaknesses

  • A discussion of the reasons for this particular design of the GaRA modules would improve the paper. Why was this split between low and high rank adapters chosen? Why not just one of them? Or a more general approach that allows the gate to choose from any dimension? While the technical implementation is clear from section 3.3, the reasoning, as well as the advantages and drawbacks remain elusive.
  • The paper does not properly address its limitations. L294 has two sentences on it, but only raises a single point — the manual engineering of the GaRAs. It would be helpful to go into more details on why the authors chose this design compared to more general and flexible options. Additionally, other limitations should be addressed such as no comparison to training a SAM model with corrupted data from scratch, only computing single runs without std or statistical significance values, etc.
  • Some of the tables and figures are very crammed, which makes reading them difficult. For example, Table 6, 7, and 8, and Figures 5 and 6 barely have any separation.
  • The authors provide no code to evaluate the implementation or build upon their work.

问题

  • Why do you use two GaRA gates for high- and low-dimensional spaces — instead of a single one that can freely adapt any rank?
  • L217: why r_L = 16 and r_H = 256? How were this values determined?
  • L222: can you please clarify what you mean by “seen datasets”? I assume this means performance of a held-out test set from the same data distribution as the training data — as opposed to testing on a distribution from a different dataset?
  • Table 8: Can you please provide more details on the performance evaluation? What types of GPUs do you refer to? How many are used for inference? How many fixed (I.e. non-learnable) parameters are there? And how much training time is required for fine-tuning with GaRA-SAM vs. RobustSAM in total (in e.g. GPU-hours)?

局限性

The paper does not properly address its limitations. L294 has two sentences on it, but only raises a single point — the manual engineering of the GaRAs. It would be helpful to go into more details on why the authors chose this design compared to more general and flexible options. Additionally, other limitations should be addressed such as no comparison to training a SAM model with corrupted data from scratch, only computing single runs without std or statistical significance values, etc.

最终评判理由

The authors addressed most of my concerns. The only remaining point is that the method's limitations could be addressed more thoroughly. However, this is a minor concern, outweighed by the method's good performance and evaluation. The other reviewers share my positive evaluation and do not raise any additional concerns. I therefore keep my recommendation to accept.

格式问题

Space around tables and figures is sometimes too tight, making the entire layout crammed.

作者回复

We appreciate your insightful feedback and constructive suggestions, which help improve our paper substantially. We will address all the comments and include additional experiments in the revision. Please find our responses to the comments below. All ablation studies were conducted using the ViT-B backbone and evaluated on LVIS with point prompts.


Weakness 1 & Question 1. Justification for rank space separation

We sincerely appreciate this insightful comment. The rank space separation is motivated by our empirical observations that, in a unified rank space, the model exhibits limited diversity in (rank-1) component activation across different degraded inputs. Specifically, the model tends to activate components within a narrow range (133-190; std: 7.7), limiting its ability to adjust representational capacity according to each degraded input. This is potentially suboptimal because different degraded inputs require different representational capacities (Figure 2(b) of the paper).

By separating the rank space into lower-rank and higher-rank regions, we allow the model to activate a more diverse range of components (5-181; std: 75.8), yielding significant performance improvement.

VariantRank Space Separation(rank-1) Component SelectionIoU# of Active (rank-1) Components (Mean ± Std)Range
Unified Rank Space73.3161.8 ± 7.7133 ~ 190
Separated Rank Space (GaRA-SAM)77.085.7 ± 75.85 ~ 181

We guess this is because the rank space separation into two exclusive sets, each specialized for different degraded inputs demanding different representation capacities (Figure 2(b) of the paper), helps mitigate potential conflicts during (rank-1) component selection and promotes more diverse usage of components.

We will clarify this motivation and supporting evidence in the revised manuscript.


Weakness 2-1. Expanded discussion on limitations

We sincerely appreciate the thoughtful comment. We will expand our limitations section, clearly detailing the manual design choices and their potential constraints. We plan to revise the section as follows:

Limitations. While GaRA adaptively determines the appropriate rank for each rank space, several design choices, such as the split between lower- and higher-rank spaces, the predefined maximum ranks, and the fixed 2-level gating hierarchy, are set manually and thus limit flexibility. More generalized automated designs are conceptually desirable but practically challenging due to unstable optimization. Developing automated methods remains an important future direction.


Weakness 2-2. Addressing experimental limitations

Thank you for raising these important concerns. We conducted additional experiments to address the experimental limitations.

  1. Training SAM from scratch: We trained SAM from scratch using corrupted data, which resulted in significantly lower performance compared to GaRA-SAM. This performance gap highlights the importance of leveraging the strong generalization ability of SAM, which originates from large-scale pretraining. We will include these results in the revised manuscript. | Model | IoU | |-------------------------|:------------:| | SAM trained from scratch | 57.3 | | GaRA-SAM | 77.0 |
  2. Statistical significance of results: We repeated the main experiments three times with different random seeds using the ViT-B backbone and evaluated the models using point prompts. As shown in the table below, our method consistently achieves strong performance with low standard deviation across four datasets. We will revise all tables to reflect the statistical results. | Dataset | Degraded IoU | Degraded PA | Degraded Dice | Clear IoU | Clear PA | Clear Dice | |---------|:--------------:|:-------------:|:----------------:|:------------:|:-----------:|:-------------:| | LVIS | 76.7 ± 0.3 | 94.2 ± 0.2 | 85.5 ± 0.2 | 82.0 ± 0.8 | 95.4 ± 0.1| 89.1 ± 0.6 | | MSRA | 89.3 ± 0.2 | 97.4 ± 0.1 | 93.9 ± 0.2 | 91.3 ± 0.1 | 97.9 ± 0.1| 95.1 ± 0.1 | | COCO | 76.9 ± 0.5 | 94.4 ± 0.2 | 85.6 ± 0.3 | 80.5 ± 1.0 | 95.2 ± 0.3| 88.0 ± 0.7 | | SN | 77.3 ± 0.5 | 99.5 ± 0.0 | 86.2 ± 0.4 | 82.6 ± 0.4 | 99.7 ± 0.0| 89.9 ± 0.3 |

Weakness 3. Figure and table readability

Thank you for your helpful comment. We will improve readability in the revision by adjusting figure spacing, table layouts, and font sizes.


Weakness 4. Code availability

We fully agree that open-sourcing is essential for reproducibility and future research. We will release the full implementation, including model weights and training/evaluation scripts.


Question 2. Why rL=16r_L = 16 and rH=256r_H = 256?

These values were selected based on ablation studies across various rank configurations. The combination of {16, 256} consistently shows the best performance:

Rank ConfigurationIoU
{8, 256}76.3
{16, 128}76.9
{16, 256} (Ours)77.0
{16, 512}76.6
{16, 1024}76.0

We will include these results in the revised version.


Question 3. Clarification on “seen datasets”

Thank you for pointing this out. The term “seen datasets“ refers to test splits from the same distributions as training data. We will clarify this explanation in the revision.


Question 4. Details on performance evaluation

All experiments were conducted using four NVIDIA RTX A6000 GPUs for training and one A6000 GPU for inference. The number of non-learnable parameters was consistent across methods and amounted to 1,250M parameters. Regarding training time, GaRA-SAM required 25 hours, while RobustSAM required 33 hours under identical settings.

评论

We are glad our response has addressed your concerns! We sincerely appreciate your positive evaluation and will include the additional results and discussions in the revision. Thank you again for your constructive feedback and support of our work.

评论

Thank you for the clarifications and additional ablations. They address most of my concerns. The other reviewers seem to share my positive evaluation - I will keep my recommendation to accept.

审稿意见
5

The paper "GaRA-SAM: Robustifying Segment Anything Model with Gated-Rank Adaptation" proposes GaRA to enhance SAM's robustness against input degradations. GaRA introduces lightweight adapters into SAM's intermediate layers, dynamically adjusting the effective rank of weight matrices based on input via a learned gating module. GaRA-SAM achieves state-of-the-art performance on robust segmentation benchmarks, significantly surpassing previous methods.

优缺点分析

1.Strengths: (1) Innovative methodology: The introduction of GaRA, which dynamically adjusts the rank of adapters based on input, is a novel approach to enhancing model robustness. This method effectively addresses the limitations of fixed-rank LoRA and other existing robustification techniques, like attaching an existing image restoration module to the front of SAM. (2) Sota Results: GaRA-SAM achieves significant performance improvements on multiple robust segmentation benchmarks, particularly on real-world corrupted datasets like ACDC. This demonstrates its practical effectiveness and superiority over previous methods. (3) Practicality and Efficiency: GaRA-SAM maintains parameter efficiency and adheres to standard training protocols, making it easy to integrate into existing workflows. It also enables training on real-world degraded images without requiring clean references, which is a major advantage for practical applications. (4) GaRA is easy to understand and follow.

2.Weaknesses: (1) GaRA, as a robust image encoder, while achieving good performance, needs an extra 60% GPU Memory (from 3.36 GB of original SAM to 5.39GB of GaRA), which may be a heavy load for resource-constrained devices.

问题

1.GaRA needs an extra 60% GPU Memory (from 3.36 GB of original SAM to 5.39GB of GaRA), and 343M extra LoRA params, which may be a heavy load for resource-constrained devices. I wonder if GaRA will outperform a bigger SAM with 1250M+343M params co-trained by clean pairs and degraded pairs, although I understand that this strategy holds significant practical value due to its flexibility. 2. How can we supervise the training process to ensure GaRA chooses the best gating path? 3. Impact of Rank Space Selection ablation in the Table.7, why choose rank 16 and 256 only? It is hoped that more experimental results on w/o gating ranks can be provided to demonstrate the necessity of space selection fully. Similarly, why are only these two levels selected for hierarchical gating, and why not go deeper into more levels?

局限性

The article mentions the potential drawbacks of the maximum rank number

最终评判理由

Considering the novelty and conciseness of the GaRA-SAM method, along with the relatively comprehensive experimental validation, its excellent performance, and the additional key experiments provided in the subsequent rebuttal, it is evident that GaRA is a high-quality article. Therefore, I am inclined to accept this paper for formal publication.

格式问题

No obvious errors were found

作者回复

We appreciate your insightful feedback and constructive suggestions, which help improve our paper substantially. We will address all the comments and include additional experiments in the revision. Please find our responses to the comments below. All ablation studies were conducted using the ViT-B backbone and evaluated on LVIS with point prompts.


Weakness 1 & Question 1-1. Memory overhead of GaRA-SAM

Thank you for highlighting this important point. Memory efficiency is indeed an important direction for future research, particularly for deployment on edge devices. Although GaRA-SAM introduces 2 GB of additional memory for inference, we note that its total memory usage (5.39 GB) is comparable to RobustSAM (5.41 GB). Additionally, this footprint remains within the capability of widely used modern edge devices such as Jetson Xavier NX and AGX Orin, making GaRA-SAM practical for real-world applications.


Question 1-2. Comparison with a larger SAM

Thank you for the constructive suggestion. To assess this, we designed a larger variant of SAM by inserting two additional 4-layer MLP blocks immediately after the image encoder and before the mask decoder, matching GaRA-SAM’s parameter count. Despite its increased capacity and full fine-tuning, this larger model showed inferior performance compared to GaRA-SAM. We attribute this to the loss of the strong generalization capability of the pretrained SAM due to the full fine-tuning. On the other hand, GaRA-SAM effectively preserves the capability by learning the lightweight, input-adaptive modules only while freezing the pretrained SAM.

ModelIoU
Fine-tuned Larger SAM72.9
GaRA-SAM77.0

This comparative analysis will be included in the revision.


Question 2. How is the gating path supervised?

The gating modules of GaRA-SAM are not directly supervised, but they are trained indirectly through the segmentation loss alone, following established practices in prior work [1,2]. In other words, no direct supervision is provided to explicitly guide which gating paths the modules should take. To empirically demonstrate that the gating modules are learned to provide desirable gating paths even with such indirect supervision, we compared GaRA-SAM and its variant with random gating. The results reported below indicate that GaRA-SAM significantly outperforms the random gating variant, supporting the effectiveness of our training strategy.

Gating StrategyIoU
GaRA-SAM w/ random gating73.9
GaRA-SAM77.0

These findings will be elaborated on in the revised manuscript.

[1] Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, ICLR 2017

[2] Categorical Reparameterization with Gumbel-Softmax, ICLR 2017


Question 3-1. Why are ranks 16 and 256 used?

We conducted comprehensive ablation studies on various rank configurations. As summarized below, the combination of {16, 256} consistently achieves the best performance.

Rank ConfigurationIoU
{8, 256}76.3
{16, 128}76.9
{16, 256} (Ours)77.0
{16, 512}76.6
{16, 1024}76.0

Thank you for the comment. We will include these results in the revised manuscript.


Question 3-2. Isolating the effect of rank space selection

Thank you for the valuable suggestion. To isolate the effect of rank space selection, we experimented with rank space selection alone with fixed rank settings (i.e., rank 16 or rank 128), disabling (rank-1) component selection. The results demonstrate that rank space separation alone significantly improves performance. We believe this is because it enables each rank space to specialize in handling different degraded inputs, allowing the gating to select the most suitable representational capacity for each. Furthermore, enabling (rank-1) component selection on top further boosts the performance, highlighting its complementary benefit.

VariantRank Space Selection(rank-1) Component SelectionIoU
w/o Rank Space Selection73.4
w/ Rank Space Selection76.0
w/ Rank Space Selection + (rank-1) Component Selection (GaRA-SAM)77.0

These results will be included in the revision.


Question 3-3. Impact of the depth of hierarchical gating

Thank you for the suggestion. We conducted additional experiments to assess deeper hierarchical gating structures (i.e., 3-level and 4-level). Our findings reveal that increasing the depth introduces optimization challenges due to more complex gating decisions, ultimately degrading performance. Hence, we conclude that the proposed 2-level hierarchy offers both strong performance and architectural simplicity.

Hierarchy DepthIoU
2-level (Ours)77.0
3-level76.4
4-level74.0
评论

Most of the questions I raised were effectively addressed by the author through experiments, resolving the majority of my doubts. This also implies that the revised manuscript will undergo significant modifications. Some remaining issues may only be satisfactorily resolved in future research. Nevertheless, I still consider this article to be of good quality.

评论

We are glad that our responses have addressed the majority of your concerns! We sincerely appreciate your positive evaluation and will include the additional results and discussions in the revision. Thank you again for your constructive feedback and support of our work.

最终决定

The paper proposes GaRA-SAM, a method to improve the robustness of SAM (Segment Anything) when images are degraded. GaRA-SAM introduces a Gated Rank Adaptation module that adjusts the rank of an adapter dynamically based on the input. The reviewers initially praised the paper's strong results and practical relevance, but raised concerns about memory overhead, the rationale for the gating design, limited discussion of limitations, and unclear statistical reporting.

During the rebuttal period, the authors reported new ablations (e.g., rank configurations, gating depth), comparisons to larger SAMs and alternative adaptations, repeated runs with variance reporting, and promised to extend their discussion of limitations in the final version. They also committed to open-sourcing the code.

After rebuttal and discussion, the reviewers agree that most concerns were convincingly addressed, with only minor limitations left. Three reviewers (feAp, uBVp, Ljhg) recommend accept, highlighting GaRA-SAM’s strong empirical results, practicality, and significance despite some minor weaknesses (memory overhead, limitations discussion, statistical reporting). One reviewer (N9R9) remains a bit more cautious, rating it borderline accept, mainly due to concerns about novelty and presentation, but still acknowledging solid experiments and overall contribution.

Given the constructive interactions during the rebuttal phase and the generally very positive view after discussion, I think this paper makes a valuable contribution to the NeurIPS community and am in favor of accepting it.