InfoSAM: Fine-Tuning the Segment Anything Model from An Information-Theoretic Perspective
InfoSAM improves SAM's performance on specialized tasks by using an information-theoretic approach to distill and preserve domain-invariant knowledge during fine-tuning.
摘要
评审与讨论
The work identifies a problem with PEFT for SAM - the breaking down of domain-invariant relations encoded during pre-training. It proposes InfoSAM, a model that minimizes a lower bound on the mutual information between the encoder and the decoder during PEFT. It does it in a Rényi entropy sense, without having to run similarity determination on the two distributions. The work is evaluated via a number of standard experiments.
给作者的问题
\
论据与证据
The idea that there exist domain-invariant relations e.g. edge information, between pre-training data )the teacher model) and the downstream task (the student model) is rather imporant. I could not find literature in reading the work that substantiates it, nor a similar mention at the beginnings of the benchmark works. I may be willing to put it to my ignorance but any science paper will make an effort to make its central premise traceable.
方法与评估标准
Both do. The entropy hack is sublime, and appropriate going by the literature. Evaluation is very out-of-the-box.
理论论述
Proofs are left to backing literature.
实验设计与分析
No surprises.
补充材料
The sup at the end. It assumes a limited role.
与现有文献的关系
Read bundled with the following question.
遗漏的重要参考文献
The claim that a loss of information happens needs to be butressed.
其他优缺点
\
其他意见或建议
\
We sincerely appreciate your thoughtful feedback. In the following sections, we will provide a detailed response to each of your comments.
Q1:Literature reviews for domain-invariant information.
The concept of domain-invariant information was first introduced in prior works on domain adaptive segmentation (DAS), which explored cross-domain invariant features such as edge and structural information (Hoffman et al., 2018). DAS aims to learn domain-invariant representations across multiple domains and follows two main approaches: (1) extraction and refinement of domain-invariant features, where methods like feature disentanglement (Chang et al., 2019) or analysis (Xu et al., 2022) decompose images into domain-invariant (e.g., shapes, edges) and domain-specific (e.g., textures, colors) components, aiming to enhance the former while suppressing the latter; (2) GAN-based domain-invariant feature generation, which employs adversarial training to align domains at different levels: image (Li et al., 2022), feature (Ma et al., 2024), and output (Huang et al., 2022). For example, GLGAN (Ma et al., 2024) integrates multi-scale global and local features to improve cross-domain transferability in remote sensing.
With the introduction of SAM, this domain-invariant concept has gained further attention. SAM's large-scale segmentation pretraining on 11 million images inherently encodes cross-domain commonalities, enabling strong zero-shot generalization. Recent works leverage these universal visual patterns for downstream tasks (Li et al., 2024; Peng et al., 2024). However, these methods rely on complex designs or external data to learn representations. In contrast, we focus on preserving the domain-invariant information in pre-trained SAM for fine-tuning.
Thanks the thoughtful review again, we will further add this discussion in the related work in the revised paper.
Q2:Loss and preservation of information
Many recent studies leverage SAM's pretrained capabilities for downstream tasks by fine-tuning. However, when the fine-tuning data distribution is narrow, the model tends to overfit task-specific local features (Wang et al., 2024). We argue that this is mainly because task-specific optimizations will cover or suppress domain-invariant features learned during pre-training.
To substantiate this assumption, we have conducted experiments in Sec. 5.4 to illustrate that the extracted relation works (see Tab. 5) and is domain-invariant (see Tab. 6) .
-
Tab. 5: Extracted relations boost other distillation methods (e.g., TinySAM) by 1.7%–5.2% IoU, indicating the preserved information's effectiveness.
-
Tab. 6: Applying RM trained on one domain to a completely different domain preserves its effectiveness, suggesting that these transferable relations are domain-invariant and beneficial for fine-tuning.
We further explore the nature of domain-invariant information. We employ relations to represent domain-invariant information, which serves as an implicit yet generalizable characterization that may inherently encode various domain-agnostic properties. Here, we showcase and evaluate structural edge information using the Boundary F1 Score (BFS) (Peng et al., 2023). As shown in Fig.2 (https://anonymous.4open.science/r/InfoSAM-7D61/README.md), InfoSAM with the relation module outperforms other fine-tuning baselines in boundary preservation, demonstrating that this implicit relational encoding effectively extracts richer structural edge features.
Boundary F1 Score comparisons on leaf dataset (threshold=3):
| Method | BFS (↑) |
|---|---|
| SAM | 39.0 ± 0.16 |
| HQSAM | 63.7 ± 0.65 |
| SU-SAM | 75.1 ± 0.69 |
| ConvLoRA-SAM | 71.5 ± 0.56 |
| InfoSAM (Ours) | 76.4 ± 0.29 |
References:
- Hoffman et al. Cycada: Cycle-consistent adversarial domain adaptation. ICML, 2018.
- Chang et al. All about Structure: Adapting Structural Information across Domains for Boosting Semantic Segmentation. CVPR, 2019.
- Xu et al. DIRL: Domain-Invariant Representation Learning for Generalizable Semantic Segmentation. AAAI, 2022.
- Li et al. A stepwise domain adaptive segmentation network with covariate shift alleviation for remote sensing imagery. TGRS, 2022.
- Ma et al. Decomposition-based Unsupervised Domain Adaptation for Remote Sensing Image Semantic Segmentation. TGRS, 2024.
- Huang et al. MLAN: Multi-level adver sarial network for domain adaptive semantic segmentation. PR, 2022.
- Li et al. Domain-invariant Representation Learning via Segment Anything Model for Blood Cell Classification. Arxiv, 2024.
- Peng et al. Learning to Adapt SAM for Segmenting Cross-domain Point Clouds. ECCV, 2024.
- Zhang et al. Learning Shape-Invariant Representation for Generalizable Semantic Segmentation. TIP, 2023
- Wang et al. SAMCL: Empowering SAM to Continually Learn from Dynamic Domains. Arxiv, 2024.
In this paper, the authors focus on parameter-efficient fine-tuning for segment anything (SAM) network from information theory aspect, and propose InfoSAM. Specifically, InfoSAM aims to mine the domain-invariant relations encoded in the pretrained model, and design a new knowledge distillation framework with two new training objectives, i.e., intra-SAM relation loss and inter-SAM relation loss. by preserving domain-invariant relations in the pretrained model and maximizing mutual information between teacher and student models, InfoSAM achieves better segmentation abilities on various downstream segmentation tasks.
给作者的问题
The ablation study of parameter numbers of relation module, e.g., the number of attention layers and different architecture. A thought analysis can support the claim from information theory aspect.
论据与证据
The formulation regarding intra-SAM and inter-SAM relations is correct. The theoretical analysis is sufficient and convincing.
方法与评估标准
The proposed method is intuitive and effectiveness. The evaluation criteria is widely used in segmentation tasks.
理论论述
The formulation regarding intra-SAM and inter-SAM relations is correct.
实验设计与分析
The validity of experimental design and analysis is enough. The experimental results are promising.
补充材料
The reviewer has read the supplementary material. The useful parts include: pseudo code of InfoSAM, derivation of information-theoretic losses, and additional experimental results. All these parts improve the quality of the manuscript.
与现有文献的关系
The proposed method may insight future knowledge distillation works and other improved parameter-efficient fine-tuning works for better visual models.
遗漏的重要参考文献
N/A.
其他优缺点
weakness: I noticed that both relation modules in the teacher and student SAM are optimized, how to ensure that the proposed relation modules will not fall into trivial solution.
其他意见或建议
Eqn 14 may include typo, I think it should be lambda_1 * L_r + lambda_2 * L_d.
We sincerely appreciate your thorough review of our paper and the valuable insights you provided. In the following sections, we will provide a detailed response to each of your comments.
Q1:Risk of trivial solutions in relation module (RM).
We clarify that the teacher's RM and student's RM share identical parameters, as described in the Problem Formulation in Sec. 4.2 (lines 185–187).
Moreover, our proposed loss function () includes several regularization terms (), which are elaborated in Sec. 4.2 after Eq. (11) and Eq. (13). These terms explicitly promote diversity in the feature distribution and prevent it from converging to trivial solutions.
To further verify the effectiveness of these regularization terms, we conducted an ablation study to assess their impact both qualitatively (through the visualization of relation maps) and quantitatively (through performance on downstream tasks). Both results indicate that the proposed loss with regularization terms effectively extracts domain-invariant features, rather than domain-specific noise, thereby enhancing downstream performance and alleviating the problem of trivial solutions.
-
Visualization: We visualize the relation maps and their corresponding statistical distributions evolving from early to late epochs. As shown in Fig. 1 (https://anonymous.4open.science/r/InfoSAM-7D61/README.md), without the regularization terms, the distribution of relation maps becomes increasingly narrow during training, and the domain-invariant information captured by the relation maps becomes less distinct. In contrast, the RM trained with regularization terms maintains a broad relation distribution and a more representative relation map.
-
Performance: The regularization terms benefit our method by improving performance, as demonstrated by a 1.0% and 1.8% increase in IoU on the Leaf and Road datasets, respectively.
Method Agriculture Remote Sensing IoU (Leaf) IoU (Road) w/o RT 74.6 ± 0.12 59.6 ± 0.69 w RT 75.6 ± 0.27 61.4 ± 0.30
Q2:Typographical error of .
Thank you for your careful review. Eq. (14) should indeed be . We will correct this typographical error in the revised paper.
Q3:Ablation study of various relation modules.
We conducted an analysis to compare different model architectures and explore the number of attention layers for relation module (RM). We compare direct dot product, a linear layer, multiple attention layers, and our proposed RM across multiple experiments on two distinct domains.
The experimental results show that: (1) attention-based RM outperforms other other architectures designs. This indicates that attention mechanism effectively assess the correlations between the input features (i.e., image and mask features), thereby adaptively filtering and enhancing the useful information (e.g., edge details) while reducing redundancy. (2) If we stack an appropriate number of attention layers (e.g., 3 layers) in the RM can be beneficial for capturing key information. However, stacking too many (e.g., five layers) increases training difficulty and risks overfitting. In a nutshell, the current RM design is a trade-off between performance and computational overhead, and it effectively captures the relationships between image and mask features.
Ablation study of various RM:
| Method | Agriculture | Remote Sensing |
|---|---|---|
| IoU (Leaf) | IoU (Road) | |
| Dot Product | 75.2 ± 0.35 | 61.0 ± 0.04 |
| Linear | 74.9 ± 0.51 | 59.3 ± 0.58 |
| Attn-5 | 75.4 ± 0.22 | 61.4 ± 0.12 |
| Attn-3 | 75.4 ± 0.40 | 61.7 ± 0.06 |
| Attn-1 (ours) | 75.6 ± 0.27 | 61.4 ± 0.30 |
Thanks for your rebuttal. The additional results have addressed my concerns. I believe the quality of the manuscript will be improved after revision. I will raise my score.
Thank you for your feedback! We are delighted that our response has addressed your concerns and appreciate your acknowledgment of the additional results. All the discussions and experiments will be added to the revised paper.
This paper proposes InfoSAM, a new SAM fine-tuning framework that (1) compresses the domain pseudo-invariant information and (2) maximizes mutual information between a pre-trained teacher and a fine-tuned student model. Experiments across diverse datasets demonstrate that InfoSAM significantly enhances segmentation performance compared to traditional parameter-efficient fine-tuning and distillation methods.
给作者的问题
- In lines 248–250, Equation (14) is L_info=lambda1Lce+lambda2Linfo. However, Lce has not been explicitly defined earlier, and Linfo appears on both sides of the equation. I think it should be L_r and L_d?
- In lines 258-260, the choice of α = 2 because it simplifies the computation (using the Frobenius norm), are there other theoretical or practical reasons for selecting this value?
论据与证据
Yes. The paper proposes two key mutual information losses: Relation compression loss Lr and Distillation loss Ld. The effectiveness of these two combinations is demonstrated by the performance of SAM and SAM2 in downstream tasks and tasks in different domains. The effectiveness of each is also verified by ablation study.
方法与评估标准
The proposed methods make sense for the task, the evaluation also is thorough.
理论论述
Yes, I have the following question for author:
- In lines 248–250, Equation (14) is L_info=lambda1Lce+lambda2Linfo. However, Lce has not been explicitly defined earlier, and Linfo appears on both sides of the equation. I think it should be L_r and L_d?
- In lines 258-260, the choice of α = 2 because it simplifies the computation (using the Frobenius norm), are there other theoretical or practical reasons for selecting this value?
实验设计与分析
Yes, the experimentals design is sound.
补充材料
Yes, the authors provide the code in supplementary Material.
与现有文献的关系
The paper’s contributions are twofold.
- Different from other parameter-efficient fine-tuning (PEFT) methods (SAMAdapter (Chen et al., 2023), Conv-LoRA (Zhong et al., 2024)), it introduces a novel information-theoretic framework that leverages mutual information to preserve critical domain-invariant features.
- Different from traditional knowledge distillation techniques (TinySAM (Shu et al., 2023), MobileSAM (Zhang et al., 2023)), this paper considers the inter-module relationships within the teacher SAM, thereby enabling a more effective transfer of structural knowledge.
遗漏的重要参考文献
None.
其他优缺点
Strengths:
- The paper creatively combines parameter-efficient fine-tuning with an information-theoretic framework, introducing mutual information-based losses that focus on preserving domain-invariant inter-module relationships within SAM. This represents an innovative twist on traditional knowledge distillation approaches.
Weaknesses:
- Insufficient Justification for alpha=2
- Unclear Definition of L_info
其他意见或建议
None
We appreciate your detailed and valuable review on our paper. We will address each of your comments thoroughly in the following sections.
Q1:Insufficient justification for .
The core reasons for choosing in matrix-based Rényi's -entropy are as follows:
(1) The primary practical motivations are computational efficiency and alignment with prior works. By setting , we enable direct computation of matrix-based Rényi entropy through Frobenius norm operations (Eq.11), eliminating the necessity for eigenvalue decomposition. This optimization reduces time complexity from to ( represents the sample numbers) (Dong et al., 2023), substantially reducing computational costs while maintaining theoretical rigor, particularly advantageous for high-dimensional data analysis (Yu et al., 2019). Additionally, prior research (Miles et al., 2023) has successfully applied Rényi entropy with in segmentation tasks, to align with the established practices in this field, we adopt .
(2) For theoretical reasons, if the application requires emphasis on tails of the distribution (rare events) or multiple modalities (distributions with multiple peaks), should be less than 2 and possibly approach to 1 from above. If the goal is to highlight the dominant mode (the most probable region), should be greater than 2 to emphasize central tendencies. provides neutral weighting (Yu et al., 2019). Moreover, the Frobenius norm's differentiable and strongly convex properties guarantee rapid convergence in gradient-based optimization algorithms (Boyd, 2004).
(3) Furthermore, we conducted an analysis to evaluate the performance of different values (). Following with prior work (Yu et al., 2019), we set to asymptotically approach Shannon entropy. The results indicate that achieves the highest verification accuracy while reducing computational overhead by an order of magnitude. This computational gain stems from its exclusive reliance on Frobenius norm operations (Eq. 11), whereas or require eigenvalue decompositions, which are computationally more expensive.
Experiments of different values in :
| Method | Agriculture | Remote Sensing | Computation Time |
|---|---|---|---|
| IoU (Leaf) | IoU (Road) | ms | |
| 75.3 ± 0.31 | 60.6 ± 0.12 | 32.1 ± 30.7 | |
| 75.6 ± 0.27 | 61.4 ± 0.30 | 1.2 ± 0.3 | |
| 75.2 ± 0.30 | 61.2 ± 0.06 | 35.4 ± 31.2 |
Q2:Unclear definition of
Thank you for your careful review. Equation (14) should indeed be . We will correct this typographical error in the revised paper.
References:
- Dong et al. Optimal Randomized Approximations for Matrix-based Renyi’s Entropy. TIT, 2023.
- Yu et al. Multivariate Extension of Matrix-Based Rényi's α-Order Entropy Functional. TPAMI, 2019.
- Miles et al. MobileVOS: Real-Time Video Object Segmentation Contrastive Learning meets Knowledge Distillation. CVPR, 2023.
- Boyd S. Convex optimization[J]. Cambridge UP, 2004.
Thanks for your rebuttal. The additional results have addressed my concern. I will keep the 'accept' recommendation.
Thank you for your feedback! We are delighted that our response has addressed your concerns and appreciate your acknowledgment of the additional results. All the discussions and experiments will be added to the revised paper.
The reviewers generally agree that InfoSAM is a novel and effective approach for fine-tuning SAM models. It leverages information-theoretic losses to preserve domain-invariant knowledge, showing significant performance improvements. The paper is recommended for acceptance.