Mitigating Parameter Interference in Model Merging via Sharpness-Aware Fine-Tuning
We connect sharpness-aware minimization (SAM) to model merging by focusing on its ability to find wider minima, and show that SAM can enhance the multi-task performance of the merged model.
摘要
评审与讨论
This paper proposes leveraging Sharpness-Aware Minimization (SAM) during fine-tuning to enhance the performance of model merging. The authors find that SAM effectively reduces parameter interference, addressing a core challenge in model merging. With its inherent capacity to improve generalization, this paper shows that SAM can be a viable and effective optimization method for model merging.
优点
This paper presents an interesting perspective on improving model merging by using SAM as the optimizer. The authors show convincing evidence that SAM can reduce parameter interference, which has not been deeply explored. The experiment results also show improvement of SAM in many scenarios.
缺点
- The experiments in Table 2 rely on SGD as the base optimizer, whereas Table 1 indicates that FTTS and FTLO significantly outperform SGD. It would be beneficial to show results combining FTTS, FTLO, and TIES merging to better understand the method's upper performance bounds.
- The experimental scope is limited to vision tasks, and expanding the evaluation to include NLP tasks would strengthen the demonstration of the method’s applicability.
问题
Please refer to the weaknesses highlighted above
This paper propose utilizing Sharpness-Aware Minimization (SAM) during the finetuning of pre-trained models to achieve better generalization and weight disentanglement for model merging. The central hypothesis is that SAM can lead to flatter minima, which in turn reduces interference between task-specific models and enhances the performance of the merged model. The authors demonstrate the effectiveness through several experiments.
优点
- The description of the background is very clear, making the motivation of this work reasonable.
- The paper shows consistent performance improvements in merged models.
- They demonstrate the effectiveness of this method through weight disentanglement visualization and CTL.
缺点
- The contributions seem to be not enough.
- Limited experimental results in this paper. They focus on the vision tasks and the results on other domains are unclear.
- The performance gain of several results are relatively small.
问题
- The resource cost of this method should be reported. Does this method increases the training cost than the baseline?
The paper presents a method that aims to reduce the interference during the merging of multiple models. This is achieved by optimizing the performance gap between the merged model and each individually finetuned model, as well as optimizing per-task losses. To minimize these objectives, the authors incorporate Sharpness-Aware Minimization (SAM) during the finetuning process. This approach not only helps reduce parameter interference but also enhances the generalization of finetuned models. Empirical results suggest that SAM facilitates weight disentanglement and improves cross-task linearity. Additionally, the performance of the final model is enhanced on different merging methods, which demonstrates the orthogonality of the proposed method.
优点
The paper is well-structured and articulate, making it easy to follow. It presents an intriguing finding that employing SAM can significantly narrow the performance gap between a merged model and task-specific models. The experiments conducted across various merging methods effectively illustrate the effectiveness of the proposed approach.
缺点
The connection between the objective of SAM and Equation (7) is relatively loose and it is uclear how SAM can help minimize the objective in Equation (7). Thus, further ablation studies are needed to demonstrate the motivation of SAM. Otherwise, it seems that any optimization technique that targets flat minima could potentially enhance the performance of the merged model by steering the parameters towards regions where interpolation between different parameters does not increase the loss values.
问题
- Could you clarify how the parameters with are defined in Equations (10) and (12)?
- I am wondering how other flat minima techniques can help bridge the performance gap between merged and individually finetuned models (e.g. SWA [1], RWP [2], ...).
- It would be beneficial to discuss a related work [3], which aims to find the common low and flat loss region of per-task objectives.
[1] Izmailov, Pavel, et al. "Averaging weights leads to wider optima and better generalization." arXiv preprint arXiv:1803.05407 (2018).
[2] Li, Tao, et al. "Efficient generalization improvement guided by random weight perturbation." arXiv preprint arXiv:2211.11489 (2022).
[3] Phan, Hoang, et al. "Improving multi-task learning via seeking task-based flat regions." arXiv preprint arXiv:2211.13723 (2022).
This paper presents a model merging method by fine-tuning each pre-trained model using SAM, aiming to reduce parameter interference and improve task-specific performance. A comprehensive empirical analysis in weight disentanglement, along with experiments, demonstrates the effectiveness of the proposed methods.
优点
- This paper is well-written and well-organized. I enjoyed reading it.
- The analysis in Section 5 is comprehensive, although primarily from an empirical perspective.
- The results are promising for some merging methods.
- The method is versatile can be applied to existing merging approaches.
缺点
-
Confused motivation: For Eq.(6), the proposed method still needs to use the dataset of each task and fine-tune of each task. Therefore, if we have all task-specific datasets, why not directly perform joint training?
-
Using "sharpness-aware" is misleading for the goal of addressing parameter interference. Although Eq.(7) takes a similar form to the SAM function Eq.(2), the perturbations in Eq.(7) are not derived from the goal of "sharpness-awareness." In my opinion, there is no link between "sharpness-aware" and "interference reduction."
-
When optimizing Eq.(6), we optimize while keeping other models frozen. Afterwards, when optimizing , which version of should we use? If we use the latest version, which task-specific model should we optimize first? Should we consider the forgetting issue for this sequential optimization?
问题
- Could you interpret the meaning of the perturbations in Eq.(7)? This could provide readers with more intuitive explanations.
伦理问题详情
NA
The paper proposes a method to perform model model merging by changing the loss function during training so that it makes the model merging process easier later. This paper leverages Sharpness-Aware Minimization (SAM) during fine-tuning to reduce parameter interference and improve task-specific performance. SAM promotes flatter minima, enhancing generalization and facilitating weight disentanglement, which addresses key challenges in merging multiple models. Reviewers agree that the paper is well written, easy to follow and the proposed method is novel. The initial weakness were attributed to explaining the SAM loss as well as evaluations. The authors presented extensive that reviewers found useful. Hence based on the final ratings, the paper can be accepted to ICLR
审稿人讨论附加意见
The authors presented thorough rebuttals and every reviewers actively participated in the rebuttal process.
Accept (Poster)