/10

Poster4 位审稿人

最低2最高4标准差0.7

ICML 2025

Understanding Overadaptation in Supervised Fine-Tuning: The Role of Ensemble Methods

Yifan HAO,Xingyuan Pan,Hanning Zhang,Chenlu Ye,Rui Pan,Tong Zhang

OpenReview PDF

提交: 2025-01-24更新: 2025-07-24

TL;DR

This paper mainly focuses on understanding the benefits of ensemble in supervised fine-tuning.

摘要

关键词

model ensemblefine-tuningmulti-task

评审与讨论

审稿意见

评分: 42025-03-08

The paper theoretically investigates the phenomenon observed in prior work that ensembling pre-trained and fine-tuned weights of foundation models outperforms dedicated fine-tuning strategies. The authors first verify this phenomenon via empirical results for instruction fine-tuning LLMs. Then the authors provide a theoretical framework to compare the excess risk between different estimators that represent vanilla or regularized fine-tuning, as well as ensembles. It is found that ensemble estimators attain a favorable balance between bias and variance error terms and hence improve performance while mitigating forgetting.

UPDATE

The authors have addressed all my concerns during the rebuttal period, hence I increased my score and recommend acceptance.

给作者的问题

Is there any particular reason why the authors jump back and forth between the terms "overadaptation" and "overfitting"? It would be easier to follow if one term would be used consistently.
Are the different points for ensemble in Figure 2 and 3 for different values of $\tau$ ? If that is the case it would be good to mention that at least in the caption or have respective values for $\tau$ at each dot of the ensemble curve.
Are the different dots for fine-tune methods in Figure 2 and 3 for different seeds or different regularizers? Those markers should be labelled accordingly.

论据与证据

The authors say that their theoretical results provide guidance for enhancing the performance of fine-tuning strategies. I appreciate the empircal validation of Theorem 5.1, but this does not resemble a real-world fine-tuning setting as in Section 3. Can the authors be a bit more specific what the claimed guidance for enhancing performance of fine-tuning strategies is?

方法与评估标准

The methods employed for empirical evaluation and evaluation criteria are sound.

理论论述

The main merit of this work is of theoretical nature.

实验设计与分析

What is the variance for the different fine-tuning methods on MT-Bench? The difference in the scores seem marginal, are they significant? Furthermore, does the same trend persist for the remaining two evaluation sets? It is important to establish whether this phenomenon persists across several datases to determine its significancy.

补充材料

I have read over the appendix, but have not rigorously checked the proofs.

与现有文献的关系

The paper positions itself by analysing a practical phenomenon observed during forming ensembles for fine-tuning of large models, which has been observed in prior work. Hence, the major contribution of the work is of theoretical nature, which is valid and relevant to foster theoretical understanding.

遗漏的重要参考文献

To the best of my knowledge the most relevant works have been discussed.

其他优缺点

Strengths

The work is well motivated and well-written for the most part.

The theoretcal contrbution seems rigorous and I enjoyed the empirical validation of Theorem 5.1.

Weaknesses

One weakness in my view is the lack of guidance for fine-tuning strategies (see Claims and Evidence).

Another one is a more thorough empirical verification of the observed phenomenon.

其他意见或建议

The relation between early-stopping and ridge regularization was not clear to me from the empirical sections. The first time this connection is established and backed with references is in line 270 (right), to enhance clarity, I suggest elaborating no this relation earlier in the empirical part which is also serves as justification for the choice of regularizer.

作者回复

2025-04-01

We would like to extend our sincere appreciation to the reviewer for all the constructive and positive feedbacks on our contribution. Additional experiments and explanations have been provided to address the mentioned concerns.

Guidance for enhancing performance of fine-tuning strategies. Some theoretical insights have been successfully applied to empirical experiments and provide guidelines for hyperparameter search. For example, the theory predicts that the value of $\tau$ should be large, and the value of $\lambda$ should be small, while in practice, the optimal $\tau$ is in [0.4, 0.8], with the optimal $\lambda$ in [1e-3, 1e-2]. This provides concrete empirical evidence and verifies the theoretical prediction.

The variance for the different fine-tuning methods on MT-Bench. Additional multi-trial experiments on Norm-Penalty with Ensemble (https://anonymous.4open.science/r/ICML2025-rebuttal-forgetting-1F68/README.md ) show that the standard deviation on MT-Bench is ~0.06, which is sufficiently small for results in Table 1. Furthermore, an accuracy gap of 1.0 is generally considered non-trivial for MMLU and Commonsense-QA for large language models. For example, in the llama-3.1’s post, llama-3.1-8B claims superiority over Gemma-2-9B-it with a MMLU accuracy of 73.0 v.s. 72.3.

More thorough empirical verification. Thanks for the constructive suggestion! We have included further experimental results on LoRA (rank=32) (https://anonymous.4open.science/r/ICML2025-rebuttal-forgetting-1F68/README.md ) to justify the empirical benefits of the ensemble method, where consistent performance improvements have been observed across all benchmarks.

Relationship between early-stopping and ridge regularization. Many thanks for the helpful feedback, we will add a remark in section 3.2 to elaborate the relationship between early-stopping and the ridge regularization.

Explanation for the usage of overadaptation and overfitting. We apologize for the confusing usage of these two terms. In our work, "overfitting" and "overadaptation" refer to the same concept—training the model until the training loss becomes very small during the fine-tuning process. To maintain consistency, we will revise the main text and use the term "overadaptation" throughout.

Are the different points for ensemble in Figure 2 and 3 for different values of $\tau$ ? If that is the case it would be good to mention that at least in the caption or have respective values for $\tau$ at each dot of the ensemble curve. In Figures 2 and 3 (https://anonymous.4open.science/r/ICML2025-rebuttal-forgetting-1F68/README.md ), the different points for ensemble models are from different $\tau$ values and the fine-tuned models are trained with different seeds. We have modified Figure 2 and 3 to make the presentation more clear.

Are the different dots for fine-tune methods in Figure 2 and 3 for different seeds or different regularizers? Those markers should be labelled accordingly. For the dots of fine-tune methods, we adopt both Norm-Penalty and DiffNorm-Penalty to fine-tune. We also use different seeds to repeat the experiments to provide more results. They are all plotted in Figures 2 and 3 (https://anonymous.4open.science/r/ICML2025-rebuttal-forgetting-1F68/README.md ).

审稿人评论

2025-04-03

Thank you for clarifying my concerns, I recommend to add a short paragraph discussing the practical implications. Other than that, I believe this is good work, hence I reside with my original rating.

作者评论

2025-04-03

Many thanks for your reply and helpful suggestion. We will integrate the discussion into the paper.

审稿意见

评分: 32025-03-10

The paper theoretically analyzes why WiSE-FT or other regularization methods work. The authors also show that WiSE-FT is better than vanilla fine-tuning with regularization.

给作者的问题

Could you elaborate on what noisy training data in L370 is?

Can you guarantee Dolly is not used for training the Llama 3 model? Considering both release dates, I suppose yes, but I'm not 100% sure.

[Post Rebuttal] Thank you for the clarification. I keep my rating.

论据与证据

The claim is generally supported by theoretical analysis, but it is difficult to find reasons for showing LLM experiment results. The analysis itself is not designed specifically for LLM tasks. The LLM experiments make me question whether the theory is not applicable to the standard image classification task (Taori et al., 2020) used in previous studies the authors cited.

Secondly, the authors use a linear regression task to show why ensemble methods (e.g., WiSE-FT). The authors need to justify why this simplification is valid (e.g., whether a proof holds for cross-entropy loss). I acknowledge that Kumar et al. (2022) also used a linear regression task, but that does not mean that this work does not need to be justified.

Finally, Theorem 5.1 has conditions for each statement. I wonder how easily this condition is met in practice.

方法与评估标准

The authors chose reasonable benchmark and evaluation criteria for measuring performance drop in the robust fine-tuning of LLM.

理论论述

I did not fully check the correctness of proofs in supplementary materials. Hence, my confidence score can be 1-2.

实验设计与分析

The authors chose reasonable benchmarks and training for measuring performance drops in the robust fine-tuning of LLM. However, it is difficult to find the source of lines between ensemble methods in Figures 2 and 3. Considering the trajectories of WiSE-FT, it can have different shapes. The linear interpolation using five data points can show readers how ensemble results can be changed.

补充材料

I read supplementary materials but did not thoroughly review each proof.

与现有文献的关系

The theoretical analysis can help understand robust fine-tuning.

遗漏的重要参考文献

I do not have any concerns regarding the reference.

其他优缺点

N/A

其他意见或建议

It is good to show pre-trained model’s performance in each experiment.

In L 330, Regular. -> regularization or just Reg. Regular is confusing.

In L 363, Task1, because -> Task 1 because

作者回复

2025-04-01

Many thanks for all the constructive suggestions.Additional experiments and explanations have been provided as follows.

The connection between theoretical and empirical results. Our experiment on LLM and theory match in showing the benefits of ensemble in fine-tuning.As fine-tuned parameters are close to pretrained points, we can adopt a linear setting with NTK explanation. Specifically, we can approximate the neural network $f(x, \alpha)$ as $f(x, \alpha_0) + \nabla_\alpha f(x, \alpha_0)^T (\alpha - \alpha_0)$ . Comparing this with our linear setting $y = x^T \theta$ , the features $\nabla_\alpha f(x, \alpha_0)$ can be interpreted as the input x in the linear model, and the trainable parameter $\alpha - \alpha_0$ as $\theta$ . Since $f(x, \alpha_0)$ remains unchanged during training, its effect can be disregarded in this simplification. Such simplifications can provide intuitive insights and have been widely used in prior works [1,2].

The high-dimensional assumption on x characterizes the features $\nabla_\alpha f(x, \alpha_0)$ in overparameterized models, i.e, there are several large eigenvalues and many small eigenvalues. It can be validated by the eigenvalue distribution of Hessian matrix in [3] [4].

Considering the loss function, to provide theoretical insights into empirical observations,we analyze the MSE loss, which intuitively captures the impact of bias and variance. From a practical standpoint, a well-performing model should exhibit a small gap between training and test loss.In such cases, the test prediction y’ is expected to be close to the training label y, given that the corresponding test input x' is similar to the training input x.Under this scenario, the behavior of MSE aligns with that of other loss functions,making this simplification reasonable and informative. Though bridging the gap between empirical results and theory still remains a challenging endeavor, this work represents a first step toward developing a theoretical framework for understanding this phenomenon. We aim to explore more comprehensive explanations in future works.

Conditions in Theorem 5.1. In Theorem 5.1,to achieve good performance on both the fine-tuning and pretraining tasks, we prefer to set the regularization coefficient within $0 \le \lambda \le \lambda’$ and the ensemble coefficient within $\tau’(\lambda) \le \tau \le 1$ . Generally, we have $\lambda' = O(1/n)$ , where n is the sample size. This suggests that the regularization penalty should not be too large, as excessive regularization may dominate the primary MSE loss. Meanwhile, $\tau'(\lambda)$ is a constant between 0 and 1 (often greater than 0.5), implying that the ensemble coefficient $\tau$ should not be too small; otherwise, the information learned during fine-tuning may be lost. To validate these theoretical insights, we conduct several simulations (see Figures 4 and 5 in Appendix B). The results confirm that a small penalty coefficient $\lambda$ and a large ensemble weight $\tau$ consistently lead to improved model performance.

Performance on pre-trained model, figure 2 and 3. We have improved the figures accordingly and highlighted the checkpoint source of all ensembles in the figures (https://anonymous.4open.science/r/ICML2025-rebuttal-forgetting-1F68/README.md ). Additionally, interpolation weights of 0.0 and 1.0 are also included to cover the two extremes of the pure pre-trained and fine-tuned model respectively. Results show that the pre-trained model demonstrates moderate performance in commonsense reasoning, but shows poor performance in downstream tasks.

Typos in L330, L363. We appreciate the valuable feedback and will correct these terms in the main text.

Noisy training data in L370. It refers to samples that contain noise. Let x denote the input and y denote the output, the relationship is $y = x^T \theta + \epsilon$ , where $\epsilon$ represents a noise variable.

Dolly. Yes, Dolly is not used for training the Llama-3-8B base model. According to the released huggingface model page, the llama-3-8B base model’s knowledge cutoff is March 2023, meaning the pre-training data utilized is until March 2023. Meanwhile, Dolly was released in 2023.4. Additionally, Meta explicitly states that their pretraining data comes from publicly available sources (https://ai.meta.com/blog/meta-llama-3/), so the possibility that they use Dolly from private sources is also excluded.

[1] Benign, tempered, or catastrophic: A taxonomy of overfitting.

[2] Fine-tuning can distort pretrained features and underperform out-of-distribution.

[3] Pyhessian: Neural networks through the lens of the hessian.

[4] Why transformers need adam: A hessian perspective.

审稿人评论

2025-04-05

I appreciate the authors for their responses. One of my concerns was not answered. Could you please explain why the linear regression task represents deep neural networks in various tasks? The response regarding The connection between theoretical and empirical results also directly assumes linear regression without justification.

Secondly, the authors use a linear regression task to show why ensemble methods (e.g., WiSE-FT). The authors need to justify why this simplification is valid (e.g., whether a proof holds for cross-entropy loss). I acknowledge that Kumar et al. (2022) also used a linear regression task, but that does not mean that this work does not need to be justified.

The connection between theoretical and empirical results As mentioned in the original comment, it is difficult to understand why authors specifically chose a new task not showing a standard existing setting. I still believe that it is beneficial to use both tasks, showing that the proposed method is also applicable to the existing standard benchmark. However, I acknowledge that the paper should not be degraded only because of this part.

Performance on pre-trained model, figure 2 and 3 As shown in updated figures, the performance trajectories are not simple piecewise linear. I recommend updating the figures accordingly. However, it is still difficult to know the meaning of the lines between $\tau =0.4$ and $\tau=0.8$ in the updated Figure 2 and between $\tau =0, 0.4, 0.8$ in the updated Figure 3. Please specify them as well.

Compared with the previous figures, several Ensemble methods datapoints are missing in the new figures. For example, in the original Figure 2, five ensemble datapoints are approximately (5.675, 75.5), (5.75, 75.6), (5.785, 75.45), (5.81, 75.0), and (5.875, 74.7). I originally thought that these data points are generated with various $\tau$ . However, it seems like some of the datapoints [e.g., (5.675, 75.5), (5.81, 75.0)] are missing. Are they one of $\tau = 0.1, 0.3, 0.5, 0.7, 0.8$ that the authors did not draw in the updated figure? If so, please add all data points [0.1, 0.2, ..., 0.9]. In particular, $\tau=0.5$ is important since this is the recommended hyperparameter by WiSE-FT (Section 6).

Noisy training data in L370: Is there any assumption regarding $\epsilon$ ? Please specify more details in the paper.

作者评论

2025-04-08

Thank you for your constructive feedback and insightful suggestions. Your positive remarks about our work are truly encouraging. We are grateful for your assistance in identifying potential concerns in our manuscript, which have helped us improve our paper, and we have diligently addressed these in our responses as follows:

The connection between theoretical and empirical results We chose the linear regression task because it offers a more intuitive illustration of the benefits of ensemble, particularly in terms of its clear decomposition of bias and variance in excess risks. However, we acknowledge that this simplification, while helpful for theoretical analysis, introduces a gap between our theoretical findings and practical results. We agree with the reviewer that developing more explicit theoretical insights for realistic tasks is both valuable and necessary, and we consider this an important direction for future work.

To illustrate the relevance of our linear regression analysis to practical settings, we provide a simple example demonstrating its consistency with Bayesian optimal prediction in language tasks. Consider a dataset of n samples $\{ x_i, y_i \}_{i=1}^n$ , where $x_i \in R^p$ represents a prompt and $y_i$ its corresponding response. We assume a finite state setting in which $x_i \in \{ e_1, \dots, e_p \}$ , and our goal is to predict y’ for a new prompt x’ in the test dataset.

In the linear setup, we model $y = x^T \theta$ . With the training samples, the least square estimator of $\theta$ is given by $\hat{\theta} = (X^TX)^{-1} X^T Y,$ where $X \in R^{n \times p}$ and $Y \in R^{n \times 1}$ are formed from the training samples. For each $k \in [p]$ , the k-th component of $\hat{\theta}$ can be expressed as $\hat{\theta} (k) = \frac{1}{\sum_{i=1}^n 1(x_i = e_k) } \sum_{i=1}^n y_i 1(x_i = e_k),$ which shows that the prediction on y’ in linear regression is: $\hat{y’} = \hat{\theta}^T x’ = \frac{1}{\sum_{i=1}^n 1(x_i = x’) } \sum_{i=1}^n y_i 1(x_i = x’).$ This result aligns with the empirical Bayesian optimal prediction for $y'$ , supporting the relevance of the linear regression task to more complex tasks.

Performance on pre-trained model, Figures 2 and 3 Regarding the meaning of the lines, we have included additional data points for $\tau \in [0.1, 0.2, \dots, 0.9]$ . As shown in the updated figures (https://anonymous.4open.science/r/ICML2025-rebuttal-forgetting-1F68/README.md), there is a clear upward trend in MT-Bench scores.

The curves are not piecewise linear due to the inherent variance in MT-Bench evaluations by LLM judges. Although the standard deviation is relatively small ( $\sim0.06$ ), it can still cause slight deviations of data points from their actual positions of means.

Regarding the missing points, they correspond to the same $\tau = 0.8$ but with different choices of $\lambda$ . We omitted them to improve the clarity of the figures. Specifically, we identify those missing points as:

Figure 2
- [5.675, 75.5]: $\lambda=10^{-2}, \tau=0.8$
- [5.780, 75.4]: $\lambda=10^{-3}, \tau=0.8$
- [5.810, 75.0]: $\lambda=2\times10^{-3}, \tau=0.8$
Figure 3
- [5.675, 64.7]: $\lambda=10^{-2}, \tau=0.8$
- [5.780, 64.3]: $\lambda=10^{-3}, \tau=0.8$
- [5.810, 64.2]: $\lambda=2\times10^{-3}, \tau=0.8$

Noisy training data in L370 In our theoretical analysis, we assume that the data noise $\epsilon$ is independent of $x$ and has zero mean. Specifically, in Condition 2, we assume the data noise is non-negligible. This assumption plays a crucial role in illustrating the negative impact of overadaptation during fine-tuning. More precisely:

Date noise is not quite small, meaning that overfitting to such noise during fine-tuning introduces a variance term in the excess risk.
Data noise is not quite large, so it does not overwhelm the meaningful signal in the fine-tuning task. As a result, fine-tuning remains effective for extracting useful task-specific knowledge. And insufficient fine-tuning can lead to a bias term in the fine-tuning risk.

Then the benefit of ensemble is its ability to achieve a better balance between bias and variance, thereby improving performance. This condition is important in our analysis, and we will include a more detailed discussion in the main paper to clarify its implications.

Your help in identifying potential concerns in our manuscript has been invaluable, and we have diligently addressed these in our response. Since the discussion period deadline is approaching, we would greatly appreciate any further comments you may have on our response. We are eager to resolve any additional concerns you might have. We fully understand your busy schedule and are genuinely thankful for the time you have dedicated to helping us enhance our work. We look forward to receiving any additional feedback you may offer.

审稿意见

评分: 22025-03-15

This paper focuses on theoretically and empirically studying task specific fine-tuning of large (over-parameterized) models. It aims to provide theoretical underpinnings for empirical observations such as task specific fine-tuning leads to a performance degradation/forgetting on pre-training tasks, an early stopping or another form of regularization is needed to achieve good fine-tuning performance, and an ensemble of fine-tuned model with pre-trained model leads to a superior performance on fine-tuning data set as well as results in alleviated forgetting on pre-training tasks. Authors present theoretical results, in a linear regression setting, on how regularizers are effective in reducing excess mean squared error on fine-tuning task and model ensemble improves it even further, and how regularization and model ensemble also provide improvement in excess mean squared error on pre-training task thereby leading to alleviation in forgetting. These results are also interpreted in terms of bias-variance tradeoff and how the tradeoff is improved by regularization and model ensemble.

给作者的问题

None

论据与证据

I find the following claims to be weak:

Claim in Section 3 that empirical results with ensemble of fine-tuned and pre-trained model show magical performance of ensemble as compared to fine-tuned model only. In these results it seems the baseline (epoch 0) performance to indicate how well the pre-trained model performs by itself on the tasks covered in the paper. Similarly, when the fine-tuned model is interpolated with the pre-trained model, please include interpolation weights of 0.0 and 1.0 to cover those two extremes.
The claim that the theoretical analysis presented in the paper helps to explain the observed empirical results is also weak. The theoretical analysis is conducted for a linear regression scenario with potentially strong assumptions on data distributions. How that extends to the practical case of large language models and empirical data distributions is unclear. The theoretical analysis is nice in itself (I did not validate its correctness, taking the results on their face value), so perhaps it should be presented by itself and not as a justification of empirical results with large models?

方法与评估标准

Yes

理论论述

Did not verify correctness of theoretical claims

实验设计与分析

As mentioned above, experimental results are lacking in a couple of respects:

Baseline results that show how well the pre-trained model performs by itself on the tasks covered in the paper
When the fine-tuned model is interpolated with the pre-trained model to create the ensemble, interpolation weights of 0.0 and 1.0 are missing
Which model epoch(s) are used for Table 1 results? And for the ensembles used in this table, what were the ensemble weights?

补充材料

No supplementary material provided

与现有文献的关系

The theoretical analysis and results on impact of model ensemble on excess mean square error are novel.

遗漏的重要参考文献

None that I am aware of

其他优缺点

Nothing beyond what’s covered in comments above

其他意见或建议

None

作者回复

2025-04-01

We would like to extend our sincere appreciation to the reviewer for all the constructive and insightful suggestions. Additional experiments and explanations have been provided to address the mentioned concerns.

The connection between theoretical and empirical results. Our experiment on LLM and theory match in showing the benefits of ensemble in fine-tuning. As fine-tuned parameters are close to pretrained points, we can adopt a linear setting with NTK explanation. Specifically, we can approximate the neural network $f(x, \alpha)$ as $f(x, \alpha_0) + \nabla_\alpha f(x, \alpha_0)^T (\alpha - \alpha_0)$ . Comparing this with our linear setting $y = x^T \theta$ , the features $\nabla_\alpha f(x, \alpha_0)$ can be interpreted as the input x in the linear model, and the trainable parameter $\alpha - \alpha_0$ as $\theta$ . Since $f(x, \alpha_0)$ remains unchanged during training, its effect can be disregarded in this simplification. Such simplifications can provide intuitive insights and have been widely used in prior works [1,2].

Regarding our assumptions on x, the high-dimensional setting characterizes the “features” $\nabla_\alpha f(x, \alpha_0)$ in overparameterized neural networks. We can further validate this assumption by analyzing the eigenvalue distribution of the Hessian matrix in practical models using PyHessian [3] and variants of Lanczos algorithms [4], which confirms the presence of several “large” eigenvalues and many “small” eigenvalues.

Additionally, the sparse structure observed in fine-tuning tasks reflects the nature of knowledge specialization across different inputs. While pretraining involves diverse inputs encompassing broad knowledge, fine-tuning is performed on specific tasks with a narrower scope, leading to a “sparse” structure in our theoretical formulation. Though bridging the gap between empirical observations and theoretical results still remains a challenging endeavor, this work represents a first step toward developing a theoretical framework for understanding this phenomenon. We aim to explore more comprehensive explanations in future research.

Explanation on Table 1. Thank you for your insightful questions. For Llama-3-8B, Qwen2-7B, and Gemma-2-9B, we all train for 1 epoch. Regarding the ensemble methods, we adopt the weight of $1-\tau=0.2$ for the pre-trained model, and $\tau=0.8$ for the fine-tuned model. These experimental details will be included to later versions of the paper to improve clarity.

[1] Mallinar, Neil, et al. "Benign, tempered, or catastrophic: A taxonomy of overfitting." arXiv preprint arXiv:2207.06569 (2022).

[2] Kumar, Ananya, et al. "Fine-tuning can distort pretrained features and underperform out-of-distribution." arXiv preprint arXiv:2202.10054 (2022).

[3] Yao, Zhewei, et al. "Pyhessian: Neural networks through the lens of the hessian." 2020 IEEE international conference on big data (Big data). IEEE, 2020.

[4] Zhang, Yushun, et al. "Why transformers need adam: A hessian perspective." Advances in Neural Information Processing Systems 37 (2024): 131786-131823.

审稿意见

评分: 32025-03-18

This paper studies overadaptation in supervised fine-tuning (SFT). It builds on prior work that observes that ensembling a pretrained model with its fine-tuned variant improves performance. Within an over-parameterized linear regression setup, the paper theoretically shows that this effect is due to the ensemble's ability to manage bias and variance.

给作者的问题

Can your analysis be extended to make similar statements about (1) ensembles of multiple fine-tuned models or (2) weight-space ensembles? Such ensembles work well in practice (e.g., Model Soups); weight-space ensembles, in particular, are very useful because they don't increase parameter count/latency.
How does the overadaptation phenomenon relate to catastrophic forgetting? The paper mentions catastrophic forgetting but doesn't fully distinguish between overadaptation and catastrophic forgetting.
How does the sparsity assumption in Condition 2 relate to real-world fine-tuning scenarios? The theoretical setup assumes Task 2 (fine-tuning) has a "sparse" structure with many zero eigenvalues compared to Task 1 (pre-training). While this makes sense intuitively, is there a way you could measure this with LLMs?
Given that your theoretical analysis is based on linear models, which aspects of your findings (if any) do you think might not directly transfer to non-linear neural networks? Are there specific mechanisms in non-linear networks that might affect the benefits of ensembling?
How does your ensembling approach compare to parameter-efficient fine-tuning methods (e.g., LoRA, adapters) in terms of mitigating overadaptation? Could ensembling be combined with these approaches for additional benefits?
For LLMs, SFT is primarily an "intermediate stage" for post-training. Do you think your findings have valuable implications for RLHF/RLVR?

论据与证据

The theoretical claims are supported by proofs. The theory considers a simplified linear setting while the paper's overarching story is about large neural networks. The paper provides some empirical evidence with LLM benchmarks.

方法与评估标准

The choice of benchmarks is appropriate: MT-Bench (instruction-following), Commonsense-QA (general knowledge), and MMLU (multi-task understanding). The linear regression framework provides a controlled theoretical environment for studying the bias-variance trade-off.

理论论述

I skimmed the statements and proofs and did not find any specific issues. I am not an expert here.

实验设计与分析

I checked the soundness of the empirical evaluation setting.

补充材料

I skimmed the proof.

与现有文献的关系

While the authors present these findings as novel (“surprising,” “intriguing”), similar phenomena—such as improved fine-tuning performance through model averaging—have already been observed and extensively studied in prior work. Section 3 reproduces the phenomenon on LLM benchmarks, which may be considered a contribution but isn't surprising.

The main novelty here lies in the theoretical contribution (Section 5), offering an explicit formalization of how ensemble methods manage the bias-variance trade-off within an over-parameterized linear setting.

遗漏的重要参考文献

其他优缺点

The theory seems to be the main contribution of the paper. The experiments in section 3 feels disconnected from this core contribution. The theory could equally serve as a justification for any of the similar prior results on e.g. ImageNet models.
The paper provides a convincing theoretical justification for understanding why ensembling works.

其他意见或建议

Figure 1 takes a lot of space but isn't informative.
In place of Figures 2 and 3, I'd suggest drawing the interpolation-style plots in e.g., Fig 1 of "Robust fine-tuning of zero-shot models ". Currently it's unclear which ensembles are constructed from which fine-tuned checkpoints.

作者回复

2025-04-01

Thanks for the constructive suggestions! Additional experiments and explanations have been provided as follows.

The connection between theoretical and empirical results. Our experiment on LLM and theory match in showing the benefits of ensemble in fine-tuning. As fine-tuned parameters are close to pretrained points, we can adopt a linear setting with NTK explanation. Specifically, we can approximate the neural network $f(x, \alpha)$ as $f(x, \alpha_0) + \nabla_\alpha f(x, \alpha_0)^T (\alpha - \alpha_0)$ . Comparing this with our linear setting $y = x^T \theta$ , the features $\nabla_\alpha f(x, \alpha_0)$ can be interpreted as the input x in the linear model, and the trainable parameter $\alpha - \alpha_0$ as $\theta$ . Since $f(x, \alpha_0)$ remains unchanged during training, its effect can be disregarded in this simplification. Such simplifications can provide intuitive insights and have been widely used in prior works [1,2].

The high-dimensional assumption on x characterizes the “features” $\nabla_\alpha f(x, \alpha_0)$ in large models, i.e, there are several large eigenvalues and many small eigenvalues. It can be validated by the eigenvalue distribution of Hessian matrix in [3] [4]. Though bridging the gap between empirical results and theory is still challenging, this work takes a first step toward a theoretical framework, and we aim to develop more comprehensive explanations in future works.

Figure 1. Thanks for the suggestions! We will adjust its size accordingly. Figure 1 illustrates the effects of overadaptation in fine-tuning, showing that model performance degrades with additional training epochs. This phenomenon highlights the importance of mitigating overadaptation to enhance model performance.

Figure 2, 3. We have improved the figures accordingly and highlighted the checkpoint source of all ensembles (https://anonymous.4open.science/r/ICML2025-rebuttal-forgetting-1F68/README.md ). Additionally, interpolation weights of 0.0 and 1.0 are also included to cover the two extremes of the pure pre-trained and fine-tuned model respectively.

Extension to other ensembles. Weight-space and output-space ensembles are indistinguishable in linear models. Our theoretical results can be directly extended to ensembles of multiple fine-tuned models with properly chosen weights, under the assumption of same-distributed K fine-tuning datasets. The model obtained via model soup has the potential to reduce variance in fine-tuning risk, as well as both bias and variance in pretraining risk. Extending to those directions is interesting and can be future works.

Discussion on overadaptation and catastrophic forgetting. Overadaptation means overfitting on training samples during fine-tuning, leading to degraded performance on fine-tuning tasks. Forgetting occurs when fine-tuning on specific tasks reduces performance on pretraining tasks, effectively forgetting prior knowledge. As they are closely related, overadaptation can contribute to both fine-tuning performance drop and forgetting.

Sparsity assumption. The sparse structure observed in fine-tuning tasks reflects the nature of knowledge specialization across different inputs. While pretraining involves diverse inputs encompassing broad knowledge,fine-tuning is performed on specific tasks with a narrower scope,leading to a sparse structure. For the measurement, the eigenvalue distribution of the Hessian matrix can be analyzed in LLMs using PyHessian[3] and variants of Lanczos algorithms[4].

Linear and nonlinear models. One concern is that the simplification from nonlinear model to linear model is reasonable only in NTK regime. Our results may not be directly applicable out of NTK. In practical models, the choice of ensemble method,such as output-space average or parameter-space average (indistinguishable in linear models),impacts model performance. Another important factor is the use of input-dependent weights (router) in ensemble. Different strategies for designing routers can impact performance.

Extension to LoRA. Further experimental results on LoRA (https://anonymous.4open.science/r/ICML2025-rebuttal-forgetting-1F68/README.md ) show that:

1. Indeed, LoRA can mitigate overadaptation as well but tends to forget more in certain benchmarks, such as Commonsense-QA in comparison with DiffNorm-Penalty.
1. As predicted by the reviewer, the combination of LoRA and ensemble yields further performance improvement in all benchmarks.

Implication for RLHF/RLVR. Yes, these findings can be extended to RLHF, as observed in [5]. And we believe similar phenomena also exists in RLVR and will include relevant discussions.

[1] Benign, tempered, or catastrophic: A taxonomy of overfitting.

[2] Fine-tuning can distort pretrained features and underperform out-of-distribution.

[3] Pyhessian: Neural networks through the lens of the hessian.

[4] Why transformers need adam: A hessian perspective.

[5] Mitigating the alignment tax of rlhf.

最终决定Accept (poster)

2025-05-01

This paper explores the effectiveness of ensembling pretrained and fine-tuned language models to combat the issue of catastrophic forgetting in SFT. The authors demonstrate that ensembling helps retain general knowledge in language models. More notably, they uncover an overadaptation effect where the ensemble model not only preserves general capabilities but also outperforms the fine-tuned model on its own target domain. To explain this, the paper presents a theoretical analysis centered on the bias-variance trade-off, showing that ensembling strikes a more effective balance than traditional regularization techniques. The analysis, grounded in overparameterized linear models, is supported by empirical results that highlight the benefits of interpolating between pretrained and fine-tuned weights.

The paper received a positive recommendation for acceptance. One reviewer praised the significance and novelty of the findings, emphasizing their relevance to the broader community. A critical point raised by a negative reviewer was addressed in the authors’ rebuttal, which clarified the misunderstanding. The paper demonstrates known behaviour in vision for language models, which may not be as significant. In the final recommendation, I support acceptance, with a suggestion that the authors incorporate these clarifications, including issues raised by the reviewers, into the camera-ready version to ensure clarity and completeness.