Eq. (1), the main proposed theoretical framework in this work, has a severe technical flaw.
- Specifically, the authors analyze to discuss the performance of existing model merging methods.
- However, all the existing model merging methods mentioned in this work such as DARE [1], TIES-Merging [2], BitDelta [3], and so on, applied the edited delta parameter to the rather than .
- Therefore, the desired analysis should be conducted on the loss term such as or rather than the current form () to make any claims on the final downstream performance of existing merging methods.
Limited contributions
- Although the authors provide some generalization of existing methods, e.g., multiplies a magnitude hyperparameter in DARE framework, the novelty and innovativeness of these generalizations are too limited and the implications are also not surprising and uninformative. It seems like just reporting a result from engineering. Presenting more rigorous generalizations and providing profound implications from the proposed unified framework will improve the quality of this work significantly.
Unreasonable experiment setup
- Regarding the EXPO [4] method, the authors claim that the relative effectiveness of extrapolation and interpolation depends on the dataset, which shows the performances of interpolation and extrapolation over some NLP downstream tasks.
- However, the motivation of EXPO is focused on the alignment for enhancing the instruction-following capability of large language models, and the authors should conduct the experiment about EXPO on that kind of benchmark such as AlpacaEval 2.0 adopted in the EXPO paper.
Bad presentation and validity of claims
- The quality of some presentations is not good enough for the purpose of publication. For example, see Figure 3. It is much better to omit the pre-train models' performance here to highlight the more important parts -- a comparison between varying values.
- Moreover, although the authors make an argument based on some bar plots (Figure 1, Figure 4, Figure 7) they state the differences have some trend, and the absolute scale is too small among the comparison participants, which raises concerns about the statistical significance.

Reference

Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch, Yu et al. 2024
TIES-Merging: Resolving Interference When Merging Models, Yadav et al. 2023
BitDelta: Your Fine-Tune May Only Be Worth One Bit, Liu et al. 2024
Weak-to-Strong Extrapolation Expedites Alignment, Zheng et al. 2024