Collective Model Intelligence Requires Compatible Specialization
model merging is bottlenecked by incompatible features
摘要
评审与讨论
This paper explores certain phenomena related to feature merging within the MoE (Mixture of Experts) architecture. The authors have conducted validation experiments on the GPT-2 model using mathematical and code datasets, arriving at two primary conclusions: 1) As the expert models are trained, the decreasing similarity among expert models affects the performance of merging in MoE architecture. 2) Granting the router more freedom (enabling it to route more MLPs) within the MoE architecture improves the performance of the MoE model.
优点
Pros:
- Investigating the impact of similarity among expert models in MoE on merging performance is a valuable direction.
- The concept of cross-layer routing is intriguing and holds potential for developing new methodologies.
缺点
- Novelty: The conclusions drawn are not particularly surprising, as similar and more in-depth discussions in MoE research [1,2,3] already exist, especially concerning the form of the router. Additionally, their discussions often relate to practical models or applications, whereas this paper merely performs validation, leaving the content somewhat lacking. I strongly encourage the authors to advance their research further, particularly with the cross-layer router.
- Scope of the paper: The experiments conducted by the authors are based solely on the MoE architecture for GPT, focusing on feature merging. There is a significant gap between this and another major area of model merging: parameter merging. Moreover, feature merging encompasses many methods not related to MoE. Whether the conclusions drawn from MoE can be generalized to methods across the entire field of model merging is debatable. I suggest the authors clearly specify their focus on MoE in the abstract and introduction to avoid potential misunderstandings.
- Experimental settings: The use of a relatively small GPT-2 model trained on mathematical and code tasks with SFT, and using validation loss as a metric, is not a common setting. Given the small scale, code and mathematical tasks are particularly challenging for the GPT-2 model, especially with SFT training, which raises concerns about potentially adverse effects on the experimental results. I recommend the authors consider using more traditional, simpler tasks or a larger-scale model. Typos: There is a conspicuous error in Figure 3.
[1] Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer [2] GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding [3] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
问题
- CKA(Centered Kernel Alignment) does not appear to be a commonly used metric for measuring feature distance. How does it differ from the more commonly used metrics, such as cosine similarity and PCA-based methods?
- As training progresses, the disparities between expert models become more pronounced. However, this not only poses the drawback of increasingly difficult communication among experts but also offers the benefit of further enhancing the experts' specialized capabilities. How does the author perceive this balance?
Thank you for your detailed feedback and constructive comments. Your insights are greatly appreciated and will help us refine and improve the scope and presentation of our work. Below, we address each of your points:
Weaknesses
We acknowledge that our work builds on existing discussions in MoE research. Our focus in on the application of MoE methods for the problem setting of model merging. Could the reviewer please further elaborate on "leaving the context somewhat lacking". We would also argue that feature merging is related to MoE because in an MoE model, the router dynamically averages the features of different models. Its related to the activation interpolation method where instead of having a fixed coefficient that weights the features across both models in an MoE setting we have a router function that predicts.
In addition, we recognize the limitations of relying solely on cross-entropy (CE) loss to evaluate the merged model, particularly in the context of language modeling tasks. Due to the use of models with fewer than a billion parameters, we adopted CE loss as our evaluation metric. However, we agree on the importance of scaling to larger models. In support of this, we reference recent research highlighting a decline in merging performance as larger models are trained for more steps: https://arxiv.org/abs/2411.03055. We believe our analysis offers valuable insights that help explain these observations.
Questions
1. CKA vs. Commonly Used Metrics (Cosine Similarity, PCA-Based Methods)
CKA is a metric specifically designed to compare the similarity of feature representations across neural networks. Unlike cosine similarity, which measures pairwise alignment of vectors, CKA evaluates the similarity of entire feature spaces. We also tested mutual knn, and found similar findings.
CKA offers advantages such as:
- Invariant Properties: It is invariant to isotropic scaling, biases, and orthogonal transformations, making it robust to differences in network architecture or activation scaling.
- Global Alignment: It captures global alignment of representations rather than local similarities.
2. Trade-offs Between Specialization and Communication
You raise an important point regarding the balance between specialization and communication. Increased specialization enhances the performance of individual experts on their respective tasks (math and coding for example) but may make communication and feature merging (on a joint task gsm-hard) more challenging due to diverging representations. Our findings emphasize that representational compatibility is crucial for effective merging without compromising the benefits of specialization.
We hope these revisions and additional analyses address your concerns and strengthen the manuscript. Thank you again for your valuable feedback and for the opportunity to improve our work.
Thanks for the author's response. However, most of my concerns have not been addressed. Let me clarify my main concerns once again:
-
Scope: MoE (Mixture of Experts) can indeed be considered a type of Model Merging method (although this perspective is not usual), but the conclusions drawn from experiments focusing solely on the MoE cannot be directly generalized to the entire Model Merging field. Furthermore, the article's intro and abs do not even mention MoE, but talking about conclusions on "Model Merging" while only conducted experiments on MoE in the rest of paper.
-
Experimental Setting: Using GPT-2 for experiments is not the issue. But why choose math and coding tasks, which are very challenging for GPT-2, resulting that only CELoss can be used as the evaluation metric? Why not select more common, simpler tasks like GLUE and NLI?
-
Novelty: The conclusions the authors reached regarding the MoE are not surprising. Furthermore, the authors did not propose any practical methods.
Thanks for the response again.
Thank you for the reviewer’s clarification. The conclusions we derive regarding the tradeoff between compatibility and specialization during fine-tuning were demonstrated using both activation interpolation and Mixture of Experts (MoE) methods. Additionally, our experiments identified MoE as the strongest merging method, which is why we focused on it in our study.
We agree that math and coding tasks are challenging for GPT-2. However, through our experiments, we found that these domains yielded strong merging performance on gsm-hard. In contrast, attempts to merge models trained on biology and question-answering tasks did not perform as well. Our approach required three datasets: two to train the base skills and a third to evaluate the combination of these skills. We are open to exploring additional tasks, such as those from GLUE and NLI benchmarks, as part of our future work.
Lastly, could you point us to any works that draw similar conclusions? We would be happy to include them in our related works section.
Thank you.
Dear Authors,
Thanks for the author's response. Here are my response and advise.
-
Same as Mixture of Experts, the methods of activation interpolation can be considered a type of Model Merging method, but both cannot represent the entire field of model merging. I suggest that the authors explore more model merging methods or clearly clarify the scope in the abstract and introduction.
-
I still hope that the authors can choose more general combinations of backbones and tasks, such as Llama and code-math tasks, which would greatly enhance the persuasiveness.
-
The conclusions drawn in the article are indeed insufficient for ICLR, especially in the absence of a specific method. But I think the cross-layer router proposed by the authors to be quite interesting and a promising direction for future research.
I appreciate the efforts of the authors, but I regret to say that I think this work still has several significant shortcomings that cannot be overlooked.
I sincerely hope that my suggestions will be of assistance.
Reviewer ibVT
This paper studies model merging focusing on compatible specialization. The paper points out that as models specialise, the similarity of their internal representations, as measured by centered kernel alignment (CKA), diminishes, and therefore, make it harder to merge with other specialised models. The paper investigates model merging via dynamic routing across different layers but reports limitations to such methods as well.
优点
- The analyses conducted in the paper involving multiple tasks and representation similarity is insightful, motivates future work.
缺点
- The experimental design of the study is somewhat weak: -- The only routing function R studied in this work is a linear transformation which may explain the plateau reported in Fig.5. -- The paper reports a plateau in Fig. 5 using only 3 points (standard, 2 layers, and 3 layers) but it is unconvincing to report a plateau using only 3 points. -- Results are reported in validation cross-entropy loss, which is a disadvantage, as that may not necessarily translate to merging of expertise; one or more well defined benchmarks should be chosen that require multiple specializations, to allow more meaningful and comparable results.
问题
- Why are results reported in validation cross-entropy loss?
- In response to "Hence, having models directly being representationally compatible both internally and across models appears critical if we want feature space merging. There have been works to improve layer wise compatibility within models (Jiang et al., 2024b), yet we still have no clear way to ensure representations are compatible across models as they specialize during the finetuning process.", what are the trade-offs? Would this compatibility perhaps be at the cost of individual specialisation performance?
- Why are math and coding chosen? One may argue that both tasks have more in common than other tasks.
- Can you list the number of parameters in addition to performance in Table 1? Am I correct in stating that fine-tuning method has less than half of the parameters of the merging method?
Thank you for your detailed feedback and constructive questions. We appreciate your insights, which will help us refine and improve the quality of our work. Below, we address each point raised:
Weaknesses
1. Limited Routing Function (R) Design and Loss Metric
This is a good point. We ran experiments with routing 4 experts as well. In addition, we tried an MLP for the router and did not notice significant performance gains. We also agree that the CE loss may not be ideal; however, given the size of the models we used, this is the most applicable metric. We would like to point out that similar findings have been observed in larger models (see Fig. 2 in https://arxiv.org/abs/2411.03055).
Questions
The reviewer raises an important point about the potential trade-offs between representational compatibility and specialization performance. Increasing compatibility across models may indeed constrain specialization, as models may need to retain shared structures or representations. This trade-off can limit the extent to which individual models can adapt to their specialized tasks.
In addition, we chose math and coding because we found a task (GSM-Hard) that requires both math and coding skills. Their shared commonalities also make it easier to merge these tasks. Finally, the fine-tuning method only trains the MLP layers, which are much larger than the routing function, which is just a linear transformation.
We hope these planned revisions and additional experiments address your concerns and strengthen the manuscript. Thank you again for your valuable feedback and for the opportunity to improve our work.
Previous works use parameter interpolation for merging models from two domains, which may fail if the CKA (Centered Kernel Alignment) between the two models is too large. This paper proposes using a Mixture of Experts (MoE) approach that routes between two models as a new model merging strategy.
优点
The proposed MoE-based approach introduces a fresh perspective to the challenge of model merging by routing between domain-specific models. This strategy could address issues with prior methods that rely on simple parameter interpolation, particularly when models are poorly aligned in terms of CKA.
缺点
-
The paper only uses cross-entropy (CE) loss to evaluate the merged model, which may be insufficient to capture the model’s true performance, particularly in language modeling tasks. CE loss alone may not fully reflect the merged model’s language modeling and generation quality. Following guidelines from prior work ([1]), it would be beneficial to include additional metrics that measure both the intrinsic quality of the model (e.g., perplexity and accuracy) and the quality of generated text (e.g., diversity and coherence).
-
Since the MoE approach increases the model’s parameter count (potentially doubling it compared to individual models), the observed reduction in loss may be due in part to this increase in model size rather than the efficacy of the MoE merging itself. Interpolation of parameters, by contrast, maintains the original model size, which could lead to an uneven comparison. It would be more informative to compare models of similar sizes to isolate the effectiveness of the MoE merging strategy.
-
While the paper claims that a larger CKA divergence between models degrades the performance of interpolation-based merging, it lacks experimental results to support this claim. Testing model pairs with varying CKA levels (e.g., comparing interpolation results between a GPT-math and GPT-code model versus two similar GPT-code models) would strengthen the case for MoE merging as a preferable alternative. For example, additional experiments can be conduct to verify that when , then .
[1] Su, Y., Lan, T., Wang, Y., Yogatama, D., Kong, L., & Collier, N. (2022). A contrastive framework for neural text generation. Advances in Neural Information Processing Systems, 35, 21548-21561.
问题
I'm curious that what would the CKA pattern look like for an MoE-merged model compared to an interpolation-based merged model?
Specifically: How do CKA values between the MoE model and the original models (A or B) differ from those seen in interpolation-based merging? Further analysis on these patterns could reveal additional insights into the advantages of MoE merging over parameter interpolation, especially in cross-domain contexts.
伦理问题详情
No Ethics concerns needed.
Thank you for your detailed and insightful feedback. We greatly value your suggestions, which will help us improve the quality and scope of our work. Below, we address each point of concern:
Weaknesses
1. Evaluation Metrics Beyond Cross-Entropy (CE) Loss
We acknowledge the limitations of using CE loss alone to evaluate the merged model, particularly for language modeling tasks. We used models that are less than a billion parameters and consequently used CE loss as our metric. We do agree that scaling to larger models is important. To that end, we found a recent work that also observes the decreasing of merging performance for larger model as they are trained for more steps: https://arxiv.org/abs/2411.03055. We believe that our work provides insightful analysis that can be used to explain these observations.
2. Impact of Increased Parameter Count in MoE Approaches
We operate in the paradigm where we are provided a set of specialized fine-tuned models, and we want to explore what is the best merging strategy given these model. To this end, in this problem setting, we do not change the size of the models to keep the parameter count the same across merging methods.
3. Experimental Validation of CKA’s Role in Performance Degradation
We are not clear about the reviewers question. We do show pairs of models CKA similarity and their merging performance over the fine-tuning steps. In addition what is the task to calculate the loss in the proposed experiment?
Questions
1. CKA Patterns for MoE vs. Interpolation-Based Merging
The CKA pattern does not change based on the merging method. The CKA metric informs how similar two models representations are. Could the reviewer please elaborate on the question.
We are grateful for your valuable feedback and hope these planned revisions and additional experiments address your concerns. Thank you again for the opportunity to improve our work.
Evaluation Metrics Beyond Cross-Entropy (CE) Loss
My point is that we should evaluate the model using additional metrics to better reflect its generation capabilities. This comprehensive evaluation approach is feasible regardless of model size.
Impact of Increased Parameter Count in MoE Approaches
My concern is that Params(MoE Merging Model) > Params(Each Model before Merging) = Params(Interpolation Merging Model). MoE model size is larger than original models and interpolation merging model. Clarify me if it's not the case.
Experimental Validation of CKA’s Role in Performance Degradation
CKA can be computed both between different models and within a single model by comparing its layers. For here, I mean the CKA within a model across layers. In experiments, extract layerwise features of the MoE merging model as {}, then compute the , for any . Then we get a pairwise CKA matrix.
Since most of the concerns are not addressed, and no experiments are added, I would keep the score.
In this paper, the central argument arises as the authors highlight how fine-tuning models may hinder their parameter-wise combinations (i.e., model merging). The authors present empirical evidence (via a CKA similarity measurement) on how specializing models, via transfer learning, cause their representation to diverge; hence, feature averaging algorithms will tend to be suboptimal, for merging models in this setting.
The authors claim that it is crucial for models to communicate their (layer-wise) learned knowledge; they go on to state that feature averaging and representational alignment methods do not address this problem. As an alternative, they propose a mixture of experts (MoE) mechanism, focusing entirely on adequate routing. Overall, given two (pretrained) models, their method proposes to treat certain layers as experts and route them, instead of combining them with any interpolation coefficient. This method is, allegedly, flexible to support merging different architectures.
Experiments show that while the proposed routing-based merging approach does improve upon basic feature averaging methods, it still faces substantial limitations. Specifically, the authors observe that as models become more specialized through fine-tuning on distinct tasks (e.g., math vs. coding), the merged model’s performance on a cross-domain adaptation task initially improves but subsequently declines as representational divergence increases. The experiments also demonstrate that routing methods with more degrees of freedom—such as multi-layer and cross-layer routing—yield better performance than single-layer static interpolation, particularly when the models share similar internal structures. However, even these sophisticated routing techniques struggle when the internal layer representations of the models are highly dissimilar, ultimately reaching a performance plateau.
优点
- Insightful Identification of Compatibility Issues: The paper highlights that representational divergence during fine-tuning can hinder model merging efforts, which is a fundamental issue when merging any kind of machine learning models; hence the paper is well motivated.
- The paper is generally well-written and offers useful analogies for the reader to ground his or her intuitions correctly.
- The reader can smoothly go over the paper, as academic language develops every concept with eloquence.
- Experiments are conducted on both, in-domain and out-of-distribution settings, which are typical scenarios to assess knowledge transferability between models.
- There is no discussion on time and space complexity caveats.
缺点
- Arguing that models require compatible specialization is central to the purposes of this paper. Yet the concept remains ill-defined after the methodology section, which brings certain issues such as:
- no direct/intuitive connection between routing mechanisms and a neural model's ability to communicate its knowledge;
- lack of literature review on past methods that could, at least, somehow contribute to enhancing compatible specialization;
- While Figure 6 showcases how layer incompatibility increases between more distant layers, no concrete actions were taken to improve the experimental decisions; in particular, only conjectures were made to handle similar relative positions between models, without explicitly modifying their method.
- There are no experimental ablations against past model merging techniques, rather than just comparing the routing with feature averaging.
问题
- The issue of merging task-specialized pretrained models has been explored before; therefore, it will be interesting to compare and contrast the presented compatible specialization with:
- : this method represents a preliminary fine-tuning-based approach, where regularization is added to retain pretrained knowledge: https://arxiv.org/abs/1802.01483;
- The question of whether two models are compatible, with each other, has already been studied exhaustively, through the lens of linear mode connectivity (LMC). For instance, see: https://arxiv.org/abs/2110.06296.
- Section 2.4 introduces CKA, yet it seems that a bunch of equations were just thrown to the reader without the needed context. I suggest revising this section by somewhat explaining step-by-step what CKA is doing.
- What is the difference between activation interpolation (see Section 2.2) and averaging weights with varying coefficients?
伦理问题详情
No ethics concerns.
Thank you for your detailed and constructive feedback on our submission. We appreciate your acknowledgment of the strengths in our work, including the identification of compatibility issues, our experiments in diverse settings, and the clarity of our writing. Below, we address your concerns:
Weaknesses
1. Ill-defined Concept of Compatible Specialization
We recognize the importance of clearly defining compatible specialization and its connection to model communication. We will provide a more intuitive explanation, emphasizing how models must be specialized while maintaining compatibility for effective communication and collaboration.
Interactions in a latent space are non-trivial, as demonstrated by our analysis of existing model merging methods. These methods can be viewed as attempts to allow multiple models to share their capabilities. Among these, routing-based methods emerged as the most competent in our experiments. Routing enables dynamic weighting of features across models, offering a flexible approach to facilitate knowledge sharing. We will clarify this in the manuscript and emphasize how routing provides a foundation for achieving compatible specialization.
Furthermore, we acknowledge the need for a more comprehensive literature review. We included the key works in model merging.
2. Conjectures Without Experimental Validation
While Figure 6 highlights the challenges of layer incompatibility, we agree that empirical actions to address these issues are underexplored. In our work, we experimented with incorporating pretraining data during fine-tuning to mitigate representational divergence, but this did not yield significant improvements. We recognize this as an open and important research direction and will include this discussion in the revised manuscript to acknowledge the limitations and motivate future exploration.
3. No Experimental Ablations Against Past Model Merging Techniques
Our focus was on merging techniques that do not require intermediate communication during fine-tuning. As such, we compared feature averaging, parameter averaging, and routing-based methods. While parameter merging is computationally efficient, our results highlight the additional capacity introduced by routing, making it the strongest merging method among those explored.
Questions
1. Comparison with Fine-Tuning-Based Methods
In our experiments, instead of directly regularizing fine-tuned weights to remain close to the base model, we incorporated pretraining data during fine-tuning and applied weight decay. However, we acknowledge the importance of exploring additional regularization methods to maintain compatibility with the base model. This is a critical and valuable direction, and we will expand on this in the discussion of future work.
2. Clarification of Section CKA
Thank you for pointing out the need for a clearer explanation. We revised the section, and believe that the references provide a clear and detailed description of CKA. In addition, we tested mutual-knn and found similar resultss
3. Activation Interpolation vs. Weight Averaging
Activation interpolation (Section 2.2) operates on layer outputs, dynamically combining them based on routing decisions or a fixed weighting coefficient. In contrast, weight averaging with varying coefficients directly combines model parameters. We will expand this distinction in the manuscript to clarify their respective roles and mechanisms. We also reran some of our analysis using mutual knn as done in: (https://arxiv.org/abs/2405.07987), and find that we obtain the same conclusion as we did with CKA.
4. Complexity
We acknowledge that routing methods require more memory and FLOPS compared to simpler parameter averaging methods. We will include a discussion of these trade-offs in the revised manuscript, providing additional context for the computational cost of routing-based approaches.
We hope these clarifications and revisions address your concerns and strengthen the manuscript. Thank you again for your thoughtful feedback and for giving us the opportunity to improve our work.
This paper investigates the challenges of merging specialized neural models, focusing on issues of representational similarity and compatibility. It highlights that common merging techniques, such as feature averaging, often struggle to produce effective merged models as the individual models become more specialized. This specialization leads to divergent feature representations, which limit the models' ability to combine effectively. The authors propose exploring dynamic routing-based strategies, where features are merged across multiple layers instead of using static, layer-by-layer averaging. Their empirical analysis, including the use of centered kernel alignment (CKA) metrics, underscores the limitations of current merging methods and emphasizes the need for new approaches that prioritize compatible specialization. This work signals the importance of inter-model compatibility in the development of effective model merging techniques.
优点
-
Originality: This paper introduces a fresh perspective on model merging by emphasizing the importance of compatible specialization rather than direct feature merging. The application of centered kernel alignment (CKA) in this context is a novel approach that strengthens the paper’s analysis and arguments.
-
Clarity: The paper is well-organized, with clear sections and helpful diagrams (e.g., Figures 1 and 4) that effectively illustrate the merging processes and highlight the limitations of current methods.
-
Significance: By addressing the challenges of model merging and outlining future directions, this work provides valuable insights for advancing collective model intelligence. This is increasingly significant as fine-tuned, domain-specific models continue to proliferate. The paper’s emphasis on both computational efficiency and compatibility offers a meaningful contribution to the field.
缺点
-
Quality: The paper’s experimental scope is limited, and expanding it to include a wider range of tasks, such as those in SuperGLUE [1], would strengthen its findings. Additionally, comparisons with other benchmark model merging of Mixture of Experts (MoE) methods, such as those presented in recent works [2, 3, 4, and 5], would help contextualize the contributions and insights of this paper within the broader field.
-
Experimental Validation of CKA Insights: The paper uses CKA analysis to examine representational similarity (e.g., in Figures 2 and 6). However, recent works have raised concerns about the reliability of CKA in drawing conclusions based on its similarity values, suggesting that it can sometimes produce misleading results in neural network similarity analyses [6, 7, 8]. Given that CKA’s benefits depend on observing specific conditions (discussed in aforementioned works), further investigation into its validity in this context, or the inclusion of additional similarity metrics, would lend greater robustness to the findings.
-
Findings of Section 4.3: Section 4.3 begins with the statement:
“When the performance in Figure 5 plateaus, we find that this is correlated with representational similarity between layers within a model and across models.”
However, the analysis in Figure 6 does not conclusively support a correlation between the performance plateaus observed in Figure 5 and representational similarity among adjacent layers. As Figure 6 shows only a single snapshot (likely for the "Routing with 3 Layers" setting, please clarify if it is not the case as it is not mentioned in the text), observing CKA changes from "Standard Routing" to "Routing with 2 Layers" to "Routing with 3 Layers" would be necessary to conclude a correlation. If CKA values indeed increase between nearby layers across these configurations, then a correlation might be established. The current results as it stands only reconfirm previous observations that adjacent layers exhibit high CKA similarity, as shown in earlier studies [9 and 10].
-
Scope of the Vision for Collective Intelligence: Although the paper’s proposal for collective intelligence in Section 5.2 (Routing Models in Input and Output Spaces) is intriguing, its practical impact is limited by a lack of concrete steps or preliminary experiments to advance this vision.
References
-
Wang, A., et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
-
Tang, A., et al. Merging Multi-Task Models via Weight-Ensembling Mixture of Experts
-
Zheng, Y., et al. Layer-Wise Representation Fusion for Compositional Generalization
-
D’Ascoli, S., et al. The False Promise of the Similarity Hypothesis in Neural Networks
-
Kim, M., et al. Pitfalls of Measuring Neural Network Similarity with Centered Kernel Alignment
-
Doimo, D., et al. Interpreting CKA as Similarity in Representations: The Importance of Calibration
-
Raghu, M., et al. Do Vision Transformers See Like Convolutional Neural Networks?
-
Kornblith, S., et al. Similarity of Neural Network Representations Revisited
问题
- Practical Implications of Compatible Specialization: Beyond the discussions provided in the paper, how do the authors envision compatible specialization being achieved in practice? Are there potential architectures or training methods that might enhance inter-model compatibility?
- Routing Complexity Trade-offs: The results in Figure 5 suggest diminishing returns as routing complexity increases. Could the authors elaborate on potential architectural modifications or efficiency improvements to manage the trade-offs of increased routing complexity?
Minor Suggestions
- Line 343 (Figure 4 caption): The caption appears incomplete. Consider rephrasing to finish the sentence describing the complexity of routing to different layers.
- Lin 411: It is written " we find that is is correlated". There is one too many "is" here.
- Consider Additional Citations: Adding citations to studies that critique the use of CKA in representational analysis would provide a balanced view and strengthen the experimental conclusions.
Thank you for your thoughtful and constructive feedback. We appreciate your detailed insights, which will help us strengthen our work. Below, we address your concerns:
Weaknesses
-
Experimental Scope and Comparisons
We agree that expanding the experimental scope to include a broader range of tasks will strengthen the claim. We choose our tasks carefully to ensure we are testing novel generalization to an unseen task which requires knowledge from both base tasks.
Additionally, we appreciate your suggestion to compare our approach with recent benchmark Mixture of Experts (MoE) methods. We will reference and contextualize these works in the revised manuscript.
-
Experimental Validation of CKA Insights
We acknowledge the ongoing discussion regarding the reliability of CKA for representational similarity analysis, as highlighted in recent works. In response, we will clarify the limitations of CKA in our manuscript and discuss the specific conditions under which we employed it. We also tested some of our analysis using mutual knn as done in: (https://arxiv.org/abs/2405.07987) and obtained similar results.
-
Findings of Section 4.3
Thank you for pointing out the need for additional evidence to support the correlation between performance plateaus and representational similarity. To clarify, we find that as we try to route to more positionally distant layers across different models the the gains obtained plateau - signifying that layers too far apart are not beneficial for merging. This is correlated with the representational similarity across different layers. To the raised point "observing CKA changes from "Standard Routing" to "Routing with 2 Layers" to "Routing with 3 Layers" would be necessary to conclude a correlation", the CKA metric doesn't change regardless of which merging method we use because it is only measuring how similar the models representations are before they are merged.
-
Scope of the Vision for Collective Intelligence
We appreciate your recognition of the potential impact of our proposal in Section 5.2. We agree that concrete steps or preliminary experiments are necessary to demonstrate the feasibility of routing models in input and output spaces. We wrote this section to provide a possible direction of research based on our conclusions and it will require follow up experimentation.
Questions
-
Practical Implications of Compatible Specialization Currently there are works that are able to obtain better compatibility by intermediate merging (https://arxiv.org/pdf/2411.03055), however we do not know of a clear direction to improve compatibility without intermediate merging / communication during the fine-tuning process.
-
Routing Complexity Trade-offs Routing to more layers yields diminishing benefits. We hypothesize that a potential change can be to try using more recurrence in the network, such that there is more compatibility across different positions in the network.
We are grateful for your valuable feedback and hope these revisions and clarifications will address your concerns. Thank you again for the opportunity to improve our work.
Thank you for your response and for addressing some of my concerns. While I appreciate the clarifications and the additional context provided, I find that several critical issues remain insufficiently addressed. Below, I outline my remaining concerns and the reasoning behind maintaining my current evaluation:
-
Clarification on CKA Analysis:
The authors clarified that the CKA values presented are pre-merging and not post-merging. However, this distinction was not explicitly clear in the original manuscript. Given that the observation of adjacent layers having high CKA similarity is well-documented in prior work, this analysis lacks novelty. Observing how representational similarity changes post-merging could have provided more actionable insights, as it would directly relate to the performance dynamics observed in the routing methods. Without this, the analysis primarily reiterates known phenomena without leveraging them to advance understanding or propose innovative strategies. -
Experimental Scope and Comparisons:
While the authors acknowledge the importance of expanding the experimental scope and comparing against more recent MoE benchmarks, these aspects are still missing from the current work. Incorporating tasks like those in SuperGLUE or benchmarking against newer MoE methods would better contextualize the contribution and provide a stronger foundation for evaluating the proposed methods. -
Findings of Section 4.3:
The authors’ explanation that CKA values do not change across different merging methods because they are measured pre-merging further highlights the limitations of this analysis. If the CKA metric remains constant regardless of the merging method, it undermines its utility in correlating representational similarity with the performance plateaus observed in Figure 5. This calls into question the strength of the conclusions drawn in Section 4.3. Without additional evidence, such as tracking representational dynamics post-merging, the claimed correlation remains speculative.
In summary, while the paper explores an interesting problem and provides valuable insights, it suffers from a lack of novelty in some key analyses, limited experimental scope, and insufficient evidence to support certain claims. The concerns regarding CKA reliability, the narrow range of evaluated tasks, and the speculative nature of some sections leave notable gaps in the work’s impact. I appreciate the authors' efforts to address these points, but I must regretfully maintain my current rating.
Response to Reviewer Feedback
Strengths
The reviewers highlighted several strengths of our work:
1. Insightful Identification of Compatibility Issues
- The paper effectively identifies the challenges of representational divergence during model merging and emphasizes the importance of "compatible specialization" for effective communication between specialized models.
2. Clear Writing and Accessibility
- Reviewers appreciated the clarity of the writing and the use of analogies, which make the concepts more accessible to readers.
3. Focused Experiments on Representational Compatibility
- The experiments were recognized for exploring compatibility in MoE-based GPT models across in-domain and cross-domain tasks, particularly math and coding.
Concerns Raised by Reviewers
1. Generalizability and Scalability
- Reviewers raised concerns about the use of small-scale GPT-2 models fine-tuned on challenging tasks like math and coding, suggesting that this experimental setup may limit the generalizability and scalability of the findings.
We do agree with these statements, and would like to scale up experiments in the future. We would like to point that a similar finding has been found in larger models (see Fig 2 in https://arxiv.org/abs/2411.03055 )
2. Evaluation Metrics
- The reliance on validation cross-entropy loss as the sole evaluation metric was questioned. Reviewers recommended incorporating additional benchmarks and metrics, such as task-specific accuracy, perplexity, and generation quality (e.g., diversity and coherence), to better assess the merging of expertise.
We agree with this point as well. We focused on finding datasets that were compatible, such as a math, coding. Having larger models would likely allow for testing more complex combinations of domains, such as medical and question answering tasks.
4. CKA as a Metric
- While the use of CKA was appreciated, some reviewers pointed out its limitations compared to more commonly used metrics, such as cosine similarity or PCA-based methods. They suggested additional validation of CKA's applicability in the study.
This is a good point raised by the reviewers and we tested mutual-knn as used in https://arxiv.org/abs/2405.07987 as well and found similar results.
5. Trade-offs Between Specialization and Compatibility
- Reviewers emphasized the importance of exploring the trade-offs between ensuring representational compatibility and maintaining specialization, which was not deeply addressed in the current version of the paper.
We do agree that while this work presents the notion that the current paradigm of model merging has not successfully reached compatible specialization, there was no proposed way to mitigate this. While we tested fine-tuning with the pre-training data, that did not help as well. This is an important area of future work.
We updated our paper and attached it along with extra figures showing our Mutual-Knn experiment and routing 4 layers as well.
Reviewers appreciated the papers emphasis on the importance of compatible specialization rather than direct feature merging. Reviewers raised concerns about the scope and generalizability of the experiments (e.g., using only small-scale GPT-2 models on math and coding tasks), the reliance solely on validation cross-entropy loss as a performance metric, and the absence of comparisons with a broader range of model merging techniques. They also questioned the use of CKA as the sole similarity measure and pointed out that the findings, while interesting, lack concrete steps toward practical implementations or more comprehensive experimental evaluations. Overall, the reviewers see value in the direction but remain unconvinced that the current work fully justifies its claims or provides robust evidence.
审稿人讨论附加意见
The authors acknowledged that the current experimental validation is limited. The authors added some experiments on additional metrics besides CKA, but it is not clear if these resolve the reviewers concerns about similarity metrics and their reliability in this setting.
Reject