PaperHub
6.7
/10
Rejected3 位审稿人
最低6最高8标准差0.9
8
6
6
3.3
置信度
正确性2.7
贡献度2.3
表达3.0
ICLR 2025

Structured-Initialization Learning

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05

摘要

关键词
Efficient Learning

评审与讨论

审稿意见
8

This paper proposed a novel way to combine the weights of multiple pre-trained LLMs into a smaller model. The problem to be addressed is important, as it offers opportunity to train a smaller specialized model faster, rather than train from scratch.

The paper proposed a set of theorems support it's claims, which offers theoretical guarantees.

优点

  • This paper supports its claims with both theoretical and empirical results. The proposed theorems are proved in the appendix, which looks correct (without in-depth review)

  • This paper address an important problem of LLM reusing, which lacks efficient solutions before. The proposed method seems to be better than naive averaging weights.

  • This paper marks different theorems and labels in color, which helps readers to quickly locate and navigate.

缺点

  • The field of knowledge distillation should be discussed in the related works section, which is highly related.

  • The use of colored block seems to be too extensive, which is not common in research papers. Though not an outstanding weakness point.

问题

  • Why is there a spike at γ=0.5\gamma=0.5?

  • Are you assuming all other network elements are the same, e.g. activation functions, vocabulary size?

  • Better to add discussion about knowledge distillation.

评论

Weaknesses:

W1: The field of knowledge distillation should be discussed in the related works section, which is highly related & Better to add discussion about knowledge distillation.

Our Response:

We appreciate the reviewer highlighting the relevance of knowledge distillation to our work. In response, we have expanded the related work in Appendix G in page of 32 to include a detailed discussion on knowledge distillation.

Comparison Between Knowledge Distillation and SAIL:

  • Knowledge Distillation: This technique involves training a smaller student model to replicate the behavior of a larger teacher model by mimicking its output logits. The primary focus is on transferring output-space knowledge to achieve model compression and efficiency.
  • SAIL (Our Approach): Unlike knowledge distillation, SAIL focuses on parameter-space knowledge transfer. It directly transforms and integrates parameters from multiple pre-trained models to initialize a new model. This method leverages the collective knowledge embedded in the parameters, rather than relying solely on output alignment.

Complementary Nature:

While both approaches aim to transfer knowledge from existing models, they operate at different levels. Moreover, knowledge distillation can be utilized as a subsequent optimization step following SAIL, where the merged parameters are further fine-tuned or re-trained to align with the target task's specific requirements.

W2: The use of colored block seems to be too extensive, which is not common in research papers. Though not an outstanding weakness point.

Our Response:

We appreciate your comment regarding the colored blocks. While we aimed to enhance the readability and highlight key conclusions for readers, we acknowledge your point about academic convention. We will consider adopting a more standard formatting approach further.

Questions:

Q1: Why is there a spike at γ=0.5?

Our Response:

We acknowledge the reviewer's observation regarding the spike at γ=0.5\gamma=0.5 in the validation loss curves depicted in Figure 2b. This spike is indicative of the complex dynamics involved in merging model parameters and can be attributed to the following factors:

Explanation for the Spike at γ=0.5\gamma=0.5:

  1. Interference Between Divergent Representations:
    • At γ=0.5\gamma=0.5, the parameters from the two pre-trained models are weighted equally. If the models have learned significantly different or even conflicting representations due to their training on distinct datasets (as is the case with D1D_1 and D2D_2), the direct average can result in parameter interference.
    • This interference can temporarily degrade the model's performance, manifesting as an increase in validation loss.
  2. Lack of Alignment in Parameter Space:
    • Equal weighting does not account for the relative relevance or compatibility of each model's parameters with respect to the target dataset DtD_t.
    • Without proper alignment, the superposition of parameters may not produce a coherent initial model, leading to suboptimal performance at γ=0.5\gamma=0.5.

Q2: Are you assuming all other network elements are the same, e.g., activation functions, vocabulary size?

Our Response:

Yes, in our experiments, there is usually a target model architecture and a corresponding activation function or vocabulary size, and we directly choose to find the vocabulary size of a target model size in our experiments and directly inherit it, without converting it here.

评论

Dear Reviewer,

As the deadline approaches, we kindly request your response to our feedback at your earliest convenience. Thank you for your time and consideration.

Best regards

评论

Dear Reviewer,

We sincerely appreciate your detailed review of our manuscript. Given the limited time remaining in the discussion phase, we would welcome the opportunity to promptly address your concerns and discuss any additional questions you may have to ensure our responses are thorough and satisfactory.

We look forward to your timely feedback.

Best regards, Authors

审稿意见
6

This paper proposes Structured-Initialization Learning (SAIL), a method to accelerate training for large models by reusing parameters from pre-trained models. The approach includes transforming parameters to fit the target model and integrating them to form a more optimal starting point for training, reducing the need for random initialization.

优点

  1. The paper provides a solid theoretical analysis, effectively demonstrating how the proposed Proximal Parameter initialization leads to faster convergence. The authors present well-structured convergence theorems, lending strong support to the efficacy of SAIL in reducing training time and improving efficiency.
  2. The method is tested on both NLP and computer vision tasks, demonstrating applicability across different domains and model architectures.

缺点

  1. Limited Novelty in Leveraging Pre-trained Models for Initialization The proposed method of reusing parameters from pre-trained models to accelerate the training of new models is similar to existing work [1-2].
  2. The motivation and system design of Figure 1 claims to use a pre-trained model such as LLM, however, the actual experiments are conducted by training the model from scratch on small-scale datasets in a controlled setup. It would be good to actually use the pre-trained models in hugging faces to create the SAIL.
  3. The explanation of how the parameter transformation is conducted is not clear enough, the authors mentioned the random projection and learnable methods, but a detailed investigation and experiments on which method is better are not included.

Reference

[1] Initializing Models with Larger Ones ICLR 2024 [2] Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization

问题

  1. Could the author provide the training curves of the NLP experiments as shown in the CV experiments, it helps to verify if the proved fast convergence is also applicable to transformer training.
评论
  1. Integration of Multiple Pre-trained Models Beyond Single-Model Scaling:
    • SAIL is designed to integrate knowledge from multiple pre-trained models, potentially trained on different datasets or tasks, to form a more comprehensive initialization.
    • Xu et al. (2023) focus on transferring weights from a single larger model to a smaller one via weight selection. Their method does not generalize to integrating multiple models.
    • Samragh et al. (2024) scale up a single small model through HyperCloning, which involves duplicating weights symmetrically but does not consider merging different models.

References:

[1] Xu Z, Chen Y, Vishniakov K, et al. Initializing models with larger ones[C]//The Twelfth International Conference on Learning Representations. 2023.

[2] Samragh M, Mirzadeh I, Vahid K A, et al. Scaling smart: Accelerating large language model pre-training with small model initialization[J]. arXiv preprint arXiv:2409.12903, 2024.

W2: The motivation and system design of Figure 1 claims to use a pre-trained model such as LLM, but experiments are conducted by training the model from scratch on small-scale datasets in a controlled setup. It would be good to actually use the pre-trained models in Hugging Face to create the SAIL.

Our Response:

We appreciate the reviewer’s suggestion. In our initial submission, we focused on controlled experiments to isolate and understand the effects of SAIL. To enhance the robustness of our work, we have now incorporated experiments using pre-trained models from Hugging Face, specifically the OLMo model. The detailed results are presented in Appendix I in page of 34.

W3: The explanation of how the parameter transformation is conducted is not clear enough; the authors mentioned random projection and learnable methods, but a detailed investigation and experiments on which method is better are not included.

Our Response:

Thank you for highlighting the need for a clearer explanation of our parameter transformation process.

In our work, we utilize a learnable linear projection method to map weights from pre-trained models to the target model architecture. We have updated the manuscript to include a comprehensive explanation of the learnable linear projection method in Appendix F in page of 30. Additionally, we have conducted experiments comparing learnable linear projection with random projection, demonstrating the superiority of the learnable approach in terms of convergence speed and final performance.

Questions:

Q1: Could the authors provide the training curves of the NLP experiments as shown in the CV experiments? It helps to verify if the proved fast convergence is also applicable to transformer training.

Our Response:

Thank you for this insightful suggestion. We have included the training curves for the NLP experiments in Figure 7 at the page of 37 of the revised manuscript. These curves display both the training loss and validation perplexity across epochs for models initialized with SAIL compared to those with random initialization.

Additionally, we have conducted a comprehensive comparative analysis of SAIL against weight transformation and linear model merging approaches, with detailed results presented in Tables 1, 2, and 3 of the General Response section. The comparative metrics demonstrate superior performance across multiple evaluation criteria. These empirical findings corroborate that the rapid convergence properties initially observed in our computer vision (CV) experiments generalize effectively to transformer-based Natural Language Processing (NLP) architectures.

评论

I want to thank the authors' rebuttal, which addressed my concerns and provided more solid evidence. I have increased my score from 5 to 6.

评论

We greatly appreciate your feedback and consideration. We would like to share some new results that we have recently obtained, which are detailed in the general comments part. We hope these additional results will be of interest and provide further insights for you.

评论

Weaknesses:

W1: Limited novelty in leveraging pre-trained models for initialization; similarity to existing work [1][2].

Our Response:

We appreciate the reviewer’s feedback regarding the novelty of our approach in leveraging pre-trained models for initialization and the perceived similarity to existing works [1][2]. We would like to clarify the distinct contributions and theoretical advancements that Structured Initialization Learning (SAIL) introduces, setting it apart from prior methods. And we will expand our discussion with [1][2] in our manuscript. We have noticed that [2] is concurrent work with ours and hasn't been open-sourced.

Distinction from Existing Work:

  1. Comprehensive Parameter Transformation and Architecture Alignment:
    • SAIL introduces a dual-faceted parameter transformation technique that adjusts both the width (number of neurons per layer) and depth (number of layers) of pre-trained models to match the architecture of the target model. This allows for the integration of multiple pre-trained models with varying architectures and sizes, enabling a seamless amalgamation of diverse knowledge bases.

    • In contrast, Xu et al. (2023) [1] propose a method called Weight Selection, which primarily focuses on initializing smaller models by selecting subsets of weights from a larger pre-trained model. Their approach is limited to models within the same family and requires similar architectures, emphasizing downscaling rather than flexible architectural alignment.

    • Samragh et al. (2024) [2] introduce HyperCloning, a method that expands a small pre-trained model to a larger one through weight replication and symmetric initialization. Their technique involves duplicating neurons to match the larger model's dimensions but does not account for architectural differences beyond scaling width dimensions.

    • In SAIL, we formulate the parameter transformation as a linear mapping TiT_i for each pre-trained model parameter θi\theta_i :

      θ~i=Ti(θi)Rd\tilde{\theta}_i = T_i(\theta_i) \in \mathbb{R}^d

      where dd is the dimensionality of the target model's parameter space.

    • This transformation ensures that the parameters from different pre-trained models are projected into a common space, allowing for seamless integration regardless of their original architectures.

  2. Optimal Parameter Integration with Theoretical Guarantees:
    • SAIL defines the Proximal Parameter θP\theta^P as an optimal linear combination of transformed parameters:

      θP=i=1Kγiθ~i\theta^P = \sum_{i=1}^K \gamma_i^* \tilde{\theta}_i

      where γi\gamma_i^* are the combination coefficients optimized to minimize the distance between θP\theta^P and the target model's optimal parameters θ\theta^* .

    • Our Theorem 3 provides explicit solutions for γi\gamma_i^* by solving:

      γ=argminγi=1Kγiθ~iθ2\gamma^* = \arg\min_{\gamma} \left\| \sum_{i=1}^K \gamma_i \tilde{\theta}_i - \theta^* \right\|^2

      subject to i=1Kγi=1\sum_{i=1}^K \gamma_i = 1 and γi0\gamma_i \geq 0 .

    • The coefficients γi\gamma_i^* are determined based on the statistical distances (e.g., total variation distance) between the distributions of the pre-trained models and the target data, providing a theoretically optimal integration of knowledge.

    • In contrast, Xu et al. (2023) and Samragh et al. (2024) do not formulate an optimization problem for parameter integration. Their methods lack theoretical guarantees regarding the optimality of parameter initialization and convergence speed.

  3. Theoretical Convergence Analysis:
    • SAIL offers rigorous theoretical analysis demonstrating that initializing with the Proximal Parameter θP\theta_P leads to faster convergence to the optimal parameters θ\theta^* compared to random initialization.

    • Theorem 2 in our work shows that the suboptimality after TT iterations satisfies:

      J(θ(T))J(θ)(1ημ)T(J(θP)J(θ))J(\theta^{(T)}) - J(\theta^*) \leq (1 - \eta \mu)^T \left( J(\theta^P) - J(\theta^*) \right)

      where η\eta is the learning rate and μ\mu is the strong convexity parameter of the loss function J(θ)J(\theta) .

    • This result indicates that by minimizing the initial suboptimality J(θP)J(θ)J(\theta^P) - J(\theta^*) through optimal parameter integration, SAIL accelerates the convergence of the training process.

    • Existing works do not provide such convergence analysis or theoretical foundations for their initialization methods.

审稿意见
6

This paper presents SAIL, a method to accelerate training by leveraging knowledge from pre-trained models using a structured initialization approach. It introduces a parameter transformation technique to adapt pre-trained model dimensions to the target architecture, alongside a proximal parameter integration strategy. Theoretical guarantees show the benefits of using transformed pre-trained parameters for faster convergence compared to random initialization as well as the guidance on obtaining the optimal parameters in the integration strategy. Experimental results across NLP and computer vision tasks confirm that SAIL reduces training time while improving model performance, supporting the efficacy and broad applicability of structured initialization for efficient large model training.

优点

  1. This paper offers strong motivation for addressing the challenges of efficient model initialization by harnessing pre-trained models, making a compelling case for its approach.
  2. Rigorous theoretical analysis and effective visualizations are used throughout, solidly supporting the paper’s claims and enhancing interpretability.
  3. Extensive experiments demonstrate that SAIL significantly outperforms random initialization, highlighting its effectiveness in reducing training time and improving model performance.

缺点

  1. A major concern is the limited comparison with related methods. This paper’s approach aligns closely with areas like model reuse/expansion and model merging, both mentioned in the related work. Model expansion combined with proximal parameter integration or parameter transformation with model merging could potentially address the problem posed here. A comparison with methods from these areas would provide a more comprehensive evaluation of SAIL’s efficiency.
  2. The paper lacks discussion on the gap between its linear model theory and real-world application, which is crucial to understanding SAIL's limitations. For instance, it would be beneficial to clarify practical computation of γ\gamma^* and address situations where the required data isn’t available—a common issue with open-source models that often lack access to original training data.

问题

  1. How do the authors justify the claim in lines 268-269 based on Theorem 2?
评论

Weaknesses:

W1: Limited comparison with related methods in model reuse/expansion and model merging.

Our Response:

We appreciate your feedback regarding the need for a more thorough comparison with existing methods in model reuse, expansion, and merging. We appreciate the opportunity to elaborate on how our approach, SAIL, distinguishes itself from current techniques.

The component nature of SAIL:

  • Flexibility: Traditional model reuse and expansion methods typically initialize larger models from single smaller ones through weight replication or knowledge distillation. In contrast, SAIL offers a versatile framework that integrates multiple pre-trained models of diverse architectures and sizes.
  • Parameter Transformation: SAIL employs parameter transformation techniques across both width and depth dimensions. While prior works have explored unidirectional parameter transfer strategies - either initializing larger models from smaller ones [3] or extracting subset parameters from larger models to initialize smaller ones [4] - these approaches are inherently constrained to single-direction transformations. In contrast, SAIL introduces a bidirectional parameter transformation framework that enables flexible and seamless transitions between models of varying scales.
  • Optimal Linear Merging: Our method introduces a systematic approach for the optimal linear merging of different models by calculating combination coefficients γ*. This process, grounded in our theoretical framework, allows for the integration of knowledge from heterogeneous models, a capability that existing methods lack.

Experiments:

  • Benchmark & Evaluation Metrics: We have expanded our experimental evaluation to include comparisons with prominent methods such as LIGO [1] and Model Soup [2], focusing on model expansion and merging. These methods were assessed using a comprehensive suite of natural language understanding and reasoning benchmarks to ensure robust performance analysis.
  • Results: As presented in General Response Tables 2 & 3, SAIL demonstrates superior performance and greater flexibility compared to the evaluated existing approaches.

Ref:

[1] Wang P, Panda R, Hennigen L T, et al. Learning to grow pretrained models for efficient transformer training[J]. arXiv preprint arXiv:2303.00980, 2023.

[2] Wortsman M, Ilharco G, Gadre S Y, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time[C]//International conference on machine learning. PMLR, 2022: 23965-23998.

[3] Du W, Luo T, Qiu Z, et al. Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training[J]. arXiv preprint arXiv:2405.15319, 2024.

[4] Xu Z, Chen Y, Vishniakov K, et al. Initializing models with larger ones[C]//The Twelfth International Conference on Learning Representations. 2023.

W2: The paper lacks discussion on the gap between its linear model theory and real-world application, including practical computation of γ and situations where required data isn’t available.*

Our Response:

We appreciate your feedback regarding the theoretical aspects of our work and their practical implications. In practical applications, computing the optimal weights γ* based on total variation distances between data distributions can be challenging. To address this, we have explored alternative methods for estimating γ*:

Alternative Estimation Methods for γ\gamma:

To address these challenges, we propose practical estimation methods that align with our theoretical framework:

1. Implicit Methods Using Model Outputs or Predictions:

Even without direct access to the original datasets, we can estimate the distances between models by leveraging their outputs on a shared dataset or synthetic data. Let V\mathcal{V} be a validation set accessible during model deployment. We define empirical distances based on model outputs:

  • Empirical Distance Between Models:

    D^ij=1VxVfθi(x)fθj(x)2\hat{D} _{ij} = \frac{1}{|\mathcal{V}|} \sum _{x \in \mathcal{V}} \| f _{\theta_i}(x) - f _{\theta_j}(x) \|^2

  • Empirical Distance Between Models and Target: D^iD=1VxVfθi(x)yx2,\hat{D} _{i}{D^\ast} = \frac{1}{|\mathcal{V}|} \sum _{x \in \mathcal{V}} \| f _{\theta_i}(x) - y _x^\ast \|^2,

    where fθi(x)f_{\theta_i}(x) is the output of model θi\theta_i for input xx, and yxy_x^\ast is the target output (if available).

Estimating γ\gamma^\ast:

Using these empirical distances, we construct an estimated matrix H^\hat{H}: H^ij=D^iD2+D^jD2D^ij2,\hat{H} _{ij} = \hat{D} _{i}{D^\ast}^2 + \hat{D} _{j}{D^\ast}^2 - \hat{D} _{ij}^2,

and compute: γ^=H^1eeH^1e.\hat{\gamma}^\ast = \frac{\hat{H}^{-1} \mathbf{e}}{\mathbf{e}^\top \hat{H}^{-1} \mathbf{e}}.

评论

I would like to thank the authors for their comprehensive explanations. Most of my concerns have been addressed, and I sincerely hope that these discussions will be included in the revised paper. However, I still have additional comments on the following two aspects:

Firstly, I understand that it is challenging to directly analyze non-linear neural networks in practice. As such, providing insights from linear models is acceptable. However, it is important that the authors provide direct discussion or empirical verification (rather than indirect justification from final performance, which can be influenced by many other factors) to demonstrate how effective these insights from linear models are when applied to non-linear models.

Secondly, regarding the claim "Theorem 2 demonstrates that initializing with the proximal parameter θ_P leads to faster convergence compared to random initialization," I believe it should be stated more cautiously. The claim should indicate that the proximal parameter is likely to achieve faster convergence, rather than stating it will definitely do so. This qualification is necessary because there is no proof that the proximal parameter strictly outperforms random initialization by reducing the difference between θ_P and θ^* by a guaranteed margin. Additionally, since the upper bound can be loose, the theoretical guarantee may not translate directly to practical performance. Therefore, the claim should be expressed with appropriate uncertainty rather than absolute certainty.

评论

Our Response to First Point of Comment:

Thank you for highlighting the importance of bridging the insights from linear models to non-linear neural networks. We need to clarify that we have conducted empirical experiments to validate the key theoretical results of our SAIL in Section 4.4 and examined the training dynamics through changes in validation accuracy during training in Section 4.5.

To better address this concern, we have incorporated a new section in Appendix K at page of 43 in the revised paper. This section presents empirical validation of our theoretical findings using a non-linear model (a multi-layer perceptron with ReLU activations) on the Spirals dataset, a synthetic dataset known for its non-linear decision boundaries.

In response to your comments, we examine the detailed training dynamics of our SAIL in this section, focusing on the convergence rate of the loss, among other aspects.

Conclusion of Our Empirical Verification:

Experimental Setup. Parameters of two MLP models, θ1\theta_1 and θ2\theta_2, were independently pre-trained on Spirals datasets D1D_1 and D2D_2, respectively. Utilizing SAIL, we transformed and merged these parameters to obtain the proximal parameter θP\theta^{\mathrm{P}}, which was then used to initialize a new model for training on DD^\ast.

Baselines. Recall that our SAIL calculates the merge ratio, denoted as γ\gamma^*, to integrate the parameters θ1\theta_1 and θ2\theta_2. For comparison, we manually design nine equidistant values for the merge ratio γ\gamma within the range [1,2][-1, 2]. Additionally, a randomly initialized model serves as a fundamental baseline.

Key Results:

  1. Faster Convergence:
    • The SAIL-initialized model θP\theta^{\mathrm{P}} consistently achieves lower training loss at each epoch compared to the randomly initialized model. The training loss curve for SAIL-initialized models drops more rapidly. Additionally, our optimal γ\gamma^\ast approximately aligns with the best-performing γ\gamma values in loss reduction.
  2. Stable Optimization Trajectory:
    • The gradient norm for the SAIL-initialized model remains consistently lower during the initial epochs, indicating more stable and efficient updates. The gradient norms of SAIL-initialized models are significantly reduced compared to those with random initialization and those across different γ\gamma values.
  3. Parameter Proximity:
    • The proximal parameter θP\theta^{\mathrm{P}} constructed through SAIL is closer to the optimal parameters θ\theta^\star than random initialization. This proximity is quantitatively supported by the inequality: θPθ2αθRandomθ2\| \theta^{\mathrm{P}} - \theta^\star \|^2 \leq \alpha \| \theta_{\text{Random}} - \theta^\star \|^2 where α<1\alpha < 1, demonstrating that θP\theta^{\mathrm{P}} resides closer to θ\theta^\star in the parameter space.
  4. Optimal γ\gamma Selection:
    • Our experiments systematically explored the impact of the merge ratio γ\gamma on training performance. Empirically, the optimal γ\gamma^\ast not only minimized the training loss and maximized accuracy but also maintained the lowest gradient norms. This empirical optimal γ\gamma^\ast aligns closely with our theoretical predictions.
评论

2. Model State-Based Estimation:

We can exploit internal representations to estimate model similarities:

  • Activation-Based Distance:

    For each model θi\theta_i and input xx, let Aθi(x)A_{\theta_i}(x) denote the activation at a particular layer. We define: D^ij=1VxVAθi(x)Aθj(x)2.\hat{D} _{ij} = \frac{1}{|\mathcal{V}|} \sum _{x \in \mathcal{V}} \| A _{\theta_i}(x) - A _{\theta_j}(x) \| ^2.

3. Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO):

During SFT or DPO, we can incorporate the estimation of γ\gamma into the training objective:

  • Modified Loss Function: L(θ)=Ltask(θ)+λi=1nγiθθi2,L(\theta) = L_{\text{task}}(\theta) + \lambda \sum_{i=1}^n \gamma_i \| \theta - \theta_i \|^2,

    where LtaskL_{\text{task}} is the task-specific loss, and λ\lambda is a regularization parameter.

Optimization Strategy:

  • The coefficients γi\gamma_i can be treated as learnable parameters, optimized jointly with θ\theta.
  • Constraints γi0\gamma_i \geq 0 and i=1nγi=1\sum_{i=1}^n \gamma_i = 1 can be enforced using projection methods or by parameterizing γ\gamma using a softmax function over unconstrained variables ηi\eta_i: γi=eηij=1neηj.\gamma_i = \frac{e^{\eta_i}}{\sum_{j=1}^n e^{\eta_j}}.

Questions:

Q1: How do the authors justify the claim in lines 268-269 based on Theorem 2?

Our Response:

We offer a comprehensive analysis in Appendix C in the page of 22 by first providing an intuitive overview of how proximal parameter initialization θP\theta^P reduces the initial distance to the optimal parameter θ\theta^\star , thereby lowering the suboptimality of the loss function as indicated by Theorem 2. Specifically, we establish that: J(θP)J(θ)L2θPθ22\mathcal{J}(\theta^P) - \mathcal{J}(\theta^\star) \leq \frac{L}{2} \| \theta^P - \theta^\star \|_2^2

This demonstrates that initializing with θP\theta^P controls the initial loss difference through the parameter distance θPθ22\| \theta^P - \theta^\star \|_2^2 .

Following the intuitive explanation, we present a detailed proof that begins by bounding the parameter distances using Theorem 3, which provides a probabilistic guarantee that pre-trained parameters are closer to θ\theta^\star than random initialization: Pr(θiθ22αθrandθ22)1O(τ2+βα)\Pr \left( \| \theta _i - \theta^\star \| _2^2 \leq \alpha \| \theta _{\text{rand}} - \theta^\star \| _2^2 \right) \geq 1 - O\left( \frac{\tau^2 + \beta}{\alpha} \right)

By selecting an appropriate α\alpha , we ensure that θPθ22\| \theta ^P - \theta ^\star \| _2 ^2 is sufficiently smaller than θrandθ22\| \theta _{\text{rand}} - \theta ^\star \| _2^2. We then relate these parameter bounds to the loss function's suboptimality using the smoothness and strong convexity properties, ultimately showing that: ρ=J(θP)J(θ)J(θrand)J(θ)12\rho = \frac{\mathcal{J}(\theta ^P) - \mathcal{J}(\theta ^\star)}{\mathcal{J}(\theta _{\text{rand}}) - \mathcal{J}(\theta ^\star)} \leq \frac{1}{2}

This ratio confirms that θP\theta^P leads to a loss reduction that is at least half as effective as random initialization, thereby ensuring faster convergence of the gradient descent algorithm.

评论

Our Response to Second Point of Comment:

Thank you for your valuable feedback. We completely agree with your point regarding the cautious phrasing of our claim. As you suggested, the statement about faster convergence with proximal parameter initialization should indeed be expressed with more uncertainty. Specifically, as stated in Theorem 2, the advantage of initializing with the proximal parameter θP\theta^\text{P} in terms of convergence speed is likely to occur, but it holds with high probability rather than as a guaranteed result. To address this, we revised the main text (lines 267 to 273) to clearly indicate that the convergence advantage of θP\theta^\text{P} is probabilistic, and we provided an explanation to this effect. In Appendix C, we further provide a mathematical correction to our previous proof to explicitly account for the probabilistic nature of this result.

Regarding your second point, we would like to clarify that:

  • While we can assert that the loss function is likely advantageous initially with high probability—due to the convexity assumptions and the specific choice of α\alpha in Theorem 1—there are limitations in maintaining this theoretical advantage throughout the entire iteration process. We cannot guarantee a consistent lower bound on the loss function after TT iterations. The upper bound provided earlier is useful for initial stages but may not remain tight as iterations increase. We have clarified this in Appendix C, discussing the potential reduction of the initial advantage over time.
  • Nonetheless, our method has demonstrated superior performance in various practical scenarios. Specifically, it significantly accelerates training in neural networks across different domains, such as natural language processing (see Section 4.4 in page of 8 and Appendix I in page of 36), image recognition (see Section 4.5 in page of 10 and Appendix H in page of 35), and image generation (see Appendix J in page of 40).

We sincerely appreciate the reviewer's insightful suggestions and valuable feedback on both the empirical verification and theoretical aspects of our article. Your constructive comments have greatly contributed to enhancing the quality and clarity of our work. Thank you again for your thoughtful and detailed review. Should you have any further questions or require additional clarification, please do not hesitate to ask.

评论

Dear Reviewer,

As the deadline approaches, we kindly request your response to our feedback at your earliest convenience. Thank you for your time and consideration.

Best regards

评论

Dear Reviewer,

We sincerely appreciate your thorough review of our manuscript. While we have already undertaken new empirical investigations and theoretical refinements based on your feedback, we would welcome the opportunity to engage in further discussion during the remaining time in the discussion phase. Our aim is to ensure that our revisions fully address your concerns.

Best regards, Authors

评论

General Response

We have conducted extensive experiments to evaluate the performance of our proposed method, SAIL, compared to various baseline methods. Due to space constraints, we provide a summary here and include detailed settings and results in Appendix I in page of 34.

Comparison Methods:

  • Training from Scratch: Models initialized randomly and trained on the target tasks.
  • Existing Methods: We compared SAIL with methods such as LIGO (a weight transformation method) and Model Soup (a linear model merging method).

Benchmarks:

We evaluated our approach on a diverse set of NLP benchmarks to assess its performance across various linguistic and reasoning tasks:

  • PIQA: Assesses physical commonsense reasoning.
  • HellaSwag: Challenges models with complex multiple-choice questions requiring robust inference.
  • Winogrande: Focuses on pronoun resolution for contextual understanding.
  • SciQ: Tests comprehension of scientific texts.
  • ARC-Easy: Presents grade-school level science questions for factual knowledge application.
  • COPA: Measures causal reasoning by selecting plausible alternatives.

Overview of Results:

  • Table 1: Compares the accuracy of models trained from scratch versus models initialized with SAIL across various NLP benchmarks.
  • Table 2: Compares SAIL with LIGO, highlighting the effectiveness of our parameter transformation method.
  • Table 3: Presents a comparison between Uniform Soup, Greedy Soup, and SAIL, demonstrating that SAIL's γ* yields better performance.

Note: The best performance for each dataset is highlighted in bold.


Table 1: Training from Scratch vs. SAIL (Accuracy) 

DatasetTraining from Scratch (%)SAIL (Ours) (%)
PIQA51.9661.92
HellaSwag24.8734.48
Winogrande51.1452.96
SciQ22.1070.30
ARC-Easy27.5442.63
COPA58.0063.00

Table 2:  LIGO vs. SAIL (Accuracy)

DatasetLIGO (%)SAIL (Ours) (%)
PIQA52.2961.92
HellaSwag25.3334.48
Winogrande50.2052.96
SciQ23.9070.30
ARC-Easy29.6542.63
COPA55.0063.00

Table 3: Uniform Soup vs. Greedy Soup vs. SAIL (Accuracy)

DatasetUniform Soup (%)Greedy Soup (%)SAIL (Ours) (%)
PIQA54.8057.7361.92
HellaSwag25.2527.6534.48
Winogrande50.5152.0952.96
SciQ51.9059.7070.30
ARC-Easy27.1934.9142.63
COPA57.0051.0063.00

Note: The best performance for each dataset is highlighted in bold. For perplexity, lower is better (indicated by ↓).

评论

Dear Reviewers,

Based on your valuable feedback, we have expanded our experiments to enhance the training efficiency of diffusion models.

Specifically, building on the theoretical foundations of our Structured-Initialization Learning (SAIL) approach, we have discovered that a slightly modified version of SAIL can effectively leverage representation models (e.g., DINOv2 [1]) to significantly accelerate the training of the SiT diffusion model [2], a state-of-the-art generative model.

For benchmarking, we conducted experiments comparing our method to the recently proposed state-of-the-art diffusion model acceleration method, REPA [3]. This method employs a pre-trained representation model DINOv2 to accelerate diffusion model training by 17.5×17.5\times through representation alignment, garnering significant attention in the generative model community. Detailed experimental setups are provided in Appendix J on page 39.

Table: Comparison of the performance of our SAIL with the baseline SiT-B/2 model and SiT-B/2 augmented with REPA across various training iterations for ImageNet 256×256256 \times 256 generation.

Model#ParamsIter.FID↓IS↑Prec.↑
SiT-B/2130M400K33.043.70.53
REPA130M50K78.217.10.33
SAIL(ours)130M50K67.620.50.34
REPA130M100K49.527.50.46
SAIL(ours)130M100K35.945.10.53
REPA130M200K33.243.70.54
SAIL(ours)130M200K19.7881.90.64
REPA130M400K24.459.90.59
SAIL(ours)130M400K12.16119.40.70

The experimental results in the table demonstrate that our SAIL significantly outperforms both REPA and SiT.

[1] Oquab, Maxime, et al. "Dinov2: Learning robust visual features without supervision." arXiv preprint arXiv:2304.07193 (2023).

[2] Ma, Nanye, et al. "Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers." arXiv preprint arXiv:2401.08740 (2024).

[3] Yu, Sihyun, et al. "Representation alignment for generation: Training diffusion transformers is easier than you think." arXiv preprint arXiv:2410.06940 (2024).

AC 元评审

This submission discusses the resource challenges of developing and deploying LLMs. To address this, they introduce Sail, a method that accelerates training by leveraging publicly available pre-trained models. Sail combines a parameter transformation technique for aligning pre-trained model parameters with target architectures and a proximal parameter integration strategy for efficient model initialization, significantly reducing training time and resource usage while maintaining or improving performance on downstream tasks. However, most experiments are conducted on small-scale benchmarks. It is very hard to argue that the nanoGPT is a LLM. We see some mismatch between the claim and the experimental validation.

Most reviewers acknowledge the motivation, the importance of the targeted problem, and the theoretical analysis. However, the reviewer like the most positive reviewer seems mainly attracted by the LLM story ("This paper addresses an important problem of LLM reusing, which lacks efficient solutions before. The proposed method seems to be better than naive averaging weights.") and does not have in-depth analyses. One of the reviewers argued about the limited innovation. In our view, the major limitation of this submission lies in the insufficient large-scale experimental validation and the mismatch between claimed and validated. The models that "have revolutionized natural language processing" are ones with at least billion-level parameters.

Meanwhile, more comparisons with merging literature/methods (e.g., https://github.com/arcee-ai/mergekit) could greatly strengthen this submission, since merging serves as a key component in the proposed framework.

Lastly, in the current shape, both the abstract and introduction emphasize LLMs. However, more than 50% of experiments and findings are on tool vision datasets like CIFAR. Reorganizing and reshaping the overall story and turning down the tune of LLMs could improve this submission.

审稿人讨论附加意见

The authors did a great job during the rebuttal. The paper could be further enhanced by incorporating all the discussion and providing more experimental validation with large-scale models which will match the claim.

最终决定

Reject