S$^{2}$FT: Efficient, Scalable and Generalizable LLM Fine-tuning by Structured Sparsity
摘要
评审与讨论
This paper aims to address the fact that the current parameter efficient fine-tuning methods cannot simultaneously achieve high-quality, efficient training or scalable LLM services. Therefore, a series of structured sparse fine-tuning methods for LLM are proposed, which simultaneously achieve state-of-the-art fine-tuning performance, training efficiency and reasoning scalability. S2FT achieves this through "sparse selection and dense computation". It only selects a few heads in the MHA module and a few channels in the FFN module in each Transformer block. Then the weight matrix of the coupling structure in the LLM is rearranged so that the selected parts are interconnected to generate multiple compact and dense trainable weight sub-matrices. S2FT only needs to update these sub-matrices for efficient parameter fine-tuning. The paper focuses on the fact that current methods such as LoRA, DoRA, etc., although they can reduce memory, are not as good as fine-tuning in large language models. The unstructured characteristics of SFT require matrix sparse operations, which makes service scalability and training efficiency impossible to guarantee. This is the main reason and motivation for the proposal of S2FT in this paper. In addition, this paper establishes an interface between S2FT and LoRA to support joint or non-joint computing paradigms.
优点
The idea of this paper is very novel and valuable. It starts with the three problems that the current LLM model may encounter, namely training quality, efficient training, and service scalability. It proposes a structured sparse fine-tuning method to accelerate the large language model and provides corresponding mathematical proofs. This method is very effective for both in-distribution and out-of-distribution model training. The highlights of this paper are as follows:
- The mathematical theoretical proof of this paper is very complete, the proof ideas are very clear, and the proofs of the in-distribution model and the out-of-distribution model are very complete. The conclusions and theoretical analysis of this paper are clearly seen, and the mathematical theory is very solid.
- The analysis process of this paper is complete. The advantages and disadvantages of the current research are analyzed in place in the literature review part, the method part is concise and clear, and a complete theoretical proof is provided. The advantages of operating efficiency and service scalability are also analyzed in the experimental part.
- The simplicity and clarity of this method is a highlight of this paper. Compared with the complicated fine-tuning methods of other methods, this paper only needs to select and rearrange the matrices of some layers for updating, and develops a partial back-propagation algorithm. The number of lines of code implemented is very small. Simplicity is a big highlight of this paper.
缺点
The advantages of the S2FT method in this paper are very prominent, and it provides some references for efficient, scalable, and high-quality fine-tuning methods for large language models. However, the author can consider the following areas for improvement or sufficient explanation:
- The symbols in the proof process of this paper can be slightly organized. Readers may feel that the symbols of the proof are a little messy when reading the entire proof process, but the whole process is complete and flawless.
- There are few figures for the method in this paper, only the right picture of Figure 2 is shown, and the overall picture is not clear enough. In addition, does this method extract the same layer for each different model? Or is it better to extract specific layers for each model?
- This paper locks the U of the low-rank decomposition matrix, which is a good attempt, but for different data sets, will the distribution of data affect the data form of the U matrix and the form of the U matrix basis?
- In the experimental results, there are many phenomena that the method used in this paper is lower than the previous methods. Can the author analyze each task more fully, because it can be seen from the table that some tasks are more obviously degraded?
问题
- The symbols in the proof process of this paper can be slightly organized. Readers may feel that the symbols of the proof are a little messy when reading the entire proof process, but the whole process is complete and flawless.
- There are few figures for the method in this paper, only the right picture of Figure 2 is shown, and the overall picture is not clear enough. In addition, does this method extract the same layer for each different model? Or is it better to extract specific layers for each model?
- This paper locks the U of the low-rank decomposition matrix, which is a good attempt, but for different data sets, will the distribution of data affect the data form of the U matrix and the form of the U matrix basis?
- In the experimental results, there are many phenomena that the method used in this paper is lower than the previous methods. Can the author analyze each task more fully, because it can be seen from the table that some tasks are more obviously degraded?
局限性
This paper mentions some limitations of this paper in the limitations. For example, although various model architectures show coupled structures, this paper does not explore the opportunity to extend S2FT to other architectures. At the same time, in terms of model deployment, although this paper verifies the feasibility of scalable services in the work, there is still a lack of a practical and scalable actual service system. However, when using this method, is it necessary to consider the fairness or importance of each layer? For example, for a large language model, the first layer may contain more information, and some information in the middle layer can be appropriately eliminated. Secondly, whether it is truly quality-oriented needs to be considered because S2FT does not perform well in any task in the experiment of this paper. Finally, this paper also mentions that the subsequent optimization direction may be targeted at the residual dependency network, and how to perform S2FT on this type of network needs further consideration. However, since the work of this paper mainly focuses on PEFT, it leads to a reduction in GPU computing resource consumption. Therefore, this method has the potential to have a positive impact on the computing environment by minimizing the computing resources required for fine-tuning LLM.
Thank you for your valuable comments and suggestions. We will carefully revise our paper based on your comments. Our responses to your questions are detailed below. We would greatly appreciate your input on whether our revisions address your concerns.
Q1: The symbols in the proof process of this paper can be slightly organized.
A1: Thank you for pointing out this issue. In the camera-ready version of our work, we will simplify the symbols and add additional notations before our proof to improve readability.
Q2: There are few figures for the method in this paper, and the overall picture is not clear enough.
A2: Thank you for your suggestion. We will include more figures to illustrate our selection and permutation strategy specific to transformer architectures in detail. The current Figure 2 serves as an abstract demonstration of each coupled structure in a standard Transformer model. Additionally, we will ensure that our figures are clearer in the camera-ready version.
Q3: Does this method extract the same layer for each different model?
A3: Our method employs the same extraction strategy for all models discussed in our paper. Specifically, we always extract all feed-forward network (FFN) layers for Transformer models, with the trainable parameters uniformly distributed across the Up, Down, and Gate Projection Layers. This approach is based on our findings in Figure 4, which show that fine-tuning FFN layers results in better performance compared to fine-tuning Attention layers under the same budget. To demonstrate the effectiveness of our uniform allocation strategy among FFN layers, we have included an additional ablation study in Table R1.
In Table R1, we have added an ablation study regarding layerwise allocation of trainable parameters. This study includes the following design patterns:
- Increasing (): the number of trainable parameters in every layer gradually increases (or remains the same);
- Uniform (: the number of trainable parameters in every layer is the same;
- Decreasing (): the number of trainable parameters in every layer gradually decreases;
- Random One: only one randomly selected layer has trainable parameters.
The results show that maintaining a uniform distribution of trainable parameters among different layers leads to the best performance. We will include these results in the camera-ready version of our paper if accepted.
Table R1: Performance of different layer allocation strategies on commonsense reasoning tasks.
| Allocation Strategy | Average Accuracy |
|---|---|
| Increasing | 81.2 |
| Uniform | 81.8 |
| Decreasing | 80.4 |
| Random One | 79.9 |
While our allocation strategy is both model- and task-independent, there is great potential to further improve performance by selecting trainable parameters based on model weights or downstream tasks. We leave this problem for future research.
Q4: Will the distribution of data affect the data form of the U matrix and the form of the U matrix basis?
A4: Thank you for your question. Our work employs a random selection strategy for channel selection in Section 5.3, which is task-independent. To demonstrate how different U matrices affect performance on various tasks, we used three different random seeds for channel selection. The results for both commonsense reasoning and math reasoning tasks are presented in Table R2. Our findings indicate that the optimal form of the U matrix varies across different tasks, suggesting that the "best U matrices" should be task-specific. We have left this direction for future research.
Table R2: Ablation study of channel selection strategies on the commonsense reasoning and math reasoning tasks.
| Selection Strategy | Commonsense | Math |
|---|---|---|
| Seed: 0 | 82.2 | 68.6 |
| Seed: 42 | 81.5 | 70.1 |
| Seed: 1234 | 81.0 | 69.4 |
Q5: Can the author analyze each task more fully?
A5: We would like to clarify that in Sections 5.1 and 5.2, we train the base model on a single dataset and evaluate it across several sub-tasks. Therefore, the average performance should be the primary focus, as our method achieves significant improvements in both settings. These results sufficiently demonstrate the superiority and robustness of SFT. It is expected that our method may not always perform the best in every sub-tasks as different sub-tasks will affect each other. Nonetheless, SFT consistently achieves the best or second-best performance in sub-tasks, which cannot be characterized as "obviously degraded."
To further address your concerns, we have provided more in-depth analyses of the results in Table 1 and Table 2 of the original paper. For the commonsense reasoning tasks in Table 1, SFT achieves the best performance in most sub-tasks, while maintaining second-best performance in the remaining tasks. These results demonstrate the superiority and robustness of SFT in memorizing common knowledge.
Table 2 further verifies this conclusion, showing that SFT outperforms LoRA, especially in memorization tasks such as Writing or Humanities within instruction-following scenarios. It also excels in tasks requiring pre-trained knowledge, such as those in STEM. This indicates that SFT's performance improvements are primarily due to its enhanced ability to memorize new information and retain pre-trained knowledge. For other tasks, such as Reasoning and Math, which focus on multi-hop reasoning with limited knowledge, the performance of LoRA and SFT is similar. It is common for LoRA to slightly outperform SFT in some of these tasks. Consequently, SFT-based methods are more effective for knowledge-rich tasks. In comparisons among SFT-based methods, LISA performs slightly better than SFT in some tasks. This is because LISA updates much more parameters than SFT, resulting in similar behavior and better performance in certain tasks.
Thank you so much for your dedicated review of our paper. We recognize the significant time and effort involved in your review, and we greatly appreciate it. With only 2 days remaining before the conclusion of the discussion phase, we wish to extend a respectful request for your feedback about our responses. Thank you!
Author's rebuttal mostly addresses my concern. I will change the score.
The paper introduces a structured pruning method for LLMs. The main idea is to permute the rows and columns of the weight matrices and select a submatrix during the fine-tuning process. The authors show that the proposed technique outperform previous techniques in terms of accuracy and efficiency.
优点
- The paper is well written. The authors gave a good summary of existing parameter-efficient fine-tuning (PEFT) methods. While the proposed technique itself is not entirely new (it can be considered as a special form of structured sparse training), the authors described the technique clearly.
- Theoretical analysis is provided for generalization performance.
- Experiments are comprehensive. The authors tested their technique for two types of tasks (commonsense reasoning and instruction-following) with five fine-tuning categories on more than 10 datasets.
缺点
Novelty in the pruning technique itself is limited. It is basically a structured sparse training method. Similar structured pruning methods have been proposed on non-LLM models previously (e.g., [27, 24]).
问题
Will the authors open-source the code?
局限性
Yes.
Thank you for your constructive comments and suggestions. We respond to your questions below and would appreciate it if you could let us know if our response addresses your concerns.
Q1: Novelty in the pruning technique itself.
A1: Thank you for your comment. We acknowledge that structure sparsity is commonly used in model pruning. However, we are the first to adopt structure sparsity for the parameter-efficient fine-tuning (PEFT) of large language models (LLMs). Therefore, our method is a gradient selection technique instead of a pruning technique. Motivated by the challenges of practical efficiency and scalability in previous SFT-based methods, we use coupled structures for flexible and fine-grained gradient selection, introducing a completely new gradient selection strategy. This idea is both novel and effective in this research line.
Before the era of LLMs, methods like Diff pruning [1] and Fish Mask [2] mainly focused on unstructured selective fine-tuning, as model sizes were not very large. These methods used a binary mask to enable sparse gradient updates during training, which led to large memory footprints and time costs. In the era of LLMs, researchers additionally prioritize the memory efficiency of PEFT methods during training and scalable serving ability during inference, leading to the popularity of LoRA. SFT-based methods like LISA [3] only enable layer-wise selection and have limitations in serving scalability. In comparison, our method addresses these efficiency bottlenecks and revitalizes SFT-based approaches, surpassing LoRA in both performance and efficiency. Given the current research trends in PEFT methods, our approach is novel in this area.
Q2: Will the authors open-source the code?
A2: We strongly agree that open-sourcing the code is critical for reproducibility and supporting future research in this area. We therefore will definitely make our code publically available and results easy to reproduce upon acceptance.
References:
[1] Guo D, Rush A M, Kim Y. Parameter-efficient transfer learning with diff pruning[J]. arXiv preprint arXiv:2012.07463, 2020.
[2] Sung Y L, Nair V, Raffel C A. Training neural networks with fixed sparse masks[J]. Advances in Neural Information Processing Systems, 2021, 34: 24193-24205.
[3] Pan R, Liu X, Diao S, et al. LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning[J]. arXiv preprint arXiv:2403.17919, 2024.
To facilitate reproduction, we have provided our code at https://anonymous.4open.science/r/S2FT_Rebuttal-7B17 for reviewer’s verification and will make it publicly available upon paper acceptance.
This repository contains the training and inference code necessary to fine-tune a LLaMA-7B model on commonsense reasoning tasks. We hope this addresses your reproduction concerns.
Thank you so much for your dedicated review of our paper. We recognize the significant time and effort involved in your review, and we greatly appreciate it. With only 2 days remaining before the conclusion of the discussion phase, we wish to extend a respectful request for your feedback about our responses. Thank you!
Current PEFT methods for LLMs fail to achieve high quality, efficient training, and scalable serving simultaneously. To overcome this, the authors developed Structured Sparse Fine-Tuning (S²FT), which excels in all three areas. S²FT improves generalization by selecting a few heads in the Multi-Head Attention (MHA) and channels in the Feed-Forward Network (FFN) modules for each Transformer block. It forms dense, trainable submatrices by co-permuting weight matrices, preventing overfitting and forgetting. S²FT achieves state-of-the-art performance and reduces fine-tuning memory usage by up to 3 times and increases throughput by 1.5-2.7 times.
优点
- This paper is well-written and organized. The proposed method is technically sound.
- Efficient tuning of LLMs is an important topic.
- The performance of the proposed method is very promising.
缺点
- The author claimed that S2FT prevent overfitting and forgetting. However except the toy experiments in 2nd section, I did not see any experimental proof. E.g. The author only gives MTBench results after the finetuning on AlpacaGPT4, what about the overfitting and forgetting issues?
- Experiments only conducted on relative small models.
- The name SFT may incur some confusions as it already has the meaning of supervised fine-tuning.
- The experiments parts are not sufficient to proof the effectiveness of the proposed method especially on some downstream tasks.
问题
- The author has some ablations on the trainable allocation choice within a block, what about the allocation in different layers?
- Can this paper be combined with quantization just like QLoRA.
局限性
Yes
Thank you for reviewing our paper and for your valuable feedback. Below, we address your concerns point by point and we will revise our paper according to your suggestions. We would appreciate it if you could let us know whether your concerns are addressed by our response.
Q1: Empirical Results about the overfitting and forgetting issues?
A1: To verify that SFT can prevent overfitting and forgetting issues, we evaluate its performance in an out-of-distribution (OOD) scenario using MT-Bench, as shown in Table 2 of the original paper. Additionally, we conduct an experiment on arithmetic reasoning tasks, demonstrating that SFT significantly outperforms LoRA in both near OOD and far OOD settings.
In Table 2, we present the results of training the model on the Alpaca GPT-4 dataset and evaluating it on the MT-Bench benchmark, representing an OOD setting. Our method significantly outperforms both LoRA and Full FT in this task, demonstrating that it leads to much less forgetting and better generalization to new datasets.
To further address your concern, we trained LLaMA-7B/13B on the Math10K dataset and evaluated its performance on three in-distribution (ID) tasks (GSM8K, AQuA, MAWPS), four near OOD tasks (SVAMP, MultiArith, AddSub, SingleEq), and one far OOD task (MMLU). The results are shown in Table R1.
Table R1: Performance comparison between SFT and LoRA for LLaMA-7B/13B on the arithmetic reasoning tasks and MMLU benchmark.
| Method | Average (ID) | Average (Near OOD) | MMLU (Far OOD) |
|---|---|---|---|
| LoRA (LLaMA-7B) | 45.1 | 78.7 | 27.8 |
| SFT (LLaMA-7B) | 46.6 | 81.7 | 33.1 |
| LoRA (LLaMA-13B) | 49.9 | 81.6 | 36.3 |
| SFT (LLaMA-13B) | 50.5 | 84.2 | 42.2 |
According to the results, SFT led to improvements of 4.3%, 3.0%, and 1.4% for LLaMA-7B in far OOD, near OOD, and ID settings, respectively, compared to LoRA. For LLaMA-13B, the corresponding improvements are 5.9%, 2.6%, and 0.6%. As the distribution difference between the training data and test data increases, the performance gap between SFT and LoRA enlarges, demonstrating that our method is effective in preventing overfitting and forgetting. We will include these results in the camera-ready version of our paper if accepted.
Q2: Experiments on larger models.
A2: Thank you for your suggestion. Following LISA, we have added experiment results for LLaMA2-70B on MT-Bench and GSM8K in Table R2. The results show that SFT outperforms other PEFT methods for larger models, providing strong evidence to support SFT’s scalability under large-scale training scenarios. We will include these results in the camera-ready version of our paper if accepted.
Table R2: Performance comparison between different methods for LLaMA2-70B on MT-Bench and GSM8K.
| Method | MT-Bench | GSM8K |
|---|---|---|
| Vanilla | 5.19 | 54.8 |
| LoRA | 6.10 | 59.4 |
| LISA | 6.72 | 61.1 |
| Full FT | 6.25 | 67.1 |
| SFT | 6.91 | 64.7 |
Q3: The name SFT may incur some confusions as it already has the meaning of supervised fine-tuning.
A3: Thank you for your suggestion. We will replace "SFT" with "sparse FT" in the camera-ready version to avoid any confusion.
Q4: The experiment parts are not sufficient.
A4: In the original paper, we included experiments with five different base models and more than ten tasks, covering both commonsense reasoning and instruction-following tasks. We have also added some new experiments in Table R1 and R2. We hope these findings provide more comprehensive results of our method, and we will include them in the camera-ready version of our paper if accepted.
In Table R1, we further present experimental results for arithmetic reasoning tasks, demonstrating the effectiveness of our method across various downstream tasks. The results in near OOD and far OOD settings further highlight its ability to address overfitting and forgetting issues. Additionally, in Table R2, our experiments on LLaMA2-70B showcase SFT's scalability in large-scale training scenarios.
Q5: What about the allocation in different layers?
A5: Thank you for pointing out this issue. In Section 5.3, we maintain a uniform allocation of parameters across different layers. To further address your question, we have added an ablation study in Table R3 concerning layer-wise allocation. This study includes the following design patterns: (i) Increasing (): the number of trainable parameters in every layer gradually increases (or remains the same); (ii) Uniform (: the number of trainable parameters in every layer is the same; and (iii) Decreasing (): the number of trainable parameters in every layer gradually decreases. The results in Table R3 indicate that maintaining a uniform distribution of trainable parameters across different layers leads to the best performance. A more detailed analysis in this direction will be left for future research.
Table R3: Performance of different allocation strategies on commonsense reasoning tasks for LLaMA-7B.
| Allocation Strategy | Average Accuracy |
|---|---|
| Increasing | 81.2 |
| Uniform | 81.8 |
| Decreasing | 80.4 |
Q6: Can this paper be combined with quantization just like QLoRA
A6: Thank you for highlighting the potential of combining SFT with quantization. Our method can indeed be integrated with quantization, similar to QLoRA, by using mixed precision storage. Once the trainable parameters are determined, we retain these parameters in their original precision while quantizing the other parameters to low bits. This approach enables quantized PEFT, similar to QLoRA. By maintaining the trainable parameters as small, dense submatrices after permutation, our storage remains relatively hardware-efficient, even with mixed precision. We plan to conduct more experiments in this direction in the future.
Thank you so much for your dedicated review of our paper. We recognize the significant time and effort involved in your review, and we greatly appreciate it. With only 2 days remaining before the conclusion of the discussion phase, we wish to extend a respectful request for your feedback about our responses. Thank you!
Thanks for the authors' response. I have no more questions right now. I choose not to change the scores.
This paper introduces a new family of methods called Structured Sparse Fine-Tuning (SFT) for large language models (LLMs). SFT aims to achieve state-of-the-art fine-tuning performance, training efficiency, and inference scalability simultaneously. The method selects a few heads in the multi-head attention (MHA) module and a few channels in the feed-forward network (FFN) module for each Transformer block, then co-permutes the weight matrices to connect these selected components. This results in multiple compact, dense, and trainable weight submatrices that are updated during fine-tuning. The approach prevents overfitting and forgetting, delivers superior performance on benchmarks, and improves memory and throughput efficiency compared to full fine-tuning and existing parameter-efficient fine-tuning (PEFT) methods.
优点
This paper introduces a method combining structured sparsity with fine-tuning, enhancing both efficiency and performance. The method proposed can reduce memory costs and improve throughput. The method demonstrates strong generalization capabilities. The approach allows for scalable batched serving of multiple fine-tuned models without additional inference overhead. The experiments are comprehensive.
缺点
While the results look impressive, the reviewer is not fully convinced by the theoretical motivations. For example, the paper relies on the assumption that structured sparsity can effectively represent the necessary model adaptations, which is hard to verify in realistic settings.
Although the benchmarks are comprehensive, most of the benchmarks do not include SFT or other SFT-based methods. Since SFT is in some sense an SFT-based method, it would be necessary to include such comparisons to understand if the performance improvements come from the "structure" (i.e. the ) or SFT itself. Based on the current benchmarks, it is difficult to make such conclusions.
While the paper includes some implementation details, aiding reproducibility, the code is not provided, making it hard to verify or reproduce the results, especially for a paper with mostly empirical results.
问题
Could the authors elaborate on the scaling of trainable parameters, as well as the space and time complexity, compared to other SFT-based or LoRA-based methods?
Could the authors elaborate on how the hyperparameters are chosen, and provide a more comprehensive list of hyperparameters used in this paper, the procedure to choose them, and the discarded hyperparameters?
Right now, the paper mainly discusses how to apply SFT to attention-based LLMs. The reviewer wonders how to identify the important weights and apply this method to other models.
Most of the experiments were performed on LLaMA and LLaMA2. The reviewer wonders if there is any difficulty in applying the method to LLaMA3.
Is it possible that different datasets/tasks could result in a different set of "important weights"? The reviewer wonders if the authors explored this possibility.
The reviewer wonders if the author could release the source code into an anonymous repository for review purposes.
In the NeurIPS checklist, the author claims that their results include confidence intervals. Could the authors make some comments on the confidence intervals for the tables in the paper?
局限性
The limitations and broader impacts are addressed.
Thank you so much for the insightful and valuable comments! They are very helpful for further improving the clarity and quality of our paper. We'll revise our manuscript in the camera-ready version to address all of your concerns.
Q1: How can we verify the model adaptation ability in real-world settings?
A1: To verify the model's adaptation ability in real-world settings, we evaluate its performance in an out-of-distribution (OOD) scenario, as shown in Table 2 in the original paper. Our method, SFT, significantly outperforms both LoRA and Full FT by a large margin in this task, demonstrating its superior model adaptation ability. Additionally, we conduct an experiment on arithmetic reasoning tasks to demonstrate that SFT significantly outperforms LoRA in OOD settings (See Empirical results for generalization task part from rebuttal to all reviewers).
Q2: Ablation study between SFT and SFT.
A2: Thank you for your suggestion. Our results in Table R2 show that SFT primarily leads to performance improvement, while our structure enhances efficiency and scalability, as discussed in Section 6 in the original paper.
Table R2: Performance comparison between SFT and SFT on the commonsense reasoning tasks.
| Model | SFT | SFT |
|---|---|---|
| LLaMA-7B | 81.2 | 81.8 |
| LLaMA2-7B | 83.7 | 83.0 |
| LLaMA3-8B | 87.2 | 86.6 |
Q3: Reproduction of the paper.
A3: We strongly agree that open-sourcing the code is critical for reproducibility and supporting future research in this area. We therefore will definitely make our code publically available and results easy to reproduce upon acceptance.
Q4: How are the hyperparameters chosen?
A4: Thank you for highlighting the importance of detailing our hyperparameter configuration. We have included a more comprehensive list of the hyperparameters used in this paper in Tables R4 and R5. Most hyperparameters follow previous work [2, 3], and we only tune the learning rates.
Table R4: Hyperparameter configurations of SFT for LLaMA-7B/13B, LLaMA2-7B, and LLaMA3-8B on the commonsense reasoning tasks.
| Hyperparameters (SFT) | LLaMA-7B | LLaMA-13B | LLaMA2-7B | LLaMA3-8B |
|---|---|---|---|---|
| LR | 1e-6 | 1e-6 | 1e-6 | 1e-6 |
| LR Scheduler | Linear | Linear | Linear | Linear |
| Optimizer | AdamW | AdamW | AdamW | AdamW |
| Batch size | 16 | 16 | 16 | 16 |
| Warmup Steps | 100 | 100 | 100 | 100 |
| Epochs | 3 | 3 | 3 | 3 |
| Where | Up,Down,Gate | Up,Down,Gate | Up,Down,Gate | Up,Down,Gate |
Table R5: Hyperparameter configurations of SFT for LLaMA2-7B and Mistral-7B on the instruction-following task.
| Hyperparameters (SFT) | Mistral-7B | LLaMA2-7B |
|---|---|---|
| LR | 2e-5 | 1e-5 |
| LR Scheduler | Cosine | Cosine |
| Optimizer | AdamW | AdamW |
| Batch size | 4 | 4 |
| Warmup Steps | 100 | 100 |
| Epochs | 1 | 1 |
| Where | Up,Down,Gate | Up,Down,Gate |
Q5: How to identify the important weights?
A5: In Section 5.2, we discuss our weight selection strategy. For a standard transformer architecture, we first uniformly allocate the trainable parameters across different transformer layers. Next, we freeze all attention modules and evenly assign parameters to the Up, Gate, and Down projection modules. Since these three modules represent a coupled structure within the FFN module, we randomly select the same channels to update for each module. In Table 3, we also introduce an alternative activation-based selection strategy, which resulted in inferior performance. We leave the exploration of more advanced metrics to identify important weights as a topic for future research.
Q6: How to apply SFT to different models?
A6: Thank you for your interest in the application of SFT across different model architectures. It is indeed versatile and can be applied to Transformers, CNNs, RNNs, and GNNs.
As detailed in Section 3.2 of our work, SFT can be utilized in the Multi-Head Attention and Feed-Forward Network modules of standard Transformer architecture, which is sufficient for most LLMs and diffusion models currently. Additionally, such coupled structures in Figure 3(a) also exist in CNNs, RNNs, and GNNs as shown in [1]. Therefore, SFT is effective across various model architectures.
Q7: Most of the experiments were performed on LLaMA and LLaMA2. The reviewer wonders if there is any difficulty in applying the method to LLaMA3.
A7: In Table 1, we present the results of LLaMA3-8B on commonsense reasoning tasks following DoRA and ReFT. SFT achieves an average improvement of 1.4%, demonstrating its applicability to LLaMA3. For other experiments, we primarily focus on LLaMA and LLaMA2 to ensure a fair comparison, as these base models are commonly used in our baseline methods.
Q8: Could the authors make some comments on the confidence intervals for the tables in the paper?
A8: Thank you for your suggestion. We have added the full results with confidence intervals in the Appendix, and included comments in the tables in the main paper, which will be visible in the camera-ready version.
Q9: Could different datasets/tasks result in a different set of "important weights"?
A9: See The affect of selecting different trainable parameters on different datasets/tasks part from rebuttal to all reviewers.
References:
[1] Fang G, Ma X, Song M, et al. Depgraph: Towards any structural pruning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 16091-16101.
Thanks a lot for taking the time and effort to answer my questions. I would like to keep my recommendation for acceptance of the paper, and am considering raising the score in the next 2 days.
Thank you very much for your response. Please don't hesitate to reach out if you have any further questions. We greatly appreciate your willingness to consider raising the score.
To facilitate reproduction, we have provided our code at https://anonymous.4open.science/r/S2FT_Rebuttal-7B17 for reviewer’s verification and will make it publicly available upon paper acceptance.
This repository contains the training and inference code necessary to fine-tune a LLaMA-7B model on commonsense reasoning tasks. We hope this addresses your reproduction concerns.
We thank reviewers [R1(CFWk), R2(8C1L), R3(d9g6), R4(Cfys)] for their thoughtful and highly supportive feedback! We were glad that the reviewers found the problem significant and interesting [R2], the observations and theoretical analysis insightful and highly valuable [R3, R4], the methods novel, simple, and effective [R2, R4], the presentation easy to follow [R2, R3], experimental results comprehensive and impressive [R1, R2, R3].
We have updated the paper to incorporate constructive suggestions, which will show in the camera-ready version. We summarize the major changes:
-
[ R1, R3] Empirical results for generalization tasks: In Table R1, we trained LLaMA-7B/13B on the Math10K dataset and evaluated its performance on three in-distribution (ID) tasks (GSM8K, AQuA, MAWPS), four near OOD tasks (SVAMP, MultiArith, AddSub, SingleEq), and one far OOD task (MMLU). As the distribution difference between the training data and test data increases, the performance gap between SFT and LoRA enlarges, demonstrating that our method is effective in generalization by preventing overfitting and forgetting.
Table R1: Performance comparison between SFT and LoRA for LLaMA-7B/13B on the arithmetic reasoning tasks and MMLU benchmark.
Method GSM8K AQuA MAWPS SVAMP MultiArith AddSub SingleEq Average (ID) Average (Near OOD) MMLU (Far OOD) LoRA (LLaMA-7B) 37.5 18.9 79 52.1 95 83.3 84.4 45.1 78.7 27.8 SFT (LLaMA-7B) 35.8 22 81.9 57.1 93.3 87.3 89.2 46.6 81.7 33.1 LoRA (LLaMA-13B) 47.5 18.5 83.6 54.6 94.8 87.3 89.8 49.9 81.6 36.3 SFT (LLaMA-13B) 45.8 21.7 84 63 95 87.3 91.5 50.5 84.2 42.2 -
[ R2 ] Experimental results for large-scale language models: We have added experiment results for LLaMA2-70B on MT-Bench and GSM8K in Table R2. The results show that SFT outperforms other PEFT methods for larger models, providing strong evidence to support SFT’s scalability under large-scale training scenarios. We will include these results in the camera-ready version of our paper if accepted.
Table R2: Performance comparison between different methods for LLaMA2-70B on MT-Bench and GSM8K.
Method MT-Bench GSM8K Vanilla 5.19 54.8 LoRA 6.10 59.4 LISA 6.72 61.1 Full FT 6.25 67.1 SFT 6.91 64.7 -
[ R2, R4 ] Ablation Study for layerwise allocation strategy: For the allocation among different layers, we maintain a uniform allocation strategy, meaning we fairly assign trainable parameters to all layers. Our study in Table 3 includes four design patterns: (i) Increasing (): the number of trainable parameters in every layer gradually increases (or remains the same); (ii) Uniform (: the number of trainable parameters in every layer is the same; and (iii) Decreasing (): the number of trainable parameters in every layer gradually decreases; (iv) Random One: only one randomly selected layer has trainable parameters. The results show that maintaining a uniform distribution of trainable parameters among different layers leads to the best performance.
Table R3: Performance of different layerwise allocation strategies on commonsense reasoning tasks for LLaMA-7B.
Allocation Strategy Avg. Accuracy Increasing 81.2 Uniform 81.8 Decreasing 80.4 Random One 79.9 -
[R1, R4] The affect of selecting different trainable parameters on different datasets/tasks: As detailed In Section 5.3, we use a random selection strategy for channel selection. We apply the same three random seeds to determine the selection strategies for both commonsense reasoning and math reasoning tasks. The results are shown in Table R3. The results indicate that the best channel selection strategy varies across different tasks. Therefore, the "important trainable parameters" are different for each task.
Table R4: Ablation study of channel selection strategies on different tasks.
Selection Strategy Commonsense Math Seed: 0 82.2 68.6 Seed: 42 81.5 70.1 Seed: 1234 81.0 69.4
To facilitate reproduction, we have provided our code at https://anonymous.4open.science/r/S2FT_Rebuttal-7B17 for reviewer’s verification and will make it publicly available upon paper acceptance.
This repository contains the training and inference code necessary to fine-tune a LLaMA-7B model on commonsense reasoning tasks. We hope this addresses the reproduction concerns raised by Reviewer CFWk and Reviewer d9g6.
The reviewers generally found the paper well written, the proposed method to be technically sound and easy to understand, the work well motivated, and supported by the comprehensive experimental results. Reviewer's concerns during the rebutal centred on the theoretical analysis, confusing naming of the method, novelty of the pruning method itself, complex notation used in theoretical analysis and lack of explanatory figures. Thanks to the significant discussion and efforts of the author's and reviewers, it appears most of these concerns have been addressed and the reviewers are unanimous in accepting the paper.