Sparse High Rank Adapters
摘要
评审与讨论
This work proposes a new PEFT method, SHiRA, which finetunes 1-2% of pretrained model weights. The authors demonstrate that the resulting sparse adapter weights can be combined in multi-adapter settings with less concept loss than LoRA as the sparse adapters are mostly orthogonal. Further, the authors empirically demonstrate that the scatter operation used in SHiRA has a lower latency than fusing LoRA weights at inference time for model embedding dimensions > 1024.
优点
- This is a timely work as adapters as the multi-adapter setting offers great promise for memory constrained devices such as mobile phones.
- The proposed method outperforms LoRA across a diverse set of tasks and models.
- The paper is presented well, with clear figures, tables, and discussion.
缺点
- The primary weakness is the relatively modest novelty of this work in the context of Diff Pruning [1] and SFT [2]. Both of these prior works follow a very similar strategy as SHiRA, namely, fine tuning a very small, sparse set of the original pretrained parameters. In my view, the most unique aspect of SHiRA is the emphasis on composability in the multi-adapter setting and related analysis. A comparison between the performance differences between SFT-AG and SHiRA would help establish the potential benefits / drawbacks of using dynamic mask selection (SFT) vs. static (SHiRA). In any case, [2] is a very relevant work which should be discussed in the related work section.
- The primary motivation for this work is low latency adapter switching, providing end-to-end profiling would be much more convincing than focusing only on the scatter-op as in Appendix B. Perhaps Appendix J was originally intended for this analysis? It is unclear how significant the overall effect of the latency reductions in Appendix B are in the context of online or batched inference on an edge device. Does SHiRA provide any latency benefit for smaller embedding dimensions such as those used in StableDiffusion?
- One of the main benefits of LoRA is the reduced memory footprint for fine-tuning with adaptive optimizers such as Adam. Due to the momentum buffers typically requiring full float precision, LoRA greatly reduces the memory overhead required for fine-tuning. In contrast, it appears that based on Appendix C.3 SHiRA must materialize the full grad buffers for the pretrained weights which are selectively set to zero for frozen parameters with a post gradient accumulation hook. I believe SHiRA could be reparameterized as such that only the parameter tracks gradients and therefore Adam would not allocate buffers for every parameter in . Acknowledging the difference in memory footprint between LoRA and SHiRA during training and suggesting how this overhead may be avoided would help extend SHiRA to low-memory PEFT settings.
- The pruning criterion studied for mask initialization could be expanded to include some additional, more modern criteria designed specifically for transformer based models such as Movement Pruning [3] and Wanda [4]. Further discussion on the surprisingly high performance of the SHiRA-Rand mask would be beneficial as well.
- An analysis on how to distribute the sparse fine-tuned parameters across layers was not discussed. It would be particularly interesting to explore how to allocate the fine-tuning parameter budget across various modules in the network. For instance, how does the uniform strategy showcased here compare with OWL [5]?
[1] D. Guo, A. M. Rush, and Y. Kim, “Parameter-Efficient Transfer Learning with Diff Pruning.” arXiv, Jun. 09, 2021. doi: 10.48550/arXiv.2012.07463.
[2] A. Ansell, I. Vulić, H. Sterz, A. Korhonen, and E. M. Ponti, “Scaling Sparse Fine-Tuning to Large Language Models.” arXiv, Jan. 29, 2024. doi: 10.48550/arXiv.2401.16405.
[3] V. Sanh, T. Wolf, and A. Rush, “Movement Pruning: Adaptive Sparsity by Fine-Tuning,” in Advances in Neural Information Processing Systems, Curran Associates, Inc., 2020, pp. 20378–20389
[4] M. Sun, Z. Liu, A. Bair, and J. Z. Kolter, “A Simple and Effective Pruning Approach for Large Language Models.” arXiv, Jun. 20, 2023. doi: 10.48550/arXiv.2306.11695.
[5] L. Yin et al., “Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity.” arXiv, Oct. 08, 2023. doi: 10.48550/arXiv.2310.05175.
问题
Questions
- What are the key differences between SHiRA and SFT? How does SFT compare with SHiRA in terms of generalization performance?
- What are the end-to-end latency profiles of LoRA vs. SHiRA when performing both online and batched inference? I would be interested in the profiles for both the single adapter case and the multi-adapter setting. How does the latency scale as the number of adapters is increased in the multi-adapter setting for both methods? Based on Figure 7, it does not appear that SHiRA would benefit StableDiffusion 1.5 since it has a much smaller embedding dimensions than Llama (320 for the UNet vs. 4096), does SHiRA provide a latency benefit at this small embedding dimension?
- What is the memory overhead required for fine-tuning with SHiRA vs. LoRA as currently implemented? Can SHiRA be engineered to be more memory efficient than currently implemented, and if so, how?
- What is the effect of the dataset order in the SHiRA-WM-Non-Overlap setting? Is there an appreciable difference in performance for the dataset that is fine-tuned on the top 1% of weights vs. those datasets trained with lower magnitude weights?
- The performance of SHiRA-WM-Overlap was surprising, I speculate that this may be due to the three datasets being relatively similar (QA based reasoning). Does SHiRA-WM-Overlap maintain its high performance when fine-tuned on very different datasets? For example, combining adapters trained for multiple-choice based QA reasoning and in context retrieval augmented QA (TriviaQA, for example). Another way to perform this analysis could be to compare the L2 distance between the pretrained weights and the individually trained single adapters vs. the distance between the single adapters. Are the single adapters trained on different datasets more similar to each other than the pretrained weights?
- How are the number of trainable parameters determined? While performance remains competitive at 1%, it would be worthwhile to examine the trade-off between latency and generalization performance of smaller or larger adapters.
- Did the authors experiment with different layerwise distributions of the fine-tuning parameter budget? For instance, only fine tuning the MHA modules or only the MLPs? Another interesting approach could be to focus the fine-tuning budget on earlier blocks as recent works [6] have found that these early blocks have a disproportionate impact on the model output.
- In Appendix B.3, the SNIP code example determines the mask by selecting the topk elements in a gradient_dict member variable of the SFTTrainer class. Does this dict contain the products of the gradient and weight magnitudes?
Suggestions:
- Adding a discussion of PERP [7] to related work. While PERP aims at restoring LLM performance post pruning, it is related to this work in that it also finds that a very small portion of the pretrained parameters (<1% in some cases) can be used to fine-tune the network.
- Add citation for PEFT library on L70
- The typical LoRA fusion formulation is . I suggest revising to match the original paper unless the authors prefer to explicitly define the shapes of matrices B and A.
- The SHiRA-Rand baseline in Table 10 appears to be very strong. I note that random selection has been established as a robust baseline in the dynamic sparse training literature. Adding a discussion of the performance of the random mask and potentially expanding the main paper’s results to include the random mask would be of interest to the reader.
[6] S. He, G. Sun, Z. Shen, and A. Li, “What Matters in Transformers? Not All Attention is Needed.” arXiv, Jul. 07, 2024. doi: 10.48550/arXiv.2406.15786.
[7] M. Zimmer, M. Andoni, C. Spiegel, and S. Pokutta, “PERP: Rethinking the Prune-Retrain Paradigm in the Era of LLMs.” arXiv, Dec. 23, 2023. doi: 10.48550/arXiv.2312.15230.
局限性
Adequately addressed; however, adding additional commentary on memory required for fine-tuning would help reader understand if this method is applicable to memory-constrained fine-tuning setups.
Continued from Author Rebuttal…
Q5 [L2 distance analysis for SHiRA overlap: Are adapters more similar to each other than to base model weights?] Table 4 (and section 5.3.2 in submitted paper) have results for unstructured SHiRA-WM masks. Fig. S2 (rebuttal PDF) shows AWOM and AWOR for unstructured SHiRA masks such as SHiRA-WM overlap and non-overlap masks. As evident, the has very similar number of zeros for both overlap and non-overlap cases. This suggests that the relative orthogonality properties between SHiRA-overlap and non-overlap would be similar and that explains why SHiRA-overlap performs well. Hence, for unstructured masks, overlap and non-overlap adapters have similar orthogonality properties. Since overlap adapters are trained on top-1% weight magnitudes, they tend to achieve slightly higher single adapter accuracy.
Table S5 (rebuttal PDF) shows the L2 analysis for the adapters trained in Table 4 (submitted paper). We compute the L2 distance between each adapter and the original pretrained weights (all adapters train top 1% weights in the overlap setting) as well as the L2 distance between each adapter. Clearly, each adapter is closer to the pretrained weights compared to the other adapters. This demonstrates that the tasks are sufficiently different. We hypothesize that the main reason for the good performance of SHiRA-WM-overlap is its orthogonality properties as shown in Fig. S2 (rebuttal PDF).
Q6 [#trainable params] is purely a user defined metric and depends on what size adapters we want. For all experiments, adapter sizes are kept close to LoRA for a fair comparison.
Q8 [gradient-dict?] Yes, that dictionary contains the products of the gradient and weight magnitudes.
Thanks for the remaining suggestions. We will incorporate the rest of them in the next version of the paper.
With these new results, the paper has improved significantly. We would greatly appreciate if the reviewer could increase the rating of our work.
I thank the authors for their detailed rebuttal, clarifications, and additional analysis. I agree that SFT can be considered contemporaneous work. My other concerns regarding memory overhead and latency were adequately addressed by the rebuttal figures and discussion.
Based on the above, I've elected to increase my inital score.
Thank you so much for increasing the rating of our work and for the very detailed feedback. It genuinely improved the quality of our work.
We thank the reviewer for constructive feedback and appreciating the strengths of our work. Below we address the concerns:
1: Thanks for pointing out these related works. SFT is an excellent (and concurrent) work that aims to scale sparse finetuning to LLMs using dynamic masks. In contrast, SHiRA uses a static mask. We have now created a memory- and latency-efficient implementation for SHiRA based on the PEFT library. Our implementation uses the scatter-op during training (see section B in common response). A key difference between SFT and SHiRA is that dynamic mask-based SFT requires users to install custom sparse linear layer kernels (i.e., “linear-sd” in the official SFT code). This can sometimes make it non-trivial to run SFT particularly on different environments (e.g., if PEFT or CUDA versions are dramatically different). On the other hand, our static mask-based SHiRA PEFT implementation does not require any custom sparse kernel installations and uses pure Pytorch scatter-op during the training. Hence, an important benefit of SHiRA is its ease of training/inference deployment. Other differences between SHiRA and SFT include detailed analysis of multi-adapter fusion properties, including impact of sparsity on orthogonality between adapters, etc., which were not discussed in SFT paper. Also, SHiRA demonstrates its effectiveness on both vision and language tasks, whereas SFT only discusses the language tasks.
We wanted to use SFT code to compare SFT vs. SHiRA on commonsense reasoning tasks which were not included in the SFT paper. Unfortunately, during the short rebuttal process, we were not able to complete these experiments due to certain environment issues on our end. We will include a head-to-head comparison in the next version of the paper. Nevertheless, SFT seems to be the very first work that scales partial finetuning to LLMs. We will cite this concurrent work (released nearly 3 months before the conference deadline), highlight its clear importance, and discuss all these differences in the related work section.
2: Please see section A in common response.
3: Please see section B in common response.
4: Given a mask, SHiRA can be easily extended to any pruning strategy. Designing an optimal mask is an important future direction. We will cite Movement Pruning/Wanda and discuss them in a future work section. The scope of our current work is only to demonstrate that even the most basic pruning criteria outperform LoRA. With our PEFT implementation, we have now provided a way to perform partial finetuning at lower training memory costs than LoRA.
The rank of random sparse matrix has been studied extensively using random graph theoretic and combinatorial techniques [P1, P2]. Based on the results discussed in [P1, P2] the rank of the SHiRA-Rand adapter should be high. Empirically, we did observe that our SHiRA-Rand adapters were high rank which might explain their high accuracy.
5: SHiRA makes it easy for users to perform sparse finetuning without having to discover “the best parameters” to train. This is one of the biggest advantages of LoRA: users just specify a rank and then train the model. In a similar spirit, SHiRA users would just need to provide an application-dependent “adapter size”. Indeed, layerwise distribution of these parameters is of interest but beyond the scope of current study. We will discuss the OWL and reference 6 suggested by reviewer in future work. A similar analogy is how AdaLoRA [P3] equipped LoRA with layerwise ranks (instead of a constant rank for all layers) and SoRA [P4] explored adaptive ranks. Similar follow up studies can be conducted for SHiRA as well.
[P1] On the rank of a random binary matrix. https://arxiv.org/abs/1806.04988
[P2] The rank of sparse random matrices. https://arxiv.org/pdf/1906.05757
[P3] AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning (ICLR 2023). https://arxiv.org/pdf/2303.10512
[P4] Sparse Low-rank Adaptation of Pre-trained Language Models (EMNLP 2023). https://arxiv.org/abs/2311.11696
Other questions: Most questions are addressed above. For remaining questions, please see below:
Q2: [Single adapter vs. multi-adapter latency profiles and effect of small embedding dimensions]. In the unfused mode for LoRA, inference latency will keep increasing with number of adapters. For SHiRA, inference always happens in fused mode, so inference latency will always be equal to base model latency once the fusion is complete (which we have shown in Table S4 is much more efficient than LoRA fusion). On smaller embedding dimensions, we have shown significant latency benefits (for adapter fusion) on SDXL (2.6B params in SDXL vs. 7B params in LLaMA2).
On fusion speed for smaller weight dimensions: We have plotted the data from Fig. 7 (Appendix B in the submitted paper) in semi-log-y format in Fig. S3. Clearly, we see 16x to 13x improvements in adapter loading using scatter-op for SHiRA. Hence, scatter-op is significantly faster than LoRA fusion even at smaller embedding dimensions. For realistic networks like SDXL and LLaMA2, end-to-end switching times are provided in Table S4 in rebuttal PDF. Therefore, we still see significant speed up (4.68x-5.71x) in adapter switching for real networks.
Q4 [Dataset order for SHiRA non-overlap]. Yes, the accuracy of single adapters slightly changes when we train them on top 1% or top 1-2% or top 2-3%. This data can be inferred from Table 4 (submitted paper): Arc-easy and Boolq lose slight accuracy in the non-overlap case compared to their typical top-1% training checkpoints. Arc-Easy was trained on top 2-3% parameters for the non-overlap case and loses accuracy from 77.57% to 75.97%. BoolQ loses was trained on top 1-2% and it loses accuracy from 78.07% to 76.94%. This is highly application dependent and highlights why robust masks for SHiRA are an important direction for future research.
Remaining minor concerns are in the official comment below due to lack of space.
This paper presents a PEFT method, which applies gradient masks for downstream adaptation. It claims three main contributions: rapid adapter switching, lower concept loss, and higher rank. Experiments are conducted in the area of LVMs and LLMs. The proposed method presents SOTA performance and adapter switching efficiency.
优点
-
Leaving the LoRA framework may indeed bring additional benefits, such as adapter switching and high rank, etc. The motivation of this article is intuitively reasonable.
-
This paper is easy to understand.
-
It is reasonable to use classic (initialization) pruning methods such as SNIP for adaptation.
缺点
-
This article highlights too many contributions. I respectfully admit that there are lots of requirements to be met in the PEFT field, such as performance, efficiency, resource costs, adapter switching, etc. However, the premise for proposing a method that is good at adapter switching is that it must perform well under a single adapter set up and on tasks in a variety of fields. Compared with tasks such as image generation and LLaMA, more basic tasks may better reflect the strength of the PEFT method. Therefore, I suggest that the authors refer to methods like VeRA, LoRA to perform experiments. In short, GLUE and image classification tasks are suggested to be added.
-
For efficiency, which is also a very necessary property for PEFT, can the authors provide the time per epoch and peak GPU memory usage on the LLaMA-2-7B model or maybe larger models? This article repeatedly emphasizes that the proposed method has very few parameters, but this does not mean that it can be more efficient than LoRA. This is because the number of parameters has no absolute relationship with GPU cost and training time. To save trouble, you can only provide the efficiency comparison between proposed method and LoRA on the LLaMA-2-7B model under single adapter (when they achieve similar performance).
-
I think this is a strong claim for high rank. Many works show that high rank may be better, but it does not necessarily mean that the higher the rank, the better the effect. This phenomenon varies depending on the model and data. For example, [1] shows that if the rank is too high, LoRA's performance may deteriorate. This is intuitively reasonable (my personal view), because a large number of parameters does mean strong expressiveness, but it may lead to a more complex optimization process, making it difficult to converge to its peak capability. Therefore, high rank is an extremely strong conclusion, which needs to be verified by comprehensive experiments. I do not recommend that authors continue to claim this contribution unless there are more detailed experimental results for support.
-
Minor: Why not consider SynFlow as one of your strategies? Because it seems that you don't need to initialize the mask for all datasets with SynFlow.
[1] Increasing Model Capacity for Free: A Simple Strategy for Parameter Efficient Fine-tuning, ICLR 2024.
问题
N.A.
局限性
N.A.
We thank the reviewer for constructive feedback and appreciating the strengths of our work and its motivation. Below we address the concerns listed in weaknesses section:
1: Thank you for noticing that we have a lot of contributions in our paper. To summarize our main contribution, we highlight that SHiRA – when used with even the most basic pruning metrics (such as weight- or gradient-magnitude, SNIP, structured masks, etc.) – significantly outperforms LoRA on a variety of large-scale tasks in both large vision and large language domains. Additionally, it brings the rapid switching benefits for deployment (by changing only 1-2% params) and encourages multi-adapter fusion (by having better orthogonality properties). As evident from Figs. 1, 4, 6 and Tables 1-4 (main paper), various types of SHiRA masks significantly outperform LoRA. Based on our current study with our selected types of masks, if we were to recommend one technique, we would recommend SHiRA-SNIP which performs consistently well across both vision and language problems.
We further conducted more experiments on image classification and GLUE tasks using SHiRA-WM (since weight magnitude is the easiest mask to use). For image classification, we finetune Vision Transformer (ViT) using LoRA and SHiRA for four common transfer learning datasets, namely, CIFAR-10, CIFAR-100, Food101, and Describable Textures Dataset (DTD). Both methods have comparable #parameters around 300K. As shown in Table S1 (rebuttal pdf), we outperform LoRA on all image classification tasks.
For GLUE, we use the code released by SoRA [P1] which relies on dynamically adjusting the ranks of the adapters. In Table S2 (rebuttal pdf), we report accuracy on four common GLUE tasks: QNLI, COLA, SST2, and MRPC. Accuracy numbers for LoRA and SoRA are directly taken from the SoRA paper since we are using the official code to run SHiRA experiments. As evident, with nearly 2x smaller adapter, SHiRA outperforms LoRA by 1.1% accuracy on average. Further, SHiRA achieves a similar accuracy as SoRA while being 30% smaller in adapter size. Indeed, SoRA cannot enable rapid switching like SHiRA. Therefore, we again demonstrate that a simple approach like SHiRA-WM outperforms LoRA and its advanced variants with a similar or significantly better accuracy while providing additional deployment benefits.
2: Please see section B in common response.
3: We agree with the reviewer that arbitrarily increasing the rank need not benefit the task at hand. It is important to note that recent studies described in [P3, P4] have identified performance gap of LoRA when compared with full model fine tuning. Therefore, techniques for higher rank adaptations are suggested in [P3, P4] with insights gleaned from various downstream tasks. Further, in [P2], the rank of the update for fine tuning a model is suggested to be related to the size of the model and on how it is trained. Importantly, our experiments with SHiRA result in higher rank updates without explicit assumptions on the rank. Hence, we do not need to explicitly set a rank for SHiRA (unlike LoRA). However, we completely agree with the reviewer that claims on the high rank warrant further studies. Therefore, we will discuss all these related works and adjust our claims accordingly.
4 (Minor): As mentioned before, our objective was to just use a few standard weight importance metrics from pruning literature for determining masks and to see if they outperform LoRA (which they do). Synflow is an unsigned version of SNIP (see Eq. 1 in [P5]). We agree with the reviewer that looking at the unsigned SNIP value to rank important weights could be interesting. We leave this ablation to future work. We expect it to work well because Synflow will select at least a subset of the same weights that were selected by SNIP.
With these new results, the paper has improved significantly. We would greatly appreciate if the reviewer could increase the rating of our work.
[P1] Sparse Low-rank Adaptation of Pre-trained Language Models (EMNLP 2023). https://arxiv.org/abs/2311.11696
[P2] Intrinsic dimensionality explains the effectiveness of language model fine-tuning. https://arxiv.org/abs/2012.13255
[P3] MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning. https://arxiv.org/pdf/2405.12130
[P4] PeriodicLoRA: Breaking the Low-Rank Bottleneck in LoRA Optimization. https://arxiv.org/pdf/2402.16141
[P5] Zero-Cost Proxies for Lightweight NAS (ICLR 2021). https://arxiv.org/pdf/2101.08134
Thanks to the author's rebuttal, I believe it solves most of my concerns. However, I still have one comment, which is a continuation of weakness 1. As a PEFT method applicable to multiple fields (even if it is only applicable to image generation), it should have relatively concentrated and easy-to-understand conclusions. For example, LoRA will tell users the approximate relationship between rank and performance. As for SHiRA, as a user, what mask strategy should I adopt? The author does not seem to give specific guidance or conclusions.
In addition, a very personal view is why the author hides the random mask in most of the experiments? Judging from the results in Table 10, the random strategy is not bad. I never think that this strategy should not be shown and analyzed because it sounds simple. On the contrary, I think that if the author's core argument is centered around the following, I think the insight and contribution of this article will be more meaningful: Even the simplest random gradient mask is an extremely effective means of PEFT. For the ML community, simple and effective methods have always been respected and encouraged, especially for PEFT, a field that is highly relevant to engineering applications. Therefore, for my concern, the author does not need to add additional experiments. Can the author give a simple conclusion, for example, in what filed, what mask strategy is more effective?
We are very grateful to the reviewer for their feedback on our rebuttal. We are happy to see that our rebuttal addressed most of their concerns. We completely agree that it is important to have clear conclusions for any study. To address this final concern, we propose to include the following separate “Discussion” section in the paper before the “Conclusion” section, where we will summarize our key findings.
==================================
Discussion
To summarize our main contribution, we highlight that SHiRA – when used with even the most basic pruning metrics (such as weight- or gradient-magnitude, SNIP, structured masks, etc.) – significantly outperforms LoRA on a variety of large-scale tasks in both large vision and large language domains. For LVM style transfer applications, we found that SHiRA-Struct is the most effective masking technique due to its special orthogonality properties that aid multi-adapter fusion. However, SHiRA-SNIP and SHiRA-Grad are not too far behind and achieve competitive performance as SHiRA-Struct. On the LLM commonsense reasoning side, SHiRA-SNIP is the best strategy out of the masking techniques we have considered in this work. Specifically, as discussed in the main paper, SHiRA-Struct did not achieve good results on the more complex commonsense reasoning tasks since it is a combination of a rank-1 + a highly sparse diagonal adapter. SHiRA-Grad on LLMs is about 0.8% worse accuracy than SHiRA-SNIP (76.6% vs. 77.4% average accuracy on commonsense reasoning for LLaMA-1). Therefore, in conclusion, for the applications/fields and the masking techniques considered in this paper, SHiRA-SNIP works well across both language and vision domains. Hence, we recommend that SHiRA-SNIP is one the strongest candidates for sparse finetuning.
==================================
As for the SHiRA-Random, we completely agree with the reviewer that the random mask is in fact a strong baseline (Reviewer AaSE also pointed this out). The main reason we could not include it in the main paper was the space limitation during the initial submission. We wanted to include it in the main paper Table 1 to present HPS scores for SHiRA-Rand. However, that would have also required us to put the generated images of the SHiRA-Rand baseline in the main paper. We really did not have the space to include it at the initial submission time. For the next version of the paper, we will move some of these results from the Appendices to the main paper.
About the final but important comment by the reviewer: "On the contrary, I think that if the author's core argument is centered around the following, I think the insight and contribution of this article will be more meaningful: Even the simplest random gradient mask is an extremely effective means of PEFT. For the ML community, simple and effective methods have always been respected and encouraged, especially for PEFT, a field that is highly relevant to engineering applications."
We completely agree with the reviewer on this statement (as we also mentioned in our rebuttal previously). We have also included it in the new “Discussion” section as specified above. Further, we will modify the abstract, introduction, and conclusion sections accordingly to make it clearer that this is one of the fundamental contributions of our work.
We really appreciate the reviewer’s detailed feedback. It has helped improve our work significantly. Please let us know if you have additional concerns. We would be grateful if you could please raise the rating of our work. Thank you.
Most of my concerns are addressed, and it is obvious that authors make enough efforts for rebuttal. Thus I increase my score to 5. Overall, from my personal view, the biggest pros and cons with simple words are as follows:
Pros: the proposed method has almost all properties as a good peft method. (better to consider more general tasks to verify effectiveness)
Cons: there seems no clear insight proposed. From my view, deep insight and more discussion about random mask manner will be a significant contribution and a very attractive element for others to follow this work.
Thank you so much for all the feedback and for increasing our rating. As a result, the quality of our paper has significantly improved. And yes, we agree that the properties of the random mask would be a very interesting follow up work.
The authors propose a new type of adapter, SHiRA, for parameter-efficient fine-tuning. SHiRA selects a part of parameters for update and thus triggers both rapid adapter switching and multi-adapter fusion, while traditional methods like LoRA can't have it both ways. Experiments based on Stable Diffusion and LLaMa show the effectiveness.
优点
- The writing is good and clear.
- SHiRA significantly reduces the fusion time of LoRA-type adapters.
缺点
- The technical novelty is limited. SHiRA could be seen as a type of partial fine-tuning works which are already well explored in the past years. This paper just applies similar approach on larger models than previous works, but there is no evidence that previous works like [35,27,3,32,8] can't be used for current large models.
- Another weakness is the value of the solution. As the authors stated, this paper aims to solve the problem that traditional methods like LoRA can't rapidly fuse weights when we need to switch between multiple adapters frequently. However, the fusion time might be far less than inference time, making the objective less meaningful, In addition, there are not many application scenarios with such requirement.
问题
The authors are encouraged to further demonstrate the novelty and value of their proposed method in the rebuttal.
局限性
Yes
We thank the reviewer for constructive feedback and appreciating the strengths of our work. Below we address the concerns listed in weaknesses section:
-
Please refer to response (B) from the common response section. In summary, existing partial finetuning techniques enable gradients for the entire base model weight tensors which makes them impractical for large genAI models. In contrast, with our new SHiRA PEFT implementation, we require 16% lower GPU memory than LoRA for training.
-
Practical Importance: Real-time adapter switching is a highly practical problem and is being heavily considered by the industry for on-device deployment of genAI models. For instance, please see reference [P1] that documents the efforts from one of the leading companies in this direction. Section A (common response) further provides concrete data to support the rapid switching problem for on-device applications. Please refer to section A in common response for more details.
We summarize our key innovations as follows: (1) rapid switching; (2) natural multi-adapter fusion properties of sparse adapters - demonstrate superior performance to LoRA multi-adapter fusion across various language and vision tasks; (3) PEFT implementation for sparse finetuning approaches (see section B in common response); (4) show that even the simplest of SHiRA masking techniques from pruning literature (e.g., weight magnitude, etc.) significantly outperform LoRA on many diverse tasks. Hence, even without the rapid switching motivation, with our new PEFT implementation, we have demonstrated that sparse finetuning (SHiRA) can now perform everything that LoRA and its variants can do, with the added deployment benefits. For more details, we request the reviewer to refer response (A), (B) and (C) from the common response section.
With these new results, the paper has improved significantly. We would greatly appreciate if the reviewer could increase the rating of our work.
[P1] Apple Intelligence Foundation Language Models. https://arxiv.org/pdf/2407.21075
Thank you for providing the rebuttal. After reading other reviews and rebuttals, I tend to maintain my initial recommendation.
Low Rank Adaptation (LoRA) is a crucial technique for fine-tuning LLMs and LVMs. This paper addresses two limitations of LoRA: 1. inference overhead while enabling rapid adapter switching; 2. concept loss with multiple adapters. The paper proposes Sparse High Rank Adapter (SHiRA), which directly trains a small portion of the model’s weights while keeping the other weights frozen. SHiRA has the following advantages compared to LoRA: 1. no inference overhead, 2. rapid adapter switching, and 3. less concept loss. Experiments on LVMs and LLMs validate its performance.
优点
- Efficient fine-tuning techniques for large models are a significant topic in the current field of deep learning.
- Well structured and easy to follow.
- The proposed technique is simple and sound.
- Extensive experiments demonstrate the effectiveness of SHiRA.
缺点
My main concern lies in the novelty of the approach. Directly fine-tuning a small subset of parameters is a straightforward idea and should have been widely used even before the introduction of LoRA. I don’t see the unique innovation in the proposed method, which makes its superior performance surprising to me.
问题
See "Weaknesses".
局限性
The authors provide a discussion on limitations.
We thank the reviewer for recognizing the strength and contributions of our work to the domain of parameter efficient finetuning. We are happy to see that the reviewer finds the effectiveness of SHiRA surprising. In fact, this is precisely our point since LoRA is a well-established PEFT method and the simplest and most natural way of efficient finetuning outperforms LoRA and its advanced variants. We wanted to highlight this finding and establish SHiRA as a strong baseline for future adapter methods.
We summarize our key innovations as follows: (1) rapid switching; (2) natural multi-adapter fusion properties of sparse adapters - demonstrate superior performance to LoRA multi-adapter fusion across various language and vision tasks; (3) PEFT implementation for sparse finetuning approaches (see section B in common response); (4) show that even the simplest of SHiRA masking techniques from pruning literature (e.g., weight magnitude, etc.) significantly outperform LoRA on many diverse tasks. Hence, even without the rapid switching motivation, with our new PEFT implementation, we have demonstrated that sparse finetuning (SHiRA) can now perform everything that LoRA and its variants can do, with the added deployment benefits. For more details, we request the reviewer to refer response (A), (B) and (C) from the common response section.
With these new results and discussions added, we believe our paper has significantly improved in establishing the benefits of SHiRA in the domain of efficient parameter finetuning. We would like to thank the reviewer for all their suggestions. We would greatly appreciate if the reviewer could increase the rating of our work.
Thank you for the rebuttal. Considering all the comments and the author’s replies, I’ll keep the score.
The paper introduces Sparse High Rank Adapter (SHiRA), a novel method to address the limitations of Low Rank Adaptation (LoRA) in some settings. SHiRA aims to minimize inference overhead, facilitate rapid adapter switching, and reduce concept loss when using multiple adapters. By training only 1-2% of the base model weights, SHiRA maintains high sparsity, enabling efficient on-device deployment. The paper provides both theoretical insights and empirical evidence to support the effectiveness of SHiRA, showcasing its superiority over LoRA in various experiments on large vision and language models.
优点
- SHiRA introduces a novel PEFT method that focuses on high sparsity and selective training of base model weights.
- SHiRA significantly reduces the memory and latency overhead associated with adapter switching, making it highly suitable for mobile and edge device deployment. Additionally, it will not introduce any inference overhead.
- By enabling multi-adapter fusion without significant interference, SHiRA addresses a critical limitation of LoRA, thereby preserving the integrity of concurrent adapters.
- SHiRA effectively resolves a major drawback of LoRA by allowing multiple adapters to function together seamlessly.
- The method is tested on various large models, including language and vision tasks.
- The paper provides solid theoretical foundations for the high sparsity and high rank properties of SHiRA.
缺点
- Lack of baseline: From Table 2, we observe that SHiRA works better than LoRA but perform worse than DoRA. Therefore, I was wondering whether the authors can provide some comparison with DoRA in vision tasks.
- Lack of end-to-end efficiency analysis: In the Appendix, the authors provide some evidence of the efficiency of sparse adapter. However, it do not provide an end-to-end task switching time for both LoRA, DoRA and SHiRA. I think it will be better to include this result in the main paper.
- Limited applications: Although this method emphasizes its advantages in rapid adapter switching and multi-adapter fusion, its practical applications remain questionable. As noted in Appendix B, the method primarily accelerates scenarios with a hidden dimension of 8192, potentially reaching the I/O bottleneck. However, for other settings, the fuse operation may not experience significant slowdowns. For the multi-adapter fusion, SHiRA performs better than LoRA-based methods. However, as it still lead to about 4% performance drop, making it difficult to be applied in real-world settings.
问题
See weaknesses.
局限性
See weaknesses.
We thank the reviewer for constructive feedback and appreciating the strengths of our work. Below we address the concerns listed in weaknesses section:
- As discussed in our submitted paper, SHiRA is orthogonal to advanced LoRA variants, e.g., DoRA, and can be efficiently combined with them to improve the expressive power of the SHiRA adapter. Table 2 presents these orthogonality results wrt DoRA. Specifically, Table 2 shows that SHiRA-WM when combined with DoRA, improves the performance of the baseline SHiRA-WM by 0.8% accuracy on commonsense benchmarks. Moreover, this SHiRA-WM-DoRA adapter still changes only 1% parameters in the base model. Hence, we see that SHiRA is clearly orthogonal to DoRA and can be applied on top of it while preserving the rapid switching benefits. Note that, the absolute performance of base SHiRA is very close to DoRA (Table 3 of the submitted paper) while bringing the additional deployment efficiencies.
Further, we provide qualitative comparison between LoRA, SHiRA, and DoRA on our dreambooth setup. As shown in Fig. S1, SHiRA produces similar quality images as LoRA and DoRA, with an added benefits of adapter switching. As we can see, while DoRA images also look impressive, the SHiRA dog image looks more diverse than both LoRA and DoRA. Moreover, the canvas image for SHiRA has more “canvas” effect while outputs of LoRA and DoRA are smoothed out.
-
We have included the analysis in section A of the common response. We thank the reviewer for their suggestion for this new analysis; we will add this new data in the main paper.
-
We address this concern in three steps:
Practical Importance:
Real-time adapter switching is a highly practical problem and is being heavily considered by the industry for on-device deployment of genAI models. For instance, please see reference [P1] that documents the efforts from one of the leading companies in this direction. Section A (common response) further provides concrete data to support the rapid switching problem for on-device applications.
Speedup in Switching Time:
We have plotted the data from Fig. 7 (Appendix B in the submitted paper) in semi-log-y format in Fig. S3. Clearly, we see 16x to 13x improvements in adapter loading using scatter-op for SHiRA. Hence, scatter-op is significantly faster than LoRA fusion even at smaller embedding dimensions. For realistic networks like SDXL and LLaMA2, end-to-end switching times are provided in Table S4 in the rebuttal pdf). Therefore, we still see significant speedup (4.68x-5.71x) in adapter switching for real networks.
Multi-Adapter Fusion:
As compared to LoRA which suffers 11% degradation upon naïve multi-adapter fusion, SHiRA only degrades 4%. Also, many works [7, 32, 24] in the literature have shown that naïve merging of lora adapters lead to significant concept loss and hence non-trivial postprocessing is required for effective merging. In this work, we show for that naïve SHiRA adapter merging leads to significantly less concept loss and produces better results in both language and vision domains. Finally, note that multi-adapter fusion is a very significant practical usecase. Since SHiRA naturally improves multi-adapter fusion properties, it provides a solution to an important real-world problem.
With these new results, the paper has improved significantly. We would greatly appreciate if the reviewer could increase the rating of our work.
[P1] Apple Intelligence Foundation Language Models. https://arxiv.org/pdf/2407.21075
Thank you for your rebuttal. I will maintain the score.
We thank all reviewers for constructive feedback and for appreciating SHiRA’s many strengths. Overall, reviewers found: (1) SHiRA has significant benefits for mobile/edge deployment due to reduced memory and latency for adapter switching (Reviewers K5NY, 2XqP, ZsNM, Xu4N, AaSE); (2) SHiRA addresses a critical limitation of LoRA by significantly improving multi-adapter fusion (K5NY, 2XqP, AaSE); (3) our extensive experiments validate the effectiveness of SHiRA across a wide range of vision and language tasks (K5NY, 2XqP, XU4N, AaSE); (4) our motivation is intuitive (XU4N) and paper is easy to follow (2XqP, ZsNM, XU4N).
Below, we summarize the new experiments and address common concerns related to (A) Key contributions and value of rapid switching; (B) memory costs of SHiRA during training with a new PEFT implementation and novelty w.r.t. existing partial (sparse) finetuning methods; (C) new experiments on GLUE and Image Classification tasks.
A. Key Contributions and New Data to Support Rapid Switching (K5NY, ZsNM, AaSE)
Key contributions of this work are: (1) rapid switching; (2) natural multi-adapter fusion properties of sparse adapters; (3) PEFT implementation for sparse finetuning approaches (see section B below); (4) show that even the simplest of SHiRA masking techniques significantly outperform LoRA on many diverse tasks. Hence, even without the rapid switching motivation, with our new PEFT implementation, we have demonstrated that sparse finetuning (SHiRA) can now perform everything that LoRA and its variants can do, with the added deployment benefits. Therefore, our contribution must also be seen as establishing sparse finetuning as a strong adapter baseline for future LoRA works.
We now provide more data to support rapid switching motivation. While end-to-end inference latency is an important part of the deployment, another essential factor is user experience. Specifically, user experience entails time to get first output which includes switching time (i.e., time to get the adapter inference ready, a critical pre-inference optimization) as well as the inference time. Users want to switch adapters quickly and flexibly on phones; long adapter fusion times severely degrade the user experience and must be minimized (irrespective of the actual inference time). This need for quick switching among adapters is also highlighted in recent popular industry publications and is in heavy use in the real world (e.g., see Apple’s paper [P1]).
In Table S4 (rebuttal PDF), we present end-to-end switching times for prevalent LVMs and LLMs: SDXL and LLaMA2. Notably, even for a smaller model like SDXL (2.6B params compared to 7B params in LLaMA2), SHiRA achieves a 4.68x faster switching time (0.77s vs. 3.6s), while for LLaMA2, with larger tensor dimensions, SHiRA attains a 5.71x speedup (4.93s vs. 28.15s) on a consumer grade CPU. Note that, fusing lora adapters for LLaMA2 on a CPU is 28.15s (nearly half a minute). Indeed, waiting half a minute for the adapter to switch/fuse is quite substantial and hampers user experience significantly. In contrast, SHiRA can get the adapter ready for inference within 4.93s, thus significantly improving the user experience. Note that, once the adapters are fused, inference time on the hardware is equal for both LoRA and SHiRA. Moreover, as discussed in reference [1] in our paper, for unfused LoRA case (which can enable rapid switching), the inference latency can be up to 30% higher which is not the case with SHiRA.
Finally, this pre-inference user experience optimization has not been a major focus in adapters research. Therefore, another contribution of our work is that we are bringing this new problem into the research community. Such things are very important for practical GenAI deployment. Other similar examples include time to first token for LLM inference, etc.
B. PEFT Implementation of SHiRA and Novelty wrt Partial (Sparse) Finetuning (K5NY, 2XqP, ZsNM, Xu4N, AaSE)
To address SHiRA’s training memory costs, we created a memory- and latency-efficient PEFT-based implementation for SHiRA. As discussed in Appendix B (submitted paper), scatter_op can be utilized to manage sparse weight updates during inference. Given that SHiRA only finetunes a small subset of the pretrained model weights, we adopt a similar scatter_op-based approach for training. This allows us to retain only the sparse training parameters in the optimizer, thereby significantly reducing the peak GPU memory utilization during training. As shown in Table S3, SHiRA not only trains at almost similar speed as LoRA, but also consumes 16% lower peak GPU memory. Compared to other variants like DoRA, SHiRA training consumes significantly lower peak GPU memory (40%) and trains much faster (SHiRA is about 36% faster than DoRA). All memory data was collected using “psutil” utility used within the “Transformers.Trainer” training loop for LLaMA2-7B.
Finally, partial finetuning techniques proposed in the pre-LoRA era [35,27,3,32,8] do not have such memory-efficient implementations, which makes them impractical for large generative models. This is because without the PEFT implementation (e.g., using only gradient masking), the training memory cost is high since it enables gradient for the whole weight tensor (and not just the 1% parameters). Therefore, SHiRA significantly outperforms prior partial finetuning techniques in training memory costs and is highly practical for modern LVM and LLM adaptations tasks. This is a clear difference between SHiRA and prior partial (sparse) finetuning techniques.
C. More experiments on Image Classification, GLUE tasks (XU4N)
Tables S1, S2 (rebuttal) show new experiments on image classification and GLUE where we again show SHiRA’s superior performance to other low rank methods. Please see our response (point 1) to Reviewer XU4N for more details.
[P1] Apple Intelligence Foundation Language Models. https://arxiv.org/pdf/2407.21075
Dear Reviewers,
The author-reviewer discussion period will end within two days (Aug 13). Please respond to/acknowledge the author rebuttal and indicate whether it addressed your concerns. Your help is greatly appreciated.
Best, Area Chair
Reviewers agree the paper is leaning acceptance. The paper presents a new method SHiRA whose utility is clearly motivated - rapid switching adaptive weights in different tasks, with "fused mode" on mobile devices. Experiments on both vision and language generation are conducted, and the method is comparable or better than existing methods. Reviewers agree the paper is presented clearly and easy to follow. Congratulations to the authors for the paper acceptance.