Bi-Share LoRA: Enhancing the Parameter Efficiency of LoRA with Intra-Layer and Inter-Layer Sharing
摘要
评审与讨论
-
The paper introduces Bi-Share LoRA, an efficient fine-tuning method for large language models (LLMs) designed to reduce memory and latency issues when deploying multiple LoRA models simultaneously. By combining intra-layer and inter-layer parameter sharing with local parameters, Bi-Share LoRA captures both local and global information more effectively.
-
The method uses three parallel sub-LoRAs and transformation techniques to manage shared parameters of different shapes. Experiments on Llama models (7B, 8B, and 13B) show that Bi-Share LoRA, with 44.59% fewer parameters than standard LoRA, achieves improved performance, outperforming standard LoRA by 0.33% on commonsense reasoning and 2.08% on MMLU benchmarks.
优点
-
The paper is clearly written and logically structured.
-
The paper leverages the observation of parameter redundancy and sharing across model layers to propose three distinct LoRA modules: Intra-Layer Module, Inter-Layer Module, and Local Module. The overall approach is well-motivated.
-
The authors validate their method using LLaMA models and multiple datasets, including the Commonsense Reasoning benchmark.
缺点
1 . The approach of leveraging inter-layer redundancy to design shared LoRA modules is not sufficiently novel, as several existing works have already explored this concept, such as “ShareLoRA: Parameter Efficient and Robust Large Language Model Fine-tuning via Shared Low-Rank Adaptation” and “Tied-LoRA: Enhancing Parameter Efficiency of LoRA with Weight Tying”. These ideas and techniques are already quite prevalent.
-
Using only the Alpaca dataset for fine-tuning is insufficient to demonstrate the method’s generalizability. Other datasets, such as FLAN, CoT, or domain-specific instruction fine-tuning datasets, should be used for instruction tuning to further validate the algorithm.
-
Although the paper proposes a parameter-efficient version of LoRA, it does not provide details on the overall training time for instruction fine-tuning or the inference time using the proposed method.
-
The authors did not conduct experiments on additional models, such as Mistral or Qwen, to demonstrate the robustness of their algorithm.
问题
-
Can the authors conduct experiments on other instruction fine-tuning datasets?
-
Can the authors provide the speedup ratio during the instruction fine-tuning process/ inference process?
-
Could the authors also test their method on models other than LLaMA?
伦理问题详情
None
We sincerely thank you for your detailed feedback and constructive suggestions. We have carefully considered every comment and conducted additional experiments and analyses where necessary. Below, we address each comment (W1 is corresponing to weakness 1 and Q1 denotes the question 1.) individually and provide clarifications, updates, and results that demonstrate how the concerns have been addressed.
-
Answer to W1: You are correct that parameter sharing has been extensively studied to reduce the number of trainable parameters and improve parameter efficiency. Existing works like ShareLoRA and Tied-LoRA are excellent examples that apply parameter sharing to reduce LoRA’s parameter count. However, these methods primarily focus on sharing parameters across layers with identical module shapes, such as QKV modules. These approaches, while effective, lacks flexibility and generalizability when dealing with modules with differing shapes, such as MLP modules. In our preliminary experiments, we observed that modules with different shapes also exhibit high information entropy similarity, indicating significant potential for parameter sharing. Sharing parameters across such modules could further reduce redundancy. However, implementing this in practice presents challenges, particularly for transformer architectures like LLaMA 3, which employ techniques like GQA (Grouped Query Attention). This results in inconsistent parameter shapes within the same attention block and between attention and MLP blocks, making it difficult to share parameters using a simple, uniform approach. Our method addresses this limitation by introducing a generalized parameter-sharing mechanism that incorporates shape transformation. This allows shared parameters to adapt to modules with varying shapes, enabling more flexible and universal sharing strategies. Furthermore, we consider both intra-layer and inter-layer sharing to reduce redundancy at multiple levels. As an extension of LoRA, our method can degenerate into standard LoRA by setting the rank of shared parameters to zero. Additionally, it allows configurable intra-layer and inter-layer sharing strategies, providing greater flexibility to enhance parameter efficiency.
-
Answer to W2 & Q1: You are absolutely correct that testing with different fine-tuning datasets is crucial to validate the generalizability of the method. To address this, we conducted additional experiments using the FLAN_v2 dataset to fine-tune the LLaMA 3-8B model and evaluated its performance on the CommonSense task. The results demonstrate that our method remains effective when applied to a new dataset. Detailed results are provided in the appendix.
Method openbookqa arc_challenge hellaswag arc_easy piqa winogrande boolq siqa avg LoRA 45.40 53.41 79.20 80.43 79.82 74.35 83.27 47.39 67.91 VeRA 44.80 53.92 79.14 79.92 79.54 72.69 80.95 47.03 67.25 VB-LoRA 44.00 54.01 78.72 80.26 78.62 74.66 81.19 46.01 67.18 Bi-Share-LoRA (SS) 46.60 56.23 79.00 81.14 80.79 74.27 83.52 48.77 68.79 Bi-Share-LoRA (GT) 45.20 55.38 79.22 81.99 81.28 74.98 83.21 48.52 68.72 Bi-Share-LoRA (KE) 44.40 55.03 79.11 82.41 81.18 74.90 84.10 495.4 68.83
- Answer to W3 & Q2: We conducted an additional inference experiment using the Llama 1-7B model on the Alpaca dataset with a batch size of 2, measuring both memory usage and inference time. Our results indicate that the majority of memory usage is occupied by the pre-trained model itself, while the memory required for LoRA parameters is relatively small. Bi-Share LoRA further reduces memory usage, which becomes especially beneficial when deploying a large number of LoRA parameters simultaneously.
Method Inference Memory (MB) LoRA 12,914.16 Bi-Share-LoRA 12,874.62 Regarding inference speed, we observed no significant improvement when using Bi-Share LoRA compared to standard LoRA. At this stage, our primary focus is on achieving higher parameter efficiency by reducing parameter size, rather than optimizing inference speed. By reducing the number of lora parameters, more lora parameters can be accommodated per GPU in a multitasking deployment environment, reducing latency by reducing the number of GPU and CPU switches. The following table shows the memory usage of our method and LoRA with different number of parameters, see Appendix A.11 for details.
Memory Usage for serving different number of loras (MB)
method 300 600 900 1200 1500 2000 2500 3000 3500 3800 LoRA 28964 45014 61064 77114 OOM OOM OOM OOM OOM OOM Bi-Share-LoRA 18105 23337 28568 33799 39030 47749 56468 65187 73905 79137 However, in future work, we plan to explore strategies to improve inference performance, potentially by integrating system-level optimizations for model inference.
- Answer to W4 & Q3: You are absolutely right that exploring the applicability of our method to other base models is essential. To address this, we conducted additional experiments using the Qwen-2.5-7B model. The model was fine-tuned on the Alpaca dataset and evaluated on the CommonSense task. The results demonstrate that our method performs well on the Qwen-2.5-7B model, further supporting its robustness and generalizability. Details of these experiments can be found in the appendix.
Method openbookqa arc_challenge hellaswag arc_easy piqa winogrande boolq siqa avg LoRA 44.00 47.27 71.34 76.68 77.09 67.48 84.83 49.54 64.78 VeRA 48.20 51.45 79.54 79.59 78.45 68.82 85.90 50.36 67.79 VB-LoRA 48.20 52.30 79.57 79.38 78.45 70.24 85.69 50.56 68.05 Bi-Share-LoRA (SS) 45.60 55.55 79.09 80.98 79.16 70.17 86.39 51.02 68.49 Bi-Share-LoRA (GT) 45.00 51.45 73.96 80.22 78.18 70.24 85.47 50.67 66.90 Bi-Share-LoRA (KE) 47.00 53.67 77.68 79.88 79.82 69.85 86.18 50.15 68.03
Dear Reviewer csMi,
Thank you for taking the time to review our submission. We appreciate the feedback you’ve provided, which has been invaluable in improving our work.
We noticed that there hasn’t been a response yet to the rebuttal we submitted. We would greatly value any further comments or clarifications you might have on our responses, as your input is critical for the discussion phase and the improvement of our paper.
If there are any specific areas where we could provide additional clarification or address concerns in more detail, please do not hesitate to let us know. We remain committed to addressing any outstanding issues.
Thank you again for your time and effort.
Best regards,
The Authors of Submission 697
Thank you for the authors' response. However, my main concern remains the lack of innovation and the incremental nature of this work. I believe that optimizing LoRA parameters is trivial, and the proposed technique may not bring substantial practical impact. Why would practitioners choose the compressed LoRA parameters over directly using LoRA? Are these parameters significantly memory-intensive? Furthermore, considering that many previous works have already made contributions to compressing LoRA parameters, I am concerned about the novelty of this work. Therefore, I have decided to maintain my current score.
Thank you for taking the time to provide detailed feedback. We appreciate your concerns regarding the novelty and practical impact of our work and would like to further clarify our contributions:
1. Practical Requirement
While LoRA adapters typically require minimal parameter storage and can be merged with the base model to avoid additional inference latency, this approach becomes impractical in multi-task serving scenarios. In such cases, a single inference system must support numerous downstream tasks, making it infeasible to load separate large models for each task due to their massive parameter sizes. Instead, a more practical solution is to decouple the base model and LoRA adapters, loading only the required LoRA parameters for specific tasks as needed.
However, as foundational model sizes continue to grow and the number of downstream tasks increases, the memory footprint of LoRA adapters becomes non-negligible. Even in systems like S-LoRA [1], where LoRA adapters are dynamically loaded into GPU memory during inference and offloaded to main memory when idle, frequent adapter switching can introduce significant latency, especially with diverse and unpredictable user requests. In such cases, retaining commonly used LoRA adapters in GPU memory becomes critical to reducing switching overhead.
Given the limited capacity of GPU memory, reducing the size of individual LoRA adapters is essential to accommodate more adapters and further mitigate switching delays. Additionally, the communication bandwidth between GPU and main memory can become a bottleneck when transferring large adapter parameters, further highlighting the need for parameter-efficient designs. Reducing the size of LoRA adapters is thus both a practical and meaningful research direction.
2. Existing Work
In terms of inference system optimization, prior works have sought to compress standard LoRA parameters in multi-task serving scenarios. For example, [2] proposes decomposing trained LoRA parameters to extract common components that remain in GPU memory, with task-specific components offloaded to main memory and dynamically loaded during inference. However, this approach depends on the decomposition of trained parameters, which can result in reconstruction losses when tasks differ significantly, thereby degrading inference performance. Alternatively, other works have focused on optimizing LoRA parameters themselves. Methods such as VeRA [3], ShareLoRA [4], and Tied-LoRA [5] have introduced parameter-sharing mechanisms to significantly reduce parameter sizes while retaining well-trained performance. However, these methods primarily share parameters within modules of identical shapes (e.g., QKV matrices within attention layers), overlooking the potential for sharing across modules of different shapes (e.g., QKV in attention layers vs. up/down matrices in FFN layers). This limitation restricts their flexibility and generalizability.
3. Our Solution
Our preliminary experiments reveal that modules with different shapes can also exhibit high similarity, suggesting potential for cross-module parameter sharing. To address the challenges associated with sharing parameters across differently shaped modules, we propose three distinct shape transformation strategies, enabling shared parameters to adapt seamlessly to diverse module shapes. Furthermore, we account for parameter redundancy both intra- and inter- layers, providing a more flexible and universal parameter-sharing framework.
An additional advantage of our method lies in the decomposition of ΔW into a combination of shared and local parameters. Our analysis shows that the rank of the resulting ΔW closely approximates the sum of the ranks of the individual components. By increasing the rank of shared parameters while reducing the rank of local parameters, our approach achieves a higher-rank ΔW, enhancing the information capacity and ultimately improving fine-tuning performance.
Moreover, as a superset of standard LoRA, our method offers flexibility in configuring sharing strategies. It can be tailored to use only intra-layer or inter-layer sharing to reduce redundancy or revert to standard LoRA by setting local parameters without sharing. This adaptability allows practitioners to balance parameter efficiency and model performance based on specific requirements.
We sincerely thank you for taking the time to review our response. We hope these clarifications address your concerns and highlight the novelty and practical value of our work.
[1] S-LoRA: Serving Thousands of Concurrent LoRA Adapters
[2] COMPRESS THEN SERVE: SERVING THOUSANDS OF LORA ADAPTERS WITH LITTLE OVERHEAD
[3] VeRA: Vector-based Random Matrix Adaptation
[4] ShareLoRA: Parameter Efficient and Robust Large Language Model Fine-tuning via Shared Low-Rank Adaptation
[5] Tied-Lora: Enhancing parameter efficiency of LoRA with weight tying
Thank you for your response. Considering the concerns about the novelty of this work and the incremental nature of the contribution compared to previous studies, I have decided to maintain my current score and do not recommend this submission for presentation at the ICLR conference.
This paper proposes a novel Bi-Share LoRA method to enable different LoRA modules within a model to have both shared intra-layer and inter-layer parameters. Additionally, this paper presents three shape transformation methods, including Slice Sharing, Gate Transformation, and Kronecker Extension, to tackle the challenge of adapting shared parameters to all modules with different shapes. Results on commonsense reasoning and MMLU benchmarks show Bi-Share LoRA achieves significant parameter savings of about 50% while maintaining or even improving the model’s performance compared to standard LoRA and other existing methods.
优点
-
The proposed method, Bi-Share LoRA, along with its three shape transformation methods, is both novel and promising. The idea is well-motivated with supporting visualizations, such as average similarity within and across layers. The shape transformation methods are logical and intuitive.
-
This paper is well-organized and easy to read.
-
This method is evaluated across multiple large language models on various tasks including commonsense reasoning and MMLU, making the results convincing.
-
Some analysis of results is provided to improve interpretability of Bi-Share LoRA.
缺点
-
Some details in the section of Kronecker extension shape transformation lack clarity. Specifically, it is unclear how the method handles cases where and are not integers. Additionally, the value of k used in the experiments is not specified. In the pseudocode for Kronecker Extension, it states , but this appears to be incorrect. Did the author mean instead? This section would benefit from revision for improved clarity.
-
While the paper is generally well organized, numerous small typos and formatting issues detract from its quality. Examples include typographical errors such as ‘frozon’ and ‘Traget’ in Figure 2, and a formatting issue in the caption of a table: ‘(University, 2023).’ A thorough proofreading is needed to address these minor errors in paper writing.
-
The code of Bi-Share LoRA is not shared.
问题
Please refer to the points outlined in the weaknesses section.
We sincerely thank you for your detailed feedback and constructive suggestions. We have carefully considered every comment and conducted additional experiments and analyses where necessary. Below, we address each comment (W1 is corresponing to weakness 1 and Q1 denotes the question 1.) individually and provide clarifications, updates, and results that demonstrate how the concerns have been addressed.
-
Answer to W1: The parameter 𝑘 is a hyperparameter, and through ablation studies, we found that setting𝑘=256 achieves the best trade-off between performance and parameter efficiency. Notably, for the models we evaluated, all parameter dimensions are divisible by 256. To handle edge cases where 𝑚/𝑘 or 𝑛/𝑘 are not integers, we have implemented an exception-handling mechanism that alerts users when such cases occur. Additionally, we acknowledge the error in the pseudocode where 𝑘=𝑑𝑖𝑛 mod 𝑟 was incorrectly written; it should indeed be 𝑘=𝑑𝑖𝑛//𝑟. We have revised this error and appreciate the reviewer’s thorough attention to detail.
-
Answer to W2: Thank you for pointing this out. We appreciate your attention to detail, as typos and formatting issues can indeed mislead readers and affect the overall quality of the paper. We have conducted a thorough proofreading of the entire manuscript and corrected these errors to ensure clarity and professionalism.
-
Answer to W3: We have included the training and evaluation code in the supplementary materials for this submission. Additionally, we plan to submit a pull request to the PEFT library to integrate our implementation, making it more accessible for future use.
Thank you for the response. I will keep my score.
Thank you very much for your thoughtful feedback. We sincerely appreciate your insightful suggestions, which have contributed significantly to improving our work. Your support is invaluable, and we are grateful for the time and effort you have put into reviewing our submission.
The paper presents a method to improve the memory efficiency of Low-Rank Adaptation (LoRA) in large language models by introducing Bi-Share LoRA, which integrates local, intra-layer, and inter-layer parameter sharing. This approach captures both local and global information, effectively reducing redundancy and cutting down trainable parameters without compromising performance. The authors also propose three shape transformation techniques—Slice Sharing, Gate Transformation, and Kronecker Extension—to ensure compatibility across varying parameter shapes within model layers. Experimental results demonstrate that Bi-Share LoRA achieves a 56.4% reduction in parameters, underscoring its efficiency and adaptability in multi-task environments.
优点
-
The paper introduces a novel combination of intra-layer and inter-layer parameter sharing within the LoRA framework, significantly enhancing memory efficiency for large-scale language models by capturing both local and global information.
-
The structure and explanations are clear, with useful visuals that clarify complex processes, though a bit more simplification on the shape transformation techniques would make these sections more accessible.
-
The experimental results are solid, with Bi-Share LoRA consistently cutting parameter use by 56.4% without sacrificing performance.
缺点
-
The approach lacks a precise mechanism to identify redundancies across layers, relying instead on generalized intra- and inter-layer sharing. This can lead to unnecessary parameter updates in layers where redundancy is minimal.
-
The experiments are primarily conducted on LLaMA models without exploring performance across diverse tasks or modalities.
-
The paper could benefit from more detailed ablations, particularly on the impacts of intra-layer versus inter-layer sharing under different rank settings.
-
There is a lack of thorough analysis of training and inference costs associated with the proposed parameter-sharing methods.
问题
-
Could the authors elaborate on any criteria or heuristic used for determining layer redundancy in the intra- and inter-layer sharing approach?
-
Given the primary focus on LLaMA models, are there plans to extend evaluations across different domains or task types?
-
Could the authors conduct more in-depth ablation studies to separate the contributions of intra-layer versus inter-layer sharing, especially under varying rank settings?
-
Could the authors provide further analysis on computational trade-offs, such as training and inference time or memory usage?
We sincerely thank you for your detailed feedback and constructive suggestions. We have carefully considered every comment and conducted additional experiments and analyses where necessary. Below, we address each comment (W1 is corresponing to weakness 1 and Q1 denotes the question 1.) individually and provide clarifications, updates, and results that demonstrate how the concerns have been addressed.
-
Answer to W1: You raise an important point. Introducing a mechanism to allocate different sharing strategies based on the degree of similarity across layers could indeed reduce the performance degradation caused by parameter sharing. However, LoRA parameter fine-tuning typically involves very low-rank matrices, which inherently contain limited information. As a result, the parameters tend to exhibit a high degree of similarity. In our work, we used mutual information to measure the similarity across modules and layers and found that the parameters exhibit consistently high redundancy. Therefore, our method assumes that all parameters are redundant to some extent, justifying the use of parameter sharing for all modules to reduce the model size. In the future work, we will explore the use of similarity of pre-trained parameters to dynamically determine lora sharing parameters.
-
Answer to W2: We have supplemented our experiments to address this concern. Specifically, we conducted additional tests using the Qwen model, fine-tuned on the Alpaca dataset, and evaluated its performance on the CommonSense task. The results demonstrate that our method performs well on the Qwen model, as detailed in the Appendix A.6.
Method openbookqa arc_challenge hellaswag arc_easy piqa winogrande boolq siqa avg LoRA 44.00 47.27 71.34 76.68 77.09 67.48 84.83 49.54 64.78 VeRA 48.20 51.45 79.54 79.59 78.45 68.82 85.90 50.36 67.79 VB-LoRA 48.20 52.30 79.57 79.38 78.45 70.24 85.69 50.56 68.05 Bi-Share-LoRA (SS) 45.60 55.55 79.09 80.98 79.16 70.17 86.39 51.02 68.49 Bi-Share-LoRA (GT) 45.00 51.45 73.96 80.22 78.18 70.24 85.47 50.67 66.90 Bi-Share-LoRA (KE) 47.00 53.67 77.68 79.88 79.82 69.85 86.18 50.15 68.03
Furthermore, we fine-tuned the LLaMA 3-8B model using the FLAN_v2 dataset and evaluated it on the CommonSense task. The results show that our method consistently outperforms the baselines, further validating its effectiveness. Detailed results can be found in the Appendix A.7.
Method openbookqa arc_challenge hellaswag arc_easy piqa winogrande boolq siqa avg LoRA 45.40 53.41 79.20 80.43 79.82 74.35 83.27 47.39 67.91 VeRA 44.80 53.92 79.14 79.92 79.54 72.69 80.95 47.03 67.25 VB-LoRA 44.00 54.01 78.72 80.26 78.62 74.66 81.19 46.01 67.18 Bi-Share-LoRA (SS) 46.60 56.23 79.00 81.14 80.79 74.27 83.52 48.77 68.79 Bi-Share-LoRA (GT) 45.20 55.38 79.22 81.99 81.28 74.98 83.21 48.52 68.72 Bi-Share-LoRA (KE) 44.40 55.03 79.11 82.41 81.18 74.90 84.10 495.4 68.83
- Answer to W3: Thank you for the suggestion. In Appendix A.10, we have initially discussed the impact of adjusting one rank value while keeping the other fixed on fine-tuning performance. However, we recognize this is insufficient to fully explore the effects of different sharing strategies. To address this, we conducted additional experiments to analyze the independent impacts of intra-layer and inter-layer sharing under varying rank settings. These experiments compare performance when only intra-layer or inter-layer sharing is applied. The results, detailed in the Appendix A.9, show that intra-layer sharing achieves better performance than inter-layer sharing. However, inter-layer sharing is significantly more parameter-efficient, requiring fewer parameters to achieve comparable results. This trade-off highlights the complementary benefits of combining both strategies.
ranks openbookqa arc_challenge hellaswag arc_easy piqa winogrande boolq siqa avg adjust intra Bi-Share-LoRA (SS) 0,8,0 45.00 47.70 77.32 76.64 80.25 70.01 78.10 48.41 65.43 Bi-Share-LoRA (SS) 0,16,0 45.20 47.10 77.31 77.02 80.14 70.09 78.01 48.82 65.46 Bi-Share-LoRA (SS) 0,32,0 45.40 47.35 77.18 76.47 80.09 69.69 78.50 49.18 65.48 Bi-Share-LoRA (GT) 0,8,0 43.20 45.99 76.52 75.63 79.11 69.85 74.98 46.16 63.93 Bi-Share-LoRA (GT) 0,16,0 43.80 45.48 75.71 75.55 79.38 69.53 74.16 45.70 63.66 Bi-Share-LoRA (GT) 0,32,0 43.60 44.88 76.03 75.34 79.00 70.01 73.00 46.11 63.50 Bi-Share-LoRA (KE) 0,8,0 45.40 48.04 77.16 76.35 80.20 70.72 75.90 46.78 65.07 Bi-Share-LoRA (KE) 0,16,0 45.80 46.93 77.48 76.73 80.03 70.56 75.93 47.44 65.11 Bi-Share-LoRA (KE) 0,32,0 45.20 47.10 77.53 76.39 79.82 70.24 75.96 47.13 64.92 adjust inter Bi-Share-LoRA (SS) 0,0,8 44.60 48.29 77.23 77.02 80.20 69.85 77.74 48.31 65.40 Bi-Share-LoRA (SS) 0,0,16 44.40 48.72 77.35 76.64 80.09 70.32 78.23 47.95 65.46 Bi-Share-LoRA (SS) 0,0,32 44.40 47.78 77.36 76.94 80.03 70.80 78.75 48.21 65.53 Bi-Share-LoRA (GT) 0,0,8 44.40 45.56 76.41 75.67 79.38 69.85 75.84 45.75 64.11 Bi-Share-LoRA (GT) 0,0,16 44.00 46.33 76.87 76.14 79.11 69.30 74.40 46.93 64.13 Bi-Share-LoRA (GT) 0,0,32 44.40 44.97 76.00 72.73 79.11 69.38 75.11 45.75 63.43 Bi-Share-LoRA (KE) 0,0,8 43.80 45.65 75.70 75.63 79.38 70.01 75.66 46.47 64.04 Bi-Share-LoRA (KE) 0,0,16 45.80 46.93 76.59 76.77 79.16 70.64 76.57 47.75 65.03 Bi-Share-LoRA (KE) 0,0,32 46.00 45.56 76.17 76.30 79.16 70.01 76.79 47.80 64.72
- Answer to W4 & Q4: We conducted an additional inference experiment using the Llama 1-7B model on the Alpaca dataset with a batch size of 2, measuring both memory usage and inference time. Our results indicate that the majority of memory usage is occupied by the pre-trained model itself, while the memory required for LoRA parameters is relatively small. Bi-Share LoRA further reduces memory usage, which becomes especially beneficial when deploying a large number of LoRA parameters simultaneously.
Method Inference Memory (MB) LoRA 12,914.16 Bi-Share-LoRA 12,874.62 Regarding inference speed, we observed no significant improvement when using Bi-Share LoRA compared to standard LoRA. At this stage, our primary focus is on achieving higher parameter efficiency by reducing parameter size, rather than optimizing inference speed. By reducing the number of lora parameters, more lora parameters can be accommodated per GPU in a multitasking deployment environment, reducing latency by reducing the number of GPU and CPU switches. The following table shows the memory usage of our method and LoRA with different number of parameters, see Appendix A.11 for details.
Memory Usage for serving different number of loras (MB)
method 300 600 900 1200 1500 2000 2500 3000 3500 3800 LoRA 28964 45014 61064 77114 OOM OOM OOM OOM OOM OOM Bi-Share-LoRA 18105 23337 28568 33799 39030 47749 56468 65187 73905 79137 However, in future work, we plan to explore strategies to improve inference performance, potentially by integrating system-level optimizations for model inference.
-
Answer to Q1: Currently, we use mutual information to measure the similarity of LoRA parameters after fine-tuning. Our analysis shows that the parameters exhibit high similarity (over 90% in terms of information entropy) both within and across layers. Based on this observation, we conclude that LoRA parameters generally have significant redundancy, which motivates our decision to share all parameters. That said, developing a more granular similarity metric or heuristic to determine the extent of parameter sharing prior to fine-tuning could further improve parameter efficiency. This is an important direction we plan to explore in future work.
-
Answer to Q2: Our current experiments are conducted on the LLaMA model within the context of instruction learning tasks. To provide additional validation, we also performed experiments on the Qwen2 model under the same experimental settings, with the results and analysis provided in the appendix. In future work, we plan to extend our evaluations to multimodal models, such as LLaVA, to further demonstrate the effectiveness of our approach. Additionally, we intend to explore other tasks, such as evaluating our method on the GLUE benchmark, to assess its performance across diverse scenarios.
The rebuttal has addressed the reviewer's concerns. The reviewer thinks that such an effective approach benefits the community.
Thank you very much for your thoughtful feedback. We sincerely appreciate your insightful suggestions, which have contributed significantly to improving our work. Your support is invaluable, and we are grateful for the time and effort you have put into reviewing our submission.
This paper introduces the bishare LoRA, which uses the intra-layer and inter-layer sharing to reduce the number of parameters of the standard LoRA method, which keeps competitive or better performance on commonsense reasoning & MMLU. Overall, this paper is well-written and easy to follow. This method seems promising to reduce the number of parameters while keeping competitive performance. Although this paper could be interesting to a certain group of researchers, the experiments require more baselines/ablations to demonstrate the effectiveness of this method. Also, the gains between VeRA and Bi-Share LoRA is not significant enough but VeRA requires much less parameters.
优点
This paper is well-written and easy to follow. The method compares with standard Lora and achieves significant gains with fewer parameters. This paper conducts multiple ablation experiments, which show various aspects of this method. This paper studies three different dimension transform methods, which could provide some insights to other researches required this methods.
缺点
- This method uses both inter- and intra-layer sharing but does not have the baselines of either only using intra or inter-sharing. These experiments are quite useful to better understand which parts provide more gains. 2 For LLama 1, the performance gains are higher and more consistent than llama3, especially on VeRA. VeRA uses a much smaller number of parameters. This raises the question of whether this method is still useful for more powerful LLMs.
- One of the motivations of this paper is that when serving multiple LoRAs, memory is important. However, this paper does not provide an analysis of the comparison of multiple Lora servings.
问题
- What would be the performance if only use inter- or intra-layer sharing?
- Why is the performance on llama3 less significant than llama1 performance?
- With the new method, how many more LoRAs can be served?
- Answer to W2 & Q2: This is an insightful question. As the performance of foundational pre-trained models improves, the performance gains from fine-tuning with LoRA tend to diminish. Specifically, we used the Alpaca dataset for fine-tuning, which is a general-purpose instruction dataset covering diverse tasks. For more powerful models like LLaMA 3, it is likely that much of the knowledge in the Alpaca dataset has already been learned during pre-training, resulting in smaller gains compared to less capable models like LLaMA 1. In such cases, fewer additional parameters are required during fine-tuning to achieve strong performance. To investigate this, we conducted a simple experiment using Bi-Share-LoRA, introducing even fewer parameters than VeRA, and observed improved performance.
Method Ransks params ratio openbookqa arc_challenge hellaswag arc_easy piqa winogrande boolq siqa avg VeRA 64 796,672 0.0099% 0.4500 0.5401 0.7927 0.8051 0.8123 0.7332 0.8107 0.4734 0.7096 Bi-Share-LoRA (KE) 0,0,16 22,016 0.0003% 0.4540 0.5452 0.7938 0.8068 0.8090 0.7380 0.8003 0.4688 0.7088 Bi-Share-LoRA (KE) 0,0,32 30,208 0.0004% 0.4560 0.5572 0.7973 0.8131 0.8161 0.7214 0.8153 0.4724 0.7132 Bi-Share-LoRA (KE) 0,0,64 46,592 0.0006% 0.4520 0.5520 0.7970 0.8043 0.8118 0.7395 0.8141 0.4724 0.7130 Bi-Share-LoRA (GE) 0,0,128 73,360 0.0010% 0.4500 0.5367 0.7940 0.8013 0.8074 0.7364 0.8211 0.4703 0.7096
- Answer to W3 & Q3: We have added a comparison experiment to analyze memory usage in multi-LoRA deployment. Specifically, we implemented an inference service using a LLaMA-7B model on an A100 80G GPU. While the model continuously performed inference tasks, we incrementally added new LoRA parameters to memory, recording GPU memory usage after every 100 added LoRA parameters. The memory usage for every 300 LoRA parameters is summarized in the table, with detailed results available in the Appendix.
Memory Usage for serving different number of loras (MB)
method 300 600 900 1200 1500 2000 2500 3000 3500 3800 LoRA 28964 45014 61064 77114 OOM OOM OOM OOM OOM OOM Bi-Share-LoRA 18105 23337 28568 33799 39030 47749 56468 65187 73905 79137 The results show that our method reduces memory usage by approximately 60% compared to standard LoRA. Additionally, we tested the maximum deployment capacity, finding that our method can load up to 3,878 LoRA parameters in GPU memory, whereas standard LoRA can load only 1,207. These findings highlight the efficiency of our approach in multi-LoRA serving scenarios.
The rebuttal has addressed most of my concerns. I will increase my scores
Thank you very much for your thoughtful feedback. We sincerely appreciate your insightful suggestions, which have contributed significantly to improving our work. Your support is invaluable, and we are grateful for the time and effort you have put into reviewing our submission.
We sincerely thank you for your detailed feedback and constructive suggestions. We have carefully considered every comment and conducted additional experiments and analyses where necessary. Below, we address each comment (W1 is corresponing to weakness 1 and Q1 denotes the question 1.) individually and provide clarifications, updates, and results that demonstrate how the concerns have been addressed.
- Answer to W1 & Q1: Thank you for this insightful suggestion. In Section 3, our preliminary experiments indicated that neither intra-layer sharing nor inter-layer sharing alone achieves optimal performance. Additionally, the Contribution Analysis section highlights that combining both strategies yields better results. However, further exploration of individual sharing strategies is indeed valuable for understanding their respective contributions.
To address this, we have added comparison experiments to the paper, evaluating the performance impact of using only intra-layer sharing or only inter-layer sharing. The results, detailed in the Appendix A.9, show that intra-layer sharing alone achieves better performance than inter-layer sharing alone. However, inter-layer sharing is significantly more parameter-efficient, as it requires fewer parameters to achieve comparable results. This trade-off highlights the complementary strengths of the two strategies.
Ranks openbookqa arc_challenge hellaswag arc_easy piqa winogrande boolq siqa avg adjust intra Bi-Share-LoRA (SS) 0,8,0 45.00 47.70 77.32 76.64 80.25 70.01 78.10 48.41 65.43 Bi-Share-LoRA (SS) 0,16,0 45.20 47.10 77.31 77.02 80.14 70.09 78.01 48.82 65.46 Bi-Share-LoRA (SS) 0,32,0 45.40 47.35 77.18 76.47 80.09 69.69 78.50 49.18 65.48 Bi-Share-LoRA (GT) 0,8,0 43.20 45.99 76.52 75.63 79.11 69.85 74.98 46.16 63.93 Bi-Share-LoRA (GT) 0,16,0 43.80 45.48 75.71 75.55 79.38 69.53 74.16 45.70 63.66 Bi-Share-LoRA (GT) 0,32,0 43.60 44.88 76.03 75.34 79.00 70.01 73.00 46.11 63.50 Bi-Share-LoRA (KE) 0,8,0 45.40 48.04 77.16 76.35 80.20 70.72 75.90 46.78 65.07 Bi-Share-LoRA (KE) 0,16,0 45.80 46.93 77.48 76.73 80.03 70.56 75.93 47.44 65.11 Bi-Share-LoRA (KE) 0,32,0 45.20 47.10 77.53 76.39 79.82 70.24 75.96 47.13 64.92 adjust inter Bi-Share-LoRA (SS) 0,0,8 44.60 48.29 77.23 77.02 80.20 69.85 77.74 48.31 65.40 Bi-Share-LoRA (SS) 0,0,16 44.40 48.72 77.35 76.64 80.09 70.32 78.23 47.95 65.46 Bi-Share-LoRA (SS) 0,0,32 44.40 47.78 77.36 76.94 80.03 70.80 78.75 48.21 65.53 Bi-Share-LoRA (GT) 0,0,8 44.40 45.56 76.41 75.67 79.38 69.85 75.84 45.75 64.11 Bi-Share-LoRA (GT) 0,0,16 44.00 46.33 76.87 76.14 79.11 69.30 74.40 46.93 64.13 Bi-Share-LoRA (GT) 0,0,32 44.40 44.97 76.00 72.73 79.11 69.38 75.11 45.75 63.43 Bi-Share-LoRA (KE) 0,0,8 43.80 45.65 75.70 75.63 79.38 70.01 75.66 46.47 64.04 Bi-Share-LoRA (KE) 0,0,16 45.80 46.93 76.59 76.77 79.16 70.64 76.57 47.75 65.03 Bi-Share-LoRA (KE) 0,0,32 46.00 45.56 76.17 76.30 79.16 70.01 76.79 47.80 64.72
Claims and Findings:
- This paper presents an enhancement to the LoRA method for fine-tuning large language models (LLMs), utilizing intra-layer and inter-layer sharing to reduce the number of parameters while maintaining competitive performance on commonsense reasoning and MMLU tasks.
Strengths:
- The proposed method is sound.
- Compared to previous works that adopt a similar sharing concept, the proposed method offers greater flexibility in configuring sharing strategies.
Weaknesses:
- As Reviewer csMi correctly pointed out, there are already many existing works that use sharing strategies to compress LoRA parameters. The proposed method lacks sufficient novelty compared to previous works such as ShareLoRA or Tied-LoRA. However, ShareLoRA was not even cited in the paper.
- There is a lack of empirical comparisons to closely related methods such as ShareLoRA or Tied-LoRA.
- The practical value of the proposed method is limited. It only shows a significant memory advantage over standard LoRA when hundreds of tasks are served simultaneously.
Reasons for Decision:
- Based on the identified weaknesses.
审稿人讨论附加意见
The authors provided additional ablation studies and experiments during the rebuttal, which led Reviewer hfue to raise their score from 5 to 6 after the rebuttal. Overall, the AC agrees with Reviewer csMi that this paper falls short in novelty. While the proposed method is sound, it is somewhat incremental. Additionally, it does not compare with the most related methods ShareLoRA or Tied-LoRA. This paper could benefit from another round of revision to address these issues.
Reject