PaperHub
6.5
/10
Oral4 位审稿人
最低5最高8标准差1.1
7
5
6
8
4.3
置信度
正确性3.0
贡献度3.0
表达2.3
NeurIPS 2024

HydraLoRA: An Asymmetric LoRA Architecture for Efficient Fine-Tuning

OpenReviewPDF
提交: 2024-05-02更新: 2024-11-06

摘要

关键词
Large Language ModelsEfficient Fine-TuningAsymmetric Structure

评审与讨论

审稿意见
7

This paper proposes two improvements to LoRA geared towards heterogeneous corpora on which LoRA underperforms full fine-tuning. First, it proposes training a number of smaller Lora heads (Ai,Bi) (Lora-Split) rather than a single head which improves performance while preserving the overall number of parameters. Second, the paper proposes an improvement over Lora-Split - called HydraLora - which reduces the number of parameters by sharing 'A' matrices across domains while allowing Bi's to vary across domains. This variant uses Mixture-of-Experts strategy for training/inference and improves performance over Lora with fewer parameters.

优点

  • Proposes a new method called HydraLoRA that improves performance over LoRA on heterogeneous corpora with fewer parameters (originality)
  • HydraLoRA does not require domain expertise either at training or inference (originality).
  • HydraLoRA improves training speed by around 2X relative to LoRA (significance)
  • Reports ablations showing what components matter in the final model (Quality)
  • Proposed method is likely to be employed by a number of researchers who are working on datasets which exhibit heterogeneity (Significance)
  • The paper presents an observation based on a tsne analysis whereby 'A' matrices from Lora heads are similar across domains while 'B' matrices vary. This is a really useful form of visualization that could be used by researchers working with LoRA (Significance)

缺点

  • The proposed method still underperforms full fine-tuning.
  • It looks like inference using HydraLoRA routes each example to all experts i.e. B matrices and then computes a weighted average. The paper does not provide an ablation where only one of the B matrices (argmax of the gating score) is used at inference time, which may further reduce inference cost.
  • There are some details in the paper which are not clear. See questions below.

Thanks to the authors for addressing many of these issues in the rebuttal.

问题

  • L54: 'autonomous' -> 'automatic'
  • Fig 2: How is corpus heterogeneity measured?
  • Table 1: How is Lora split trained - Is each Lora head trained on examples from a specific domain? If so, what are the domains? Are these domains naturally occurring in the corpus? or were they inferred by 'k-means clustering'?
  • Table 3: What is the performance of Lora-split?
  • L193: 'With equivalent parameters (rank=16), …' - this is unclear since Table 2 reports performance of HydraLora with rank=8.
  • Figure 7: what is the x-axis?
  • L259: How does the variant without MoE work?

局限性

Yes

作者回复

Thank you for your thoughtful review and insightful comments. We hereby address your concerns below:

W1: HydraLoRA still underperforms full fine-tuning.

  • HydraLoRA is more efficient. HydraLoRA offers the advantage of low training overhead, allowing LLMs to adapt to specific domain tasks more efficiently than Full Fine-Tuning (FFT). Although HydraLoRA may not match FFT in model performance, as depicted in Figure 2, FFT adjusts all parameters (as shown in Table 2, FFT modifies tunable parameters about 800 times higher than HydraLoRA), which better captures downstream task features but also incurs substantial costs that may be prohibitive for end-users. FFT cannot construct efficient multi-head structures like LoRA. Meanwhile, compared to other PEFT methods, HydraLoRA minimizes this performance gap with FFT, as shown in Table 2.

  • HydraLoRA is more robust and adaptive. As the downstream tasks dynamically evolve, the overhead of re-running the FFT process is significant. However, HydraLoRA easily adapts to the changes due to its plug-and-play and asymmetric architecture.

W2: Ablation where only one of the B matrices.

Thanks for your constructive comment. We add more experiments with the same setting as Table 3, to explore how the number of experts (B matrices) during the HydraLoRA inference pipeline influences performance. As shown in the table below, we find that an increase in the number of B matrices generally leads to enhanced performance in downstream tasks. In practice, user requests may belong to different tasks, while a single request potentially involves mixed tasks. This improvement can be attributed to the expanded configuration space afforded by additional LoRA modules, which allows for a more fine-grained and tailored adaptation to the diverse and mixed-task inputs encountered in the benchmark.

MethodsBaseTop-1Top-3HydraLoRA
BBH31.635.438.641.5

Table: Sensitivity analysis of the number of B matrices. ‘Base’ means vanilla Llama2-7B, Top-1 means selecting the highest-ranked (top-1) B matrix, and Top-3 means selecting three highest-ranked (top-3) B matrices.

Q1: Typos.

Thanks for pointing out them. We will correct all the typos in the updated version.

Q2: How is corpus heterogeneity measured?

Heterogeneity signifies the diversity within the dataset. To visualize the diversity, we adopt similarity between task embeddings for different tasks. We place an example heatmap figure in the overall rebuttal pdf.

Q3: How does lora_split in Table 1 classify the data?

To simulate real-world scenarios, we cannot know in advance the domains of data that require fine-tuning. Therefore, Split-LoRA, a baseline we proposed, performs k-means clustering on the data and then fine-tunes it for different categories. This approach underscores the importance of exploiting asymmetry in HydraLoRA.

Q4: The performance of Split_LoRA in Table 3.

For a single dataset, no existing studies have discussed multi-LoRA fine-tuning methods, prompting us to introduce the LoRA-Split variant. In contrast, Table 3 focuses on multi-task scenarios, where numerous methods[1,2,3,4] already exist. Therefore, we directly compared our approach with established LoRA MoE methods [1,2].

Q4: L193 “With equivalent parameters (rank=16)”

In Table 2, HydraLoRA (r=8) refers to each A/B matrices with a rank of 8, yet the total parameter count is equivalent to a single LoRA module with a rank of 16, due to multiple B matrices. Meanwhile, HydraLoRA demonstrates superior performance, further highlighting its efficiency.

Q5: What is the x-axis in Figure 7?

Figure 7 displays the dataset classification results for different methods, with the x-axis representing the number of repeated experiments, aiming to provide more representative results through a 15-fold experiment as mentioned on line 274.

References:

[1] Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning. ICLR 2024.

[2] Lorahub: Efficient cross-task generalization via dynamic lora composition, COLM 2024. References:

[3] Mixture of LoRA Experts, ICLR 2024.

[4] When MOE Meets LLMs: Parameter Efficient Fine-tuning for Multi-task Medical Applications, SIGIR 2024.

评论

Thanks for your clarifications.

评论

Thank you for your time and positive consideration of our rebuttal. We appreciate it and are glad it helped clarify concerns and enhance the quality of the paper.

We would be grateful if you would consider raising your final rating to a higher score.

审稿意见
5

The paper presents HydraLoRA, an innovative and asymmetric Low-Rank Adaptation (LoRA) framework designed to enhance the efficiency of fine-tuning Large Language Models (LLMs) for specific tasks. The authors identify inefficiencies in the original LoRA approach, particularly its underperformance in complex domains, and propose HydraLoRA to address these issues.

优点

Improved Efficiency: The framework requires no domain expertise and outperforms other Parameter-Efficient Fine-Tuning (PEFT) methods, including those that use domain knowledge during training and inference.

Generalizability: The framework shows robust generalization across unseen tasks without relying on prior task-specific knowledge, making it a versatile solution for adapting LLMs to various domains.

Resource Optimization: HydraLoRA is designed to be parameter-efficient, which not only improves performance but also reduces the computational resources required for training and deployment of LLMs.

缺点

HydraLoRA is more computationally intensive than conventional Parameter-Efficient Fine-Tuning (PEFT) methods due to the use of multiple adapter copies.

HydraLoRA It requires more training iterations, which can be 1 to 2 times higher than typical PEFT methods, affecting the environmental footprint of model training.

The study primarily examines LoRA and does not test additional configurations like prompt-tuning and adapter layers, limiting the scope of the findings.

The method's practical effectiveness in real-world applications outside the experimental setup is not discussed.

问题

  1. How does the asymmetric structure of HydraLoRA impact the interpretability of the model, and can the authors provide insights into how different components of the model contribute to the final predictions?

  2. The paper uses k-means for initialization. How sensitive are the results to the choice of initialization method, and how does this impact the overall performance?

局限性

  1. The paper uses k-means for initialization, but it is not clear how sensitive the model's performance is to the choice of initialization method

  2. The use of multiple adapter copies in HydraLoRA leads to higher training costs compared to conventional PEFT methods.

  3. The asymmetric structure of HydraLoRA may introduce complexity in terms of model interpretability.

作者回复

Thank you for your thoughtful review and insightful comments. We hereby address your concerns below:

W1& Limitation2: HydraLoRA has multiple adapter copies.

The reason for multiple "B" modules is that, in practice, downstream tasks are often complex and multi-task. Traditional PEFT methods typically focus on optimizing for a single task. Tuning a single LoRA to serve all tasks without considering the task differences can lead to reduced performance.

Current approaches [1,2,3,4] often train multiple LoRA modules for multiple tasks, similarly overlooking the task synergies. In contrast, HydraLoRA, by sharing the "A" module and training distinct "B" modules, perfectly couples these two features, leading to superior performance. Moreover, by sharing the "A" module, HydraLoRA significantly reduces the parameter count to just 11.5% of that of existing methods [2], as detailed in Table 3. Nonetheless, the additional computation introduced is negligible compared to the parameters of LLMs themselves, as shown in Table 2 where HydraLoRA accounts for only 0.124% of the total parameters.

W2: Environmental footprint of model training.

While the vanilla LoRA method incurs higher computational overhead compared to other PEFT approaches, it also delivers significant performance improvements. HydraLoRA, an adaptation of LoRA, enhances downstream task performance with the same parameter settings (rank=16), as demonstrated in Table 2.

Moreover, as Figure 5 illustrates, HydraLoRA cuts energy consumption by 50% compared to Split-LoRA, which uses multiple LoRA modules. This underscores HydraLoRA's efficiency and its eco-friendly nature. Additionally, the carbon footprint of fine-tuning LoRA is effectively negligible when contrasted with full-parameter tuning, highlighting its environmental and computational benefits [5].

W3: Primarily on LoRA, not test other PEFT configurations.

Our core focus is on better understanding and analyzing the LoRA structure (line 35). We first perform a thorough analysis of the LoRA structure, showing that the asymmetry (Figure 3) is primarily due to the different initialization methods of the A and B matrices. However, this characteristic may not be directly transferable to other PEFT methods. We appreciate your suggestion and will consider how similar explorations might be applied to other PEFT techniques.

W4: Outside the experimental setup is not discussed.

We have validated HydraLoRA on representative single-domain datasets in General, Medical, Law, Math, and Code (line166 - line175), as well as on the multi-domain dataset Flanv2, which covers 10 distinct task clusters (line175 - line178), effectively simulating common scenarios. We hope this addresses the reviewer's question and we are willing to answer more questions about the setup.

Q1&Limitation3: HydraLoRA's asymmetric structure interpretability.

Our analysis of the LoRA module breakdown (Figure 3) revealed asymmetrical properties of the A-B modules: post-training, the A module shows similarities across tasks, whereas the B module exhibits distinct differences. This observation aligns with the synergies and differences encountered in downstream multi-task learning with LLMs. Consequently, we have refined the existing LoRA structure and introduced the HydraLoRA asymmetric architecture (Figure 1.C). In this design, the A module captures the commonalities of knowledge, while the B module captures specific characteristics. We hope this addresses the reviewer's question. Could the reviewer please clarify what is meant by “interpretability of the model”? We apologize for any confusion.

Q2&Limitation1: Initialization of K-means.

As discussed in Section 4.5, we find that the number k of clusters is NOT a sensitive parameter for HydraLoRA with a wide range of reasonable number k of clusters performing decently well in all settings in our experiments (Figure 8). We also compare K-means with sophisticated hyperparameter search approaches and find that K-means is simple but effective (Figure 7).

References:

[1] Mixture of LoRA Experts, ICLR 2024.

[2] Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning. ICLR 2024.

[3] Lorahub: Efficient cross-task generalization via dynamic lora composition, COLM 2024.

[4] When MOE Meets LLMs: Parameter Efficient Fine-tuning for Multi-task Medical Applications, SIGIR 2024.

[5] Carbon Footprint of LLM Fine Tuning — A Case Study.

评论

Thanks for the response. I have updated the rating.

评论

Thank you very much for raising your score! We appreciate it and are glad it helped clarify concerns and enhance the quality of the paper.

审稿意见
6

The paper introduces HydraLoRA, a PEFT (Parameter-Efficient Fine-Tuning) architecture designed to improve the efficiency and performance of fine-tuning large language models (LLMs). HydraLoRA's main contribution lies in its asymmetric structure, which employs a shared matrix (A) for commonalities across tasks and multiple distinct matrices (B) for task-specific adaptations. The paper claims that this approach mitigates task interference and enhances parameter efficiency without requiring domain expertise.

优点

  • The idea of an asymmetric LoRA architecture that splits the parameter matrices into shared and task-specific components is a somewhat novel approach aimed at addressing the inefficiencies in traditional symmetric PEFT methods.
  • The paper includes a variety of experiments across different domains, including general language tasks, medical, legal, mathematical reasoning, and code generation. This wide scope provides a robust evaluation of HydraLoRA's potential benefits.
  • HydraLoRA is compared with several existing PEFT methods such as Prompt Tuning, P-Tuning, Prefix Tuning, and AdaLoRA, providing a comprehensive view of its performance relative to state-of-the-art techniques.

缺点

  • Many sections of the paper are vague and lack sufficient detail. For example, the exact observations of how the shared matrix (A) and distinct matrices (B) interact and are optimized is not clearly explained. This makes it difficult to fully understand the proposed method. For example, lines 97 to 105 explain Figure 3, but it’s confusing to read the center and right subfigures. The center subfigure shows A matrix has fewer clusters and the heads are more distinct, but the text says the opposite (B is more distinct) right subfigure shows B is more clustered and not easily distinguishable. The workflow section 3.2 is scattered and difficult to follow. Key components of HydraLoRA, such as the structure of the matrices and the routing mechanism, are not described cohesively. The figures provided do not effectively clarify these components.

  • The idea of using MoE and LoRA adapters to implement multiple B matrices is very similar to Mixture of LoRA Experts (https://openreview.net/forum?id=uWvKBCYh4S, ICLR 2024), but not discussed and compared. The difference is probably the rank size selection.

  • The empirical results are incremental, table 2 shows most results compared to LoRA are within 1% improvements, e.g. Compared with LoRA-Split or r=32, HydraLoRA does use half trainable parameters, but unclear how much inference efficiency gains it achieves.

问题

  • what are the inference speed gains compared to other PEFT methods?
  • what is the actual training overhead compared to other PEFT methods?

局限性

The authors discuss that HydraLoRA is computationally demanding, primarily due to the necessity of fine-tuning large-scale language models. It incurs a higher training expenditure than conventional PEFT methods, attributed to the employment of multiple adapter copies.

作者回复

Thank you for your thoughtful review and insightful comments. We hereby address your concerns below:

W1:Clarify asymmetric structure and workflow.

  • Asymmetric structure: Figure 3 presents the post-fine-tuning characteristics of the LoRA module within Llama-7B across four different tasks, analyzing the same submodules. Figure 3a displays the total 4x128 submodules across four LoRAs. Figure 3b illustrates the breakdown of the A matrices (even-numbered), the same submodules ( same index) overlap significantly, making them indistinguishable. Conversely, Figure 3c shows the B matrix breakdown (odd-numbered) where, after training on different tasks, the same submodules demonstrate distinct differences, facilitating clear differentiation. This analysis substantiates the HydraLoRA approach of sharing the "A" module and training distinct "B" modules to perfectly couple the synergies and differences across tasks.

  • Workflow. For Section 3.2, the HydraLoRA fine-tuning first involves categorizing datasets to initialize the number of B matrices, essentially constructing asymmetric structures. Subsequently, these B matrices serve as the experts of MoE (Eq. 3). We'd like to clarify that HydraLoRA goes beyond a simple MoE for the PEFT approaches. Our core focus is on better understanding and analyzing the LoRA structure (line 35), which delivers superior model performance while maintaining the efficiency benefits of a reduced parameter footprint.

W2:Novelty of HydraLoRA & Compared with a LoRA MoE Work: MOLE [1].

  • Novelty. We'd like to clarify that HydraLoRA represents an asymmetric architecture enhancement of the vanilla LoRA, while existing LoRA MoE approaches [1,2,3,4] serve as LoRA frameworks for multi-tasks. MoE plays a secondary role in HydraLoRA. We leverage it as a method to aggregate these asymmetric B-matrix modules. Thus, HydraLoRA architecture can be seamlessly adapted into existing enhancements to the LoRA MoE framework, further extending its capabilities and effectiveness.

  • Compared with MOLE [1]. As the reviewer mentioned, we had noticed the LoRA MoE work MOLE [1] before, but since it is NOT open-sourced (https://github.com/yushuiwx/MoLE/issues). To be fair, in Table 3, we compare HydraLoRA with another similar LoRA MoE work ICLR 2024 [2] and COLM 2024 [3]. Meanwhile, we have attempted to reproduce MOLE [1], which is not a guaranteed fair comparison. The results are as follows, MOLE underperformances [2, 3], and HydraLoRA still achieves better performance, which further proves the strong adaptability and efficiency of HydraLoRA.

Llama2-7BBaseLorahub [3]LoRA MoE [2]MOLE [1]HydraLoRA
BBH31.639.740.337.441.5

W3:Results are incremental. HydraLoRA achieves superior performance on downstream tasks with fewer parameters. Specifically,

  • Compared with LoRA (r=8), HydraLoRA (r=8) demonstrates a performance gain of over 5%, as shown in Table 2;
  • Compared with strategies that employ multiple LoRAs directly for Mixture of Experts (Table 3) and LoRA (r=32) (Table 2), HydraLoRA enhances efficiency by sharing the "A" module to capture task synergies and training distinct "B" modules to recognize task differences. Consequently, HydraLoRA significantly reduces about 88.5% of parameters compared to existing methods [2].

Q1: Comparison of inference speed.

For inference, the speed is primarily influenced by the base model. Since the parameters of PEFT modules constitute a small fraction of the total model parameters (ranging from 0.001% to 0.248% as shown in Table 2), the inference latency differences among various PEFT methods are minimal.

The following presents the latency and energy consumption during inference using Llama2-7B with different PEFT methods, evaluated on the WikiText2 dataset using a single NVIDIA A40 GPU. The results show nearly equal energy consumption and latency, but HydraLoRA exhibits the highest model performance.

Latency(s)Energy(Wh)MMLU(%)
LLaMA2-7B90.2172.7238.88
+Prompt Tuning91.78(+1.57)73.53(+0.81)39.91(+1.03)
+P-Tuning91.3(+1.13)73.87(+1.15)41.11(+2.23)
+Prefix Tuning92.52(+2.31)74.21(+1.49)41.78(+2.90)
+LoRA (r=8)92.28(+2.07)73.95(+1.23)43.22(4.34)
+HydraLoRA (r=8)92.86(+2.65)74.25(+1.53)47.22(+8.34)

Q2:Comparison of training overhead.

  • Compared with LoRA variants and LoRA MoE methods. HydraLoRA not only enhances performance with the same parameters LoRA variants (rank=16) as shown in Table 2, but it also demonstrates substantial parameter reductions compared with LoRA MoE— reducing 88.5% compared to [2] and 72.5% to [3] as shown in Table 3. Moreover, as Figure 5 illustrates, HydraLoRA cuts energy consumption by 50% compared to Split-LoRA, which uses multiple LoRA modules. This underscores HydraLoRA's efficiency and system-friendly.

  • Compared with other PEFT. While the vanilla LoRA method involves a higher computational overhead than other PEFT strategies, it offers significant performance gains, as shown in Table 2. However, LoRA's carbon footprint is negligible compared to full-parameter tuning, emphasizing its environmental and computational advantages[4]. Meanwhile, the fine-tuning is a one-time event, but inference overhead is crucial. As noted earlier, HydraLoRA boosts performance with minimal additional overhead.

References:

[1] Mixture of LoRA Experts, ICLR 2024.

[2] Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning. ICLR 2024.

[3] Lorahub: Efficient cross-task generalization via dynamic lora composition, COLM 2024

[4] Carbon Footprint of LLM Fine Tuning — A Case Study.

评论

I believe a fair comparison is with LoRA (r=16 or r=32), that's why the improvements are incremental. Since r=8 only has half of the HydraLoRA parameters used for finetuning. Or comparing under the same inference latency/compute budget.

评论

Dear Reviewer 6P2g,

Thank you for your feedback! From Table 2, we can observe that:

  • Compared to LoRA with Rank=16, HydraLoRA with the same parameters improves performance by up to 2.61% and 2.05% on average.
  • Compared to LoRA with Rank=32, HydraLoRA uses only half the parameters, while improving performance by up to 1.60% and 1.29% on average.

Such a performance improvement is significant sufficient. For example,

  • DoRA [1] improves the performance of LoRA with the same parameters by only 0.84% to 0.88% (Table 2 of its paper).
  • AdaLoRA [2] improves the performance of LoRA with the same parameters by only 0.71% to 0.97% (Table 1 of its paper).
  • MOELoRA [3] improves the performance of LoRA with the same parameters by only 0.66% to 0.98% (Table 2 of its paper).

Therefore, we can be confident that HydraLoRA's improvement is not incremental.

PapersDoRA [1]AdaLoRA [2]MOELoRA [3]HydraLoRA v.s. Rank=16HydraLoRA v.s. Rank=32
Improvement0.84%-0.88%0.71%-0.97%0.66%-0.98%2.61%1.60%

Table: Absolute value of performance improvement of different papers.

If our responses address your concerns, we would be grateful if you would consider raising your final rating to a higher score.

References:

[1] DoRA: Enhancing Parameter-Efficient Fine-Tuning with Dynamic Rank Distribution. ACL 2024.

[2] AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning. ICLR 2023.

[3] When MOE Meets LLMs: Parameter Efficient Fine-tuning for Multi-task Medical Applications, SIGIR 2024.

Sincerely,

Authors

评论

Compared to recent adapter work, HydraLora indeed has better improvements. Thanks for explaining this in detail. But I do hope authors can take W1 and W2 into the revision carefully. I will adjust my rating accordingly.

评论

Thank you very much for raising your score! We appreciate it and are glad it helped clarify concerns and enhance the quality of the paper.

审稿意见
8

This paper tackles the challenge of efficiently adapting large language models to new tasks. The authors highlight the limitations of current techniques like LoRA, which, while parameter-efficient, struggle with diverse data.

Through a series of experiments, they discover that using multiple, task-specific LoRA modules improves performance but introduces redundancy. They further observe that within these multiple modules, certain parameters consistently learn common knowledge while others specialize in individual tasks.

Based on these findings, they introduce HydraLoRA which utilizes an asymmetric LoRA structure. A single, shared matrix captures the common knowledge identified in their analysis, while multiple smaller matrices, one per task, handle specialized adaptations. This design maximizes learning from diverse data while minimizing redundancy.

Rather than depending on pre-defined task information, HydraLoRA employs a Mixture-of-Experts approach to dynamically route data during training and combine expert outputs during inference.

Experimental results across multiple benchmarks demonstrate HydraLoRA consistently outperforming other efficient fine-tuning methods, including those using MoE. The authors further emphasize HydraLoRA's practical advantages by analyzing its energy consumption and latency.

优点

Motivation and Design:

  • The paper excels at connecting its experimental findings to the proposed architecture. Specifically:
    • The authors use t-SNE visualizations to analyze the parameter distributions of LoRA modules trained on different data subsets. This approach reveals a clear pattern: the "A" matrices of these modules tend to converge, indicating common knowledge acquisition, while the "B" matrices remain distinct, suggesting they specialize in task-specific features. This key finding highlights the inherent asymmetric nature of knowledge representation within LoRA and provides the foundation for HydraLoRA's design.
    • Building upon this insight, the authors demonstrate that splitting a single LoRA into multiple, smaller ones, each trained on a different data subset (LoRA-Split), leads to significant performance improvements. This is evident in tasks like MMLU, Medical, and Law, where LoRA-Split consistently outperforms a single, large LoRA with the same parameter budget. These results suggest that intrinsic dataset differences can hinder the performance of a monolithic LoRA, and splitting helps mitigate this by allowing for specialized adaptation to those inherent data variations.

Evaluation:

  • Comparisons against a wide spectrum of PEFT methods, from traditional techniques like Prompt Tuning and P-tuning to more recent ones like AdaLoRA and (IA)3, provide a comprehensive picture of HydraLoRA's effectiveness.
  • Significant Improvement over LoRA MoE: The direct comparison with LoRA MoE is a key strength in my opinion. While both methods utilize MoE, HydraLoRA consistently demonstrates superior performance. This highlights the effectiveness of HydraLoRA's shared "A" matrix in capturing common knowledge and its advantage over using entirely separate LoRA modules. These gains are evident in both accuracy improvements and reduced parameter count, as shown in the BBH benchmark results.
  • Thorough Ablations: Authors present extensive ablation studies to capture the impact of various For example, comparing HydraLoRA to a variant with uniform expert weights ("w/o Gate") demonstrates the crucial role of the gating mechanism in selectively applying expert knowledge. This level of detail, presented across multiple benchmarks, strengthens the paper's conclusions and provides a deeper understanding of HydraLoRA's inner workings.

缺点

  • While the shared "A" matrix in HydraLoRA appears effective for the tested benchmarks, the paper could benefit from exploring potential limitations of this design choice. Investigating performance on datasets with very different domains or tasks, where the notion of shared knowledge might be less applicable, would strengthen the claims about its generalizability.

  • The paper would be more convincing with a comparison against a LoRA-Split baseline that uses existing domain knowledge. For example, on a multi-task dataset, directly comparing HydraLoRA against splitting LoRAs by task labels would provide valuable insights into the trade-offs between automatic routing and a more informed, but potentially manual, approach.

  • The paper covers a wide variety of necessary aspects, but the presentation could be more streamlined and easy to read. For example, placing the comparison with MoE-based methods and the discussion about the shared "A" matrix's advantages earlier in the paper would have made this paper more appealing to readers. This would also emphasize HydraLoRA's unique strengths more effectively.

  • A deeper analysis of the MoE router's behavior would have been really interesting. Exploring aspects like its complexity, influence on overall latency, and potential routing biases could provide a more complete picture of its role within HydraLoRA.

  • It's surprising that the authors mention the increased training iterations required by HydraLoRA (1-2 times more than typical PEFT) only within the limitations section. It would have been interesting to explore this nuance further or at least call it out in one of the main sections.

问题

  • The shared "A" matrix effectively captures common knowledge in your experiments. However, how would HydraLoRA perform on datasets with more disparate domains or tasks where this notion of shared knowledge might be weaker or less well-defined?

  • Did you experiment with other routing techniques, such as top-k routing, during your exploration of HydraLoRA's design? If so, could you elaborate on the performance implications of these different routing strategies and what led you to choose your current approach?

局限性

Yes

作者回复

Thank you for your thoughtful review and insightful comments. We hereby address your concerns below:

W1 & Q1: Exploring potential limitations of this design.

Thanks for the insightful question. The limitations may primarily stem from the training data. Particularly, in multi-task, extreme conditions such as contaminated or adversarial data can severely impair performance due to aggregation. The heterogeneity between tasks—differences in language, task type, and domain—means that the shared knowledge might be weaker or less noteworthy. Importantly, this challenge is not unique to HydraLoRA but is common across all multi-task frameworks. Robustness enhancement (such as data sanitization, robust aggregation, and anomaly detection) and privacy-enhancing technologies (like homomorphic encryption, differential privacy, and blockchain) may be potential solutions.

W2: Automatic routing v.s manual predefined tasks.

In Table 3, the LoRA MoE baselines [1,2] utilize existing domain knowledge (manual) to train multiple LoRA units, whereas HydraLoRA employs automatic routing. The results indicate that HydraLoRA uses fewer parameters and performs better in downstream tasks. This suggests potential coupling relationships between tasks, aligning closely with real-world conditions where we cannot anticipate the domains needing fine-tuning. Moreover, Section 4.5 reveals that the number of clusters K is not a sensitive parameter for HydraLoRA. It demonstrates the efficiency and robustness of HydraLoRA.

W3: More streamlined presentation.

Thanks for your constructive comment. We will revise the paper based on your suggestions in the updated version.

W4 & Q2: More MoE discussion.

Thanks for your constructive comment. We add more experiments with the same setting with Table 3, to explore how the number of experts (B matrices) during the HydraLoRA inference pipeline influences performance. As shown in the Table below, we find that an increase in the number of B matrices generally leads to enhanced performance in downstream tasks.

In practice, user requests may belong to different tasks, while a single request potentially involves mixed tasks. This improvement can be attributed to the expanded configuration space afforded by additional LoRA modules, which allows for a more fine-grained and tailored adaptation to the diverse and mixed-task inputs encountered in the benchmark.

MethodsBaseTop-1Top-3HydraLoRA
BBH31.635.438.641.5

Table: Sensitivity analysis of the number of B matrices. "Base" means vanilla Llama2-7B, "Top-1" means selecting the highest-ranked (top-1) B matrix, and "Top-3" means selecting three highest-ranked (top-3) B matrices.

W5:Overhead with other PEFT.

While the vanilla LoRA method incurs higher computational overhead compared to other PEFT approaches, it also delivers significant performance improvements. HydraLoRA, an adaptation of LoRA, enhances downstream task performance with the same parameter settings (rank=16), as demonstrated in Table 2.

Moreover, as Figure 5 illustrates, HydraLoRA cuts energy consumption by 50% compared to Split-LoRA, which uses multiple LoRA modules. This underscores HydraLoRA's efficiency and eco-friendly. However, LoRA's carbon footprint is negligible compared to full-parameter tuning, emphasizing its environmental and computational advantages [3]. Meanwhile, fine-tuning is a one-time event, but inference overhead is crucial. As fellows, HydraLoRA boosts performance with minimal additional overhead.

Latency(s)Energy(Wh)MMLU(%)
LLaMA2-7B90.2172.7238.88
+Prompt Tuning91.78(+1.57)73.53(+0.81)39.91(+1.03)
+P-Tuning91.3(+1.13)73.87(+1.15)41.11(+2.23)
+Prefix Tuning92.52(+2.31)74.21(+1.49)41.78(+2.90)
+LoRA (r=8)92.28(+2.07)73.95(+1.23)43.22(4.34)
+HydraLoRA (r=8)92.86(+2.65)74.25(+1.53)47.22(+8.34)

Table: latency and energy consumption during inference using Llama2-7B with different PEFT methods, evaluated on the WikiText2 dataset using a single NVIDIA A40 GPU.

References:

[1] Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning. ICLR 2024.

[2] Lorahub: Efficient cross-task generalization via dynamic lora composition, COLM 2024

[3] Carbon Footprint of LLM Fine Tuning — A Case Study.

作者回复

Dear PCs, SAC, AC, and Reviewers:

We sincerely appreciate your thoughtful review and insightful comments, we have tried our best to address your concerns one by one in the correspondence rebuttal sessions. If our responses address your concerns, we would be grateful if you could consider raising your final rating to a higher score.

Attached is a PDF containing the task embedding similarity heatmap, supplementing Question 2 posed by Reviewer vvtw.

Wishing you all the best,

Sincerely,

Authors

最终决定

This paper presents a architecture improvement to LoRA that brings in compute efficiency and also quality improvements. Basically, the authors split the Lora heads first into multiple heads while keeping the number of parameters the same, and then within the split heads they use the same A matrix for the multiple heads thus sharing more parameters while introducing a MoE architecture. The experimental setup is robust and some nice qualitative analysis is provided for the results.