LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition
We present LoraHub, a framework for easily combining multiple LoRA modules to achieve similar performance as in few-shot in-context learning.
摘要
评审与讨论
The paper tackles the task of employing LoRA (parameter-efficient low rank adapters) for few-shot learning of unseen tasks:: When only few samples are available for a new task, then training a new LoRA module on it may not work well. Instead, the paper considers the setting where several LoRA modules are available, pretrained on a set of upstream tasks . Then, given a set of few samples for a new task, the proposed LoRAHub aims to learn the ideal "meta-weights" such that the agglomerated LoRA module yields the best performance for the new few-shot task (Note that this required that all upstream LoRA modules have the same rank).
To optimize the weights , LoraHub employs a gradient-free method based on combinatorial optimization, Shiwa, and uses it to minimize the loss on the given few-shot samples, with an additional regularization term on the weights .
The proposed method is then evaluated using a Flan-T5 backbone on the Big-Bench Hard benchmark. The upstream tasks are the 200 tasks originally used to instruct Flan-T5; however, in practice, a random subset of 20 tasks is used in each run, such that we only have 20 weights to tune in each run. Overall, LoraHub performs almost on-par with in-context learning method, with the advantage that it requires shorter input prompts hence fewer tokens to process (essentially identical to zero-shot learning).
优点
-
Interesting idea and motivation: I find the core idea of the paper very interesting with potential applications to fields such as multi-task learning or continual learning. While there are similar concurrent idea mixing mixture-of-experts with LoRA, I found the results of the paper and focus on few-shot multi-task novel and insightful.
-
Clear writing: Overall the paper is clearly written, easy to read and understand.
-
Detailled experimental analysis: I found the experimental analysis (Section 5 in the paper) quite interesting and raises interesting properties as well as limitations of the proposed method.
缺点
-
Computational cost of optimization: Unlike In-Context Learning,
LoraHubdoes not need to process additional tokens hence a reduced inference cost. However, it also adds the cost to optimize the combination weights on the input few-shot samples, in particular when many upstream tasks are available. It would be interesting to discuss the trade-off between these two costs, e.g. say we have some few-shot samples but only want to solve the associated task once, it might be more practical to use in-context learning rather than the optimization pipeline ofLoraHub? -
Optimization of for many tasks and robustness of
LoRAHub: It's not clear to me how the optimization methods scale to a higher number of upstream LoRA modules either in terms of cost (see previous point) or performance: In Figure 4, we see that increasing the number of LoRA module does not always improve performance but strongly affects the variance of the outputs. This suggests that the optimization procedure is either noisy and/or does not converge well. As a consequence, it means that the number of candidate upstream LoRA modules, , must be carefully selected (and the optimal even seems to be task dependent from Figure 4) which introduces an additional hyperparameter. This can be an important limitation for real-life applications.
问题
-
On the topic of optimizing for many tasks, I am wondering if authors have considered alternative techniques which might be more robust to optimization noise (beyond the prefiltering strategy mentioned in Section 7): e.g. a hierarchical approach (optimize for multiple random subsets of 20 LoRA candidates, then learn for these agglomerated modules) or a curriculum like approach (gradually drop some of the candidates when optimizing if they are consistently given very low weights) ?
-
Question/Suggestion on Table 1: it is not clear to me whether Table 1 reports results averaged on 5 random seeds for all methods or only for
LoraHuband whether the different random seeds impact the choice of query few-shot samples, or only optimization(e.g. initialization, and library of Lora modules). Maybe a more complete evaluation metric would be to report (avg) and (best) for all methods (or even some form of significance test) to understand how robust the other methods are to the random seed. -
Figure 3 and variance: I think Figure 3 would be much more convincing with error bars or some notion of variance. This figure's aim is to illustrate that "LoRA with few samples does not work as well as LoraHub's few-shot learning" , however the results are only available for 3 tasks, for which the assumption only holds until 20 few-shot samples;therefore it's not clear how the insight generalizes to more few-shot settings.
We sincerely appreciate the reviewer's examination of our paper and the insightful, detailed comments provided. The constructive feedback has been invaluable in refining our work, and we have incorporated them into our revisions.
Weakness 1: Computational Cost of Optimization
Thank you for your insightful comment regarding the trade-off between the costs of in-context learning and LoraHub. We truly appreciate your observation, and we have addressed this trade-off in our discussion below.
LoraHub addresses the challenge of reducing inference costs by eliminating the need for processing additional tokens, resulting in a noticeable reduction in overall inference expenses. However, it indeed introduces an inherent cost during the Adapt stage, necessitating extra inference steps. This introduces a trade-off between choosing the in-context learning approach and LoraHub, with the decision typically hinging on the nature of the situation.
For one-time ad-hoc tasks, the in-context learning approach should be more pragmatic due to LoraHub's additional inference step costs. In such scenarios, where immediate, single-use solutions are preferred, the simplicity and efficiency of in-context learning might outweigh the benefits of potential savings offered by LoraHub. Conversely, for recurring or similar tasks, LoraHub emerges as a compelling option. Despite the initial fixed cost associated with the Adapt stage, LoraHub's ability to efficiently handle repetitive tasks, often occurring thousands of times, while simultaneously reducing overall expenses, establishes it as a practical and viable option in such scenarios.
In summary, our intention is not to replace in-context learning with LoraHub, but to present LoraHub as a complementary strategy with performance-efficiency trade-offs. We appreciate your valuable feedback, and in response, we have revised the paper to thoroughly discuss these trade-offs in Section 4.3. Thank you again for bringing this to our attention.
Weakness 2 / Question 1:Optimizing for Many Tasks and the Robustness of Optimization
We are grateful for the thoughtful feedback provided by the reviewer, especially regarding alternative methods to address variance in the optimization process. We highly appreciate the suggestions for exploring a hierarchical or curriculum-like strategy to enhance robustness beyond the prefiltering approach.
While these suggestions are intriguing, we would like to clarify that, based on our in-depth analysis, the primary source of variance in our current method is not related to gradient-free optimization algorithms but rather associated with the LoRA candidate modules. In other words, once the candidates are determined, random seeds have minimal impact on the final performance. Hence, we posit that the observed instability primarily arises from the inherent challenge of balancing the quantity and quality of the LoRA module candidates. This hypothesis receives partial support from our analysis in Section 5, where attempts to increase the number of LoRA modules led to a noticeable rise in model performance volatility.
In response to this observation, we explored a straightforward approach in the selection of LoRA module candidates. Specifically, we first identified LoRA module candidates with the lowest loss on the few-shot examples, with the goal of minimizing introduced variance and fostering a more consistent performance. This strategic selection process, named as , has notably contributed to a slight enhancement in average performance compared to . Below we show a brief summary of the results, and the detailed results can be found in Appendix D.
| Boolean Expressions | 55.5 | 60.0 |
| Causal Judgement | 54.3 | 52.9 |
| Date Understanding | 32.9 | 33.3 |
| Disambiguation | 45.2 | 62.7 |
| Dyck Languages | 1.0 | 0.0 |
| ... | ||
| Tracking Shuffled Objects (three objects) | 29.0 | 32.7 |
| Web of Lies | 53.0 | 46.0 |
| Word Sorting | 1.1 | 1.3 |
| Avg Performance Per Task | 34.7 | 35.4 |
However, the method has also introduced additional reasoning overhead, as the losses of all LoRA module candidates need to be calculated. We have also discussed the variance in Appendix G. We appreciate your insights and remain open to further discussion on refining our methodology. We intend to explore additional strategies in future investigations.
Question 2:Suggestion on Table 1
Thank you for your constructive feedback regarding Table 1. We have thoroughly reviewed your suggestions, as well as those from other reviewers, and made the necessary adjustments to improve the comprehensiveness of our experimental results. In response to your suggestion, we have conducted additional experiments by randomly sampling three sets of five examples each and applied various few-shot methods to assess their robustness. We follow your suggestion to report both the average (avg) and best (best) performance across multiple runs. Below is a summary of the results obtained from these experiments:
| Boolean Expressions | 62.7 | 58.0 | 60.7 | 65.3 | 60.7 | 59.6 | 56.2 | 56.0 | 62.2 | 55.5 |
| Causal Judgement | 59.8 | 62.1 | 57.5 | 60.9 | 63.2 | 59.4 | 60.2 | 55.6 | 57.5 | 54.3 |
| Date Understanding | 21.3 | 20.7 | 40.7 | 67.3 | 45.3 | 20.4 | 20.0 | 35.8 | 59.3 | 32.9 |
| Disambiguation | 69.3 | 0.0 | 68.7 | 70.7 | 68.0 | 69.1 | 0.0 | 68.0 | 68.2 | 45.2 |
| Dyck Languages | 2.0 | 4.7 | 25.3 | 33.3 | 2.7 | 0.9 | 4.2 | 22.2 | 19.5 | 1.0 |
| Formal Fallacies | 59.3 | 52.0 | 56.7 | 56.0 | 59.3 | 55.3 | 51.5 | 53.6 | 54.0 | 52.8 |
| Geometric Shapes | 20.0 | 15.3 | 28.7 | 39.3 | 18.7 | 19.6 | 14.7 | 24.0 | 31.1 | 7.4 |
| Hyperbaton | 72.7 | 49.3 | 57.3 | 82.0 | 72.7 | 71.8 | 49.3 | 55.3 | 77.3 | 62.8 |
| ... | ||||||||||
| Tracking Shuffled Objects (seven objects) | 6.7 | 6.7 | 12.0 | 10.0 | 15.3 | 6.7 | 6.7 | 10.0 | 9.8 | 7.7 |
| Tracking Shuffled Objects (three objects) | 31.3 | 30.7 | 32.0 | 36.0 | 31.3 | 31.1 | 30.7 | 30.9 | 32.0 | 29.2 |
| Web of Lies | 54.0 | 54.7 | 55.3 | 54.0 | 57.3 | 53.8 | 54.2 | 52.7 | 48.2 | 50.1 |
| Word Sorting | 0.7 | 1.3 | 5.3 | 6.0 | 1.3 | 0.5 | 1.3 | 4.9 | 4.9 | 1.1 |
| Average Performance Per Task | 38.4 | 32.1 | 40.9 | 46.2 | 41.2 | 37.3 | 31.6 | 37.7 | 42.1 | 34.7 |
| Avg Tokens Per Example | 597.8 | 111.6 | 111.6 | 111.6 | 111.6 | 597.8 | 111.6 | 111.6 | 111.6 | 111.6 |
| Gradient-based Training | No | Yes | Yes | Yes | No | No | Yes | Yes | Yes | No |
As depicted in the above table, it is evident that uncertainties exist in all methods when confronted with changes in few-shot examples. On average, LoraHub demonstrates a reasonable performance-efficiency trade-off, showcasing solid performance with a notable reduction in tokens. We have further updated Table 1 and introduced Table 3 (Appendix A) to incorporate these results. These tables present both the average performance and the best performance across different runs for each method. We hope these additional results contribute to a more comprehensive understanding of the methods' performance, effectively addressing your concerns.
Question 3: Figure 3 and Variance
In response to your insightful feedback, we have dedicated efforts to conduct a thorough analysis of various methods, including LoRA tuning (LoRA), full fine-tuning (FFT), and IA3 fine-tuning (IA3) as suggested by Reviewer 8M9N. Surprisingly, upon averaging the performance of LoRA tuning, our conclusions differ from the observations presented in Figure 3. Essentially, our new experimental results, as presented in Table 1 in response to Question 2, suggest that the previously highlighted instability of LoRA tuning, especially in scenarios with extremely limited examples, may have been overstated. As it stands, the overall performance of LoRA tuning appears to be great.
In light of these findings, we have taken the step of removing statements about the instability of LoRA tuning in Section 2, Section 4.2 and Section 5. Additionally, we have temporarily removed Figure 3 from the paper to prevent the dissemination of potentially inaccurate conclusions. In the final version, we are committed to enhancing the comprehensiveness of our presentation by expanding Figure 3 to encompass all tasks in BBH with varying numbers of examples. This extension aims to provide a more nuanced and thorough representation of our research findings. Your feedback has been invaluable in refining the precision and accuracy of our paper.
Dear Reviewer ZnG8,
We sincerely appreciate your thorough review and the valuable suggestions and comments you provided for our paper. We have carefully considered each point and have addressed them in detail in our rebuttal.
As the Author-Review Discussion period is drawing to a close with only two days remaining, we would like to ensure that all your concerns have been adequately addressed. If there are any questions or unresolved issues, we are eager to provide further clarification or make necessary revisions.
Best regards,
The Authors
Dear authors, thank you for your rebuttal and addressing my questions! Thanks for extending the results of Table 1 and making a stronger comparison to the baselines. And it's also very encouraging to see that introducing a selection criterion for the candidate LoRA modules can further boost the performance of
We sincerely appreciate your thoughtful review of our paper and your valuable feedback. We are delighted that you found our revisions, especially the extension of results in Table 1 and the enhanced comparison to the baselines, to be beneficial. Your positive comments are particularly encouraging.
We hope that our responses adequately address your concerns. If you have any additional suggestions or questions, please do not hesitate to let us know. Once again, we appreciate your time and effort in reviewing our work.
Best Regards,
The Authors
The paper introduces LoRAHub, a method designed to enhance performance on unseen tasks by re-utilizing trained LoRA (Low Rank Approximation) parameters across different tasks. LoRAHub operates by weighting each of these parameters, learned through a minimal set of examples on unseen tasks, thereby outperforming zero-shot baselines while achieving comparable results to in-context learning.
优点
-
The paper presents a novel approach to leveraging previously learned LoRA parameters to improve performance on unseen tasks.
-
It demonstrates competitive performance compared to in-context learning while outperforming zero-shot baselines, showcasing the potential of the method.
-
It would be beneficial if the authors could release the fine-tuned LoRA weight to the community.
缺点
-
The choice of using FLAN-T5-Large as the base model is questionable as a model pre-trained on unsupervised text might have been more relevant for fine-tuning / LoRA fine-tuning on upstream tasks from the FLAN collection.
-
The paper lacks clarity in explaining the rationale behind maintaining the same rank for the composed LoRA module and could benefit from exploring higher rank matrices when composing.
-
The selection of 20 LoRAs for unseen tasks seems arbitrary and might limit the method’s performance. An iterative procedure or justification for this selection would have been beneficial.
-
The absence of certain baselines, such as the average performance for BBH's top 5 upstream tasks, leaves gaps in the evaluation.
-
The baseline corresponding to the retrieval of a trained LoRAs, when given a handful of examples from unseen tasks, is missing. For example, see https://arxiv.org/abs/2302.03202
问题
-
Can you elaborate on the interpretation of negative coefficients for the LoRA weights?
-
Why was the decision made to maintain the same rank for the composed LoRA module? Have higher rank matrices been explored and if so, what were the findings?
-
Is the selection of 20 LoRAs for unseen tasks fixed or is there an iterative procedure to this selection? How does this choice impact the method’s performance on unseen tasks?
-
I believe the strength of the method lies in cases where there are a handful of examples from unseen tasks. If it's otherwise, that it is beneficial to use a larger number of examples, it would make sense to compare against methods like IA3 [https://arxiv.org/abs/2205.05638], which fine-tunes efficiently on few-shot examples from unseen tasks.
-
Why was FLAN-T5-Large chosen as the base model over a model pre-trained on unsupervised text? Wouldn't it be strange to fine-tune on FLAN tasks using LoRA on an already FLAN multitask-trained backbone model?
-
How does the absolute value of LoRA weight not exceeding 1.5 relate to the method's performance in section 4.2, and is there a particular significance to this threshold?
-
It would be beneficial to include parameter efficient fine-tuning and traditional fine-tuning performance in Table 1, especially with the setup of a limited number of examples.
-
Is In-Context Learning (ICL) performed on the same base Language Model (LLM), or is it conducted using larger decoder only LLMs? What are the implications of this choice on the comparison of results?
Question 4: The IA3 Baseline
Thank you for your valuable suggestion. We have incorporated your feedback by updating Table 1 to provide a more focused comparison between our method and IA3, a representative efficient few-shot fine-tuning method. The experimental results are as below.
| Boolean Expressions | 54.0 | 59.6 | 56.2 | 55.5 |
| Causal Judgement | 57.5 | 59.4 | 60.2 | 54.3 |
| Date Understanding | 15.3 | 20.4 | 20.0 | 32.9 |
| Disambiguation | 0.0 | 69.1 | 0.0 | 45.2 |
| Dyck Languages | 1.3 | 0.9 | 4.2 | 1.0 |
| Formal Fallacies | 51.3 | 55.3 | 51.5 | 52.8 |
| Geometric Shapes | 6.7 | 19.6 | 14.7 | 7.4 |
| Hyperbaton | 6.7 | 71.8 | 49.3 | 62.8 |
| Logical Deduction§ (five objects) | 21.3 | 39.1 | 32.7 | 36.1 |
| Logical Deduction§ (seven objects) | 12.7 | 40.7 | 33.8 | 36.8 |
| Logical Deduction§ (three objects) | 0.0 | 51.6 | 8.5 | 45.7 |
| Movie Recommendation | 62.7 | 55.8 | 61.8 | 55.3 |
| Multistep Arithmetic | 0.7 | 0.7 | 0.7 | 0.4 |
| Navigate | 47.3 | 45.3 | 46.2 | 47.1 |
| Object Counting | 34.7 | 32.4 | 35.1 | 33.7 |
| Penguins in a Table | 43.5 | 41.3 | 45.0 | 35.9 |
| Reasoning about Colored Objects | 32.0 | 40.2 | 40.7 | 40.0 |
| Ruin Names | 23.3 | 19.3 | 24.4 | 24.4 |
| Salient Translation Error Detection | 37.3 | 47.3 | 37.1 | 36.0 |
| Snarks | 50.0 | 54.2 | 53.9 | 56.9 |
| Sports Understanding | 56.0 | 54.7 | 55.1 | 56.7 |
| Temporal Sequences | 16.7 | 25.1 | 18.2 | 18.2 |
| Tracking Shuffled Objects (five objects) | 12.0 | 12.0 | 12.0 | 12.3 |
| Tracking Shuffled Objects (seven objects) | 6.7 | 6.7 | 6.7 | 7.7 |
| Tracking Shuffled Objects (three objects) | 24.7 | 31.1 | 30.7 | 29.2 |
| Web of Lies | 54.0 | 53.8 | 54.2 | 50.1 |
| Word Sorting | 1.3 | 0.5 | 1.3 | 1.1 |
| Avg Performance Per Task | 27.0 | 37.3 | 31.6 | 34.7 |
| Avg Tokens Per Example | 111.6 | 597.8 | 111.6 | 111.6 |
| Gradient-based Training | No | No | Yes | No |
The experimental results indicate that our method exhibits competitive, and even superior, average performance compared to IA3. We appreciate your insightful suggestion, which has led to a more targeted evaluation of our approach in comparison to IA3.
Question 6: The Weight Threshold
Thank you for question! In fact, the final LORA parameters will all be within 1. The choice of 1.5 is more to adapt to the implementation of parameter-free optimization methods, providing a good initial step size. This parameter plays a role similar to the learning rate in the deep learning process. Choosing other parameters may increase the number of steps required for convergence (used to adjust the step size).
Weakness 5: The LoRA Retrieval Baseline
Thank you for pointing out the absence of a baseline corresponding to the retrieval of trained LoRAs given a handful of examples from unseen tasks. We sincerely appreciate your sharing of the related work (https://arxiv.org/abs/2302.03202), and we acknowledge its interesting and novel methods.
In response to your feedback, we have included a discussion of this work in Section 5 ("Does composing LoRA modules extend beyond the single module’s benefits?") in the revised paper. And we conducted an experiment involving the LoRA retrieval method as our baseline. Notably, in order to ensure a fair comparison, we designed a LoRA retrieval mechanism based on the loss derived from few-shot examples. Specifically, we ranked all LoRA module candidates based on their loss on the few-shot examples and evaluated the best candidate on the test set of the unseen task. The performance comparison is as below.
| Boolean Expressions | 53.3 | 55.5 | 60.7 |
| Causal Judgement | 63.2 | 54.3 | 63.2 |
| Date Understanding | 20.7 | 32.9 | 45.3 |
| Disambiguation | 0 | 45.2 | 68 |
| Dyck Languages | 0.7 | 1.0 | 2.7 |
| Formal Fallacies | 51.3 | 52.8 | 59.3 |
| Geometric Shapes | 9.3 | 7.4 | 18.7 |
| Hyperbaton | 57.3 | 62.8 | 72.7 |
| Logical Deduction (five objects) | 26 | 36.1 | 40 |
| Logical Deduction (seven objects) | 36.8 | 36.8 | 46 |
| Logical Deduction (three objects) | 10.0 | 45.7 | 52.7 |
| Movie Recommendation | 63.3 | 55.3 | 62 |
| Multistep Arithmetic Two | 1.3 | 0.4 | 1.3 |
| Navigate | 49.3 | 47.1 | 51.3 |
| Object Counting | 37.3 | 33.7 | 36.7 |
| Penguins in a Table | 41.3 | 35.9 | 47.8 |
| Reasoning about Colored Objects | 40 | 40 | 44.7 |
| ruin_names | 21.3 | 24.4 | 28.7 |
| Salient Translation Error Detection | 46 | 36 | 42.7 |
| Snarks | 50 | 56.9 | 61.5 |
| Sports Understanding | 54.7 | 56.7 | 62.7 |
| Temporal Sequences | 17.3 | 18.2 | 21.3 |
| Tracking Shuffled Objects (five objects) | 12.0 | 12.3 | 16.7 |
| Tracking Shuffled Objects (seven objects) | 6.7 | 7.7 | 15.3 |
| Tracking Shuffled Objects (three objects) | 30.7 | 29.2 | 31.3 |
| Web of Lies | 55.3 | 50.1 | 57.3 |
| Word Sorting | 0.7 | 1.1 | 1.3 |
| Average Performance | 31.7 | 34.7 | 41.2 |
As depicted in the table above, the performance of LoRA retrieval is notably impressive, positioning it as a strong baseline. However, in comparison to LoraHub, the performance of LoRA retrieval is relatively less favorable. This underscores the imperative and effectiveness of LoRA module composition.
We extend our sincere gratitude to the reviewer for providing insightful perspectives that have significantly enriched the comprehensiveness of our evaluation.
Question 1: The Negative Coefficients
We apologize if our paper's description did not accurately convey our approach. In the equation: , when the weight parameter () assumes a negative value, the operation remains consistent—subtracting the corresponding parameters of LoRA at the specified positions. No additional adjustments are made for negative coefficients. We hope this clarification provides a clearer understanding of the process.
Weakness 4: Absence of Baselines
Thank you for your valuable suggestion. We acknowledge the importance of incorporating such a baseline, and as a response, we have included new experimental results in our study. Concretely, we evaluate the performance on BBH tasks of the top 5 generally useful LoRA modules presented in Table 2 (More details of these LoRA modules can be found in the paper). The additional experimental results are outlined below.
| Task | WIQA: Last | RACE: Right | WIQA: First | AdversaraialQA: BiDAF | WebQuestions: Answer |
|---|---|---|---|---|---|
| Boolean Expressions | 52.7 | 58.0 | 52.7 | 54.7 | 53.3 |
| Causal Judgement | 55.2 | 63.2 | 55.2 | 57.5 | 57.5 |
| Date Understanding | 17.3 | 19.3 | 17.3 | 16.7 | 15.3 |
| Disambiguation | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Dyck Languages | 0.7 | 0.7 | 0.7 | 1.3 | 1.3 |
| Formal Fallacies | 51.3 | 51.3 | 51.3 | 51.3 | 51.3 |
| Geometric Shapes | 8.0 | 13.3 | 8.0 | 6.7 | 7.3 |
| Hyperbaton | 16.7 | 44.0 | 16.7 | 1.3 | 6.0 |
| Logical Deduction (five objects) | 23.3 | 28.0 | 23.3 | 19.3 | 20.7 |
| Logical Deduction (seven objects) | 22.0 | 26.0 | 22.0 | 10.7 | 12.0 |
| Logical Deduction (three objects) | 0.7 | 9.3 | 0.7 | 0.0 | 0.0 |
| Movie Recommendation | 63.3 | 62.7 | 63.3 | 56.7 | 63.3 |
| Multistep Arithmetic | 0.7 | 0.7 | 0.7 | 0.7 | 0.7 |
| Navigate | 47.3 | 50.0 | 47.3 | 47.3 | 47.3 |
| Object Counting | 34.7 | 34.0 | 34.7 | 35.3 | 35.3 |
| Penguins in a Table | 45.7 | 41.3 | 45.7 | 39.1 | 43.5 |
| Reasoning about Colored Objects | 40.0 | 37.3 | 40.0 | 31.3 | 30.7 |
| Ruin Names | 22.0 | 21.3 | 22.0 | 17.3 | 22.7 |
| Salient Translation Error Detection | 36.7 | 34.7 | 36.7 | 32.7 | 37.3 |
| Snarks | 52.6 | 55.1 | 52.6 | 47.4 | 52.6 |
| Sports Understanding | 56.0 | 58.7 | 56.0 | 55.3 | 55.3 |
| Temporal Sequences | 16.7 | 17.3 | 16.7 | 12.7 | 17.3 |
| Tracking Shuffled Objects (five objects) | 12.0 | 12.0 | 12.0 | 10.7 | 12.0 |
| Tracking Shuffled Objects (seven objects) | 6.7 | 6.7 | 6.7 | 6.7 | 6.7 |
| Tracking Shuffled Objects (three objects) | 20.7 | 30.7 | 20.7 | 10.7 | 25.3 |
| Web of Lies | 54.7 | 54.0 | 54.7 | 54.0 | 54.0 |
| Word Sorting | 1.3 | 1.3 | 1.3 | 1.3 | 1.3 |
| Avg Performance per Task | 28.1 | 30.8 | 28.1 | 25.1 | 27.0 |
| FLAN-T5-large | 1.1 | 3.8 | 1.1 | -1.9 | 0.0 |
Notably, the findings indicate that, in most cases, these top LoRA modules perform similarly as the original FLAN-T5-large. And there is only one module "RACE: Right" stands out as significantly surpassing FLAN-T5-large. However, it's important to highlight that this individual performance doesn't reach the level achieved by LoraHub. These results reinforce the conclusion that module compositions can indeed enhance the overall performance. We appreciate your insightful suggestion, as it has enabled us to more effectively demonstrate the efficacy and necessity of our approach. Your perspective is invaluable to our work.
Thank you for acknowledging our efforts and providing constructive feedback! We are glad to respond to all your concerns below.
Weakness 1 / Question 5: The Choice of FLAN-T5
Thanks for your feedback. We would like to clarify that the selection of FLAN-T5-large as our main base model was a strategic choice based on several considerations and explain them in detail below.
Firstly, the choice of FLAN-T5 as our base model is grounded in the pursuit of language models with exceptional few-shot capabilities, making it a robust baseline for our study. FLAN-T5 not only excels in both zero-shot and few-shot scenarios but also boasts strong problem-solving capabilities, positioning it as a formidable candidate for our consideration.
Secondly, the choice of FLAN-T5 as our primary base model is crucial to ensuring a fair comparison with zero-shot learning. By utilizing LoRA modules trained on the FLAN collection, our intention is to insulate our method's performance from the influence of additional datasets. Given that the FLAN collection is already seen during FLAN-T5's training, we can reasonably assert that the true efficacy lies in the composition of LoRA modules rather than the introduction of extra datasets. Selecting an unsupervised model, such as T5, as the base would present an unfair advantage for our method, as it would indirectly leverage datasets not encountered during T5's pre-training.
Lastly, we genuinely appreciate and understand your concern. To address it comprehensively, we have also provided an analysis of the performance of our method with T5 as the base model in both Section 5 (Can LoraHub work well on non-instruction-tuning models) and Appendix B. These sections explicitly showcase the efficacy of our method when applied to models trained exclusively on unsupervised text, underscoring the generality of our approach.
Weakness 2 / Question 2: LoRA Rank Exploration
Thanks for your question! In fact, we have investigated the impact of different ranks in Section 5 (Will the rank of LoRA modules impact the performance of LoraHub learning) and Appendix B. Our analysis indicates that, for FLAN-T5, the choice of rank has minimal impact. However, for T5, it still exerts some influence. Empirical findings reveal that, in comparison to rank values of or , a rank value of consistently demonstrates superior performance across different runs, both in terms of average and best performance.
| Task | Rank 4 | Rank 4 | Rank 16 | Rank 16 | Rank 64 | Rank 64 |
|---|---|---|---|---|---|---|
| Average Performance Per Task | 16.1 | 24.2 | 20.8 | 30.7 | 14.8 | 21.4 |
We hope that this exploration addresses your concerns and brings satisfaction. We anticipate that future research endeavors will contribute to a deeper understanding of the factors influencing these effects.
Weakness 3 / Question 3: LoRA Module Selection
Thank you for pointing this out! In our initial experiment, the selection of the LoRA module candidates is conducted randomly, serving as the primary source of randomness in our method.
Acknowledging the need for improvement in the aspect, we explored a straightforward approach () in the selection of LoRA module candidates. Specifically, we first identified LoRA module candidates with the lowest loss on the few-shot examples, with the goal of minimizing introduced variance and fostering a more consistent performance. The experimental results demonstrate that this approach contributes to a slight improvement in the average performance. These results have been included in Appendix D for further reference. We hope the updated experiment addresses your concerns.
| Boolean Expressions | 55.5 | 60.0 |
| Causal Judgement | 54.3 | 52.9 |
| Date Understanding | 32.9 | 33.3 |
| Disambiguation | 45.2 | 62.7 |
| Dyck Languages | 1.0 | 0.0 |
| ... | ||
| Tracking Shuffled Objects (three objects) | 29.0 | 32.7 |
| Web of Lies | 53.0 | 46.0 |
| Word Sorting | 1.1 | 1.3 |
| Avg Performance Per Task | 34.7 | 35.4 |
Question 7: Including LoRA and Fine-tuning Results in Table 1
Thank you for your constructive feedback regarding Table 1. We have thoroughly reviewed your suggestions, as well as those from other reviewers, and made the necessary adjustments to improve the comprehensiveness of our experimental results. In response to your suggestion, we have conducted additional experiments on LoRA tuning (LoRA), full fine-tuning (FFT), and report their average (avg) and best (best) performance across three different runs. Below is a summary of the results obtained from these experiments:
| Boolean Expressions | 62.7 | 58.0 | 60.7 | 65.3 | 60.7 | 59.6 | 56.2 | 56.0 | 62.2 | 55.5 |
| Causal Judgement | 59.8 | 62.1 | 57.5 | 60.9 | 63.2 | 59.4 | 60.2 | 55.6 | 57.5 | 54.3 |
| Date Understanding | 21.3 | 20.7 | 40.7 | 67.3 | 45.3 | 20.4 | 20.0 | 35.8 | 59.3 | 32.9 |
| Disambiguation | 69.3 | 0.0 | 68.7 | 70.7 | 68.0 | 69.1 | 0.0 | 68.0 | 68.2 | 45.2 |
| Dyck Languages | 2.0 | 4.7 | 25.3 | 33.3 | 2.7 | 0.9 | 4.2 | 22.2 | 19.5 | 1.0 |
| Formal Fallacies | 59.3 | 52.0 | 56.7 | 56.0 | 59.3 | 55.3 | 51.5 | 53.6 | 54.0 | 52.8 |
| Geometric Shapes | 20.0 | 15.3 | 28.7 | 39.3 | 18.7 | 19.6 | 14.7 | 24.0 | 31.1 | 7.4 |
| Hyperbaton | 72.7 | 49.3 | 57.3 | 82.0 | 72.7 | 71.8 | 49.3 | 55.3 | 77.3 | 62.8 |
| ... | ||||||||||
| Tracking Shuffled Objects (seven objects) | 6.7 | 6.7 | 12.0 | 10.0 | 15.3 | 6.7 | 6.7 | 10.0 | 9.8 | 7.7 |
| Tracking Shuffled Objects (three objects) | 31.3 | 30.7 | 32.0 | 36.0 | 31.3 | 31.1 | 30.7 | 30.9 | 32.0 | 29.2 |
| Web of Lies | 54.0 | 54.7 | 55.3 | 54.0 | 57.3 | 53.8 | 54.2 | 52.7 | 48.2 | 50.1 |
| Word Sorting | 0.7 | 1.3 | 5.3 | 6.0 | 1.3 | 0.5 | 1.3 | 4.9 | 4.9 | 1.1 |
| Average Performance Per Task | 38.4 | 32.1 | 40.9 | 46.2 | 41.2 | 37.3 | 31.6 | 37.7 | 42.1 | 34.7 |
| Avg Tokens Per Example | 597.8 | 111.6 | 111.6 | 111.6 | 111.6 | 597.8 | 111.6 | 111.6 | 111.6 | 111.6 |
| Gradient-based Training | No | Yes | Yes | Yes | No | No | Yes | Yes | Yes | No |
Question 8: In-Context Learning Comparison
Thanks for your question! The in-context learning (ICL) results are derived from Flan-T5-large to ensure a fair and consistent comparison. Given that Flan-T5-large has been exposed to the same dataset utilized in training LoRA candidates, this setup establishes a fair basis for assessing the performance of both ICL and LoraHub on the BBH task. To reflect this decision clearly, we have revised the descriptions in the caption of Table 1 accordingly.
If we select T5 LM as the backbone, I agree that the baseline might not be very strong. However, we could incorporate retrieval baselines. This could involve selecting a dataset from the FLAN collection that performs best across various held-out datasets and choosing the most effective dataset from the collection for each held-out dataset for a more effective comparison. I believe this approach is appropriate because, in a general LoRAhub setting, it wouldn't be reasonable to assume a multitask backbone. The strategy would then involve adding LoRA to all held-in datasets, and for held-out datasets, learning the weights using a few examples from each held-out dataset. I understand obtaining results for this setting in a short time frame might be challenging. However, could you provide your thoughts on whether this represents the future direction of LoraHub?
Thank you for running experiments with different ranks. It helps justify the choice of rank=16.
It's encouraging to see that the LoRAhub filter offers a slight improvement over the original method. Thank you for conducting these experiments.
Thank you also for experimenting with choosing the best dataset for all held-out datasets. The results show that RACE: Right outperforms the Flan-T5 Large multitask model by 3.8 points, indicating the drawbacks of multitasking a model. Training LoRA for each dataset and learning to weigh them for held-out datasets seems to be a more beneficial approach.
Regarding the retrieval baseline, I’d like to point out that the original retrieval method involves creating an expert library with keys as encoded examples using a sentence encoder, and values as LoRA weights. For held-out datasets, it operates by taking the inner product of keys, which are encoded representations of a few examples, and retrieves a LoRA weight by majority. Could you please modify the baseline accordingly?
Thank you for clarifying the aspects of negative coefficients, the ICL baseline, and weight thresholds. Are there any experiments where not including a weight threshold negatively impacts the method?
Finally, regarding the PEFT baselines, I just want to confirm: did you train IA3 and LoRA by initializing them randomly and then training only with the 5-10 few-shot examples from the held-out dataset?
I noticed that the average performance on BBH using FLAN-T5-large in Table 1 is higher than the FLAN-T5-xl results in Table 6. I am under the impression that larger models usually perform better. Could you please double-check these results?
Thank you for clarifying the aspects of negative coefficients, the ICL baseline, and weight thresholds. Are there any experiments where not including a weight threshold negatively impacts the method?
Thanks for your question! In response, we conducted an ablation study by removing the threshold. Our observations revealed that this adjustment had minimal impact on the majority of tasks, underscoring the robustness of the gradient-free optimization algorithm in most scenarios. The algorithm efficiently identified reasonable ranges without specific upper and lower bounds.
However, three tasks, namely Date Understanding, Disambiguation and Hyperbaton, exhibited notable effects. The resulting performance decline led to an average decrease of 1.2% compared to with threshold. This highlights the significance of establishing a reasonable threshold to mitigate extreme scenarios. A summary of the experimental results is outlined below, with additional details provided in Appendix H.
| Boolean Expressions | 55.5 | 54.0 |
| Causal Judgement | 54.3 | 54.8 |
| Date Understanding | 32.9 | 17.7 |
| Disambiguation | 45.2 | 40.6 |
| ... | ... | ... |
| Temporal Sequences | 18.2 | 16.7 |
| Tracking Shuffled Objects (five objects) | 12.3 | 12.3 |
| Tracking Shuffled Objects (seven objects) | 7.7 | 8.5 |
| Tracking Shuffled Objects (three objects) | 29.2 | 29.8 |
| Web of Lies | 50.1 | 50.3 |
| Word Sorting | 1.1 | 1.3 |
| Average Performance | 34.7 | 33.5 |
Finally, regarding the PEFT baselines, I just want to confirm: did you train IA3 and LoRA by initializing them randomly and then training only with the 5-10 few-shot examples from the held-out dataset?
Yes, we trained IA3 and LoRA by initializing them randomly and using only 5 examples, aligning with the same examples employed in LoraHub and In-Context Learning.
I noticed that the average performance on BBH using FLAN-T5-large in Table 1 is higher than the FLAN-T5-xl results in Table 6. I am under the impression that larger models usually perform better. Could you please double-check these results?
Thank you for bringing attention to the discrepancy between FLAN-T5-large and FLAN-T5-xl results. We have carefully re-evaluated the numbers, and we can confirm that the reported results for FLAN-T5-xl are accurate. However, we deeply appreciate your concerns and thus conduct a thorough investigation in the results. Based on our observations, FLAN-T5-xl exhibits less robust output formatting (at least on BBH), particularly in zero-shot tasks like logical_deduction and tracking_shuffled_objects. This impacts the exact match metrics, resulting in lower zero-shot performance. For instance, in cases where the expected output should be (b), FLAN-T5-xl outputs b. large rectangular Turkish knife, despite being prompted to output only the option itself. We are committed to addressing this issue by implementing more robust post-processing techniques in the final version. Your careful observation has prompted us to enhance the reliability and accuracy of our experimental evaluations.
Finally, we hope you will take into consideration an increase in your scores based on the aforementioned improvements. As we work on revising the paper to address your concerns, please feel free to let us know if there's any clarification or additional information we can provide.
I genuinely appreciate the authors' efforts in experimenting with the suggested baselines and believe that the framework established by this paper will benefit the community. However, while the proposed method offers better inference benefits compared to ICL, it does not show any performance advantages over ICL and even underperforms compared to LoRA tuning. The latter has a similar inference speed to the proposed method, with perhaps a slight overhead in training, but this is only a one-time requirement. Therefore, I am keeping my score. I wish the authors the best of luck in continuing to develop in this direction.
We appreciate your time and effort in reviewing our paper, and we are grateful for your thoughtful comments and feedback. Your acknowledgment of our efforts in experimenting with the suggested baselines is encouraging, and we are pleased to hear that you see the potential benefits our framework can bring to the community.
We respect your perspective on our paper, and we appreciate the opportunity to address your concerns in the rebuttal. We believe that our work offers unique merits that extend beyond performance, as we conveyed to Reviewer pXQC: We believe our work opens an interesting research direction in composing LoRA modules for unseen task generalization. While our current prototype still requires few-shot examples, we are excited by the possibility of automating the module composition process without any example. Our vision is to democratize artificial intelligence. Specifically, we aim to establish a LoRA platform where users can seamlessly share and access well-trained LoRA modules for application in new tasks. LoRA providers can freely share or sell their modules on the platform without compromising data privacy. Users can leverage powerful LoRA modules contributed by others, utilizing algorithms for automated distribution, with the process requiring only CPU capability.
We hope that, by elucidating our broader vision and the potential implications of LoRAHub, we can convey the significance of our contributions to the community. Thank you once again for your detailed and insightful feedback, and we welcome any further suggestions or comments you may have.
Best Regards,
The Authors
We sincerely appreciate your acknowledgment of the significance of our recent experiment and are grateful for your encouraging remarks. Your thoughtful and constructive feedback is highly valued, serving as a guiding force to improve the quality and direction of our work. We genuinely thank you for dedicating your time and expertise to provide insightful comments that will undoubtedly contribute to refining our research. In response to your inquiries, we have provided detailed responses below.
If we select T5 LM as the backbone, I agree that the baseline might not be very strong. However, we could incorporate retrieval baselines. This could involve selecting a dataset from the FLAN collection that performs best across various held-out datasets and choosing the most effective dataset from the collection for each held-out dataset for a more effective comparison. I believe this approach is appropriate because, in a general LoRAhub setting, it wouldn't be reasonable to assume a multitask backbone. The strategy would then involve adding LoRA to all held-in datasets, and for held-out datasets, learning the weights using a few examples from each held-out dataset. I understand obtaining results for this setting in a short time frame might be challenging. However, could you provide your thoughts on whether this represents the future direction of LoraHub?
Thank you for your insightful suggestions regarding the improvement of our experimental setup. We highly value your recommendation to incorporate a retrieval baseline from the FLAN collection, as it aligns more closely with practical applications and establishes a reasonable baseline for comparative analysis. Constructing LoraHub on base language models, as you proposed, is a promising approach that not only boosts its generality but also reduces the risk of potential overfitting of language models to specific tasks.
Although time constraints limit our capacity to conduct experiments on both the retrieval baseline and construct LoraHub using other base language models beyond T5-large, we fully recognize that exploring these avenues is essential for enhancing the overall value of our paper. Therefore, we are committed to incorporating experiments featuring Llama 2 as the base model for LoraHub in the final version of our paper. The addition will enable us to compare its effectiveness with the current FLAN-T5, providing valuable insights for the entire research community. Thank you again for your thoughtful input.
Regarding the retrieval baseline, I’d like to point out that the original retrieval method involves creating an expert library with keys as encoded examples using a sentence encoder, and values as LoRA weights. For held-out datasets, it operates by taking the inner product of keys, which are encoded representations of a few examples, and retrieves a LoRA weight by majority. Could you please modify the baseline accordingly?
Thank you for your thoughtful feedback and corrections. Following your suggestions, we conducted experiments utilizing the sentence-transformers/all-MiniLM-L6-v2 as the sentence encoder for similarity measurement to retrieve LoRA module candidates. Unfortunately, the results fell short of expectations, with an average performance across all tasks in BBH reaching 28.3. This slightly surpassed the zero-shot setting but lagged behind the loss-based LoRA retrieval baseline.
In analyzing the reasons for these unexpected results, we find it is usually challenging to identify relevant LoRA experts based on the prompt similarity, especially in cases where there are no explicit upstream tasks closely related to the evaluation tasks. Notably, we also encountered some positive outcomes, where tasks like Movie Recommendation successfully identified relevant LoRA experts (e.g., Duorc ParaphraseRC: Movie Director).
Nevertheless, it is crucial to acknowledge that the limited time may have impacted the implementation of our baseline, and we recognize the potential for incompleteness in its current state. We are committed to conducting a thorough review and addressing any shortcomings in the final version. Additionally, we are exploring alternative retrieval systems to better align with the concepts presented in the referenced paper. Please let us know if you have any concern!
This paper proposes a method to improve the generalizability of LLMs to new tasks. The proposed method can be summarized as below:
- Train several models on a set of downstream tasks (one model for each task). This is done in a parameter-efficient way using the LORA method. This results in a set of LORA models.
- Randomly choose a subset of the LORA models.
- Learn a set of weights to combine these LORA models using a few (for eg, 5) samples from a new task not seen in step 1. The weights are learnt using gradient-free optimization.
- Use the learnt weights to generate a combination of the LORA models and combine it with the base LLMs for inference.
The result is a method that can be used to improve generalizability when the number of samples available for a new task is very low.
优点
- The paper presents an interesting way to fine-tune LLMs on tasks where the number of training examples might be very low.
- The proposed method uses gradient-free optimization to minimize resource requirements, since the number of parameters being learnt is very low. -The method outperforms zero-shot deployment of the base LLM.
缺点
The main weakness of the paper is its performance compared to in-context learning (ICL), as highlighted in Table. 1. The authors acknowledge this in the paper but justify by saying that their method uses fewer tokens in their fine-tuning process compared to in-context learning. However, I feel that ICL is a very straight-forward and easy way to improve generalizability and that the problem that the authors are addressing is a minor one. Further, the performance of the proposed method also weakens the paper. From a practitioner's perspective I feel that the proposed method will be less appealing to just using ICL.
问题
- Can the authors provide more insight into the impact of the reduced number of tokens required in the proposed method? What does that mean to a practitioner in the end? Especially considering the fact that the proposed method first needs the training of several LORA modules, the burden on a practitioner may actually be higher.
We deeply appreciate your thoughtful comments on our paper. Your insightful questions have proven invaluable, helping us refine and articulate the positioning and motivation of our work. To provide a clear overview, we outline three key reasons highlighting the significance of our paper below.
Reason 1. Inference Efficiency
While we acknowledge the effectiveness of in-context learning (ICL) for generalization, we wish to emphasize that our proposed LoraHub is not intended to replace ICL but rather to offer a complementary strategy with performance-efficiency trade-offs. An essential benefit of LoraHub, highlighted in our paper, is its focus on improving inference efficiency by utilizing fewer prompt tokens per example. As illustrated in Table 1, LoraHub demonstrates a significant advantage with a reduced token requirement per example — 111.6 tokens compared to the 597.8 tokens needed for ICL.
The efficiency of inference is paramount for real-world deployments, particularly when dealing with lots of inference examples. With fewer tokens per example during inference, our method significantly reduces computational overhead and enables faster responses. The commitment to optimizing inference efficiency aligns seamlessly with a broader research trend, as evidenced by several recent studies actively exploring approaches to reduce the number of input tokens [1,2,3,4,5,6]. For example, as stated in [1], in-context learning can be inefficient because it makes the input prompt much longer, consuming valuable space in the context window and leading to larger computational costs.
[1]. Wangchunshu Zhou, Yuchen Eleanor Jiang, Ryan Cotterell, Mrinmaya Sachan, Efficient Prompting via Dynamic In-Context Learning. ArXiv abs/2305.11170 (2023)
[2]. Tao Ge, Jing Hu, Lei Wang, Xun Wang, Si-Qing Chen, Furu Wei, In-context Autoencoder for Context Compression in a Large Language Model. ArXiv abs/2307.06945 (2023)
[3]. Alexis Chevalier, Alexander Wettig, Anirudh Ajith, Danqi Chen, Adapting Language Models to Compress Contexts. ArXiv abs/2305.14788 (2023)
[4]. Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu, LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression. ArXiv abs/2310.06839 (2023)
[5]. Yucheng Li, Bo Dong, Chenghua Lin, Frank Guerin, Compressing Context to Enhance Inference Efficiency of Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
[6]. Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, Lili Qiu, LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Reason 2. The Potential of LoraHub
While acknowledging that our current approach currently lags behind ICL in terms of average performance, we posit that the inferior performance is primarily an artifact of our implementation rather than an inherent limitation of LoraHub. We believe that the observed gap can be narrowed through the development of better module composition algorithms. This conviction finds support in the new experimental results (below) on the challenging Big-Bench Hard benchmark.
The new experimental results directly address the suggestion from Reviewer ZnG8, which prompted us to report the best performance of ICL. In response, we conducted a meticulous examination, collecting three distinct sets of few-shot examples, ensuring a comprehensive evaluation that covers both average (Table 1) and best (Table 3) performances across all tasks for various methods. The best performance of both ICL and LoraHub are as below.
| Task | ICL | LoraHub |
|---|---|---|
| Boolean Expressions | 62.7 | 60.7 |
| Causal Judgement | 59.8 | 63.2 |
| Date Understanding | 21.3 | 45.3 |
| Disambiguation | 69.3 | 68.0 |
| Dyck Languages | 2.0 | 2.7 |
| Formal Fallacies | 59.3 | 59.3 |
| Geometric Shapes | 20.0 | 18.7 |
| Hyperbaton | 72.7 | 72.7 |
| Logical Deduction (five objects) | 39.3 | 40.0 |
| Logical Deduction (seven objects) | 42.0 | 46.0 |
| Logical Deduction (three objects) | 52.7 | 52.7 |
| ... | ||
| Tracking Shuffled Objects (five objects) | 12.0 | 16.7 |
| Tracking Shuffled Objects (seven objects) | 6.7 | 15.3 |
| Tracking Shuffled Objects (three objects) | 31.3 | 31.3 |
| Web of Lies | 54.0 | 57.3 |
| Word Sorting | 0.7 | 1.3 |
| Avg Performance Per Task | 38.4 | 41.2 |
| Avg Tokens Per Example | 597.8 | 111.6 |
Notably, our findings indicate that LoraHub's best performance surpasses that of ICL on 18 tasks, highlighting its promising potential for future development. This underscores the untapped capabilities of LoraHub, emphasizing its capacity to excel in challenging tasks.
Reason 3. New Insights
More broadly, we believe our work opens an interesting research direction in composing LoRA modules for unseen task generalization. While our current prototype still requires few-shot examples, we are excited by the possibility of automating the module composition process without any example. There is great potential in finding the right assemblies of specialized modules to attain broad capabilities.
Our vision is to democratize artificial intelligence. Specifically, we aim to establish a LoRA platform where users can seamlessly share and access well-trained LoRA modules for application in new tasks. LoRA providers can freely share or sell their modules on the platform without compromising data privacy. Users can leverage powerful LoRA modules contributed by others, utilizing algorithms for automated distribution, with the process requiring only CPU capability. The training of LoRA modules in our experiments serves as a simulation of LoRA providers within the platform.
Summary
Thank you again for the constructive feedback. We hope the above clarification alleviates your concerns and provides more insights into our method. We have revised our paper to more clearly position the trade-offs and potential of our approach compared to ICL. Please let us know if you have any other suggestions for improving our messaging and the impact of this work.
Dear Reviewer pXQC,
We appreciate the time and effort you dedicated to reviewing our paper. Your feedback is invaluable to us, and we genuinely thank you for your insightful comments.
We want to assure you that we took your feedback seriously and made a concerted effort to address the highlighted issues to the best of our ability. We conducted additional experiments, and refined our arguments to enhance the overall quality of the paper. We are open to any further suggestions or specific points you feel require additional attention.
Thank you once again for your time and consideration. We look forward to the opportunity to discuss any remaining concerns.
Best Regards,
The Authors
I thank the authors for the detailed response. I appreciate the clarification regarding why the reduced number of input tokens is an important contribution and I acknowledge the same.
I have raised my score from 3 to 5.
Thank you for carefully reviewing our response. We are delighted to hear that the clarification on the significance of the reduced number of input tokens resonated with you, and we sincerely appreciate your acknowledgment. Your decision to raise the score from 3 to 5 is truly encouraging, and we are grateful for your reassessment.
If you have any further comments or suggestions, please feel free to share them.
Best Regards,
The Authors
We really appreciate all reviewers for their careful reviews and constructive comments. In response to their valuable feedback, we have submitted an updated version of the paper. This revision incorporates several additional experimental results, addressing the reviewers' comments comprehensively. We believe that the latest draft effectively addresses the queries raised and enhances the overall quality of the paper. We look forward to further feedback and discussions from reviewers.
- We conducted a thorough comparison between LoraHub and representative gradient-based training methods, including LoRA tuning, IA3 tuning, and Full Fine-tuning, in few-shot scenarios (Reviewer
ZnG8, Reviewer8M9N). - We sampled different few-shot examples and evaluated all few-shot method for multiple times, presenting both the average performance and the best performance (Reviewer
ZnG8). - We introduced two new baselines: one reporting the performance of the top five upstream LoRA modules, and the other using the LoRA retrieval method. These additions underscore the significance of module composition, the principle of LoraHub (Reviewer
8M9N). - We introduced results for a pre-filter method to slightly improve the performance of LoraHub (Reviewer
ZnG8, Reviewer8M9N). - We meticulously revised the paper's description concerning the comparison of our method with ICL (Reviewer
pXQC, ReviewerZnG8). - We have addressed all weaknesses and questions raised by reviewers (Reviewer
pXQC, ReviewerZnG8, and Reviewer8M9N). - We have distinguished all modified content by highlighting it in throughout the paper for ease of reference.
Dear Reviewers,
If you have already responded to authors rebuttal, Thank you! If not, please take some time, read their responses and acknowledge by replying to the comment. Please also update your score, if applicable.
Thank you for your continued engagement in discussion.
Cheers, Your AC!
The paper proposes a framework for combining many pre-trained Lora adaptors to achieve good performance on new and unseen task using with a few examples. It starts with a weighted sum of the existing Lora adaptors and optimizes the weights using a gradient-free method to avoid computational overhead of gradient based learning due to large scale of the network parameters. As a result it was able to do in-context learning while using fewer tokens at inference, and hence a speedup. Experiments on the BigBench benchmark show it performs well compared to baselines, though still underperforming in-context learning.
Strengths:
- The idea of combining specialized Lora modules is interesting and has potential for further improvement. It opens a lot of avenue for multi-task learning and continual learning for future. -It offers efficient few-shot learning without needing to process long prompts (as compared to in-context learning) and reduces computational costs.
- It shows promising performance, doing reasonably well to shrink the gap to in-context learning.
Weaknesses:
- There's still a meanigful gap to in-context learning performance.
- The optimization method for weighting Lora adaptors is not trivial which can be considered as a drawback in contrast to simplicty and generality of in-context learning.
- More analysis is needed on how the approach scales as the number of candidate Lora modules increases.
What might be missing:
- Showing interaction and positive (or maybe negative) transfer between tasks. Demonstrating cases where the LoraHub clearly outperforms in-context learning or baseline Lora, rather than just achieving comparable performance.
- Reducing the computational overhead of the optimization process to make the tradeoffs versus in-context learning.
- More extensive experiments on model performance as the number and diversity of Lora modules increases. Again, specially, on task interactions and transfer of knowledge, demonstrating the effectiveness of multiple or auxiliary tasks.
为何不给更高分
Since the LoraHub learning is underperforming especially compared to the simplicity and effectiveness of ICL for few-shot learning.
为何不给更低分
N/A
Reject