Text-to-LoRA: Instant Transformer Adaption
T2L is a hypernetwork that generated task-specific LoRA given a short task description
摘要
评审与讨论
Summary of Contributions:
The paper introduces Text-to-LoRA (T2L), a hypernetwork model designed to adapt Large Language Models (LLMs) on the fly based on natural language descriptions of target tasks. T2L aims to overcome the limitations of traditional fine-tuning by constructing Low-Rank Adaptation (LoRA) parameters in a single forward pass. The authors trained T2L on a suite of pre-trained LoRA adapters and demonstrate that the reconstructed LoRA instances achieve comparable performance to task-specific adapters. Furthermore, the paper claims that T2L can compress multiple LoRA instances and generalize to unseen tasks zero-shot using natural language descriptions. The work proposes a step towards democratizing the specialization of foundation models with reduced computational requirements.
- Literature:
- The paper adequately places its work within the context of current research on foundation model adaptation, parameter-efficient fine-tuning (specifically LoRA), and hypernetworks.
- Relevant works on LoRA compression and combination are also cited. The discussion of related hypernetwork approaches for LLM adaptation (Section 6) highlights the novelty of T2L's use of natural language instructions for zero-shot generalization.
- The citations appear to be appropriate, covering key papers in the relevant areas.
Overall Assessment: The paper presents an interesting and novel approach (T2L) for on-the-fly adaptation of LLMs using natural language task descriptions. The empirical results demonstrate the potential of this method for compressing LoRAs and achieving zero-shot generalization to unseen tasks, outperforming certain baselines. The ability to adapt different LLM architectures is also a strength. However, the current state of T2L still faces significant challenges. The gap in zero-shot performance compared to task-specific LoRAs, the generalization failure of the reconstruction training scheme, and the dependence on high-quality generated task descriptions are major limitations that need to be addressed in future work.
给作者的问题
- Authors found that a T2L trained via reconstruction fails to generalize to unseen tasks (Section 5.4). Given the analysis in Appendix D and Figure 5 suggesting that LoRAs of similar tasks are not necessarily close in the weight space, what are your hypotheses or planned future research directions to bridge this gap and potentially enable better generalization for reconstruction-trained T2L?
论据与证据
Claims Supported by Evidence:
- T2L can reconstruct pre-trained LoRAs and match their performance on corresponding test sets. Table 1 shows that T2L, trained via reconstruction loss on 9 benchmark tasks, can fully recover and even outperform the benchmark-specific LoRA adapters (highlighted in green). This is demonstrated using both one-hot and natural language task embeddings.
- T2L can compress hundreds of LoRA adapters. Figure 3 shows that T2L can be trained on an increasing number of tasks (up to 479). While there is a performance drop as the reconstruction error increases, the architectures maintain a significant portion of the oracle's performance even at higher error rates.
- T2L can generate useful LoRA adapters for unseen tasks using natural language descriptions (zero-shot generalization). Table 2 demonstrates that T2L trained with supervised fine-tuning (SFT) on the Super Natural Instruction (SNI) dataset outperforms a multi-task LoRA baseline on 10 unseen benchmark tasks. The bold numbers indicate improvements over the multi-task LoRA. Furthermore, visualization in Figure 4 shows a clear clustering of T2L activations based on the benchmark tasks, suggesting task-specific adaptation.
- Different T2L architectures (L, M, S) offer complexity-performance trade-offs. The paper presents results for all three architectures in various experiments, showing that the larger models (L and M) generally achieve better performance but have more parameters, while the smaller model (S) is more parameter-efficient.
- SFT training of T2L generally leads to better zero-shot generalization compared to reconstruction training. Table 6 shows a clear performance gap between T2L instances trained via reconstruction and SFT, with SFT achieving higher average benchmark performance. The authors attribute this to the potential issue of similar tasks having non-clustered LoRA adapters in the weight space, which reconstruction training struggles with.
- Task descriptions aligned with the target task are crucial for generating effective LoRAs. Table 5 shows that reconstruction-trained T2L performs significantly better when provided with training or evaluation descriptions (aligned) compared to random strings or training descriptions from other benchmarks (unaligned).
Potential Areas Where Claims Might Need Further Scrutiny or Are Acknowledged Limitations:
- Zero-shot performance still does not fully reach that of task-specific LoRAs. While T2L shows promising zero-shot generalization, Table 2 indicates a performance gap compared to the oracle task-specific LoRAs. The authors acknowledge this as a significant challenge.
- The reliance on high-quality generated task descriptions. The discussion section mentions that the experiments rely on GPT-4o mini generated descriptions, and performance might degrade if users provide lower-quality descriptions in real-world scenarios.
- The nature of generalization in reconstruction training. Appendix D is mentioned as containing compelling evidence why reconstruction-trained T2L cannot generalize. Section 5.4 further elaborates on the limitations due to the potential lack of clustering of similar task LoRAs in the weight space. Figure 5 shows no correlation between the cosine similarity of adapter weights and task embedding similarity.
- The choice of LoRA as the sole output space. The limitations section mentions that there might be more efficient ways to modulate LLMs given a text description than just generating LoRA adapters. For example System prompts in LLMs can adopt role playing etc.
- Scalability to larger base models. The potential for T2L trained on smaller base models to transfer effectively to larger models within the same architecture class is noted as an open area for exploration.
方法与评估标准
Strengths of the Proposed Methods and Evaluation:
- Relevance to the Problem: The core idea of T2L, a hypernetwork generating LoRA adapters from task descriptions, directly addresses the challenge of adapting LLMs for new tasks without extensive fine-tuning and data curation. This is a significant step towards democratizing specialization of foundation models.
- Parameter-Efficient Adaptation: Utilizing LoRA for adaptation is a well-established parameter-efficient fine-tuning technique, making the generated adapters lightweight and easier to integrate.
- Zero-Shot Generalization Focus: The emphasis on zero-shot LoRA generation for unseen tasks is crucial for the stated goal of "instant" adaptation, as it aims to bypass the need for task-specific datasets.
- Comprehensive Evaluation Benchmarks: The paper uses a diverse set of 10 widely used benchmarks covering a variety of LLM capabilities such as reasoning (Arc), math (GSM8K), science (Arc, OpenBookQA), coding (HumanEval, MBPP), and general knowledge (BoolQ, Hellaswag, PIQA, Winogrande). This broad coverage provides a good initial assessment of T2L's generalization abilities across different types of tasks.
- Comparison to Relevant Baselines: Evaluating against task-specific LoRAs, multi-task LoRA, and other zero-shot methods like Arrow Routing provides a context for understanding T2L's performance relative to existing approaches.
- Ablation Studies: The paper includes several ablation studies that examine the impact of different T2L architectures, training schemes (reconstruction vs. SFT), task embedding models, and task descriptions. These studies offer valuable insights into the factors influencing T2L's performance and robustness.
In conclusion, the proposed methods and evaluation criteria in the "Text-to-LoRA" paper provide a strong foundation for demonstrating the potential of language-based instant adaptation of LLMs. The use of relevant techniques (hypernetworks, LoRA), a diverse set of benchmarks, and thorough ablation studies supports the main claims of the paper. However, a more critical perspective highlights potential limitations regarding the representativeness of benchmarks for all real-world adaptation needs, the reliance on high-quality task descriptions, the scope of LoRA adaptation, and the practical implications of "instant" adaptation. Future work could explore the applicability of T2L to a broader range of tasks and foundation models, investigate alternative adaptation techniques, and consider more nuanced evaluation metrics relevant to diverse adaptation scenarios.
理论论述
The paper do not contain any explicit theoretical claims that are accompanied by mathematical proofs.
实验设计与分析
1. LoRA Compression Experiment (Section 4.1 and Table 1, Figure 3):
- Design: This experiment aims to determine if T2L can recover the performance of task-specific LoRAs through reconstruction training. It trains task-specific LoRAs (oracles) on benchmark tasks and then trains T2L to distill these LoRAs using either one-hot or natural language task embeddings. The performance of the reconstructed LoRAs is then compared to the oracle LoRAs on the respective benchmark test sets. Figure 3 further explores the impact of increasing the number of training tasks on the reconstruction error and relative performance.
- Soundness/Validity:
- The use of benchmark-specific LoRAs as "ground truth" for distillation is a reasonable approach to assess T2L's ability to compress and reproduce the functionality of individual adapters.
- Comparing performance with both one-hot and natural language embeddings helps to understand the importance of semantic task representation for reconstruction.
- The analysis of performance with increasing reconstruction error (Figure 3) provides insights into the lossy compression capabilities of T2L.
- Potential Issues:
- The experiment in Table 1 indirectly sees the benchmark tasks during training as T2L learns to distill benchmark-specific LoRAs. This makes it less of a чистое "zero-shot" scenario for these specific benchmarks in this particular experiment, although the goal here is compression, not generalization to completely unseen tasks.
- The hypothesis that the performance gain over oracle LoRAs on some benchmarks (PIQA, WG) comes from lossy compression acting as regularization is an interesting interpretation but could benefit from further investigation to confirm this mechanism.
2. Zero-Shot LoRA Generation Experiment (Section 4.2 and Table 2):
- Design: This experiment investigates T2L's ability to generate useful LoRA adapters for unseen tasks. T2L is trained with Supervised Fine-Tuning (SFT) on 479 SNI tasks using natural language task descriptions. The generated LoRAs are then evaluated on 10 held-out benchmark tasks for which T2L has not seen specific LoRAs during training. The performance is compared against a base model, prepending task descriptions, 3-shot ICL, average LoRA, multi-task LoRA, and Arrow Routing.
- Soundness/Validity:
- The use of held-out benchmark tasks is crucial for assessing zero-shot generalization.
- The comparison with a comprehensive set of baselines provides a good context for evaluating the effectiveness of T2L-generated LoRAs.
- Averaging performance over three different instances of task descriptions for each benchmark helps to account for the variability in description quality and their impact on T2L.
- Potential Issues:
- As we discussed previously, the representativeness of the chosen benchmarks for all possible "unseen tasks" might be a limitation.
- The performance comparison with Arrow Routing is noted to be indirect due to differences in the set of LoRA adapters and training tasks, as well as potential differences in evaluation prompts. This makes direct conclusions about superiority challenging.
3. Ablation Studies (Section 5):
- Increasing Training Compute (Section 5.1 and Table 3): This study examines the scalability of T2L by varying the number of training tasks while proportionally scaling the training budget. The results suggest that increasing training data generally improves zero-shot performance. This design seems sound for investigating the impact of scale.
- Task Embedding Models (Section 5.2 and Table 4): This ablation compares the zero-shot performance of T2L trained using two different task embedding models (gte-large-en-v1.5 and Mistral-7B-Instruct). The comparable performance across models suggests robustness to the embedding method. This is a well-designed experiment to test the dependency on a specific embedding technique.
- Varying Task Descriptions (Section 5.3 and Table 5): This experiment investigates the impact of different types of input task descriptions (train, eval, random, train (random)) on the performance of reconstruction-trained T2L. The significant performance drop with unaligned descriptions highlights the importance of task description relevance. This ablation effectively demonstrates the sensitivity of the method to the quality and alignment of the input.
- Training Schemes (Section 5.4 and Table 6): This ablation directly compares the zero-shot performance of T2L trained via LoRA reconstruction versus Supervised Fine-Tuning (SFT) with roughly equal training time. The significantly better performance of SFT-trained T2L supports the authors' hypothesis about the limitations of reconstruction for generalization. This is a crucial experiment for understanding the optimal training strategy for T2L's zero-shot capabilities.
Overall Considerations:
- The paper generally employs standard experimental practices in the field, including the use of benchmark datasets, comparison with relevant baselines, and ablation studies.
- The authors are transparent about certain limitations, such as the indirect comparison with Arrow Routing and the potential impact of real-world task descriptions.
- The reliance on generated task descriptions is a key aspect of the approach. While the use of a powerful LLM for generation aims to ensure quality, the potential for variability and misalignment in real-world scenarios remains a factor to consider . The ablation on varying task descriptions (Section 5.3) directly addresses this.
- The scope of evaluation primarily focuses on NLP benchmarks. While diverse within that domain, the applicability and evaluation of T2L for tasks beyond these benchmarks (as also noted in our previous discussion) would provide a more comprehensive understanding of its capabilities.
补充材料
No
与现有文献的关系
-
Parameter-Efficient Fine-Tuning (PEFT): The paper directly addresses the limitations of traditional fine-tuning, which requires expensive and lengthy training. T2L leverages Low-Rank Adaptation (LoRA), a prominent PEFT technique, but innovates by generating LoRA adapters on the fly. This contrasts with the typical approach of training separate LoRA adapters for each downstream task.
-
Hypernetworks for Neural Network Adaptation: T2L falls under the umbrella of hypernetworks, which are neural networks that generate parameters for other networks. The paper builds upon the idea of using hypernetworks for adaptation but distinguishes itself by using natural language task descriptions as input to the hypernetwork. Prior work in this area often used learned task identifiers or relied on more constrained input formats. T2L's ability to generate task-specific LoRAs from natural language represents a significant step towards more flexible and user-friendly adaptation.
-
Zero-Shot Learning: T2L's capability to generate effective LoRA adapters for completely unseen tasks based solely on their natural language descriptions connects to the field of zero-shot learning. While previous work explored zero-shot adaptation using hypernetworks in limited contexts (e.g., English dialects), T2L demonstrates task-wise zero-shot generalization across a diverse set of NLP benchmarks.
-
Meta-Learning: The training of T2L, especially through supervised fine-tuning on a distribution of downstream tasks, can be seen as a form of meta-learning. T2L learns a general adaptation mechanism from a variety of tasks, enabling it to quickly adapt to new tasks at inference time. This aligns with the goal of meta-learning to learn how to learn.
-
Prompt Engineering and In-Context Learning (ICL): The paper compares T2L against baselines like prepending task descriptions and few-shot ICL. While these methods also aim to adapt LLM behavior, T2L offers a different approach by directly modifying the model's parameters through generated LoRA adapters, potentially providing more control and efficiency. The comparison highlights the strengths of T2L in scenarios where in-context examples might be limited or costly.
-
Compression of Adapters: The paper also investigates the ability of T2L to compress pre-trained LoRA adapters. This relates to the growing body of work on efficiently serving and deploying large numbers of adapters. T2L offers a method to implicitly compress multiple adapters into a single hypernetwork, allowing for on-demand generation of task-specific parameters.
In summary, T2L advances the state of the art by combining the efficiency of LoRA with the flexibility of hypernetworks and the generalizability of natural language instructions to achieve instant, zero-shot adaptation of large language models. It moves beyond task-specific fine-tuning and learned task identifiers, offering a more intuitive and broadly applicable approach to specializing foundation models.
遗漏的重要参考文献
None
其他优缺点
3. Pros and Cons:
-
Pros:
- Enables on-the-fly adaptation of LLMs based on natural language task descriptions.
- Offers a computationally inexpensive way to generate task-specific LoRA adapters (single forward pass).
- Demonstrates the potential for zero-shot generalization to unseen tasks.
- Can compress hundreds of pre-trained LoRA adapters.
- Outperforms multi-task LoRA and Arrow Routing in zero-shot evaluation.
- Shows robustness to different task embedding models.
- Generated LoRAs for semantically similar tasks cluster together in the activation space of T2L (SFT-trained).
- Applicable to different base LLM architectures (Mistral, Llama, Gemma).
-
Cons:
- Zero-shot performance does not yet match that of task-specific LoRA adapters.
- Reconstruction-trained T2L fails to generalize to unseen tasks.
- Performance depends on the quality and alignment of the natural language task descriptions.
- Relies on generated task descriptions for training and evaluation, which might not reflect real-world user inputs.
- Compression of LoRAs is lossy, leading to a potential drop in performance compared to the original adapters, although the paper shows recovery and even outperformance in some cases potentially due to regularization.
- Limited to LoRA as the output space, and other potentially more efficient modulation techniques are not explored.
其他意见或建议
- Minor Concerns:
- The quality of writing could be improved in certain sections for better clarity and conciseness.
- Further exploration of the computational overhead of T2L itself (hypernetwork size and inference time) compared to storing and using individual LoRA adapters could be beneficial.
Detailed Evaluation:
-
Novelty, Relevance, and Significance:
- The idea of using a hypernetwork to directly generate LoRA adapters from task descriptions is novel. This approach potentially offers a more efficient and flexible way to adapt LLMs compared to fine-tuning individual LoRA adapters for each task.
- The problem of efficiently adapting foundation models to specific tasks is highly relevant. Traditional fine-tuning is computationally expensive and requires task-specific datasets and hyperparameter tuning. T2L's language-based adaptation with minimal compute could be beneficial for practitioners.
- The claimed ability to zero-shot generalize to unseen tasks is significant. If validated robustly, this could significantly reduce the need for task-specific data and training. The compression of multiple LoRAs into a single hypernetwork also presents a practical advantage.
- However, the level of novelty in hypernetwork architectures for adaptation, as the paper acknowledges related works, might be incremental rather than revolutionary. The core novelty lies in applying this technique specifically to generate LoRA adapters from natural language task descriptions for zero-shot generalization.
-
Soundness:
- The paper presents empirical evidence through various experiments. The reconstruction training results (Table 1) suggest that T2L can indeed recover the performance of trained LoRAs on seen tasks.
- The zero-shot LoRA generation results (Table 2) show that SFT-trained T2L improves over a multi-task LoRA baseline, indicating that it can generate useful adapters for unseen tasks. The comparison with Arrow Routing also provides some context.
- However, the paper also acknowledges that T2L does not fully bridge the performance gap with task-specific LoRAs in a zero-shot manner. This is a crucial limitation that needs to be considered when evaluating the soundness of the claims about achieving task-specific performance.
- The finding that reconstruction-trained T2L fails to generalize to unseen tasks (Section 5.4) raises questions about the underlying adaptation mechanisms and the effectiveness of distillation in this context. The explanation regarding the lack of clustering of similar LoRAs in the weight space (Appendix D and Figure 5) provides a plausible reason but highlights a fundamental challenge.
- The ablation studies, such as the impact of task embedding models (Table 4) and varying task descriptions (Table 5), add to the soundness by exploring the robustness of the approach. The scaling experiments (Table 3 and Figure 1) also provide insights into the relationship between training data and performance.
- The use of generated task descriptions (Appendix K and Figure 7) is interesting but also introduces a dependency on the quality and consistency of these generated descriptions, which might not always be guaranteed in real-world scenarios.
-
Quality of Writing/Presentation:
- The paper is generally well-structured and clearly presents the proposed method and experimental results. The use of figures and tables aids in understanding the concepts and findings.
- The conceptual overview of T2L training (Figure 1) and the architectural variations (Figure 2) are helpful.
- However, some parts of the paper, especially the appendices detailing the architectures and hyperparameters, could be more concisely integrated into the main body or presented more clearly for better readability.
- While the language is mostly professional, there are instances where more precise phrasing could be used to avoid overstating the capabilities of T2L, especially regarding its zero-shot performance compared to task-specific LoRAs.
- Zero-shot performance does not yet match that of task-specific LoRA adapters.
- Reconstruction-trained T2L fails to generalize to unseen tasks.
- Performance depends on the quality and alignment of the natural language task descriptions.
- Relies on generated task descriptions for training and evaluation, which might not reflect real-world user inputs.
- Compression of LoRAs is lossy, leading to a potential drop in performance compared to the original adapters, although the paper shows recovery and even outperformance in some cases potentially due to regularization.
- Limited to LoRA as the output space, and other potentially more efficient modulation techniques are not explored.
We explicitly acknowledge these concerns as limitations of current implementation of T2L and discuss them thoroughly in the paper. To further address these concerns, we provide additional clarifications here.
Zero-shot performance does not yet match that of task-specific LoRA adapters.
Task-specific LoRAs are the upper ceiling for the performance. While T2L does not fully reach the ceiling, it gets very close for multiple tasks. We think of T2L as first steps towards efficient test-time adaptation of LLMs by providing compelling evidence that hypernetworks can effectively modulate modern frontier LLMs. We refer the reviewer to our response to reviewer sP2K for a potential performance improvement of T2L.
Performance depends on the quality and alignment of the natural language task descriptions.
We provide a concrete failure case analysis in table 5 in section 5.3 and our response to reviewer sP2K. However, we believe that using an LLM for adjusting the description alignment could effectively sidestep the main failure case of T2L.
Compression of LoRAs is lossy, leading to a potential drop in performance compared to the original adapters, although the paper shows recovery and even outperformance in some cases potentially due to regularization.
We agree with the reviewer that lossy compression is not a strict disadvantage of T2L as we show in the experiments that lossy compression of task-specific LoRAs can improve performance in some cases potentially due to regularization.
Limited to LoRA as the output space, and other potentially more efficient modulation techniques are not explored.
We agree that other potentially more efficient modulation techniques might exist. Since our focus in this work is on generating LoRA due to its widespread use, we leave this investigation for future work.
We hope that the reviewer concerns are adequately addressed by our response and the discussion provided in the paper.
Authors found that a T2L trained via reconstruction fails to generalize to unseen tasks (Section 5.4). Given the analysis in Appendix D and Figure 5 suggesting that LoRAs of similar tasks are not necessarily close in the weight space, what are your hypotheses or planned future research directions to bridge this gap and potentially enable better generalization for reconstruction-trained T2L?
As mentioned by the reviewer, we explained our hypothesis in Appendix D that LoRA of similar tasks are not necessarily close in the weight space, leading to memorization or overfitness of the reconstruction-trained hypernetwork. A potential fix for this problem could be replacing the reconstruction loss. A strong candidate would be contrastive learning based on the similarity of tasks in the semantic space as opposed to the weight space. Utilizing contrastive learning would lead to latent representation of tasks that cluster similar tasks together, potentially leading to better generalization. We leave this investigation for future work.
The authors explore whether it is possible to represent a set of lora parameters as task embedding(T2L). This allows many pre-trained LoRAs to be compressed, and potentially could generalize to new unseen tasks. They show it is possible to generalize to unseen tasks in this way. They analyze the generated LoRAs, and that they tend to cluser when generated from a generation network trained end-to-end with the seen task,.
给作者的问题
See weaknesses.
论据与证据
-
Claim: T2L can efficiently encode hundreds of LoRA adapters. This is effecticely whown in Figure 3. Interstingly, while performance decreases with a reconstruction loss (where task-specific loras are reconstructed), it increases with end-to-end tuning with an SFT objective (Table 3).
-
Claim: T2L can generalize to new unseen tasks. Talbe 2 shows this, though with so many training tasks, it's hard to tell if the benchmark tasks are OOD. In fact, Figure 5 that task similarity is an important factor for high performance, so the method may struggle on truly OOD tasks.
-
Claim T2L: Clusters parameters in meaningful way . This is shown in Figure 4, showing that different text specifications of tasks do cluster together in hypernetwork space.
方法与评估标准
Yes, both methods and evaluation make sense.
理论论述
There are no theoretical claims.
实验设计与分析
Soundness and validity is strong.
补充材料
I reviewed the entire supplementary material ( Appendix).
与现有文献的关系
This build on prior work showing that LoRA is an effective PEFT model, and that hypernetworks can effectively generate neural network parameters.
遗漏的重要参考文献
HyperDreambooth[1] is an intersting paper to cite; it generates lora parameters for personilzation of generative models.
[1] Ruiz, Nataniel, et al. "Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024.
其他优缺点
Strengths:
It is a novel and interesting method!
Weaknesses:
Little ablation of the task embedder? Do better reasoning models generalize better?
Little analysis of OOD tasks, or failure cases: Does the hypernetwork fail in some cases?
The gains over hyperdecoders are marginal, at least on these benchmark tasks. Can the authors demonstrate an instance where indeed the text streerabiliity directly changes the output generation in a way, to motivate the difference (qualitative examples are fine)?
其他意见或建议
An interesting direction! I'd like to see future work look into OOD generalization of the hypernetwork; does it generate very bad parameters in parts of the task distibution it has not seen?
Also, qualitative examples would give the reader a sense of the properties of the model, beyond just benchmark performance, and would be appreciated.
“HyperDreambooth[1] is an intersting paper to cite”
We thank the reviewer for bringing the paper to our attention and will include this prior work in the camera-ready version.
“Little ablation of the task embedder? Do better reasoning models generalize better?”
We agree with the reviewer that, in general, better representation of the task description, e.g., representation from reasoning models, results in better performance. However, in our simplified implementation, we use a fixed-length vector to represent the task description (as opposed to sequence of activations) and do not expect significant improvement from better embedder models.
We hypothesize that such a compact representation would be restrictive, limiting the potential improvement from better representation. We suspect that using a more powerful architecture as opposed to a simple feedforward architecture would allow the hypernetwork to fully utilize better representations. For example, using the Perceiver architecture for aggregating information from the token activations directly. We leave this investigation for future work.
Still, thanks to this simplified representation, in table 4 section 5.2, we show that T2L is largely invariant to the embedding model used.
“Little analysis of OOD tasks, or failure cases”
Due to the response length limit, we refer the reviewer to our response to reviewer 2rJt for the clarification of the OOD tasks.
Regarding the failure cases, we include some failure cases in the experiments as follows: Unaligned task descriptions (table 5 in section 5.3) Random strings as task descriptions (table 5 in section 5.3)
Overall, we show that using task descriptions that do not correspond to the task will reduce the performance of T2L significantly. We believe that using an LLM for adjusting the description alignment could effectively sidestep the main failure case of T2L.
Additionally, we provide a new qualitative result here to show another failure case: Low-quality descriptions.
Input
Toulouse has twice as many sheep as Charleston. Charleston has 4 times as many sheep as Seattle. How many sheep do Toulouse, Charleston, and Seattle have together if Seattle has 20 sheep?
Base model response (incorrect)
Great question! Let's break it down step by step:
1. We know that Seattle has 20 sheep.
2. Charleston has 4 times as many sheep as Seattle, so Charleston has 20 * 4 = <<20*4=80>>80 sheep.
3. Toulouse has twice as many sheep as Charleston, so Toulouse has 80 * 2 = <<80*2=160>>160 sheep.
4. To find the total number of sheep in all three cities, we add the number of sheep in each city: 160 (Toulouse) + 80 (Charleston) + 20 (Seattle) = <<160+80+20=240>>240 sheep.
So, Toulouse, Charleston, and Seattle together have 240 sheep.
Generated LoRA w/ (low-quality) description=``solve this task please’’ (incorrect)
Charleston has 4 times as many sheep as Seattle, and Seattle has 20 sheep, so Charleston has 20 * 4 = 80 sheep.
Toulouse has twice as many sheep as Charleston, so Toulouse has 80 * 2 = 160 sheep.
Toulouse, Charleston, and Seattle have sheep together: 160 + 80 + 20 = 240 sheep.
Generated LoRA w/ description=``This task challenges your problem-solving abilities through mathematical reasoning. You must carefully read each scenario and systematically work through the data to compute the final outcome.’’ (correct)
Toulouse has twice as many sheep as Charleston, so Toulouse has 2 * Charleston = 2 * 4 = 8 times as many sheep as Seattle.
Seattle has 20 sheep, so Toulouse has 8 * 20 = 160 sheep.
Charleston has 4 times as many sheep as Seattle, so Charleston has 4 * 20 = 80 sheep.
Toulouse, Charleston, and Seattle have sheep together = 160 (Toulouse) + 80 (Charleston) + 20 (Seattle) = 260 sheep.
Generated LoRA w/ description=``Use your programming skill to think algorithmically, logically and systematically. Carefully read each scenario and systematically work through the data to compute the final outcome. Use your math skill to calculate correctly.’’ (correct)
let me check the relationships between the number of sheep in each city:
1. Toulouse has twice as many sheep as Charleston.
2. Charleston has 4 times as many sheep as Seattle.
From point 2, we know that Charleston has 4 * 20 = 80 sheep. From point 1, Toulouse has 2 * 80 = 160 sheep.
So, Toulouse, Charleston, and Seattle have a total of 160 + 80 + 20 = 260 sheep.
“Can the authors demonstrate an instance where indeed the text streerabiliity directly changes the output generation in a way, to motivate the difference (qualitative examples are fine)?”
We provide an example of steerability of T2L in the response to the previous question. Note that the last two examples are both correct but have different answer styles thanks to different descriptions. We remark that Hyperdecoders cannot exhibit such steerability as it uses the problem instance as the input to the hypernetwork.
Thank you to the authors. I am satisfied with the rebuttal and upgrade my score.
This paper proposes the T2L architecture and training methods to generate task-specific LoRA parameters from task embeddings. The authors claim that their approach enhances zero-shot performance by enabling on-the-fly adaptation through a single forward pass of a pretrained hypernetwork.
给作者的问题
Please refer to my questions for authors in the Experimental Designs or Analyses section and the Other Strengths and Weaknesses section.
论据与证据
Yes, the authors conducted multiple experiments to support their claims regarding the capability of zero-shot on-the-fly adaptation.
方法与评估标准
There is no explicit problem. However, I need the author's comments on some of my concerns, which are described in the Experimental Designs or Analyses section and the Other Strengths and Weaknesses section.
理论论述
The paper primarily relies on empirical evidence to support its claims. While additional theoretical analysis could strengthen the work, the assumptions and heuristics behind their approach and architecture design appear reasonable.
实验设计与分析
For this approach to be practical, it should be able to handle out-of-distribution tasks that the model has never encountered or experienced anything similar to during pre-training. While the authors removed 10 tasks from a set of 500 to prevent data contamination in the evaluation benchmark datasets, I am curious whether there exist not identical but similar tasks in the training datasets compared to those in the evaluation benchmarks. It would be helpful if the authors could provide specific examples demonstrating their approach's effectiveness on challenging OOD tasks, supported by detailed evidence.
补充材料
I reviewed the supplementary material overall, focusing particularly on the details of the architecture and dataset.
与现有文献的关系
While existing literature and concurrent work share a similar motivation, as the authors acknowledge in the related work section, I believe the proposed approach is meaningful, demonstrating the potential effectiveness of LoRA-based on-the-fly adaptation through extensive experiments.
遗漏的重要参考文献
I think the authors included the necessary references to support their claims.
其他优缺点
HyperLoRA requires an additional initial cost to generate LoRA parameters using a hypernetwork from task embeddings and merge them into the base model. I believe the advantage of HyperLoRA over in-context learning (ICL) in terms of computational efficiency becomes more significant as the model repeatedly performs a given task on more samples. This is because HyperLoRA requires less computation for each inference due to the absence of a need for few-shot samples. Regarding this point, it would be helpful if the authors included a cost analysis with specific numbers, detailing how much initial cost is required for instant adaptation and how many inferences are needed to offset this cost. Additionally, in terms of effectiveness, I am curious whether using K-shot ICL (e.g., K > 3, such as K = 32 or higher) could enable ICL to match or even outperform HyperLoRA. In that case, would the cost-effectiveness of HyperLoRA justify its use over K-shot ICL? Providing specific numerical comparisons in this analysis would be helpful.
其他意见或建议
N/A
“I am curious whether there exist not identical but similar tasks in the training datasets compared to those in the evaluation benchmarks.”
We confirm that some test and training tasks are similar in that they are mostly multiple-choice question-answering tasks. Also, there are similar and overlapping domains between the two splits. For example, the ARC benchmarks are similar to SNI task #47 (see https://instructions.apps.allenai.org/#explore).
However, some benchmarks are very different from the training distribution, e.g., MBPP and HumanEval as the training tasks do not contain any code generation task. The closest tasks are SNI task #688 and #956. Therefore, we think that MBPP and HumanEval serve as a good representation of OOD tasks. We will update the text in the camera-ready version to clarify this point. We will also list all the training tasks in the appendices for better clarification.
Although the benchmarks are arguably not fully OOD relative to the training tasks, we emphasize that one of the advantages of T2L is its ability to efficiently and cheaply adapt at test-time.
“HyperLoRA requires less computation for each inference due to the absence of a need for few-shot samples. Regarding this point, it would be helpful if the authors included a cost analysis”
We fully agree with the reviewer that one of the main advantages of T2L is its efficiency. To emphasize T2L’s efficiency, we provide a FLOP analysis on a representative scenario.
Let be the sequence length, be the hidden size, and be the number of layers of a Transformer-based LLM. We use the following equations for computing the matrix multiplications (GEMMs) FLOPs [1]
FLOPs for Self-Attention (per layer):
FLOPs for FFN (per layer):
Per Transformer Block Total FLOPs:
Setup for comparison
- 3-shot ICL examples are approximately 256 tokens long
- Question instances are approximately 64 tokens long
- Task descriptions are approximately 48 tokens long
- We consider one question instance as the main input
- We only consider input tokens for the FLOPs calculation
- We use
Mistral-7B-Instruct-v0.2as the base model (S = 256 + 64 (3-shot ICL + question instance), H = 4096, L = 32) - When the based model is used with T2L, we do not include 3-shot ICL (S = 64 (question instance), H = 4096, L = 32)
- We use
gte-large-en-v1.5as the task description encoder (S = 48 (task description), H = 1024, L = 24) - We use the M hypernetwork architecture detailed in the Appendix F
T2L per instance FLOPs
gte-large-en-v1.5: FLOPs = 24 x (24 x 48 x 1024 ^ 2 + 4 x 1024 x 48 ^ 2) = 0.029 TFLOPs/instance
Hypernetwork (M): FLOPs = 2 x 1024 x 64 + 4 x 4 x 128 x 512 + 128 x 4096 x 8 = 0.000005 TFLOPs/instance
Base LLM w/o ICL: FLOPs = 32 x (24 x 64 x 4096 ^ 2 + 4 x 4096 x 64 ^ 2) = 0.827 TFLOPs/instance
Total FLOPs = 0.029 + 0.000005 + 0.827 = 0.856005 TFLOPs/instance
Base LLM with 3-shot ICL
Total FLOPs = 32 x (24 x (256 + 64) x 4096 ^ 2 + 4 x (4096) x (256 + 64) ^ 2) = 4.177 TFLOPs/instance
Based on this calculation, we can see that the adaptation cost of T2L is significantly cheaper than 3-shot ICL—more than 4x FLOPs reduction, saving compute within the first question instance. We will include this ad-hoc analysis in the appendices for the camera-ready version.
“I am curious whether using K-shot ICL (e.g., K > 3, such as K = 32 or higher) could enable ICL to match or even outperform HyperLoRA.”
It has been shown that many-shot ICL [2] consistently outperforms few-short ICL, to the extent that it can be a substitute for full fine-tuning in some tasks. Thus, we believe that many-shot ICL could match T2L performance in some tasks. While many-shot ICL allows for scaling inference-time compute, at the same time, it is computationally expensive, and depends on the context length and memory capacity of the device. These constraints limit the deployability of the LLM on consumer hardware.
In contrast, T2L takes an alternate approach, which amortizes the compute budget (e.g., finetuning or ICL) into the training of T2L before generating task-specific LoRAs cheaply at inference time. Furthermore, generated LoRAs can be merged into the base model and quantized to further reduce the memory requirement and improve inference speed.
We provide an ad-hoc analysis comparing few-shot ICL against T2L as the response to the previous question. T2L significantly reduces the FLOPs required for each question instance significantly (4x FLOPs reduction). The reduction would be even more dramatic if we compare many-shot (e.g., 32-shot) ICL to T2L.
[1] Korthikanti et al. "Reducing activation recomputation in large transformer models." MLSys 2023
[2] Agarwal et al. "Many-shot in-context learning." NeurIPS 2024
I appreciate the authors’ efforts in providing a response, and most of my concerns have been addressed. Accordingly, I will increase my score.
This paper introduces Text-to-LoRA (T2L), an interesting hypernetwork approach for generating task-specific LoRA adapters from natural language descriptions, enabling on-the-fly adaptation of LLMs. Reviewers found the method promising for its potential efficiency gains over ICL and its ability to compress adapters and generalize zero-shot to unseen tasks, outperforming several baselines. The authors provided helpful clarifications and additional analyses (FLOPs comparison, qualitative examples) during the rebuttal phase, which led two reviewers (2rJt, sP2K) to raise their scores. While acknowledging that limitations remain, particularly the performance gap to task-specific LoRAs and the dependency on description quality, the AC leans towards acceptance given the novelty, strong empirical results on benchmarks, and constructive author response outlining future work.
Strengths:
- Novelty and Potential (sP2K, Phww): The core T2L concept of generating LoRAs from text descriptions via a hypernetwork is novel and presents a promising direction for efficient model adaptation.
- On-the-fly Adaptation & Efficiency (2rJt, Phww): T2L allows for rapid adaptation with a single forward pass, demonstrated to be significantly more computationally efficient per inference instance than few-shot ICL based on FLOPs analysis provided in the rebuttal.
- Zero-Shot Generalization Capability (2rJt, sP2K, Phww): The SFT-trained T2L shows promising zero-shot generalization to unseen tasks using natural language, outperforming relevant baselines like multi-task LoRA.
- Effective LoRA Compression (sP2K, Phww): The method demonstrates the ability to compress hundreds of LoRA adapters into a single hypernetwork, albeit sometimes with lossy compression (which occasionally acted as beneficial regularization).
- Comprehensive Evaluation (Phww): The paper includes solid empirical validation with diverse benchmarks, relevant baselines, and informative ablation studies (architecture, training scheme, embeddings).
Concerns:
- Zero-Shot Performance Gap vs. Oracle (Phww): T2L does not yet match the performance of task-specific LoRAs on unseen tasks. Authors acknowledged this limitation, positioning T2L as a first step and noting performance is close on some tasks. Reviewer Phww acknowledged.
- Dependence on Task Description Quality (sP2K, Phww): Performance relies heavily on high-quality, well-aligned task descriptions; initial analysis of failures/steerability was limited. Authors provided new qualitative examples showing impact of description quality and steerability, pointed to existing ablations (Table 5), and suggested LLMs could align user inputs. Reviewer sP2K accepted; Reviewer Phww acknowledged.
- OOD Generalization Robustness (2rJt, sP2K): Initial concern about whether evaluation tasks were truly OOD from the large training set. Authors clarified similarity but identified code-gen tasks as OOD examples, promising paper updates. Reviewers 2rJt and sP2K accepted.
- Reconstruction Training Fails to Generalize (Phww): T2L trained via reconstruction loss does not generalize well to unseen tasks, unlike SFT training. Authors explained hypothesis (lack of weight-space clustering for similar tasks) and proposed contrastive learning on semantic similarity as future work. Reviewer Phww acknowledged.
- Cost/Efficiency Justification vs. Many-Shot ICL (2rJt): Initial request for quantitative comparison and justification against potentially high-performing many-shot ICL. Authors provided FLOPs analysis vs. 3-shot ICL showing >4x gain, arguing many-shot ICL would be even costlier. Reviewer 2rJt accepted.