PaperHub
7.0
/10
Poster3 位审稿人
最低3最高4标准差0.5
4
3
4
ICML 2025

LoRA-Gen: Specializing Large Language Model via Online LoRA Generation

OpenReviewPDF
提交: 2025-01-22更新: 2025-07-24

摘要

关键词
Large Language ModelLoRASpecializing Model

评审与讨论

审稿意见
4

The paper proposes LoRA-Gen, a framework for specializing language models (LMs) on edge devices by generating task-specific LoRA parameters using a cloud-side model. The core idea involves leveraging a large cloud-based LM to generate meta tokens from task descriptions, which dynamically assemble LoRA parameters from a pool of experts. These parameters are merged into the edge-side model via reparameterization, reducing input context length and enabling efficient inference. Key claims include: ● Training-free specialization for unseen tasks via single-turn inference on system prompts. ● 2.1x speedup with TinyLLaMA-1.1B and 10.1x compression ratio with Gemma-2B on agent tasks. ● Superior performance over LoRA, LoRA-MoE, and MixLoRA across eight commonsense reasoning benchmarks.

给作者的问题

  1. How does the computational cost of cloud-side LoRA generation compare to edge-side efficiency gains?
  2. What task characteristics make LoRA-Gen most effective (e.g., prompt complexity, task diversity)?
  3. What are the primary failure modes of LoRA-Gen, and how might they be addressed?

论据与证据

The claims are largely supported by extensive experiments on commonsense reasoning (e.g., ARC-c, OpenBookQA) and agent benchmarks (GPT4Tools). Results demonstrate clear improvements in accuracy, latency, and compression. However: ● The training-free generalization claim requires broader validation across diverse tasks. ● Limited analysis on task-specific strengths/weaknesses (e.g., why LoRA-Gen excels in certain tasks but not others). ● The claim of knowledge transfer needs deeper exploration (e.g., how cloud-to-edge knowledge injection works mechanistically).

方法与评估标准

The methodology is well-designed, combining LoRA, MoE, and reparameterization to address edge-side efficiency. Evaluation metrics (accuracy, latency, compression ratio) are appropriate. The use of harmonic mean accuracy for multi-task evaluation and task-specific benchmarks (e.g., GPT4Tools) strengthens validity.

理论论述

No formal theoretical proofs are provided. The work is empirically driven, with ablation studies validating design choices (e.g., meta tokens vs. direct LoRA generation, routing strategies).

实验设计与分析

Experiments are comprehensive, comparing LoRA-Gen against strong baselines (LoRA, LoRA-MoE) across multiple models (TinyLLaMA, Gemma-2B). Key gaps: ● Hyperparameter sensitivity: Limited discussion on the impact of expert count (n=8) or auxiliary loss coefficient (α=0.01). ● Cloud model dependency: No analysis of how varying cloud-side models (e.g., larger/smaller LMs) affects performance.

补充材料

not provided

与现有文献的关系

The paper is well-situated within the broader literature on parameter-efficient fine-tuning of large language models. It builds upon recent advances in LoRA and MoE techniques and addresses the critical need for efficient and effective model specialization. The authors provide a good overview of related work, including various parameter-efficient fine-tuning methods and context compression techniques. However, the paper could benefit from a more detailed discussion on how LoRA-Gen compares to other non-LoRA methods, such as adapter-based approaches or prompt tuning.

遗漏的重要参考文献

The work builds on LoRA (Hu et al., 2021), MoE (Jacobs et al., 1991), and context compression (e.g., AutoCompressors). However, it omits critical recent works: ● VeRA (Kopiczko et al., 2024): A parameter-efficient method using shared low-rank layers. ● DoRA (Liu et al., 2024b): Decomposes weight updates into magnitude/direction components.

其他优缺点

Strengths: ● The proposed framework is innovative and addresses a significant challenge in deploying specialized language models on edge devices. ● The results demonstrate clear improvements in both performance and efficiency, making the method highly practical for real-world applications. ● The idea of generating task-specific LoRA parameters using a cloud model is novel and leverages the strengths of both cloud and edge computing. Weaknesses: ● The dependency on a cloud-side model for generating LoRA parameters may limit the applicability of the method in scenarios with limited network connectivity or strict privacy requirements. ● The generalization ability of LoRA-Gen to unseen tasks needs further validation on a broader range of tasks and domains. ● The paper could benefit from a more detailed discussion on the limitations of the method and potential areas for improvement.

其他意见或建议

● Clarify computational overhead of cloud-side LoRA generation. ● Discuss limitations (e.g., reliance on cloud infrastructure, task scope). ● Fix typos: Page 6, "utilyze" → "utilize"; unify "LoRAMoE"/"LoRA-MoE".

作者回复

Q1: The training-free generalization claim requires broader validation across diverse tasks.

Ans: As shown in Table 2, Table 3 in the manuscript and Table 10 in the response to reviewer HcLz, we have mainly validated our method in the fields of mathematics, commonsense reasoning, science, daily life and tool usage. In the future, we will further expand to more fields to verify our method.

Q2: How cloud-to-edge knowledge injection works mechanistically.

Ans: As shown in Table 12, in the response to reviewer TR5Y, when we do not utilize the online LLM to select experts, the overall performance decreases, further highlighting the importance of the online large model in converting instructions into internal knowledge to guide edge-side models.

Q3: Hyperparameter sensitivity.

Ans: Our exploration of different hyperparameter settings, as shown in Table 5 of the manuscript, revealed that adding more experts to the LoRA Expert Pool does not inherently improve results. We speculate that this may be due to the mismatch between the number of experts and the scale of available training data, resulting in poorly tuned expert classifications, and the auxiliary loss coefficient is the result of a hyperparameter search.

Q4: No analysis of how varying cloud-side models (larger/smaller LMs) affect performance?

Ans: We have conducted an experiment to show that the LLM with stronger reasoning ability brings better results, as shown in Table 13 in the response to reviewer TR5Y.

Q5: The paper could benefit from a more detailed discussion on how LoRA-Gen compares to other non-LoRA methods, Prompt tuning?

Ans: We conduct an experiment to show performance comparison between LoRA-Gen and Prompt-tuning as indicated in Table 15.

Prompt-tuningLoRA-Gen
ARC Challenge43.244.3

Table 15. Performance Comparison.

Q6: Reliance on cloud infrastructure and the limitations of the method and potential areas for improvement.

Ans: Our method supports diverse application scenarios, such as tool calls, personalized virtual assistants, offline intelligent systems, IoT device control, and tasks necessitating long system prompts. However, the current paradigm needs to predefine a pair of cloud and edge-side LM. The model-agnostic framework leaves an open question for future work.

Q7: Clarify the computational overhead of cloud-side LoRA generation.

Ans: After specifying the task description or system prompt, the cloud model only needs to perform one inference to complete the generation of LoRA parameters. For specific FLOPs, Memory, and Latency in the training and inference stages, please refer to Table 4 in the supplementary material.

Q8: Fix typos: Page 6, "utilyze" → "utilize"; unify "LoRAMoE"/"LoRA-MoE".

Ans: Thanks for your valuable advice. We will refine this in our revision paper.

Q9: How does the computational cost of cloud-side LoRA generation compare to edge-side efficiency gains?

Ans: In fact, in the deployment and reasoning phase, the cloud model only needs the computational overhead of a single inference to complete the specific task. The main computational overhead comes from the training phase. As shown in Table 15 in response to reviewer TR5Y, the 13B model will bring some gains over the 7B model.

Q10: What task characteristics make LoRA-Gen most effective (e.g., prompt complexity, task diversity)?

Ans: Our method is more beneficial in scenarios that typically feature a fixed specialized system prompt and varying user inputs. In such cases, the cloud-side large model performs the generation of customized LoRA weights through a one-time system prompt inference and supplies these weights to the edge-side small model.

Q10: What are the primary failure modes of LoRA-Gen, and how might they be addressed?

Ans: The major limitation of the current paradigm needs to predefine a pair of cloud and edge-side LM. If we do not use the same model pair as in the training phase, it will lead to a primary failure mode. A possible solution is to train multiple models in cascade, which we will continue to explore in future work.

审稿人评论

Thank you for your response. It addresses most of my concern. I'll keep my rating.

审稿意见
3

The authors automate the creation of task specific LoRAs, by leveraging a larger scale LLM finetuned to generate LoRA parameters. The larger-scale llm is prompted with a system prompt specifying the task. The LoRa parameters are applied to a smaller scale edge model. The authors find these generated lora parameters are effective for seen tasks, unseen tasks for common sense resasoing. They also find it to be effective for tool usage relative to LoRA alone.

After the Rebuttal

Thank you for providing the detailed answers. I am therefore keeping my score. I'd strongly recommend incroporating training data details in clear place in the main paper, since this is a key piece of understanding the system.

给作者的问题

See weakneses.

论据与证据

  • LoRA-Gen is faster than alternatives: Timing experiments with shorter context. Issue: It's not clear to me that context is needed for trained (non-base models), especially on seen tasks.

  • LoRA-Gen generalizes better to new tasks: Experiments it shows some generalization capability to new tasks (possibly due to few-shot examples).

  • LoRA-Gen is training free Indeed, do not need training for new tasks. However, it'd be intersting to see how a small amount of training would help.

*LoRA-Gen Allows knowledge transfer from large model to small model.

It is unclear how much of the large llm base model gets transfered into the generate lora params, vs how much is learned during the general lora generator training phase. Maybe ablate different lora generators?

方法与评估标准

Proposed methods and evaluation criteria do make sense.

理论论述

There are no theoretical claims.

实验设计与分析

I checked soundness of desigins. Experiment design is sound.

补充材料

I reviewed all of the supplementary material.

与现有文献的关系

This work builds on the prior ideas of PEFT, Mixture-of-Experts, and context compression. It takes findings that LoRA is a highly effective PEFT methodology and that LLM's can compress prompts, and instead represents prompts as parameters.

遗漏的重要参考文献

HyperDreambooth[1] is an intersting paper to cite; it generates lora parameters for personilzation of generative models.

[1] Ruiz, Nataniel, et al. "Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024.

其他优缺点

Strenghts: Clarity + writing is high on average (for minor writing weakness see below).

Weakness: The lora generator + lora expert pool needs training, yet testing on GPT4Tools is said to not need training (red x, second column, table 3). Then how is it trained? Does it generalize from commonsense reasoning task training?

How important is the lora generator base llm? Would a stronger reasoing model be better, since the cost is being amortized anyways?

This type of meta-training really requires a 'dataset of datasets'. However, the current setup is more of a proof-of-concept becacuse the number of seen task is quite low. Are there scaling properties?

其他意见或建议

Overall, an interesting direction! I'd like to see this scaled up and an experiment on the importance of the cloud-side LM. It almost feels like an embedding-type model (maybe encoder-decoder?) would make more sense.

作者回复

Q1: It's not clear to me that context is needed for trained (non-base models), especially on seen tasks.

Ans: Sincerely sorry for the confusion. We aim to utilize the online large language model and the online LoRA Expert Pool to convert user instructions, system prompts, or contextual content into tailored LoRA parameters, improving the performance of the small model. Therefore, during training, we will provide a few-shot instruction to the online LLM to guide the model in acquiring this functionality.

Q2: It'd be interesting to see how a small amount of training would help.

Ans: As shown in Table 12, a small amount of training can lead to a small increase of some tasks.

HellaswagWinograndePIQA
Training-free49.167.476.9
Training With Few Cases48.867.577.1

Table 12: Performance with Few-Shot Training.

Q3: Ablate different lora generators.

Ans: As shown in Table 13, we conduct an ablation study on the average accuracy and harmonic mean of the LLM-based LoRA generator and the general LoRA generator across 8 datasets, using TinyLLaMA-1.1B as the small LLM. The results demonstrate that the LLM-based LoRA generator outperforms the general LoRA generator. Additionally, as presented in Table 2 of the manuscript, the performance of LoRA-Gen surpasses that of several other LoRA-based MoE methods.

Average ScoreHarmonic Mean
General LoRA Generator54.048.2
LLM-based LoRA Generator55.149.8

Table 13. Ablation of Different LoRA Generators.

Q4: How is training-free setting in GPT4Tools trained?

Ans: As mentioned in Section 1.4 of the supplementary material, we process the Alpaca dataset through GPT-4, resulting in a filtered and abstracted set of 37,658 training samples. Then we use this part of the data for pre-training to obtain the instruction-aware LoRA generator and LoRA expert pool. We directly utilize the checkpoint to perform generalization tests on the GPT4Tools test set, corresponding to the 4th row "without training setting".

Q5: How important is the LoRA generator base LLM?

Ans: We conduct an experiment to verify the impact of the LoRA generator. As shown in Table 14, LLaMA3-8B (with the strongest reasoning ability) does bring better results, and increasing the size of the LoRA generator will also improve to a certain extent.

ARC ChallengeOpenbookQA
LLaMA2-7B41.328.6
LLaMA2-13B42.629.0
LLaMA3-8B43.231.0

Table 14. Ablation of Different LLM-based LoRA Generators.

Q6: Are there scaling properties?

Ans: In this work, we offer an initial verification of the advantages of this paradigm. Moving forward, we will expand the dataset size to explore more effective and resilient training paradigms.

审稿人评论

Hi, I thank the author for their response. I am leaning towards keeping my score, but I have two important clarifying questions:

Q1: You write:

Sincerely sorry for the confusion. We aim to utilize the online large language model and the online LoRA Expert Pool to convert user instructions, system prompts, or contextual content into tailored LoRA parameters, improving the performance of the small model. Therefore, during training, we will provide a few-shot instruction to the online LLM to guide the model in acquiring this functionality.

I mean regarding the Figure 2 a.) Vanilla LoRA Paradigm. Do you have insights or ablations regarding performance without a system prompt?

Q2: You write As mentioned in Section 1.4 of the supplementary material, we process the Alpaca dataset through GPT-4, resulting in a filtered and abstracted set of 37,658 training samples. Then we use this part of the data for pre-training to obtain the instruction-aware LoRA generator and LoRA expert pool. We directly utilize the checkpoint to perform generalization tests on the GPT4Tools test set, corresponding to the 4th row "without training setting".

Does this mean you train with Alpaca for tool use? Or with commonsense data (ARC/OPQA/HellaSwag etc.)? Or both? How about for the commonsense eval?

In general, could you clarify what training data is used for what experiments?

Thank you.

作者评论

We sincerely thank you for the precious review time and comments.

Q1: Our method requires specific content to be provided to the online LoRA generator. As shown in Table 16, we evaluate the performance of vanilla LoRA without a system prompt. Our LoRA-Gen utilizes a general system prompt to produce tailored LoRA weights.

Vanilla LoRALoRA-Gen
ARC-c (Seen Task)32.9434.73

Table 16: Performance comparison.

Q2: (a) Commonsense scenery evaluation shown in Table 2: all methods are trained with the training set of seen tasks as well as Alpaca. (b) Tool usage scenery evaluation shown in Table 3: all methods except Line1&4 are trained with Alpaca and the training set of GPT4Tools. Line4 is trained only with Alpaca.

审稿意见
4

This paper introduces LoRA-Gen, a framework that enhances domain-specific task performance for small edge-side models by leveraging a large cloud-side model to generate LoRA parameters based on task descriptions. Utilizing reparameterization, LoRA-Gen merges these parameters into edge-side models, enabling flexible specialization without specialized training. This approach improves inference efficiency by reducing input context length and facilitates knowledge transfer.

给作者的问题

N/A

论据与证据

Yes.

方法与评估标准

The evaluation in this paper is constrained in both scope and settings:

  1. The study focuses on classification, question answering, sentence completion, and fill-in-the-blank tasks, labeling them as "Reasoning Tasks." However, it does not include reasoning-specific datasets such as mathematical reasoning, multi-hop reasoning, or commonsense reasoning, which are critical for validating such claims.
  2. The paper lacks experiments on training-free unseen tasks, which are necessary to demonstrate the method's generalizability. Expanding the evaluation with broader experiments or more in-depth analysis would strengthen the validity of the proposed approach.

理论论述

I have reviewed the formulas within this paper.

实验设计与分析

See "Methods And Evaluation Criteria."

补充材料

Checked the entire appendix.

与现有文献的关系

The paper combines and builds upon concepts from parameter-efficient adaptation, mixture-of-experts, context compression, and knowledge transfer. It introduces an online generation and reparameterization mechanism designed to balance efficiency and effectiveness, offering a solution to the inherent trade-offs between these two aspects.

遗漏的重要参考文献

N/A

其他优缺点

The paper does not provide a detailed discussion on the specific experts included in the LoRA expert pool. A more thorough explanation of these experts would help clarify their relevance to the tested datasets and better illustrate the method's generalizability across different tasks.

其他意见或建议

N/A

作者回复

Q1: Do not include reasoning-specific datasets such as mathematical reasoning.

Ans: Thanks for the great advice, we further evaluate our method on the mathematical reasoning dataset GSM8K. As shown in Table 10, LoRA-Gen brings gains. Due to time constraints, we will collect more types of tasks for evaluation in the revised version.

GSM8K (mathematical reasoning)OpenBookQA (multi-hop reasoning)Arc challenge (commonsense reasoning)
Baseline62.431.243.3
LoRA-Gen64.233.444.3

Table 10: Diverse Tasks Performance.

Q2: The paper lacks experiments on training-free unseen tasks.

Ans: In this work, we mainly explored the generalization capabilities of mainstream reasoning scenarios and agent scenarios. We set Hellaswag, Winogrande, and PIQA as generalization test datasets, which do not appear in the training period, to verify the training-free unseen capabilities of our method as shown in Table 2 of manuscript. At the same time, we further verified the generalization performance of our method by testing the unseen tools in the GPT4Tools dataset as shown in Table 3 of manuscript. We will further explore the performance of LoRA-Gen in a wider range of scenarios in our future work.

Q3: The paper does not provide a detailed discussion on the specific experts included in the LoRA expert pool.

Ans: We analyzed the weight distribution across different experts when processing various tasks, as shown in Table 11. For instance, expert 5 is frequently activated for agent tasks but rarely for SIQA tasks, demonstrating that the online LoRA Pool implicitly assigns knowledge to different experts.

Expert 1Expert 2Expert 3Expert 4Expert 5Expert 6Expert 7Expert 8
ARC Easy0.09810.59380.30270.01490.40820.32230.20510.0518
SIQA0.01070.39450.42770.42970.03780.27150.39060.0361
GPT4Tools0.20410.68750.20700.00260.55860.26170.04150.0361

Table 11: Expert Weight Distribution.

审稿人评论

Thank you for your detailed response. I have increased my score accordingly.

最终决定

This paper proposes LoRA-Gen, an innovative framework that utilizes a large cloud-based model to generate task-specific LoRA parameters for smaller edge models, aiming to enhance performance and efficiency without requiring task-specific training on the edge device. The method merges generated parameters via reparameterization, reducing context length and enabling faster inference. Experimental results demonstrate promising improvements in accuracy, latency, and compression on commonsense reasoning and agent tasks compared to baselines like standard LoRA. The authors effectively addressed most reviewer concerns during the rebuttal phase with additional experiments and clarifications.

Strengths:

  • Novelty and Practicality: Addresses the significant challenge of deploying specialized LMs on resource-constrained edge devices using a novel cloud-edge synergy (All Reviewers).
  • Efficiency Gains: Demonstrates clear improvements in inference speed and context compression, crucial for edge applications (7PAF, TR5Y).
  • Strong Empirical Results: Shows superior performance over relevant baselines (LoRA, LoRA-MoE) on several benchmarks (All Reviewers).
  • Clarity: The paper is generally well-written and the proposed method is clearly explained (TR5Y).

Concerns and their address:

  • Generalization Scope: While authors provided additional data (GSM8K) and pointed to existing unseen task results, the validation of training-free generalization could still be broader across more diverse task types (HcLz, 7PAF). Reviewers HcLz and 7PAF were mostly satisfied by the rebuttal.
  • Training Data Clarity: Details regarding the training data composition (e.g., Alpaca pre-training) and its specific use across different experimental settings (commonsense vs. tool use) were clarified in the rebuttal but should be more explicitly integrated into the main paper for better reproducibility and understanding (TR5Y). Reviewer TR5Y maintained their score requesting this improvement.
  • Cloud Dependency & Scalability: The method relies on a cloud model and specific cloud-edge model pairs, limiting offline use and potentially requiring significant meta-training data ('dataset of datasets') for robust scaling, which remains future work (7PAF, TR5Y). Authors acknowledged this limitation; reviewers accepted the clarification.
  • Comparison Breadth: While compared effectively to LoRA variants and prompt tuning (added in rebuttal), comparison to other recent PEFT methods (e.g., VeRA, DoRA mentioned by 7PAF) is missing. Reviewer 7PAF accepted the rebuttal despite this omission.
  • Hyperparameter Sensitivity: Limited exploration of hyperparameters like the number of experts and their interaction with training data scale (7PAF). Authors provided justification based on existing experiments (Table 5); Reviewer 7PAF accepted this.