Parameter-Efficient Fine-Tuning via Partially Decomposable Loss Analysis and Sharing

Raghavendra Addanki,Ritwik Sinha,Zhao Song,Yizhou Wang,Lichen Zhang

OpenReview PDF

提交: 2023-09-19更新: 2024-03-26

TL;DR

We introduce an efficient fine-tuning procedure that further optimizes the celebrated LoRA framework. We also provide theoretical guarantee for a range of loss functions.

摘要

关键词

Fine-tuningefficient training

评审与讨论

审稿意见

评分: 3置信度: 42023-10-29

This paper proposes a framework named multi-lora, which enables parameter sharing across different parallel fine-tuning tasks. The authors provide some theory to justify the rationality of this framework.

优点

The paper is well-written and easy to read.
The idea that parameter sharing or parameter transferring between tasks under the framework of PEFT is valid and interesting.

缺点

The motivation of the studied problem is unclear. I doubt if there are any scenarios in reality where we need to parallel fine-tune LLMs on multiple tasks. Also, from the experimental results, we can save 50% of the parameters (~1M) at the cost of sacrificing performance compared to LoRA. Is this worth it?
Some important related works are missed. More related works should be discussed, such as parameter-efficient fine-tuning and multi-task learning.
The experiments are inadequate.
- (1)This paper does not include necessary factors such as ablation studies and sensitivity analysis to demonstrate the effectiveness of the method.
- (2)This paper does not compare the training time cost of different methods, which may be an important factor in the problem scenario proposed by the authors.
- (3)Sequentially fine-tuning the tasks with parameter(LoRA_A) sharing should also be studied and compared with the proposed method thoroughly.
There are no conclusions.
I am unaware of the relationship between the theories and the effectiveness of the method. This paper uses a lot of space to prove some simple properties. Why do properties of global loss (lipschitz, smooth, convexity) explain the performance of multi-lora?
Why did you freeze LoRA_A layers and not fine-tune them?
Some statements about LoRA are wrong, such as “LoRA improves inference time”.

问题

Please see the weakness part. No additional questions.

审稿意见

评分: 5置信度: 32023-10-31

The paper presents a model, Multi-LoRA, aimed at decreasing the quantity of trainable parameters when employing LoRA for parallel fine-tuning. Specifically, all tasks share a global and fixed parameter A, along with a trainable task-specific parameter B. This strategy significantly reduces the number of trainable parameters. In the proposed method, the parameter count for k tasks can be reduced from O(Kdr+kmr) to O(dr+kmr). The authors provide theoretical assurances for model convergence. Empirical experiments are performed on Roberta and GPT2 for natural language understanding and generation tasks, respectively.

优点

This paper tackles a compelling problem: reducing the number of trainable parameters for parallel fine-tuning. The proposed Multi-LoRA method technically sounds. The authors provide a detailed theoretical proof of model convergence. The empirical results underscore the effectiveness of this method. Notably, the number of trainable parameters decreases from 2.4M to 1.3M for eight Natural Language Understanding (NLU) tasks and from 1.1M to 0.7M for three generative tasks.
The methodology is straightforward, and the experimental settings are detailed.

缺点

The performance of Multi-LoRA is not as strong as that of LoRA. For example, the average scores drop from 87.2 to 85.1 for understanding tasks and from 56.7 to 53.7 for generation tasks. While parameter reduction is significant, performance is often a more critical factor.
The paper does not include a comparison with multi-task learning (MTL). Both settings involve training multiple tasks simultaneously, yet no MTL methods, such as AdapterFusion[1], are compared in the experiments.
The paper lacks experiments on recent Large Language Models (LLMs) like Llama2 and does not provide an analysis of convergence speed.

[1]. AdapterFusion: Non-Destructive Task Composition for Transfer Learning

问题

How would the model’s performance be affected if parameter A were allowed to be trainable? Could this modification potentially enhance the model’s performance?
Are the RoBERTa and GPT models trained using fp32 precision? If so, what would be the impact on the models if mixed-precision training, such as fp16, were used? This question is particularly relevant given that recent Large Language Models (LLMs) commonly employ fp16 precision for training.

审稿意见

评分: 3置信度: 22023-11-03

This paper designs a framework that reduces the parameter count even more than LoRA, in addition to enabling parameter sharing among various parallel fine-tuning tasks. When the volume of parallel fine-tuning tasks increases, the framework slashes the parameter count by nearly half in comparison to LoRA. Additionally, the authors provide theoretical evidence explaining the effectiveness of this approach—and, by extension, that of LoRA—for a wide array of loss functions. The effectiveness of the proposed method is empirically confirmed on multiple benchmark models and datasets, showcasing a substantial decrease in parameter count while maintaining performance comparable to that of LoRA.

优点

Originality: The Multi-LoRA framework is a novel approach to fine-tuning LLMs that takes into account shared structure between tasks, which is an important consideration in many real-world applications.

Quality: the theoretical analysis of the method is also well-presented and provides insights into the properties of the method.

Clarity: the paper is well-organized and easy to follow.

Significance: The Multi-LoRA framework and the proposed method have the potential to improve the efficiency and effectiveness of fine-tuning LLMs, which is an important consideration for many real-world applications.

缺点

This paper can be significantly improved by more thorough experiments in several aspects:

More LLM with larger size
More metrics beyond GLUE for LLM
The proposed Multi-LoRA seems to achieve inferior performance for all the tasks. In this way, it is hard to validate the inefficiency of the proposed method

问题

see weaknesses