5.0

/10

Poster3 位审稿人

最低5最高5标准差0.0

4.3

置信度

正确性2.7

贡献度2.3

表达2.7

NeurIPS 2024

Prompt Tuning Strikes Back: Customizing Foundation Models with Low-Rank Prompt Adaptation

Abhinav Jain,Swarat Chaudhuri,Thomas Reps,Chris Jermaine

OpenReview PDF

提交: 2024-05-15更新: 2024-11-06

TL;DR

Low-Rank Prompt Adaptation, soft-prompting approach for effective and efficient customization of large foundation models.

摘要

关键词

Parameter-Efficient Fine-TuningPrompt tuning

评审与讨论

审稿意见

评分: 5置信度: 32024-07-11

This paper proposed a new prompt-tuning based approach called Low-Rank Prompt Adaptation (LOPA), which performs comparably to the state-of-the-art PEFT methods without the need for a server-based adapter. LOPA balances between sharing task-specific information across instances and customization for each instance to generate soft prompts, utilizing the low-rank decomposition for parameter efficiency. The effectiveness of proposed method is validated on multiple natural language understanding datasets.

优点

The author considers a novel perspective to achieve fine-tuning for downstream tasks without manipulating the foundation model.
The method is simple, effective, and easy to implement.
The writing is clear and easy to understand.

缺点

The comparison with the PEFT methods only considered LoRA, other representative methods such as Adapter-tuning, P-Tuningv2, etc., were not taken into account. Additionally, methods mentioned in the related work such as LPT and SPT that may compete with LoPA were not compared in the experimental tables.
The ablation experiments are not comprehensive enough. The cost savings and performance sacrifices of using low-rank decomposition have not been discussed.
Would it be more accurate to replace "Foundation Models" in the title with "Language Models"? Although the proposed method appears to be a general approach, it was only validated on natural language datasets.
For the analysis of method principles, such as the offset subspace induced by LOPA, could some quantitative/visual verification be provided to demonstrate the changes brought about by the introduction of LOPA?

问题

See the weaknesses above.

局限性

See the weaknesses above.

作者回复

2024-08-06

“The comparison with the PEFT methods only considered LoRA, other representative methods such as Adapter-tuning, P-Tuningv2, etc., were not taken into account. Additionally, methods mentioned in the related work such as LPT and SPT that may compete with LoPA were not compared in the experimental tables.”

Response: Thank you for this feedback. We chose LoRA as a representative baseline because it requires storage of user-specific parameters on the server for LLM personalization. Additionally, we focused on soft-prompting methods (e.g., PT, IDPG) that enable model customization on the user side, without necessitating server-side modifications. Soft-prompting methods like P-TuningV2, LPT, and SPT require inserting prefix vectors within intermediate layers of the transformer network, necessitating server-side changes for every user query.

To address your concern, we showcase a comparison with P-tuning v2 (Liu et. al 2021), Prefix tuning (Li et. al 2021) and a recent parameter-efficient baseline DePT (Shi et. al 2023). We can observe that while DEPT and P-tuning v2 reduce parameters, they also lead to a significant performance drop (~21 and 16 pts average drop respectively) compared to LOPA. On the other hand, Prefix-tuning performs similarly to LOPA but does so at the cost of 16x more parameters.

Approach	Params $\downarrow$	RTE $\uparrow$	MRPC $\uparrow$	SST-2 $\uparrow$	QNLI $\uparrow$	Average $\uparrow$
LOPA	1.6M	83.39	91.09	95.99	93.74	91.05
DEPT	10.2K	53.79	72.97	89.68	57.09	68.38
P-tuning v2	0.49M	53.43	70.18	89.91	85.21	74.68
Prefix-tuning	25.75M	82.67	90.86	93.80	94.98	90.58

We will include this comparison and more task results in the paper.

“The ablation experiments are not comprehensive enough. The cost savings and performance sacrifices of using low-rank decomposition have not been discussed.”

Response:
1. Ablation study - cost vs. performance trade-off: In the paper, Figure 4 studies this cost-performance tradeoff as a function of rank. Bar plots show the training costs in terms of number of trainable parameters and Line plots show performance. | Approach | Params $\downarrow$ | RTE $\uparrow$ | MRPC $\uparrow$ | SST-2 $\uparrow$ | | -------- | ------- | ------- | ------- | ------- | | LOPA(r=4) | 1.60M | 83.39 | 91.09 | 95.99 | | LOPA_add(r=4) | 1.60M | 64.26 | 75.17 | 93.34 | | IDPG+PHM (n=8) | 0.37M | 67.14 | 76.12 | 95.07 | | IDPG+PHM (n=16) | 0.20M | 65.34 | 76.68 | 94.61 | | IDPG+PHM (n=32) | 0.17M | 68.23 | 74.99 | 94.72 |
2. Ablation study - PHM Layers vs. Low-Rank Decomposition: We compare low-rank decomposition in LOPA with Parameterized Hypercomplex Multiplication (PHM) Layers implemented for IDPG baseline that serves as an alternate for parameter efficiency. $n$ representing the hyper-parameter balancing the parameter complexity and extent of factorisation in the Kronecker product $W = \sum_i^n A_i \bigotimes B_i$ . Our study on three NLU tasks revealed that while PHM layers reduce parameter count, they also result in a significant performance drop (~15 points in RTE and MRPC), likely due to the structural constraints of Kronecker factorization limiting expressiveness (Zhang et al., 2021).
3. Ablation study - Non-linear Composition of Z: We also compared LOPA with LOPA_add, an additive approach for composing Z. The non-linear composition in LOPA, expressed as 𝑍 = 𝑍𝑆 ∘ 𝑔(𝑍𝐼), outperformed LOPA_add, indicating the importance of non-linear interaction for performance gains.

"Would it be more accurate to replace "Foundation Models" in the title with "Language Models"? Although the proposed method appears to be a general approach, it was only validated on natural language datasets.”

Response: As the method is general, our preference is to use the term 'foundation model,' but the reviewer correctly points out the fact that the method was only evaluated with LLMs. We are happy to follow the reviewer's advice here.

The proposed approach has been validated on both natural language and code-generation datasets. Refer to Table 2 for evaluation of MBPP and CruXEVAL datasets.

“For the analysis of method principles, such as the offset subspace induced by LOPA, could some quantitative/visual verification be provided to demonstrate the changes brought about by the introduction of LOPA?”

Response: This is an excellent suggestion, especially the development of an appropriate visizalization to illustrate the LOPA functionality. We plan to explore this idea and hope to add illustrative viz to an appendix.

References:

Zhang, Aston, et al. "Beyond fully-connected layers with quaternions: Parameterization of hypercomplex multiplications with $1/n$ parameters." arXiv preprint arXiv:2102.08597 (2021).

Shi, Zhengxiang, and Aldo Lipani. "Dept: Decomposed prompt tuning for parameter-efficient fine-tuning." arXiv preprint arXiv:2309.05173 (2023).

Liu, Xiao, et al. "P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks." arXiv preprint arXiv:2110.07602 (2021).

Li, Xiang Lisa, and Percy Liang. "Prefix-tuning: Optimizing continuous prompts for generation." arXiv preprint arXiv:2101.00190 (2021).

2024-08-08

Thank you for the response. After reading the rebuttal and other reviewer’s comments, my concern has been addressed.

审稿意见

评分: 5置信度: 52024-07-13

The paper introduces Low-Rank Prompt Adaptation (LOPA), an instance-aware prompt tuning-based approach. LOPA constructs soft prompts from a task-specific component (shared across samples) and an instance-specific component (unique to each sample), combining them using a gating function. It employs a low-rank decomposition of the instance-specific component to enhance parameter efficiency. Unlike Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA, LOPA does not need to store adapter-like modules for each task as it is based on prompt tuning. The paper evaluates LOPA on natural language understanding and code tasks to demonstrate its effectiveness.

优点

Parameter Efficiency: More efficient than traditional PEFT methods like LoRA.
No Server-Side Changes: Once trained, LOPA's soft prompts can be used as input prefixes without additional server-side computational cost.

缺点

Major weaknesses of the paper is its lack of novelty.

The proposed approach is quite similar to the IDPG method referenced in [36]. The IDPG approach also uses both instance-specific and task-specific prompts. While it is true that in IDPG, updates to $Z_{S}$ and $Z_{I}$ are independent of each other, this issue can be addressed by adding a non-linearity after the second layer in their prompt generator network. Additionally, to reduce the parametric complexity of $Z_{I}$ , the authors have used a low-rank decomposition of $Z_{I}$ , similar to LoRA. Similarly, the IDPG paper uses Parameterized Hypercomplex Multiplication (PHM) Layers to reduce the complexity of $Z_{I}$ . There is no analysis in the paper comparing the novelty and benefits of low-rank decomposition, as in LoRA, to PHM Layers.
Additionally, the performance of LoPA is inferior to LoRA on 6 out of 7 datasets in Table 1.
Missing important experimental details: The paper does not indicate how many epochs each of the methods was trained.

Questions and Suggestions:

a. Line 102: The paper states, "IDPG [36], which emphasizes an instance-specific prompt." Also, on Line 156: "existing instance-specific approaches [36]." In general, IDPG uses both instance-specific and task-specific prompts. Therefore, referring to it solely as an instance-specific approach may not be correct.

b. Lines 138-140: "However, encoding a matrix of size d×m can be expensive." Why not use a linear layer of dimension n×m as an encoding function f? This might not be as expensive.

c. How different and efficient is the low-rank decomposition of $Z_{I}$ compared to the Parameterized Hypercomplex Multiplication (PHM) Layers proposed in the IDPG paper (Section 3.2.1)? PHM layers also optimize the prompt generator network.

d. There is no discussion about the convergence of the proposed method, as prompt tuning is known for its slower convergence. It will good to show convergence of proposed method.

e. Line 122: $z_{k}$ should be $z^{k}$ .

f. Line 197: "Evaluation.For" -> space is missing.

问题

Please refer to the weaknesses section.

局限性

作者回复

2024-08-06

"The proposed approach is quite similar ... decomposition, as in LoRA, to PHM Layers." "How different and efficient is the ...also optimize the prompt generator network."

Response: Thank you for the detailed comparison. Here are the key distinctions and considerations:
1. Encoding of $Z_S$ and $Z_I$ : The encoding of $Z_S$ and $Z_I$ differs fundamentally between IDPG and LOPA. In IDPG, $Z_S$ is the bias term from the last layer of the prompt generator, constructing $Z = Z_S + Z_I$ . In LOPA, $Z$ is constructed as $Z = Z_S \circ g(Z_I)$ , making $Z_S$ and $Z_I$ co-dependent through a gating function $g(.)$ . We experimentally found this non-linear interaction crucial for the performance gains observed with LOPA, which was absent in existing soft-prompt-based learning approaches like IDPG, PT, etc.
2. PHM Layers vs. Low-Rank Decomposition: We acknowledge that PHM layers can reduce parameter complexity, similar to low-rank decomposition in LOPA. We have carried out an ablation study using PHM in IDPG to construct $Z_I$ with $n$ representing the hyper-parameter balancing the parameter complexity and extent of factorisation in the Kronecker product $W = \sum_i^n A_i \bigotimes B_i$ . Our ablation study on three NLU tasks shows that while PHM layers reduce parameters, they also lead to a significant performance drop (~15pt in RTE and MRPC) compared to LOPA. This drop may be due to the structural constraints of PHM layers imposed by Kronecker factorisation, which could limit expressiveness (Zhang et al 2021). | Approach | Params $\downarrow$ | RTE $\uparrow$ | MRPC $\uparrow$ | SST-2 $\uparrow$ | | -------- | ------- | ------- | ------- | ------- | | LOPA(r=4) | 1.60M | 83.39 | 91.09 | 95.99 | | IDPG+FC | 2.89M | 77.26 | 78.60 | 95.30 | | IDPG+PHM (n=8) | 0.37M | 67.14 | 76.12 | 95.07 | | IDPG+PHM (n=16) | 0.20M | 65.34 | 76.68 | 94.61 | | IDPG+PHM (n=32) | 0.17M | 68.23 | 74.99 | 94.72 |
3. Future Work: We want to point out that PHM can be used in conjunction with low-rank decomposition in LoRA to reduce trainable parameters further. See below for a comparison of trainable parameter complexity. We can observe that while using either PHM or FC layers, LOPA can still be more parameter efficient by a factor of $(\frac{r}{d}+\frac{r}{m})$ (refer to Sect. 3.3 for notations). This is an interesting experiment that we leave for future work.
- IDPG + FC : $\mathcal{O}(hdm)$
- LOPA + FC : $\mathcal{O}(hdm(\frac{r}{d}+\frac{r}{m}))$
- IDPG + PHM : $\mathcal{O}(n^3 + \frac{hdm}{n})$
- LOPA + PHM : $\mathcal{O}(n^3 + \frac{hrm}{n} + n^3 + \frac{hrd}{n}) = \mathcal{O}(n^3 + \frac{hdm}{n}(\frac{r}{d}+\frac{r}{m}))$

"Additionally, the performance of LoPA is inferior to LoRA on 6 out of 7 datasets in Table 1."

Response: This is true. However, the performance difference in Table 1 is marginal (< 1% in 5 of 6 tasks where LoRA is better). Given the additional 18 cases in Table 2, it is even less clear that there is any meaningful advantage to LoRA in terms of accuracy, as LOPA outperformed LoRA in 11 out of 24 cases. Given the other advantages of LOPA (parameter efficiency, no need for deployment on the server), we argue that the method has significant value.

“Missing important experimental details: The paper does not indicate how many epochs each of the methods was trained.”

Response: Thank you for pointing this out. Here are the training details. In NLU Tasks, FFT and LoRA were trained for 10 epochs, while prompt-tuning approaches were trained for 20 epochs. In MBPP, all methods were trained for 10 epochs across all foundation model (FM) backbones. In CruxEval Tasks, for FM backbones under 7B, PEFT approaches were trained for 20 epochs, while larger FMs (≥7B) were trained for 10 epochs. FFT on CruxEval tasks for FM backbones under 7B was trained for 5 epochs. We will include these details in the final manuscript.

“Lines 138-140: "However, encoding a matrix of size d×m can be expensive." Why not use a linear layer of dimension n×m as an encoding function f? This might not be as expensive.”

Response: The IDPG baseline considered in the paper indeed uses a linear layer of dimension $d \times m$ as an encoding function $f$ . Resultantly, it is more computationally expensive than LOPA’s encoding of $f$ . Consider the following - If the input features have dimension h, the parameter complexity of $f$ in IDPG is $\mathcal{O}(hdm)$ . In comparison, LOPA uses two low-rank factors encoded with linear layers of sizes $d \times r$ and $m \times r$ , reducing the parameter complexity to $\mathcal{O}(hdm(\frac{r}{d}+\frac{r}{m}))$ by a factor of $\frac{r}{d}+\frac{r}{m} < 1$ .

“There is no discussion about the convergence of the proposed method, as prompt tuning is known for its slower convergence. It will good to show convergence of proposed method.”

Response: Thank you for highlighting this aspect. Refer to the enclosed pdf for the training plots. Faster convergence is indeed a key benefit of LOPA. We present enclosed plots comparing the training loss and performance on NLU tasks (QQP, QNLI, MNLI) for Prompt Tuning (PT), IDPG, and LOPA. The results show that instance-dependent methods like IDPG and LOPA converge faster than traditional prompt tuning. Moreover, LOPA converges faster and achieves higher accuracy or F1 scores compared to IDPG. We appreciate the suggestion and will include this analysis in the paper.

References:

Zhang, Aston, et al. "Beyond fully-connected layers with quaternions: Parameterization of hypercomplex multiplications with $1/n$ parameters." arXiv preprint arXiv:2102.08597 (2021).

评论- Question to Authors

2024-08-12

Thank you for your detailed response.

I have a question: Is it not correct that by adding a non-linearity after $Z_{I}$ in IDPG and using the Hadamard product instead of addition between $Z_{S}$ and $Z_{I}$ , we can achieve the same effect as LOPA?

2024-08-13

There are three differences. First, as the reviewer suggests, use non-linearity and the Hadamard product. Second (and most crucially) modify the prompt only on the input, not after each transformer block, so there is no special server-side computation required for the adaption (IDPG performs computation at every layer). Finally, drop the server-side classifier head used by IDPG and use the transformer output directly.

2024-08-13

Thank for the response.

审稿意见

评分: 5置信度: 52024-07-20

The paper introduces Low-Rank Prompt Adaptation (LOPA), a novel parameter-efficient fine-tuning (PEFT) approach that improves soft prompt tuning, delivering performance on par with LoRA and full fine-tuning methods. LOPA addresses scalability issues of traditional PEFT methods by using a low-rank decomposition for instance-specific soft prompts. This technique combines task-specific and instance-specific prompts through a gating function, offering a balance between customization and parameter efficiency. The paper demonstrates LOPA's effectiveness across various natural language understanding and code generation tasks, positioning it as a competitive alternative to adapter-based methods.

优点

The paper is well-written.
The proposed method is simple and the authors demonstrate its effectiveness across various natural language understanding and code generation tasks.
The authors provide a thorough analysis to understand the relative importance of several aspects of their approach.

缺点

I felt there were several obvious questions left unexplored, noted below, which raise concerns regarding the significance of the paper's contributions.

The authors only experimented with a rather small model, i.e., 355M RoBERTa, for classification tasks while trying much larger models (up to 8B) for code generation tasks, which raises concerns about whether the proposed method works with larger models for classification tasks.
The authors focused solely on a few classification tasks and two code generation tasks. This raises concerns about the proposed approach's effectiveness for other tasks, like open-ended generation, where prompt tuning often underperforms (An et al., 2022).
Finally, I am concerned about the practical adoption of the proposed approach since it is unclear whether the proposed approach performs better than LoRA generally.

References:

An et al., 2022: https://arxiv.org/pdf/2203.03131

问题

LOPA constructs the sort prompt as $𝑍 = 𝑍_𝑆∘𝑔(𝑍_𝐼)$ . Did you try an additive approach by concatenating the two vectors instead?

局限性

The author discussed several limitations of their approaches, including the effectiveness of LOPA on practical tasks, the assumed positioning of the learned soft prompt, and the need for further exploration of LOPA as a conditional auto-encoder.

作者回复

2024-08-06

“The authors only experimented with a rather small model, i.e., 355M RoBERTa, for classification tasks while trying much larger models (up to 8B) for code generation tasks, which raises concerns about whether the proposed method works with larger models for classification tasks.”

Response: Thank you for the observation. For natural language tasks, it is established that prompt tuning becomes competitive and comparable to fine-tuning with large models (>1B) (Lester et al., 2021; Liu et al., 2021). Therefore, similar to other recent works (Liu et al., 2021; Wu et al., 2022; Zhu et al., 2023), our focus is improving prompt tuning efficacy for medium-sized models (100M to 1B). Conversely, we experimented with much larger models (up to 8B) for code generation tasks, as the impact of prompt tuning methods in this area has not been extensively studied.

“The authors focused solely on a few classification tasks and two code generation tasks. This raises concerns about the proposed approach's effectiveness for other tasks, like open-ended generation, where prompt tuning often underperforms (An et al., 2022).”

Response: To address this concern, we conducted experiments on the standard E2E and WebNLG benchmarks for open-ended natural language generation. We followed the hyper-parameter setup of Hu et al. (2021) and fine-tuned GPT2-medium with LoRA, Prompt Tuning (PT), and our approach.

			E2E
Approach	BLEU $\uparrow$	NIST $\uparrow$	METEOR $\uparrow$	ROUGE-L $\uparrow$	CIDEr $\uparrow$
LoRA	68.78	8.81	46.52	71.36	2.49
PT(m=100)	32.98	0.65	27.54	57.04	0.76
Ours	65.85	8.39	43.10	68.65	2.27

			WebNLG
Approach	BLEU-U $\uparrow$	BLEU-S $\uparrow$	BLEU-A $\uparrow$	TER-U $\downarrow$	TER-S $\downarrow$	TER-A $\downarrow$
LoRA	46.89	63.27	55.85	0.45	0.33	0.39
PT(m=100)	29.59	31.98	30.94	0.54	0.54	0.54
Ours	44.78	55.46	50.65	0.44	0.37	0.40

The results show that while standard prompt tuning underperforms, our method significantly outperforms PT and closely matches LoRA's performance on both benchmarks. For WebNLG, we further report results across seen(S), unseen(U), and all(A) categories. Our approach demonstrates strong extrapolation performance on unseen WebNLG categories (see BLEU-U and TER-U), indicating its ability to handle diverse domains in the data without server-side personalization of the foundation model. This suggests that our method is also effective in open-ended generation scenarios. We will include this benchmark comparison and more baseline results in the paper.

“Finally, I am concerned about the practical adoption of the proposed approach since it is unclear whether the proposed approach performs better than LoRA generally.”

Response: True, we find no significant difference between LoRA and LOPA in terms of accuracy. However, LOPA has two key advantages. First, all other things being equal, a purely prompt-based method (such as LOPA) is preferable to one that requires integrating an adaptor with the model at the server (such as LoRA). LOPA allows the model to be specialized at the client (or via the use of a middleware), without modification at the server. LOPA does not require that any use-case-specific parameters be stored on the server, which can be costly, especially if the number of specializations is large. Second, we find that LOPA is more parameter-efficient than LoRA.

“LOPA constructs the sort prompt as 𝑍=𝑍𝑆∘𝑔(𝑍𝐼). Did you try an additive approach by concatenating the two vectors instead?”

Response: We experimentally found that the non-linear composition of $Z$ via $𝑍=𝑍_𝑆∘𝑔(𝑍_𝐼)$ is crucial for the performance gains observed with LOPA. See the following ablation study on a subset of NLU tasks, where we observe LOPA_add that opts for an additive approach underperforms.

Approach Params $\downarrow$ RTE $\uparrow$ MRPC $\uparrow$ SST-2 $\uparrow$
LOPA(r=4) 1.60M 83.39 91.09 95.99
LOPA_add(r=4) 1.60M 64.26 75.17 93.34

We appreciate your feedback and will include these comparisons and numbers on the remaining tasks in the final manuscript.

Approach	Params $\downarrow$	RTE $\uparrow$	MRPC $\uparrow$	SST-2 $\uparrow$
LOPA(r=4)	1.60M	83.39	91.09	95.99
LOPA_add(r=4)	1.60M	64.26	75.17	93.34

References:

Lester, Brian, Rami Al-Rfou, and Noah Constant. "The power of scale for parameter-efficient prompt tuning." arXiv preprint arXiv:2104.08691 (2021).

Liu, Xiao, et al. "P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks." arXiv preprint arXiv:2110.07602 (2021).

Wu, Zhuofeng, et al. "Idpg: An instance-dependent prompt generation method." arXiv preprint arXiv:2204.04497 (2022).

Zhu, Wei, and Ming Tan. "SPT: learning to selectively insert prompts for better prompt tuning." Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023.

Hu, Edward J., et al. "Lora: Low-rank adaptation of large language models." arXiv preprint arXiv:2106.09685 (2021).

作者回复

2024-08-07

Thank you for the insightful comments. We enclose the convergence plots of the prompt-tuning based baselines and the proposed approach on a subset of NLU tasks.

2024-08-12

Dear reviewers,

Could you please check the authors' rebuttal and see if some of your concerns have been addressed? We are approaching the end of the rebuttal phase.

Thanks!

Best, AC

最终决定Accept (poster)

2024-09-25

This paper proposes Low-Rank Prompt Adaptation (LOPA), which achieves competitive results compared with state-of-the-art PEFT approaches and full fine-tuning while being more parameter-efficient and not requiring server-based adaptors (drawbacks of PEFT methods like LoRA). Reviewers agree that the paper is well-written and easy to understand, and proposes a novel approach. The contribution of an effective prompt tuning approach is exciting! Most concerns (e.g., experiments on more tasks) in the original reviews have been addressed in the rebuttal. The authors should follow the discussions to improve the paper.