Concentrate Attention: Towards Domain-Generalizable Prompt Optimization for Language Models
This paper explores the domain generalization ability of prompts in PLMs, discovering that prompts with higher "Concentration" are more generalizable, leading to new optimization methods that outperform existing techniques.
摘要
评审与讨论
This paper studies prompt optimization methods for finetuning language models. While previous methods mainly concern with in domain performance, this paper brings awareness of the domain generalization issue presented in existing PO methods, under the setting where the target domain is unknown. Two empirical findings that link the domain generality of the prompt to the behavior in attention map are presented regarding this setting. Building on top of these findings, the paper proposes new objectives that account for domain generability for both soft and hard prompt optimization settings. Empirical study on BERT-size transformers and standard NLP datasets reveals that the proposed method achieves fairly consistent and substantial gain over vanilla.
优点
- The paper presents a pioneering study of the domain generalization issue of prompt optimization for PLM finetuning under the more practical setting where the target domain is unknown.
- The findings connecting the generalization ability of prompts to the attention patterns are interesting and of value to the community.
- Translating the findings into loss objectives are nontrivial technical challenges, and the paper presents neat solutions to these problems.
- Empirical gain are fairly substantial and consistent.
缺点
- The main findings that motivate the proposed method are empirical. The paper can benefit from addressing the intuitions behind why such prompts are more generalizable.
- The models of choice are mainly small-scale transformers rather than LLMs. It is unclear whether the findings on the attention patterns generalize to bigger models. The paper could benefit from further verifying them on open-source LLMs.
- Presentation: the in-context citation formats are incorrect.
问题
Both NLI and sentiment analysis are classification tasks, I wonder if the proposed method generalize to tasks beyond classification?
局限性
Yes
We sincerely thank the reviewer for the constructive comments and feedback. We appreciate that our work is considered “pioneering”, “interesting”, and “neat”. We hope our response can address your concerns.
Q1: More intuitive explanation
We thank the reviewer for the valuable comment. Our intuition is that more concentration on prompts would lead to less attention on inputs. Thus, domain shifts (change of the domain that inputs are sampled from) cause less negative effect as the model pays more attention to knowledge-intensive prompts instead of constantly changing inputs.
Moreover, it is also interesting to investigate the reason why prompts optimized by our method are more generalizable from the perspective of calibration. Calibration refers to the model's ability to provide class probabilities that correspond to its likelihood of being true. A well-calibrated model exhibits better domain generalization ability, as claimed in [1]:
In this paper we highlight a novel connection between multi-domain calibration and OOD generalization, arguing that such calibration can be viewed as an invariant representation.
We use a common metric Expected Calibration Error (ECE) [2] for evaluating the calibration:
Here is the number of samples, is the number of bins. Each sample has its own label , predicted label , and predicted probability . A lower ECE value indicates better calibration. As shown in Table 1, we demonstrate the change in prompt calibration before and after training using our method. We find that our method significantly improves model calibration compared with vanilla method. We infer that good calibration functions as an important factor to avoid significant inductive bias in domain generalization setting, which explains the performance improvement brought by our method.
| Method | Sentiment | NLI | ||||
|---|---|---|---|---|---|---|
| S+M→C | C+M→S | S+C→M | Q+R→W | W+R→Q | Q+W→R | |
| Vanilla Prefix | 0.145 | 0.160 | 0.104 | 0.297 | 0.193 | 0.159 |
| Prefix w both | 0.128 | 0.095 | 0.072 | 0.206 | 0.162 | 0.127 |
Table 1. Expected Calibration Error of Vanilla Prefix Tuning and Prefix Tuning with our method. smaller the better.
Q2: Extension to other tasks
We would like to thank the reviewer for the valuable comment. We are glad to share that our method is able to successfully be applied on mainstream Large Language Models and question-answering tasks. We validate the effectiveness of our method on Llama-2-7b-chat, Vicuna-7b-v1.5, and Alpaca-7b-wdiff models for improving domain generalization ability of prompts on question-answering tasks. We evaluate our method on ROC, SCT, and COPA datasets from the TRAM Benchmark [3] (referred as R, S, and C for simplicity), covering multiple-choice questions in reading comprehension and commonsense reasoning. The result is shown in Table 2.
Experimental results show that our method significantly improves the performance of large models on question-answering tasks across multiple domain generalization settings. For instance, for the Llama-7b model, our method improved the average accuracy of soft prompt generalization and hard prompt generalization comparisons by 1.91% and 2.36%, respectively; similar improvements were observed for Vicuna-7b and Alpaca-7b models, ranging from 1.55% to 2.05% and 1.78% to 1.99% respectively.
| Model | Method | S+C->R | C+R->S | R+S->C | Avg Gap |
|---|---|---|---|---|---|
| Llama-2-7b-chat | Vanilla Prefix | 62.32±2.15 | 66.30±2.30 | 73.15±2.53 | - |
| Prefix with both | 63.70±1.96 | 68.47±0.97 | 75.32±1.09 | +1.91 | |
| Vanilla IC | 63.13±1.25 | 65.50±1.98 | 77.59±1.14 | - | |
| IC with both | 65.13±1.03 | 68.33±2.13 | 79.83±0.88 | +2.36 | |
| vicuna-7b-v1.5 | Vanilla Prefix | 67.72±1.79 | 81.09±2.17 | 88.97±2.64 | - |
| Prefix with both | 68.75±1.04 | 83.93±1.79 | 89.76±2.60 | +1.55 | |
| Vanilla IC | 68.37±2.24 | 83.23±4.12 | 90.98±1.99 | - | |
| IC with both | 69.67±1.58 | 85.50±5.06 | 93.39±1.23 | +2.05 | |
| alpaca-7b-wdiff | Vanilla Prefix | 61.52±3.79 | 70.03±2.88 | 87.91±2.73 | - |
| Prefix with both | 63.89±2.93 | 72.15±2.07 | 89.58±2.81 | +1.78 | |
| Vanilla IC | 60.81±1.14 | 69.11±2.46 | 89.66±2.37 | - | |
| IC with both | 63.16±1.56 | 70.57±1.95 | 91.19±2.00 | +1.99 |
Table 2. Performance comparison of LLMs on multiple-choice task accuracy under MFDG settings.
In our global response PDF, we illustrate the Concentration Strength Distribution of prompts in the In-Context Demo format for three 7B-sized language models (Llama, Vicuna, Alpaca) across three different tasks (SA, NLI, MCQA). A common observation is that the concentration strength is stronger in deep layers instead of shallow layers. To be more specific, compared to smaller models (Roberta-large), concentration phenomena occur earlier in larger models. Inductively, we get the conclusion that this phenomenon — higher concentration in deep layers, happens independent of model sizes, tasks or prompts.
Q3: In-context Citation Errors
We would like to thank the reviewers for reading our work carefully. We sincerely apologize for the in-context citation errors in our paper. We will check and revise these errors in the next version.
[1] Wald, Yoav, et al. "On calibration and out-of-domain generalization." Advances in neural information processing systems 34 (2021): 2215-2227.
[2] Naeini, Mahdi Pakdaman, Gregory Cooper, and Milos Hauskrecht. "Obtaining well calibrated probabilities using bayesian binning." Proceedings of the AAAI conference on artificial intelligence. Vol. 29. No. 1. 2015.
[3] Wang, Yuqing, and Yun Zhao. "Tram: Benchmarking temporal reasoning for large language models." arXiv preprint arXiv:2310.00835 (2023).
We sincerely appreciate your valuable feedback and insightful discussion! We hope our response has been helpful to you. As the discussion period is drawing to a close, we warmly welcome any further questions from the reviewer. We would be delighted to provide additional clarification!
Thank you for your. The newly added experiments on motivation, LLMs, and new tasks help strengthen the paper. I improved my ratings accordingly.
We sincerely thank the reviewer for the constructive discussions and positive feedback. We will optimize our work in detail based on these suggestions and incorporate the experiments you mentioned into our main paper.
This paper focuses on improving the domain generalization of prompt tuning methods on LLMs. Specifically, this work claims that the concentration strength and concentration fluctuation of a candidate soft or hard prompt may indicate its generalization ability on new domains. By demonstrating the performance of various prompts with their concentration strength and fluctuation, the authors show that higher concentration strength and lower fluctuation may bring better prompt domain generalization. As a result, this paper proposes new objectives in soft and hard prompt tuning based on these observations. The experimental results show that the obtained prompts achieve better performance on new domains in NLI and sentiment classification tasks.
优点
- The paper is well organized, with very clear problem and methodology formulation, making the proposed approach easy to follow up.
- The observation on prompt attention concentration and its potential correlation to domain generalization are interesting and may encourage further research on this topic.
- The proposed objectives for soft and hard prompting methods are generalizable and are compatible with most recent prompting algorithms.
缺点
The main weakness of this work is the limited task types in the experiments. Only the sentiment classification and NLI tasks are considered in this work. It will be much better if more evidence or results on broader task types are obtained.
问题
Have you ever evaluated the effect of prompt attention concentration on any (recent) generative LLMs?
局限性
The authors have adequately addressed the limitations of this work.
We sincerely thank the reviewer for the constructive comments and feedback. We appreciate that our work is considered “well organized”, “interesting”, and “generalizable”. We hope our response can address your concerns.
Q1: Extension to other tasks
We would like to thank the reviewer for the valuable comment. We would like to thank the reviewer for drawing our attention to extending our method to broader applications. We are glad to share that our method is able to successfully be applied on mainstream Large Language Models and question-answering tasks. We validate the effectiveness of our method on Llama-2-7b-chat, Vicuna-7b-v1.5, and Alpaca-7b-wdiff models for improving domain generalization ability of prompts on question-answering tasks. We evaluate our method on ROC, SCT, and COPA datasets from the TRAM Benchmark [1] (referred as R, S, and C for simplicity), covering multiple-choice questions in reading comprehension and commonsense reasoning. The result is shown in Table 1.
Experimental results show that our method significantly improves the performance of large models on question-answering tasks across multiple domain generalization settings. For instance, for the Llama-7b model, our method improved the average accuracy of soft prompt generalization and hard prompt generalization comparisons by 1.91% and 2.36%, respectively; similar improvements were observed for Vicuna-7b and Alpaca-7b models, ranging from 1.55% to 2.05% and 1.78% to 1.99% respectively.
| Model | Method | S+C->R | C+R->S | R+S->C | Avg Gap |
|---|---|---|---|---|---|
| Llama-2-7b-chat | Vanilla Prefix | 62.32±2.15 | 66.30±2.30 | 73.15±2.53 | - |
| Prefix with both | 63.70±1.96 | 68.47±0.97 | 75.32±1.09 | +1.91 | |
| Vanilla IC | 63.13±1.25 | 65.50±1.98 | 77.59±1.14 | - | |
| IC with both | 65.13±1.03 | 68.33±2.13 | 79.83±0.88 | +2.36 | |
| vicuna-7b-v1.5 | Vanilla Prefix | 67.72±1.79 | 81.09±2.17 | 88.97±2.64 | - |
| Prefix with both | 68.75±1.04 | 83.93±1.79 | 89.76±2.60 | +1.55 | |
| Vanilla IC | 68.37±2.24 | 83.23±4.12 | 90.98±1.99 | - | |
| IC with both | 69.67±1.58 | 85.50±5.06 | 93.39±1.23 | +2.05 | |
| alpaca-7b-wdiff | Vanilla Prefix | 61.52±3.79 | 70.03±2.88 | 87.91±2.73 | - |
| Prefix with both | 63.89±2.93 | 72.15±2.07 | 89.58±2.81 | +1.78 | |
| Vanilla IC | 60.81±1.14 | 69.11±2.46 | 89.66±2.37 | - | |
| IC with both | 63.16±1.56 | 70.57±1.95 | 91.19±2.00 | +1.99 |
Table 1. Performance comparison of LLMs on multiple-choice tasks under MFDG settings. The last column shows the average gap between test performance on vanilla methods and our methods. Results are averages from 3 runs using different random seeds.
Additionally, we would also like to discuss ‘’why our method works well for large generative language models?’’. In our global response PDF, we present the Concentration Strength Distribution of prompts using In-Context Demo across three 7B-sized language models (Llama, Vicuna, Alpaca) on three different tasks (SA, NLI, QA). We observed that all three LLMs exhibit stronger concentration strength in deeper layers compared to shallower layers when confronted with prompts for different tasks. Additionally, we found that this phenomenon occurs earlier in larger models (7B) compared to smaller models like Roberta-large. We speculate that this behavior is related to the alignment stage in pre-training of large models during Supervised Fine Tuning with a large number of prompts.
[1] Wang, Yuqing, and Yun Zhao. "Tram: Benchmarking temporal reasoning for large language models." arXiv preprint arXiv:2310.00835 (2023).
We sincerely appreciate your valuable feedback and insightful discussion! We hope our response has been helpful to you. As the discussion period is drawing to a close, we warmly welcome any further questions from the reviewer. We would be delighted to provide additional clarification!
This paper studies the problem of prompt optimziation for domain generalization. Through a pilot experiment, they find that the domain generalization capability is tied to the attention concentration in later layers of the network. Based on this finding, the authors design a set of regularizers to improve both soft and hard prompt optimziation procedures. Empirically, the method is tested on sentiment and NLI tasks. Results demonstrated a reduced generalization gap, and improved performance with the added losses over several prompt optimization methods.
优点
-
The exploration of prompt optimization is a highly important avenue for research as prompt optimization remains an efficient way to fine-tune language models. At the same time, it is improtant to consider the robustness of such approaches and how to improve this.
-
The paper is overall well-written and flows nicely. The paper highlights the core properties for the findings and main results in the introduction of the paper, constructs an experiment to validate this in Section 3, subsequently intorduces their regularizer, and demonstrates results across two tasks.
-
The proposed approach is simple to implement and can be applied irrespective of the architecture, as it is only an adjustment of the loss function thus having general applicability.
-
Authors have conducted thorough ablations studies in the proposed approach in the Appendices particularly around stability, visualizations, and initalizations.
缺点
-
My biggest concerns with this paper are around the experimental results. The performance improvements in experiments Tables 1 and 2 are small covering only 1-2% improvements over baselines. The paper also does not make comparisons to existing approaches for domain adaptation of prompts such as https://arxiv.org/pdf/2207.07087 https://arxiv.org/pdf/2210.02952 https://arxiv.org/pdf/2305.13954
-
Further the paper only explores limited settings including a single architecture, task, and setting. Exploring other modalities (such as vision transformers), additional tasks and settings, or other architectures would help solidify that the proposed approach is more general and would extend beyond the two tasks in this work.
-
Some minor typos such as hypnosis at line 284, and some hyperparameters seem to be missing from the main text for replication purposes including the value for lambda and how this is selected for the regularizers.
问题
Results for the proposed approach primarily suggest that the phenomena happens in later parts of the network, In practice, it is not clear without running the experiment, which layers will this be most prevalent and if it is task/prompt dependent. For larger networks, would this happen earlier, or for shallow networks would this not happen at all? Results here may be of interest in relation to the calibration of shallow models and lack thereof for deeper models.
局限性
-
Proposed approach is only evaluated in fine-tuning settings. This limits applicability to many settings where LLMs are evaluated such as in context learning, or zero-shot.
-
The proposed approach is evaluated on the Roberta model, whereas there are a number of language models that could be investiinvestigated including the T5 encoder-decoder models, and GPT decoder only models. Do we expect different behaviors based on the architectrues and attention?
-
The proposed approach is limited only to NLP applications, however could be applied for other variants such as vision transformers and robustness considerations. Would we expect similar trends in other applications.
Q3: Spelling errors and missing hyperparameters:
We sincerely appreciate your careful reading.
- Spelling errors: We apologize for any ambiguities and typographical errors in our paper. We will incorporate your suggestions in the revision to ensure these errors do not recur.
- Missing hyperparameters: We provide specific values of lambda in Appendix B.4 (as shown in Table 5 in our paper). We use a validation set of the same size as the training set to evaluate its selection of different regularizers. We apologize for the lack of this information and will emphasize it in the next version.
Q4: Concentration distribution on larger models:
We thank the reviewer for the valuable comments. In Appendix C.2, we present the Concentration Strength Distribution of different prompts in the Roberta-large model (355M). In our global response PDF, we illustrate the Concentration Strength Distribution of prompts in the In-Context Demo format for three 7B-sized language models (Llama, Vicuna, Alpaca) across three different tasks (SA, NLI, QA). A common observation is that the concentration strength is stronger in deep layers instead of shallow layers. To be more specific, compared to smaller models (Roberta-large), concentration phenomena occurs earlier in larger models and keeps high in deep layers . Inductively, we get the conclusion that this phenomenon — higher concentration in deep layers, happens independent of model sizes, tasks or prompts.
Q5: Evaluation other than fine-tuning
We would like to thank the reviewer for the valuable comment. We would like to clarify that in-context learning is one of our baselines for discrete prompt optimization. In our main experiments (Table 1 in our paper), In-Context Demo method provides examples as prompt, which is a classic version of existing in-context learning methods (Line 497).
We sincerely thank the reviewer for the constructive comments and feedback. We appreciate that our work is considered “important”, “well-written”, and “thorough”. We hope our response can address your concerns.
Q1: Improvement and additional baselines
Improvement: We will try to address the reviewer's concerns about performance improvement in the following points.
- The objective of this work is to enhance domain generalization in current prompt optimization methods, not to design a new framework. Our method acts as a plug-and-play module compatible with mainstream prompt optimization methods. As shown in Table 1, it improves performance in 3 soft and 3 hard prompt optimization methods.
- Our method modifies the training objective or filter-match strategy within the original framework, so performance is limited by the original model structure or initial prompt set. For example, the quality of the candidate prompt set affects the performance of hard prompt optimization (as shown in Table 1 in our paper). Nonetheless, our method consistently enhances domain generalization with prompts from the same candidate set, demonstrating its effectiveness.
Additional Baselines: Following the suggestion of the reviewer, we compare our method with the suggested domain adaptation methods — OPTIMA [1] and GPO [2] and two PEFT methods similar to [3] — IA3 and LoRA. The results are shown in Table 1. Compared with DP2O with both, the best performer in a series of prompt optimization methods piggybacked with our optimization strategy, we find that DP2O with both achieves accuracy improvements of 3.88%, 3.54%, 4.46%, 1.77%, 0.55%, and 3.62% compared to the second-best method in six experimental settings.
Although our method still performs best, we would like to highlight that our work improves model performance in domain generalization setting where no information from target domain is available during training while data from target domain is accessible in domain adaptation methods like OPTIMA and GPO. The Introduction section of our paper (line 35-39) discusses the shortcomings of domain adaptation methods, while the Preliminary section (line 87-95) introduces the basic definition of domain generalization problems. The Related Work section (line 472-480) contrasts the differences between these two types of problems.
| Method | Sentiment | NLI | ||||
|---|---|---|---|---|---|---|
| S+M→C | C+M→S | S+C→M | Q+R→W | W+R→Q | Q+W→R | |
| IA3 | 75.64±1.77 | 72.94±2.15 | 65.33±1.52 | 41.32±1.09 | 52.40±1.79 | 51.93±1.65 |
| lora | 79.52±2.32 | 80.42±1.89 | 71.76±2.00 | 47.26±2.68 | 53.16±1.73 | 52.19±2.09 |
| GPO | 82.57±1.73 | 89.32±2.09 | 83.39±0.88 | 54.65±0.71 | 54.77±2.20 | 55.90±2.65 |
| OPTIMA | 85.75±2.77 | 85.01±4.4 | 80.61±4.46 | 53.33±3.26 | 54.19±2.70 | 57.65±5.17 |
| DP2O with both | 89.63±0.52 | 92.87±0.33 | 87.85±0.47 | 56.42±0.36 | 55.32±0.33 | 61.27±0.81 |
Table 1. Performance comparison of text classification task accuracy under MFDG settings. Bold indicates the best result for each column, and underline indicates the second-best result for each column. Results are averages from 3 runs using different random seeds.
Q2: Extension to other architectures, tasks and modalities
We are glad to share that our method is able to successfully be applied on different architectures like large decoder-only models (Llama-2-7b-chat, Vicuna-7b-v1.5, and Alpaca-7b-wdiff) for additional tasks such as question answering. To be specific, we evaluate our method on ROC, SCT, and COPA datasets from the TRAM Benchmark [3] (referred as R, S, and C for simplicity), covering multiple-choice questions in reading comprehension and commonsense reasoning. The result is shown in Table 2.
Experimental results show that our method significantly improves the performance of large decoder-only models on question-answering tasks across multiple domain generalization settings. For instance, for the Llama-7b model, our method improved the average accuracy of soft prompt generalization and hard prompt generalization comparisons by 1.91% and 2.36%, respectively. Similar improvements are observed in Vicuna-7b and Alpaca-7b models, ranging from 1.55% to 2.05% and 1.78% to 1.99% respectively.
We would like to clarify that our method is designed for language model, as we stated in the title of this paper. Incorporating other modalities into our method is beyond the scope of our research. We believe exploring the possibility to generalize our method to tasks in other modalities is interesting and we will do it in the future.
| Model | Method | S+C->R | C+R->S | R+S->C |
|---|---|---|---|---|
| Llama-2-7b-chat | Vanilla Prefix | 62.32±2.15 | 66.30±2.30 | 73.15±2.53 |
| Prefix with both | 63.70±1.96 | 68.47±0.97 | 75.32±1.09 | |
| Vanilla IC | 63.13±1.25 | 65.50±1.98 | 77.59±1.14 | |
| IC with both | 65.13±1.03 | 68.33±2.13 | 79.83±0.88 | |
| vicuna-7b-v1.5 | Vanilla Prefix | 67.72±1.79 | 81.09±2.17 | 88.97±2.64 |
| Prefix with both | 68.75±1.04 | 83.93±1.79 | 89.76±2.60 | |
| Vanilla IC | 68.37±2.24 | 83.23±4.12 | 90.98±1.99 | |
| IC with both | 69.67±1.58 | 85.50±5.06 | 93.39±1.23 | |
| alpaca-7b-wdiff | Vanilla Prefix | 61.52±3.79 | 70.03±2.88 | 87.91±2.73 |
| Prefix with both | 63.89±2.93 | 72.15±2.07 | 89.58±2.81 | |
| Vanilla IC | 60.81±1.14 | 69.11±2.46 | 89.66±2.37 | |
| IC with both | 63.16±1.56 | 70.57±1.95 | 91.19±2.00 |
Table 2. Performance comparison of decoder-only LLMs on multiple-choice task accuracy under MFDG settings.
[1] Guo, et al. Improving the sample efficiency of prompt tuning with domain adaptation.
[2] Li, et al. Robust prompt optimization for large language models against distribution shifts.
[3] Tam, et al. Parameter-efficient prompt tuning makes generalized and calibrated neural text retrievers.
[4] Wang, et al. Tram: Benchmarking temporal reasoning for large language models.
We sincerely appreciate your valuable feedback and insightful discussion! We hope our response has been helpful to you. As the discussion period is drawing to a close, we warmly welcome any further questions from the reviewer. We would be delighted to provide additional clarification!
Thank you for the comments and addressing many of my concerns. My primary concerns with the paper were (1) generalizability across models and tasks, and (2) comparisons with existing benchmarks. My concerns with (1) have been addressed with the inclusion of new QA tasks and larger + more recent 7B decoder-only experiments. I believe these would be important to have in the main paper. Regarding (2) I thank the authors for including these experiments, and understand that the proposed approach follows a different line of work from the referenced works. I will increase my score to reflect the additional experiments that have improved the paper.
We sincerely thank the reviewer for the constructive discussions and positive feedback. We will optimize our work in detail based on these suggestions and incorporate the experiments you mentioned into the main paper.
This paper investigates the domain generalization ability of prompts for pretrained language models (PLMs). The paper finds that prompts that receive higher attention weights from deeper PLM layers and those with stable attention distributions generalize better across domains. The authors introduce a novel objective called "Concentration" which implements a "lookback" attention from the current decoding token to prompt tokens, aiming to enhance both soft and hard prompt optimization methods. Their experiments demonstrate significant improvements in multi-source domain generalization accuracy—1.42% for soft prompts and 2.16% for hard prompts—while maintaining robust in-domain performance. These findings offer valuable insights into creating domain-generalizable prompts.
优点
- The paper is well-written.
- The authors started with an initial analysis to inform a novel training objective, which is both insightful and methodologically sound.
- The proposed method is simple and the authors demonstrate its effectiveness across several classification tasks.
缺点
I felt there were several obvious questions left unexplored, noted below, which raise concerns regarding the significance of the paper's contributions.
-
The authors only experimented with a rather small model, i.e., 355M RoBERTa, which raises concerns about whether the proposed method works with larger model sizes.
-
The authors focused solely on a classification tasks (sentiment classification and natural language inference). This raises concerns about the proposed approach's applicability and effectiveness for other tasks, like open-ended generation.
-
Finally, the improvements were shown over rather weak baselines. For example, prompt tuning, particularly with small models and limited training data, is a rather weak approach. I also felt that the authors compared their method against a weak implementation of this baseline, using only 5 soft prompt tokens and a learning rate of . For reference, the original prompt tuning paper used 100 prompt tokens and a learning rate of 0.3, which they found to be critical for prompt tuning's strong performance and faster convergence. These differences raise concerns about the significance of the proposed method's improvements.
问题
Please see the Weaknesses above.
局限性
The author discussed several limitations of their approaches, including the limited variety of prompts, the focus on a few-shot setting, the restriction of discrete prompt optimization to the input level, and the inapplicability of their methods to generation tasks.
We sincerely thank the reviewer for the constructive comments and feedback. We appreciate that our work is considered “well-written”, “novel”, and “insightful”. We hope our response can address your concerns.
Q1: Applicability to larger models and other tasks:
We would like to thank the reviewer for drawing our attention to extending our method to broader applications. We are glad to share that our method is able to successfully be applied on mainstream Large Language Models and question-answering tasks. We validate the effectiveness of our method on Llama-2-7b-chat, Vicuna-7b-v1.5, and Alpaca-7b-wdiff models for improving domain generalization ability of prompts on question-answering tasks. We evaluate our method on ROC, SCT, and COPA datasets from the TRAM Benchmark [1] (referred as R, S, and C for simplicity), covering multiple-choice questions in reading comprehension and commonsense reasoning. The result is shown in Table 1.
Experimental results show that our method significantly improves the performance of large models on question-answering tasks across multiple domain generalization settings. For instance, for the Llama-7b model, our method improved the average accuracy of soft prompt generalization and hard prompt generalization comparisons by 1.91% and 2.36%, respectively; similar improvements were observed for Vicuna-7b and Alpaca-7b models, ranging from 1.55% to 2.05% and 1.78% to 1.99% respectively.
Our research primarily addresses the analysis and optimization of attention patterns from current decoding tokens to prompt tokens (line. 121-122). Thus, long sequence generation (e.g., open-end generation) is beyond the scope of this research, as we admit in Section 7 (line. 323-325). We will explore the possibility to improve domain generalizability on more generative tasks in the future.
| Model | Method | S+C->R | C+R->S | R+S->C | Avg Gap |
|---|---|---|---|---|---|
| Llama-2-7b-chat | Vanilla Prefix | 62.32±2.15 | 66.30±2.30 | 73.15±2.53 | - |
| Prefix with both | 63.70±1.96 | 68.47±0.97 | 75.32±1.09 | +1.91 | |
| Vanilla IC | 63.13±1.25 | 65.50±1.98 | 77.59±1.14 | - | |
| IC with both | 65.13±1.03 | 68.33±2.13 | 79.83±0.88 | +2.36 | |
| vicuna-7b-v1.5 | Vanilla Prefix | 67.72±1.79 | 81.09±2.17 | 88.97±2.64 | - |
| Prefix with both | 68.75±1.04 | 83.93±1.79 | 89.76±2.60 | +1.55 | |
| Vanilla IC | 68.37±2.24 | 83.23±4.12 | 90.98±1.99 | - | |
| IC with both | 69.67±1.58 | 85.50±5.06 | 93.39±1.23 | +2.05 | |
| alpaca-7b-wdiff | Vanilla Prefix | 61.52±3.79 | 70.03±2.88 | 87.91±2.73 | - |
| Prefix with both | 63.89±2.93 | 72.15±2.07 | 89.58±2.81 | +1.78 | |
| Vanilla IC | 60.81±1.14 | 69.11±2.46 | 89.66±2.37 | - | |
| IC with both | 63.16±1.56 | 70.57±1.95 | 91.19±2.00 | +1.99 |
Table 1. Performance comparison of LLMs on multiple-choice tasks under MFDG settings. The last column shows the average gap between test performance on vanilla methods and our methods. Results are averages from 3 runs using different random seeds.
Q2: Baselines and Hyperparameter selection
We thank the reviewer for the valuable comment. For starters, we would like to claim the objective of this work is not to design a brand-new framework, but to improve domain generalization ability of current prompt optimization methods, as stated in lines. 73-75:
With the principle of concentration §3, we propose two algorithms that could piggyback upon popular prompt optimization methods for both hard and soft prompts to improve the domain generalization ability of prompts.
Thus, prompt tuning, as one of the most popular prompt optimization methods, serves as a good baseline to demonstrate that our proposed objective works well for improving the domain generalization ability within the framework of prompt tuning itself. In addition to prompt tuning, we also try to apply our proposed goal to several stronger baselines (such as Prefix Tuning and P-Tuningv2 for soft prompts, In-Context Demo and DP2O for hard prompts), and it consistently shows excellent performance in improving domain generalization ability of prompts on all experimental settings (as shown in Table 1 in our paper).
Additionally, we would like to address the reviewer’s concern in hyperparameter selection. The suggested work [2] (100 prompt tokens and learning rate of 0.3) is based on full training data for the T5 model. In resource-limited scenarios, such settings may lead to severe overfitting issues.
We also conduct experiments with the suggested hyperparameters by using T5-base model. The result is in Table 2. We find more prompt tokens and larger learning rate actually degrades performance of prompt tuning in the few-shot setting. We would like to clarify that our hyperparameter setting is similar with [3], where a learning rate of 1e-5 is used to train XLM-RoBERTa-base model in the few-shot setting with 4 soft prompt tokens.
| Method | Sentiment | NLI | ||||
|---|---|---|---|---|---|---|
| S+M→C | C+M→S | S+C→M | Q+R→W | W+R→Q | Q+W→R | |
| Vanilla PT | 55.70±0.98 | 52.37±2.05 | 52.67±1.55 | 41.57±0.93 | 51.83±0.79 | 51.70±2.25 |
| PT with both | 57.17±0.99 | 54.80±1.77 | 53.35±1.73 | 43.72±1.33 | 53.46±1.53 | 53.79±1.47 |
Table 2. Performance comparison of soft prompt learning on classification task accuracy for T5-base model. Results are averages from 3 runs using different random seeds.
[1] Wang, Yuqing, and Yun Zhao. "Tram: Benchmarking temporal reasoning for large language models." arXiv preprint arXiv:2310.00835 (2023).
[2] Lester, Brian, Rami Al-Rfou, and Noah Constant. "The power of scale for parameter-efficient prompt tuning." arXiv preprint arXiv:2104.08691 (2021).
[3] Zhao, Mengjie, and Hinrich Schütze. "Discrete and soft prompting for multilingual models." arXiv preprint arXiv:2109.03630 (2021).
We sincerely appreciate your valuable feedback and insightful discussion! We hope our response has been helpful to you. As the discussion period is drawing to a close, we warmly welcome any further questions from the reviewer. We would be delighted to provide additional clarification!
Global Response to All Reviewers
We illustrate the Concentration Strength Distribution of prompts in the In-Context Demo format for three 7B-sized language models (Llama, Vicuna, Alpaca) across three different tasks (SA, NLI, QA). A common observation is that the concentration strength is stronger in deeper layers instead of shallower layers. To be more specific, compared to smaller models (Roberta-large), concentration phenomena occurs earlier in larger models and keeps till the deep layers.
This paper offers a detailed study of domain generalization of the learned prompts for PLMs. The paper receives a consensus of ratings after the discussion periods. All reviewers found that their concerns were sufficiently addressed during discussion and recommend acceptance. The AC shares the majority opinion. It's recommended that the authors integrate the additional experiments/discussions into the final paper, e.g. results of more tasks and architectures, and comparisons with other domain adaptation methods.