Learning from Contrastive Prompts: Automated Optimization and Adaptation
a new method to optimize and adapt prompts using contrastive learning
摘要
评审与讨论
This paper proposes Learning from Contrastive Prompts (LCP), a framework for automatic prompt optimization in LLMs. LCP improves prior prompt optimization methods by taking inspiration from contrastive learning and incorporating both good and bad prompts in the meta-prompt to generate new prompts. Moreover, the paper considers the task of prompt adaptation to accommodate different LLMs. The proposed method is evaluated on BigBench-Hard and shows an advantage over prior methods.
优点
-
The main idea of learning from contrastive prompts is simple and natural while subsuming prompt learning strategies in prior work.
-
Empirical results are in general good with sufficient ablations.
-
The considered prompt adaptation problem is interesting and may inspire further investigations.
缺点
-
While the authors have provided discussions in the main text, I still feel the results on prompt adaptation, especially model version/family adaptation, are not very satisfactory: only balancing the strength of source and target models is often not ideal if it is possible to achieve the best of both worlds. I am also curious about whether this result is general or an artifact of using the same model for generation and evaluation (see Questions for details).
-
The main technical contributions of the paper are of an engineering nature and might not stand the test of time. I think it would be better if the authors could also explore whether their main idea can be applied in a broader context apart from prompt optimization, such as in more general optimization tasks like those in the OPRO paper. Also, while the paper compares with prior methods in automatic prompt generation, it does not compare with baselines in relevant research areas such as prefix/prompt tuning (e.g., [1] and its follow-ups), which shares a similar objective to prompt generation.
[1] Prefix-tuning: Optimizing continuous prompts for generation, 2021.
问题
-
What if we use different models for prompt generation and evaluation in the prompt adaptation setting (like in the ablative experiments)? For example, if the source model is stronger, can we use it to generate prompts for the weaker target model? Will this change the "balancing effect" observed in the prompt adaptation experiments?
-
What is the main advantage of using LLMs to automatically generate prompts compared to prompt tuning by gradient descent (e.g., in terms of generalization/adaptation ability or applicability)?
伦理问题详情
None
This paper proposes Learning from Contrastive Prompts (LCP), an automatic prompt optimization technique. The key designs of LCP include 1) consistency and diversity injection (using repeated generation with a high temperature to generate the summaries of wrong predictions) and 2) history integration, where the prompts from the previous iteration are included in the meta-prompt. Another key difference is that instead of just relying on the top-K prompts from the trajectory as in OPRO, LCP additionally takes advantage of the bottom-K prompts -- the authors define a new meta-prompt to allow learning from the worst-performing prompts too. The authors also tested LCP on less conventional setups such as transferability across models and languages. LCP is then experimentally validated on various BBH tasks and XCOPA to show effectiveness over baselines like OPRO and AutoHint.
优点
- The paper addresses an important problem of automatic prompt optimization, which could potentially alleviate a lot of human cost in engineering the prompts.
- The method is reasonable and straightforward to implement (e.g., it only requires different prompt designs and thus can be applied to any LLM with ease), and draws inspirations from well-known techniques such as contrastive learning.
- The cross-model and cross-lingual setups are important and under-studied to my knowledge -- this work, to the best of my knowledge, is amongst the first works in automatic prompt optimization to consider these setups. The adaptation strategies like focusing on the samples where the models disagree with each other can also be useful for future work.
缺点
- Presentation: I think the paper could be significantly improved with a better presentation, especially the methodology section. The way the paper is currently written essentially assumes previous knowledge about works like AutoHint and OPRO. For example, the proposed LCP generates prompts in a similar to AutoHint by summarizing and learning the error cases, and LCP adopts a similar iterative optimization framework like OPRO which treats an optimizer LLM as a black box to generate new, better prompts conditioned on the past prompt and performances. However, if the reader is not sufficient knowledgable about how these two previous works, it could be difficult to understand the overall flow of LCP -- the only current aid to the high-level understanding is Fig 1, which in my opinion has not been very intuitive -- I believe even if the paper develops upon prior works, it should still be self-contained; some kind of a pseudocode algorithm box and a better presentation of the overall structure of LCP, even if it is largely the same as previous works, could be helpful for this purpose.
- Related to the above, I do think that the novelty of the proposed approach is rather questionable: the "consistency and diversity injection" is essentially repeating generation with high temperature -- this is essentially self-consistency [1]. I think OPRO also uses a high temperature with repeated sampling to generate multiple candidates per iteration (Sec 5.1 in OPRO paper), albeit with a different meta-prompt. The "history integration" part is again present in OPRO (in fact, it is the key difference between OPRO and earlier works), so again I'm uncertain about its novelty except that it is now applied to the AutoHint prompt. It seems to me that only the "contrastive meta-prompt" part is significantly different, but again, I'd be curious to understand the benefit of the proposed, implicit learning (i.e., give LLM the bottom-K prompts and ask itself to figure out how to improve) vs an more explicit approach (directly present the LLM the worse-performing prompts and ask it to critique) like ProTeGi [2].
- Baselines: another major concern from me is that the authors exclusively compared against AutoHint and OPRO only. However, to my knowledge these are not the strongest baselines out there. Some works [7] even show that OPRO can be worse than APE, one of the first works utilizing LLM for automatic prompt engineering -- I'd curious to know the performance of the proposed method against stronger baselines like ProTeGi [2] (and related to it, TextGrad [3]), which as I mentioned above, explicitly asks the LLM to reflect upon the errors and to improve on the prompt generation. Another line of works is on demonstration optimization like DSPy [4] -- according to recent works like [5, 6], on BBH tasks optimizing demonstrations would be much more effective than optimizing instructions only, which is what still what LCP does at the moment, and I'd be curious to see some qualitative and quantitative discussions (note that both [4] and [5] only optimize demonstrations bootstrapped from the model and does not require anything more than what LCP currently requires; these approaches merely reuse and select from the labeled training set that LCP already requires for few-shot prompting).
References
[1] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., ... & Zhou, D. (2022). Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
[2] Pryzant, R., Iter, D., Li, J., Lee, Y. T., Zhu, C., & Zeng, M. (2023). Automatic prompt optimization with" gradient descent" and beam search. EMNLP
[3] Hou, B., Jia, J., Zhang, Y., Zhang, G., Zhang, Y., Liu, S., & Chang, S. (2022). Textgrad: Advancing robustness evaluation in nlp by gradient-driven optimization. arXiv preprint arXiv:2212.09254.
[4] Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Santhanam, K., Vardhamanan, S., ... & Potts, C. (2024). Dspy: Compiling declarative language model calls into self-improving pipelines. ICLR
[5] Wan, X., Sun, R., Nakhost, H., & Arik, S. O. (2024). Teach Better or Show Smarter? On Instructions and Exemplars in Automatic Prompt Optimization. NeurIPS.
[6] Opsahl-Ong, K., Ryan, M. J., Purtell, J., Broman, D., Potts, C., Zaharia, M., & Khattab, O. (2024). Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs. arXiv preprint arXiv:2406.11695.
[7] Ma, R., Wang, X., Zhou, X., Li, J., Du, N., Gui, T., ... & Huang, X. (2024). Are Large Language Models Good Prompt Optimizers?. arXiv preprint arXiv:2402.02101.
问题
Please address my concerns in "Weaknesses" above.
This paper introduces Learning from Contrastive Prompts (LCP), a framework for automatically optimizing prompts of LLMs based on samples with a prompt optimizer which can be the same or different LLM. Compared to existing automatic prompt engineering techniques, LCP provides the prompt optimizer with both high-performing and low-performing prompts instead of just negative or positive prompts, similar to contrastive learning. Concretely, LCP first samples a diverse set of candidate prompts with higher temperatures from a seed prompt. The candidates are evaluated on the samples and ranked and the top and bottom scoring prompts are used as the example for the next round of generation. Prompts from the past are also kept for the new rounds. In addition, the paper also considers a new setting called prompt adaptation where the goal is to modify an existing prompt for different model versions, types, and languages. Experimental results on Big Bench and XCOPA show that LCP outperforms or performs comparably to existing methods.
优点
LCP is a fairly straightforward and intuitive prompt optimization. Using both good and bad prompts for prompt optimization is a quite natural idea and the prompting involved is fairly minimal. Empirical verification and ablation are thorough and convincing. Overall, I think LCP could be a useful automatic prompt engineering technique. Admittedly, it is hard to keep up with the literature on prompt engineering nowadays, but I don't believe this exact variation of prompt optimization has been proposed before.
缺点
- The paper introduces a new task, prompt adaptation which aims to adapt existing prompts for one model to another model or to a different language. While this task is fairly reasonable, it is not clear to me from reading the paper whether LCP accomplishes the goal. It seems like the performance could vary widely between
LastandBestperformance and also between different transfer settings (Table 2). In Table 3, it seems like query translation is a better approach for cross-language applications. I think the paper could spell out the take-away from these experiments better and provide a heuristic guideline for when one should use LCP.
our LCP adaptation framework creates a balance between strengths of source and target models.
-
I am not sure how I should interpret this sentence or how it relates the empirical results. I think more explanation would be helpful.
-
There is also no failure analysis about when LCP might fail or is not suitable.
-
The paper could also benefit from having a pseudocode block that illustrates the process clearly.
问题
- how are the win rates calculated exactly? How is the comparison done? It would be good to include this in the paper (let me know if I missed it)
- There is a prior work [1] that uses prompt optimization in the framework of boosting where the models are prompted with examples that it gets wrong. I think this is related to presenting bad prompts and perhaps discussing this connection could be useful.
Reference
[1] Language models are weak learners. Manikandan et al.
The authors propose an engineering technique for prompt tuning that leverages an LLM's reasoning and generation capabilities. To tune prompts using the LLM, they start by creating candidate prompts through a meta-prompt, adjusting parameters like temperature to introduce diversity and repeating this process multiple times.
Once a set of candidate prompts is generated, they evaluate each prompt based on its inference performance on a training set, assigning scores accordingly. The top-K prompts are labeled as "good" prompts, and the bottom-K as "bad" prompts. A meta-prompt, such as "There are some good prompts and bad prompts; generate a new prompt based on them," then instructs the LLM to generate improved prompts by drawing insights from both sets.
优点
-
Adaptation across models and languages. Adaptability is a valuable property these days. They show the adaptation ability across different models and languages on challenging benchmarks.
-
Experiments. They conducted the experiments on the Big-Bench Hard dataset, a recognized benchmark for difficult tasks. Their method achieves strong performance on this benchmark.
缺点
- Limited theoretical Insight: The proposed method is primarily an engineering technique, and the paper does not provide substantial theoretical insight or lessons, limiting its contribution to deeper understanding in prompt tuning.
问题
-
Understanding Figure 2: The explanation of Figure 2 is unclear to me. Could you provide a different example to help clarify its purpose and interpretation?
-
Clarification on Terminology in Section 2.3: What is the difference between the terms “Cross-model version” and “Cross-model” as used in Section 2.3? A detailed distinction would be helpful.
-
Comparison with Baseline (OPRO): How does the proposed method differ from the baseline OPRO?
The paper proposes a new prompt optimization method which is inspired by contrastive learning. The method makes use of both correct and incorrect samples in prompt optimization. The paper also proposes a method for achieving prompt adaptation across different models and languages.
The reviewers generally commended that the proposed method is straightforward and intuitive, the ability to achieve adaptability is novel and interesting, and the experimental results in general look convincing.
However, during the initial review, rebuttal and discussion phases, multiple reviewers expressed concerns about the technical novelty of the paper. As in the different components of the method are mostly renamed versions of those from previous methods, and the "contrastive learning" concept adopted here is more of an analogy rather than a real contrastive learning technique. In addition, some reviewers also commented that the adaptation performances are not very robust and convincing. In addition, during rebuttal, Reviewer ZqAQ discovered some potential issues regarding the fairness of the comparisons (i.e., some reported numbers are lower than the previous works), which also casts doubts on the empirical contributions.
Given the above, I think the paper still has room for improvement as discussed above, and hence recommend rejection.
审稿人讨论附加意见
During the rebuttal and discussion, the reviewers engaged in discussions regarding some important aspects of the paper, such as the empirical comparisons (mostly the fairness of comparisons), and the technical novelty of the paper.
Reject