Prompt Optimization with EASE? Efficient Ordering-aware Automated Selection of Exemplars
We propose an efficient automated exemplar selection method that uses a neural bandit algorithm to optimize the set of exemplars for in-context learning while accounting for exemplar ordering.
摘要
评审与讨论
The paper proposes EASE, an exemplar selection algorithm for in-context learning of large language models (LLMs). It takes into account the order of the exemplars and offers the possibility to jointly optimize the instruction and the exemplars. A neural network is iteratively trained on the embeddings of the current exemplar sequence to predict the score of the next exemplar. The search space is reduced using a technique based on optimal transport. The paper also empirically shows that the effect of exemplar selection is not the same for all tasks.
优点
- A wide range of baseline methods and tasks are covered in the experiments.
- The paper provides insights on the impact of exemplar selection on some specific tasks.
缺点
- The performance of EASE, which requires training a model for each task, on Instruction Induction (II) is very close to the random search baseline for most tasks.
- Experiments are mainly conducted on gpt-3.5-turbo-1106, an API that might include unknown pre- and post-processing. Moreover, the availability and behaviour of these APIs is not guaranteed hindering reproducibility. Consequently, the algorithm should also be validated on open-weight models (not just when doing progressive finetuning) in addition to models only available through APIs.
Minor issues: Concerning the presentation, although the main contribution is clearly stated, the paper is structured in a way that makes it a bit hard to read. For example, the main paper lacks a dedicated related work section that would better contextualize the proposed algorithm.
问题
- “out-of-distribution” tasks is defined as as tasks on which the LLM is not already well trained. How are these tasks selected without access to the training data?
- what is the effect of the length of exemplars k on performance (using a range instead of a fixed value as in D.5)?
局限性
There is a section on limitations in the main paper and a section on broader social impact in the appendix.
We thank the reviewer for taking the time to review our paper. We are glad to know that our paper has provided insights into the impact of exemplar selection through a wide range of baseline comparisons.
We would like to address the specific concerns and questions raised by the reviewer below:
[W1] Performance on II is very close to random search baseline
You are right that the random baseline performs well in Tab. 1 of Instruction Induction (II) tasks. It is not unexpected that random is a strong baseline: In particular, we highlight in second para. of Sec. 4.1 that such tasks might not be suitable for evaluation of the exemplar selection's effectiveness. This is likely due to data contamination where the benchmark data have been extensively used to train the target model (i.e., GPT-3.5 in our paper). Therefore, the target model easily identifies the task through exemplars (including those selected by the random baseline) and utilizes its expert knowledge on the task to answer. This explains why the choice of in-context exemplars has minimal impact on the performance in contaminated tasks. To better evaluate the effectiveness of exemplar selection, we propose three classes of tasks in Sec. 4.3 and see the significant advantages of EASE in them. We attribute the performance gain to the fact that EASE balances exploration and exploitation to be more effective under budget constraints. The above observation aligns with our insights from progressive finetuning (Sec. 4.2): As the model knows more about the task, the effect of in-context exemplars diminishes.
[W2] Validate on open-weight models for reproducibility
As suggested by the reviewer, we conduct additional experiments with open-weight models. Due to the limited time during rebuttal, we use a representative (Meta's newest) model Llama-3.1-8B-Instruct as the target model. We present the results in Tab. A5 of rebuttal PDF. Importantly, the results are consistent with the original conclusions we drew for the black-box models in the paper. The model is also much smaller than GPT-3.5, hence making it easier for others to deploy and reproduce the results. We hope this further improves the reproducibility of our results.
That said, we still advise practitioners to use the black-box models for comparison, which can be more affordable especially in academic settings (pay-as-you-go APIs vs. hosting a model on servers). We note that the black-box model is also fair for comparison because all baseline methods essentially undergo the same pre- and post-processing.
[Q1] “Out-of-distribution” tasks selection
We agree with the reviewer that we cannot confirm whether a task is already well-trained without access to the training data. We clarify that "out-of-distribution" tasks are defined loosely here (hence in inverted commas as mentioned in Lines 276-277 of our paper). If this is not a suitable name, we will gladly adopt another name that the reviewer deems fit; let us know!
We select these tasks with the following practical intuitions:
(1) If a task contains novel rules such that the model has to extract the underlying relationships among the provided in-context exemplars and directly use the relationship for test-time inferences, it is unlikely that these tasks appear in the training dataset (Rule-based tasks in Lines 281-303).
(2) If a task is constructed through random label remapping to be against the model's existing knowledge (e.g., "good" sentiments are mapped to "bad" sentiments), it is unlikely such tasks appear in the training dataset (Remapped label tasks in Lines 304-314).
(3) If excessive noises are added to existing real datasets, it is likely to present a distributional shift from the original real distribution to one that the model has not seen before (Noisy tasks in Lines 315-322).
To provide more empirical evidence, we refer the reviewer to Tab. A6 & A7 of rebuttal PDF. We performed random exemplar in-context prompting and discovered that it achieved an average performance of 17.6% for "out-of-distribution" tasks, which contrasts with the average performance of 64.7% for II benchmark tasks. This demonstrates that the model is likely to have less knowledge about our defined "out-of-distribution" tasks. Additionally, we can also see that the performance gain of EASE is higher for "out-of-distribution" tasks.
[Q2] Using a range instead of a fixed value of as in D.5
If we understand correctly, you refer to the setting of allowing a range instead of a fixed value of to be selected. That is, if , we allow the number of exemplars in the prompt to be any integer from 1 to 50. We present the additional results in Tab. A6 of rebuttal PDF. Firstly, EASE continues to consistently outperform the strongest baseline Best-of-N. Secondly, we also observe that the prompts with the best performance typically have a large number of exemplars, i.e., close to the max allowed. Thirdly, using gives better performance than . Therefore, including more exemplars in the prompt usually gives a higher performance. However, this comes at the expense of a higher query cost at test time because more tokens are used in the prompt.
We thank the reviewer for pointing us to this interesting insight which we will add to the revised paper.
Minor issues: Related works and presentation
We thank the reviewer for pointing out ways to improve the presentation of the paper. With the additional page upon acceptance, we can certainly move the dedicated related work section (now in App. A) back to the main paper to ensure a better flow. We also welcome other ways to improve the structure of presentation in our paper.
We hope our clarifications above have helped to improve your opinion of our work.
Thank you for your answer to the questions. The method was validated using the open weight model Llama-3.1-8B-Instruct which improves the reproducibility of the paper. I disagree with the stated advantages of closed APIs. There are also cloud providers that offer inexpensive pay-as-you-go access to open weight models. The authors also run experiments on the effect of k. The new results helped in answering some of the questions and improved the quality of the paper. However, I have some remarks regarding the performance on the “Out-of-distribution” tasks which is one of the main contributions of the paper (since the method is as good as the baselines on regular tasks even though it introduces overhead).
The proposed approach works better on some specific tasks labeled as “Out-of-distribution”. The task selection mainly relies on practical intuitions. Is it possible that the baselines perform worse on these tasks because of some other factors unrelated to model knowledge? This could be tested if we had access to the training data (with models like Olmo, Pythia) or by using methods that more rigorously asses the model's knowledge.
We are happy that our additional experiments helped to address the questions!
We also thank the reviewer for suggesting the use of OLMo and Pythia to verify our intuitions about model knowledge.
Upon checking the published data sources, named Dolma, of OLMo, it is likely that Instruction Induction (II) data has been included in the training source (via Common Crawl, or The Stack). Therefore, we perform additional experiments across different checkpoints (at 41k, 130k and 410k training steps) of the recent OLMo-7B-0424-hf model, which released checkpoints over more than 400k steps of training. The results are presented below.
| OLMo_41k | OLMo_41k | OLMo_130k | OLMo_130k | OLMo_410k | OLMo_410k | |
|---|---|---|---|---|---|---|
| Best-of-N | EASE | Best-of-N | EASE | Best-of-N | EASE | |
| object_counting | 20.0 ± 2.9 | 28.3 ± 1.7 | 25.0 ± 2.9 | 38.3 ± 1.7 | 45.0 ± 2.9 | 46.7 ± 1.7 |
| sentence_similarity | 25.0 ± 0.0 | 26.7 ± 1.7 | 30.0 ± 2.9 | 31.7 ± 1.7 | 41.7 ± 1.7 | 41.7 ± 1.7 |
| orthography_starts_with | 21.7 ± 1.7 | 23.3 ± 1.7 | 21.7 ± 1.7 | 28.3 ± 1.7 | 26.7 ± 1.7 | 31.7 ± 1.7 |
| translation_en-fr | 21.7 ± 1.7 | 23.3 ± 1.7 | 38.3 ± 1.7 | 45.0 ± 2.9 | 35.0 ± 0.0 | 40.0 ± 0.0 |
The conclusions are consistent with Figure 1 of the main paper.
- When the training just started (i.e., at 41k steps), the model might not be capable enough to carry out effective in-context learning.
- As the training progresses (i.e., at 130k steps), we observe the best exemplar selection effectiveness. At this point, the model is capable of learning underlying relationships among the in-context exemplars, and yet to be well-trained on the specific task.
- As the model converges (i.e., at 410k steps), the gain from exemplar selection using our EASE diminishes as the model becomes well-trained on the dataset of the respective tasks.
We also tried the rule-based tasks and remapped label tasks on OLMo-7B-0424-hf. However, the in-context learning performances are always at 0% for these more difficult tasks, so comparisons are not meaningful. We also look forward to similar efforts (as OLMo and Pythia) in the community to open-source larger and more capable models with checkpoints in the future.
We thank the reviewer again for the insightful comments that helped improve our paper.
The controlled experiment on the open-source model shows that EASE outperforms the baseline during the earlier phase of training. The gap gets smaller as the number of steps increases. As you mention, this highlights the in-context learning abilities of the model. An undertrained model benefits more from the exemplars selection method. However, the task is present in the training set. It seems to be more about the evolution of the in-context learning abilities of the model as evaluated on a dataset seen during training than about the performance on out-of-distribution tasks. It is an informative experiment nonetheless.
We appreciate the reviewer's active discussion and constructive feedback. Unfortunately, we do not have access to an open-source model with checkpoints that is sufficiently capable of performing the challenging out-of-distribution tasks in context --- Even the most recent model, OLMo-7B-0424-hf, demonstrates essentially 0% accuracy on OOD tasks. Nevertheless, we are happy to hear that the experiments above are informative from other perspectives.
The paper introduces EASE (Efficient ordering-aware Automated Selection of Exemplars), a new approach to boost in-context learning (ICL) in large language models (LLMs). EASE optimizes the selection and ordering of input-label exemplars without needing model fine-tuning or test-time retrieval. EASE trains a neural network to predict the prompt performance using hidden embeddings from a pre-trained language model. Using the network as the scorer, it then employs a bandit algorithm to search for the best order exemplars for improved performance efficiently. Additionally, EASE can jointly optimize exemplars and instructions. In Instruction Induction (II) benchmark tasks and four novel out-of-distribution tasks, EASE outperforms several basic baselines, especially when the LLM has limited knowledge of the task.
优点
- The problem addressed is significant, as finding a fixed set of examples and instructions that generalize well to testing time could be a valuable technique.
- The proposed method is well-founded, incorporating a novel component that uses optimal transport to reduce the search space.
- The performance improvement in the out-of-distribution (OOD) setting is clear.
缺点
- The main framework is quite similar to prior work [1], with only minor differences in details.
- The literature review is inadequate, failing to cover other prompt optimization approaches, including instructions and/or example ordering (see below). Including a comprehensive related work section is crucial to highlight the novelty of the approach and justify its preference over others.
- Due to the lack of discussion on prominent related works, it is unclear why the authors did not include them in the baseline set. The current baseline set is weak, relying only on heuristic methods. It would be better to incorporate proper optimization-based methods using RL[2,3], GA [4], or even LLM [5,6] as optimizers.
- The current presentation of the results is unconvincing. For instance, there is no comparison in terms of running time or efficiency, areas where the paper claims to excel. The second-best baseline, Best-of-N, shows competitive results in real benchmarks. If N increases, its performance might surpass EASE. Without efficiency comparison between methods, it is hard to determine if EASE is useful.
- Another major concern is the method seems to only work well with synthetic OOD settings. More experiments with real datasets are preferred.
[1] Xiaoqiang Lin, Zhaoxuan Wu, Zhongxiang Dai, Wenyang Hu, Yao Shu, See-Kiong Ng, Patrick Jaillet, and Bryan Kian Hsiang Low. Use your INSTINCT: Instruction optimization using neural bandits coupled with transformers. In NeurIPS Workshop on Instruction Tuning and Instruction Following, 2023.
[2] Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song, Eric Xing, and Zhiting Hu. 2022. Rlprompt: Optimizing discrete text prompts with reinforcement learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3369–3391.
[3] Tianjun Zhang, Xuezhi Wang, Denny Zhou, Dale Schuurmans, and Joseph E Gonzalez. 2022. Tempera: Test-time prompt editing via reinforcement learning. In The Eleventh International Conference on Learning Representations
[4] Archiki Prasad, Peter Hase, Xiang Zhou, and Mohit Bansal. 2023. Grips: Gradient-free, editbased instruction search for prompting large language models. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3827–3846.
[5] Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2022. Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations.
[6] Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. 2023. Large language models as optimizers. arXiv preprint arXiv:2309.03409.
问题
- What is the size of the dataset D? If D is small, a brute-force search could be feasible. Typically, LLMs are most beneficial when there is little to no data for downstream tasks, so a small D would be more realistic.
- How was the NN’s uncertainty computed?
- What method was used to tune the hyperparameters of NeuralUCB?
- Please consider more real data such as those in the Tempera paper [3].
局限性
The authors could also mention the cost of hyperparameter tuning
We thank the reviewer for taking the time to review our paper and acknowledging that we address a significant problem with a well-founded and novel method.
We would like to address your specific questions below:
[W1] Similar to [1]
In EASE, we face the challenge of an exploding permutational search space of exemplars that is distinct from [1]. A direct application of NeuralUCB would not work in practice for exemplar selection without our search space reduction technique through OT. It is also new to formulate the problem as one of optimization in the space of exemplar sequence embedding, which entirely removes the reliance on a white-box LLM as in [1]. The new formulation also enables the natural incorporation of instructions and hence the joint optimization of instructions and exemplars. Thus, the distinct solutions of EASE come from the unique challenges in exemplar selection. We also refer to the explanation of our distinction from [1] in Lines 159-165.
[W2] The literature review is inadequate
To clarify, we have a more extensive related works section in App. A, which expands on the ones discussed in Introduction. We will add references to RLPrompt, TEMPERA & GrIPS in our revised paper. We will have an additional page to include related works in the main paper upon acceptance.
[W3] Other related works
We thank the reviewer for the references. As long as the instruction optimization method can handle permutational search space of exemplars, they can serve as a baseline. While they are valuable references to add in our related works (we will do so for the revised paper), it is non-trivial to adapt these works to our problem setting, as discussed below.
LLMs: APE [5] and OPRO [6] rely on LLM's generation for the instruction search space and hence are not designed for exemplar selection. It is non-trivial to adapt them for exemplar selection. Nevertheless, we did a workaround and adopted the same spirit of asking the LLMs to directly select exemplars for us by explicitly telling LLM the search space. In App. D.9, we demonstrated that EASE performs much better than asking LLMs to select exemplars.
RL: The discrete optimization of a fixed-length prompt (i.e., no exemplars) in RLPrompt cannot be easily extended to our setting of exemplar optimization. TEMPERA focuses on query-dependent (vs. our query-agnostic) prompts which we argue to carry privacy risks of data exposure in Introduction paragraph 2. The formulation of TEMPERA also only supports classification tasks. So, we cannot adapt these methods to compare with EASE.
GA: GrIPS only focuses on iteratively editing the instruction and does not optimize for exemplars. Specifically, it either uses a fixed set of exemplars or performs a simple heuristic random search (Sec. 5.2 & 5.8 of GrIPS). In our paper, we have in fact demonstrated the superior performance of EASE against random search in Best-of-N.
[W4] Efficiency
To clarify, EASE excels in query efficiency instead of time efficiency (Lines 85-90). While EASE can take longer to perform the optimization, the saving in the LLM API calls or black-box LLM queries can be substantial. So, we perform a fair comparison with the baselines in all tables using a fixed target LLM call/query budget.
To address the question of the performance for a large , we perform additional experiments with =500. The results are shown in Tab. A1 of rebuttal PDF. Compared to the original Tab. 2 of the paper, we see that the performance gain of EASE is still significant even though all baselines (Evo, Best-of-N, and EASE) improve with a larger budget .
[W5][Q4] More real datasets
We highlight that EASE also works better than baselines in real datasets: In Tab. 1, EASE achieves top performance for 17 tasks while Best-of-N tops only 11.
To make the case more convincing, we conducted experiments on a number of additional real benchmarks. We present results on the benchmarks used in the TEMPERA paper in Tab. A4 of rebuttal PDF, including MR, Yelp P., CR, MNLI & MRPC. Note that these tasks are overly simple to distinguish the effectiveness of EASE because all of them achieve above 80% accuracy and are hard to improve further using mere in-context examples.
So, we refer the reviewer to more complicated benchmarks on real tasks with reasoning chains in Tab. A2 of rebuttal PDF, which we just experimented with during the rebuttal period. We hope that the good performance of EASE on MATH, GSM8K & AQuA-RAT with CoT reasoning chains makes the baseline comparisons in our paper more convincing.
Questions
[Q1] Brute-force search
We note that the search space grows exponentially with the size of . We use in most experiments. However, even when , a brute-force approach is still not feasible because permuting 5 exemplars with 10 candidates yields possibilities. It can incur a considerable monetary cost to query the black-box API and obtain the best exemplar sequence.
[Q2] Uncertainty for NN
Due to length restrictions, we will answer this separately in an official comment below.
[Q3] NeuralUCB hypers
In order to show the robustness and generalizability to all possible tasks, the same set of hyperparameters should perform well on all tasks. So, we fix the architecture of the NN, learning rate, weight decay, and training iterations and do not tune them. We also used the extent of exploration throughout all experiments after searching from four values .
We hope our clarifications above have helped to improve your opinion of our work.
Dear Reviewer NbA9,
Thank you for taking the time to review our paper!
As the discussion period is concluding in less than a day, we hope to hear from you whether our rebuttal has sufficiently addressed your questions and concerns. Especially, we hope that the clarifications on our distinctions from [1] and related works are satisfactory. Additionally, we hope that the additional experiments we conducted on real datasets (MR, Yelp P., CR, MNLI & MRPC) and datasets with CoT reasoning chains (MATH, GSM8K & AQuA-RAT) have made our papers more convincing.
We are more than happy to answer any further questions during the remaining discussion period.
Best regards,
Authors of Paper 16009
The paper introduces EASE, a method for optimizing ICL in LLMs by selecting and ordering exemplars efficiently. Unlike retrieval-based methods that incur extra computation and privacy risks, EASE uses a neural bandit algorithm and optimal transport techniques to find high-quality ordered exemplars without test-time computation. It also extends to jointly optimize exemplars and instructions for improved performance. Empirical evaluations demonstrate EASE's superiority over existing methods across various tasks, especially where the LLM lacks task-specific knowledge. EASE proves robust in noisy data scenarios, offering a comprehensive solution for enhancing ICL in practical applications.
优点
- The introduction of the EASE algorithm, which combines neural bandit optimization and optimal transport, is a unique approach to exemplar selection.
- The paper includes extensive empirical evaluations, comparing EASE with a comprehensive suite of baseline methods across various tasks. The consistent outperformance of EASE underscores its effectiveness.
- The paper provides clear explanations of the methodology, including the use of neural bandit algorithms and optimal transport, making the complex concepts accessible.
缺点
- The requirement for on-the-fly computation of embeddings for ordered exemplar sequences is identified as a potential computational bottleneck.
- EASE relies on the availability of a suitable validation set, which may not always be readily available in certain scenarios. This requirement could limit the method's applicability in some contexts.
- The joint optimization of exemplars and instructions, while beneficial, adds complexity to the method. The paper could provide more detailed guidance on effectively implementing this joint optimization in practice.
问题
- Can you provide more insights or potential solutions to mitigate the computational bottleneck caused by the on-the-fly computation of embeddings for ordered exemplar sequences?
- While combining EASE with retrieval-based methods for large exemplar sets has shown better performance, could you elaborate on any potential limitations or challenges of this approach?
局限性
- The need for a suitable validation set is recognized as a limitation, as such sets may not always be readily available. This could restrict the method's applicability in certain contexts where validation data is scarce or difficult to obtain.
- The selection of exemplars could inadvertently reinforce biases present in the training data. The authors could address how EASE mitigates bias and ensures fairness in exemplar selection, especially in sensitive applications.
We thank the reviewer for taking the time to review our paper. We are glad to hear that our approach is unique, the empirical evaluations are extensive on comprehensive baselines, and our explanations are clear.
For the specific concerns and questions raised, we will address them below:
[W1] On-the-fly computation of embeddings as a potential computational bottleneck
[Q1] Insights or potential solutions to mitigate the computational bottleneck
In addition to the optimal transport (OT) introduced in EASE to reduce the on-the-fly computation of embeddings required, we can potentially eliminate the on-the-fly embedding entirely by adopting the average embedding (AvgEmb) of individual exemplars. However, there is no free lunch: Such computational simplification comes at the cost of performance due to the loss of order information. We performed ablation studies in App. D.6 to demonstrate that simply averaging the embeddings of all exemplars using AvgEmb results in worse performance. Though not as good, note that AvgEmb still achieves a decent and competitive performance. The practitioners can balance the trade-off and select the most suitable method.
While the above approaches either use pre-trained models for on-the-fly sequence embedding or discard order information altogether, there is a potential alternative: One can try concatenating the individual exemplar embedding as the NN input such that the order information is captured in the concatenated embedding. However, this method may require more training data to fit the NN due to a more complex mapping (from the higher dimensional input space). Also, this method may not scale well when selecting a large number of exemplars.
[W2] EASE's reliance on the availability of a suitable validation set as a limitation
[L1] Especially in contexts where validation data is scarce or difficult to obtain
We clarify using a validation set is a common practice as in [4, 43, 14]. The validation dataset can be as small as 20 data samples to provide meaningful evaluations of the prompts to guide the optimization process. We used 20 validation samples for all experiments. The manual creation of a small validation set should not be very difficult or expensive.
Moreover, in situations where a suitable validation set is simply not possible to obtain, it is possible to revert to human feedback/responses, thanks to the recent work of Lin et al. (2024). We can replace the numerical validation score with binary human feedback (i.e., asking a human to select a preferred prompt according to the responses/outputs of two prompt candidates). This greatly relaxes the requirement of a labeled validation set and only requires preference feedback from an interactive human user.
[W3] More detailed guidance on effectively implementing this joint optimization in practice.
Sure! We will elaborate more details of the joint optimization in the revised paper. We will provide some rough explanations here:
According to Line 5 of Algorithm 1, the instructions augment the search space by the size . In practice, the most straightforward implementation without increasing the computational complexity is to reduce from to . More generally, one can randomly sample instructions to pair with exemplar sequences. This implementation is simple yet effective, where controls the trade-off of focusing more on instructions vs. exemplars.
[Q2] While combining EASE with retrieval-based methods for large exemplar sets has shown better performance, could you elaborate on any potential limitations or challenges of this approach?
We thank the reviewer for recognizing our effort in using retrieval-based methods to tackle large exemplar sets. We would like to elaborate on some limitations:
(1) The filtering through retrieval-based methods completely eliminates the consideration of a large subset of exemplars in the later optimization stage. This may result in important exemplars being left out and never explored again in our automatic optimization.
(2) The cosine similarity retrieval places a strong bias on preferring exemplars that are similar (in the embedding space) to the validation set, which may not yield the best performance in practice. This bias is dependent on the retrieval model and the retrieval metric used which therefore need to be carefully selected.
These limitations come with the simplification of the search space for practical efficiency reasons. We will add a section to discuss the above limitations and challenges for future research and improvements.
[L2] The selection of exemplars could inadvertently reinforce biases present in the training data. The authors could address how EASE mitigates bias and ensures fairness in exemplar selection, especially in sensitive applications.
Fairness is an important aspect to consider as the reviewer pointed out. We would like to refer the reviewer to App. B where we emphasize the responsible deployment of our method, including integrating safety measures and ethical constraints in the objective metric. In sensitive applications, we suggest integrating the performance objective metric with fairness such as demographic parity, equal opportunity, equalized odds, etc. to mitigate bias in exemplar selection. The integration can be done with an adjustable hyperparameter controlling the extent of fairness to be enforced, which depends on the critical and sensitive nature of the application.
We hope our clarifications above have helped to improve your opinion of our work.
References
(Lin et al., 2024) Prompt Optimization with Human Feedback. arXiv.
I appreciate the comprehensive rebuttal provided by the authors, which addresses the concerns and questions raised in my initial review. I will raise my rating from 5 to 6, reflecting a higher confidence in the paper's contribution and impact.
Dear Reviewer FzJo,
Thank you for taking the time to review our paper!
As the discussion period is concluding in less than a day, we hope to hear from you whether our rebuttal has sufficiently addressed your questions and concerns. We have provided additional insights and proposed potential solutions to the computational bottleneck. We also justified the use of our validation dataset, while offering an alternative approach to remove the validation dataset in future works. We also hope that our detailed guidance on joint optimization, along with our discussion of the limitations helped to improve the quality of the paper.
We are more than happy to answer any further questions during the remaining discussion period.
Best regards,
Authors of Paper 16009
The authors propose EASE, a method for optimizing the selection of few-shot examples for prompting black-box LLMs. EASE is an iterative algorithm that combines NeuralUCB and Optimal Transport. In iteration t, EASE trains a neural network to map embeddings of strings (of few-shot examples) to their average score on a validation set. Then, they sample a subset of possible combinations of few-shot examples, filter them with optimal transport, and use NeuralUCB to select a set that maximizes their acquisition function. The selected set is evaluated and added to the pool of embedding -> score list. The authors evaluate on a set of existing tasks and a few synthetic tasks they created. Overall, the proposed method outperforms the baselines nearly across the board.
优点
- The paper overall is easy to follow.
- The results are very strong and the range of baselines is satisfactory.
- The hypothesis and synthetic task evaluation is quite valuable and insightful.
- The method underlying EASE is likely to spur future approaches in this space.
缺点
-
The method does not really justify why we should expect off-the-shelf embeddings (into a single unlearned vector no less) of few-shot example strings to encode enough information for NeuralUCB to work effectively. In general, off-the-shelf encoders may or may not highlight the important aspects of the prompt; they're not trained to do that. The authors discuss that embeddings will be sensitive to the ordering. This is true, but in what way do we know that such sensitivity could be leveraged effectively from a frozen task-misaligned sematic encoder?
-
A lot of the discussions regarding "retrieval-based methods" are just forced and the paper doesn't even need them. The work shows that it's empirically more effective. That's sufficient. You don't need to argue that retrieval-based methods will "significant additional computational overheads especially when the number of test queries is large". That's plainly false: retrieval in the way discussed in this paper (e.g., cosine similarity) can standardly be done from billion-scale collections in 2-3 milliseconds in practice, far faster than the LLM invocation.
-
Many experimental details don't seem to have been controlled for: what's the effect of the LM choice? what's the effect of the precise budget? of the training set size? of the validation set size? How large are these currently in the first place?
-
The tasks selected are arguably overly simplistic and generally this means that the few-shot examples considered are all rather trivial. To my understanding, there are no reasoning chains or open-ended generation in any of the labels. There's no argument on why we should expect a method that works for simple direct classification-type labels to generalize to far more open-ended tasks. Such tasks exist plentifully; it's unclear why the authors do not consider any of them.
-
The authors say: "Our proposed Evo and Best-of-N are the most competitive baselines with best performance in 9 and 11 tasks, respectively." This seems to suggest that the authors believe they are the first to consider such approaches. While I think the current baselines are satisfactory, both Evo and Best-of-N approaches are components in numerous existing methods (which may involve other components), e.g. see PhaseEvo "Unified In-Context Prompt Optimization" for the former and DSPy's BootstrapFewShot with RandomSearch for the latter.
问题
-
The authors simultaneously find best-of-N to be highly effective as a baseline and yet argue for the importance of Optimal Transport in their method. Can we gain more insight on what OT ultimately selects under the hood? Intuition is lacking.
-
What's the difference between Tables 7 and 8? What about ablations on real tasks?
-
The authors say:
filter the large pool of data exemplars to eliminate those that are less relevant to the task. To this end, we propose to first use retrieval-based methods to select the exemplars that are more relevant to the task, and then run our EASE using this refined smaller set of exemplars. Specifically, we use a cosine similarity retriever and perform exemplar selection on D with a size n as large as 100000.
What does this mean? What are examples that are "more relevant" to a task? What's the precise retrieval formulation here, i.e. what's the query and what is the corpus of documents.
局限性
N/A
We thank the reviewer for taking the time to review our paper and acknowledging the strong results, valuable insights, and possibility of future extension.
We will address your concerns below:
[W1] About "off-the-shelf embeddings"
We justify in the paper that off-the-shelf encoders are "commonly used for downstream clustering, semantic search, and classification". We regard prompt optimization as a similar downstream application that can leverage the advantages of such pre-trained encoders.
As for order sensitivity, we have a dedicated section that discusses the benefit of the ordering-aware embedding through ablation studies in App. D.6. Therefore, the sensitivity to ordering is leveraged effectively in EASE to produce better performance as compared to the order-agnostic AvgEmb.
However, we do observe differences for different encodings: see Tab. 16 in App. D.8. According to the reviewer's comments, it is worth investigating as a future direction to develop or finetune embedding models specifically for the purpose of prompt optimization, such that it captures important aspects of the prompt in the latent space.
[W2] Retrieval is fast
We thank the reviewer for pointing out the comparatively negligible computational overhead for retrieval. Considering the fast retrieval in practice, we would instead comment that test time computation is not needed for our method. We also wish to highlight that the decreased privacy risks and the enhanced empirical effectiveness continue to be the advantages of our method.
[W3] Experimental details
We refer the reviewer to App. C for all implementation details. We reproduce them here for ease of reference.
LM choice: Default is gpt-3.5-turbo-1106. In App. D.7, we also show that EASE is generally useful for different black-box models, including GPT-4-V, GPT-4-Turbo, and Gemini Pro.
Budget size: Default is 165 rounds of validation evaluation (same as [14]). We conduct additional experiments by increasing the budget to 500 and show the results in Tab. A1 of rebuttal PDF. Increasing the budget improves performance.
Training set size: Default is 100 data points. In App. D.5, we increase to =1000 and show that EASE is able to select a large number of exemplars from a larger training data pool.
Validation data: Default is 20 validation data exemplars. Using a larger size may reduce performance variance and reduce "overfitting" (i.e., finding exemplars and instructions that work well only on the validation set), while having a higher cost.
[W4] Reasoning chains or open-ended generation
Reasoning chains: As suggested by the reviewer, we conduct additional experiments for tasks with reasoning chains, including MATH (Hendrycks et al., 2021), GSM8K (Cobbe et al., 2021), and AQuA-RAT (Ling et al., 2017). From Tab. A2 of rebuttal PDF, EASE works well for these tasks.
Open-ended generation: Tasks like auto_categorization, word_sorting, LR, and LP-variant are beyond simple direct classification-type labels. For example, auto_categorization requires outputting an open-ended sentence that categorizes the inputs well; LP-variant does open-ended sentence translation following a set of nontrivial rules.
Therefore, these tasks demonstrate the generalizability of EASE on various tasks of different levels of difficulty and response types.
[W5] Reference for Evo and Best-of-N
We thank the reviewer for suggesting references for Evo and Best-of-N. We will add connections to PhaseEvo for our Evo baseline, and connections to RandomSearch implemented in DSPy for our Best-of-N baseline. These references will serve as stronger support for the validity of the baselines that we compare with, and hence demonstrate the effectiveness of EASE.
Questions
[Q1] Insight on OT
Intuitively, OT selects exemplar candidates such that they are on average close to the embeddings of validation exemplars in , as measured by cosine similarity. This intuition is derived from the specific definitions of the embedding space, the cost function, and the discrete measure in Lines 179-188.
The advantages of OT are two-fold: (1) OT allows efficient filtering of less useful exemplars due to its computationally efficiency and (2) OT operates entirely on exemplar samples and their embeddings without needing to query the target LLM.
If the reviewer is referring to using Best-of-N in place of OT in our EASE, it will incur much more API calls to the target LLM. Alternatively, completely removing OT also degrades performance, as suggested in the ablation study in Tab. 7 (NeuralUCB vs. EASE).
[Q2] Tables 7 and 8
Tab. 7 presents the validation performance and Tab. 8 presents the test performance. We add experiments for real tasks that involve open-ended sentence answers and present the results in Tab. A3 of rebuttal PDF. The results are consistent with the original paper in that using both NeuralUCB and OT together contribute to the success of EASE.
[Q3] Retrieval formulation
Query: Validation data exemplars
Corpus: Training set exemplars in
Relevance: An embedding model with a cosine-similarity metric
Procedure: For each validation data exemplar in , we retrieve the most similar/relevant training set exemplars from . Combine them to form a smaller set of exemplars (than ).
Therefore, we have retrieved a small subset of exemplars from that are similar (i.e., relevant) to the validation set. Then, we proceed with EASE using this reduced subset for efficiency. We will make this clearer in the revised paper.
We hope our clarifications above have helped to improve your opinion of our work.
References
(Hendrycks et al., 2021) Measuring mathematical problem solving with the math dataset. NeurIPS.
(Cobbe et al., 2021) Training verifiers to solve math word problems. arXiv.
(Ling et al., 2017) Program induction by rationale generation: Learning to solve and explain algebraic word problems. ACL.
Thanks for the response. The retrieval formulation is a bit strange. Aren't the training and validation sets, both, from the same distribution? What's the intuition that justifies using each validation example to retrieve a training example, besides possibly overfitting to the validation set?
We thank the reviewer for the prompt reply! The retrieval method is mainly implemented as an independent step to pre-filter candidates serving as a promising extension of our EASE to handle a large set of exemplars. The intuition is to prefer exemplars that are similar to the validation set in order to achieve high performance. At the same time, it encourages diversity by selecting a subset that contains data from possible different subgroups represented by different validation data exemplars. Specifically, for each validation data exemplar, we retrieve the most relevant training data, so the combined dataset contains data from different data subgroups.
We usually assume that the validation set and the test set follow the same distribution in the learning setting. By minimizing the validation error, we also minimize the test error. This in fact follows the common practice in classical machine learning, where we typically use the validation set to select the best-performing models or hyper-parameters.
We hope that our clarification helps!
I'm unable to grasp the intuition in this specific part, but otherwise my assessment of the work is positive. I will raise my score from 6 to 7 but I expect the authors to iterate on the intuition, theory, or qualitative empirical observations from this retrieval element.
We are happy that our response generally improved your opinion about our work! For the specific part on retrieval, we will definitely iterate further to improve the clarity and the presentation of the intuition and results (e.g., we may include a diagram to visually illustrate the retrieval idea).
Global response
We sincerely appreciate the efforts of all our dedicated reviewers. The constructive feedback in the reviews significantly enhanced the quality of our paper. We are very grateful!
In response to the specific questions raised by the reviewers and to address potential weaknesses, we have provided a detailed reply to each of the reviewers. In these responses, we use short forms such as "[Q1]" to denote the first Question, "[W1]" to refer to the first Weakness, and "[L1]" to refer to the first limitation.
In this global response to all reviewers, we have attached a PDF file with tables containing additional results to support our paper. Please refer to the individual responses for detailed explanations, and we provide a summary here:
- [Reviewer u1fC and Reviewer NbA9] We increased the query budget to 500 iterations to study the effect of the query budget on performance.
- [Reviewer u1fC and Reviewer NbA9] We showed the effectiveness of EASE on real-world open-ended generation tasks with reasoning chains.
- [Reviewer u1fC] We extended the ablation studies on the necessity of both OT and NeuralUCB done in App. D.3 to additional open-ended real tasks.
- [Reviewer NbA9] We performed experiments on additional benchmarks used in the TEMPERA paper.
- [Reviewer 1tAm] We provided additional results on Meta's newest open-weight model Llama-3.1-8B-Instruct to ensure better reproducibility.
- [Reviewer 1tAm] We adopted the new experimental setting suggested by Reviewer 1tAm to allow different number of exemplars to be selected. The experiments demonstrated the superiority of EASE and showed insights into the number of exemplars to be included in the prompt.
This work advances research on automatic prompt optimization in two ways: (1) As highlighted by Reviewer FzJo and Reviewer NbA9, our "EASE algorithm, which combines neural bandit optimization and optimal transport, is a unique approach to exemplar selection". This ensures not only query efficiency through neural bandits, but also computational efficiency by "incorporating a novel component that uses optimal transport to reduce the search space"; (2) As pointed out by Reviewer FzJo, Reviewer NbA9, and Reviewer 1tAm, our EASE "takes into account the order of the exemplars" and "jointly optimizes exemplars and instructions for improved performance", which provides researchers and practitioners a fully automated pipeline for prompt optimization.
Our work can significantly impact the field of prompt optimization by bringing principled optimization and search methods to complement the existing heuristic and retrieval-based approaches. We also hope to draw researchers' attention to the ordered nature of exemplars in the input sequence and the joint effect from the interactions between exemplars and instructions in the prompt. We share similar thoughts with Reviewer u1fC that "the method underlying EASE is likely to spur future approaches in this space".
Again, we would like to extend our thanks to the reviewers for their constructive feedback and valuable insights.
Best regards,
Authors of Paper 16009
Responding to Reviewer NbA9:
Following [14] and [42], the uncertainty of the NN is the Gaussian process posterior standard deviation when the empirical neural tangent kernel (NTK) is used as the kernel where is the embedding and is the parameter of the NN . Specifically, define where is the history of embedding at step . Then, the uncertainty is calculated according to .
The reviewers agreed that this is a well-written ("easy to follow", "clear explanations") paper addressing a "significant" problem with an "insightful" and "unique approach". The "extensive empirical evaluations" cover a "wide range of baseline methods and tasks" to make the performance improvement "clear". Given the unanimous agreement, I recommend this paper for acceptance.