Hyperband-based Bayesian Optimization for Black-box Prompt Selection
A novel Hyperband-based Bayesian Optimization method that performs sample- and query-efficient structural-aware prompt selection for large language models, outperforming state-of-the-art methods across multiple benchmarks.
摘要
评审与讨论
This paper assumes a candidate instruction set and a candidate example set and aims to optimize prompts by automatically selecting the best instruction + example combination. The method combines Bayesian Optimization with Hyperband, where the main contribution compared to previous work is using Hyperband to discard poor-performing prompts early, reducing computational cost.
给作者的问题
As I discussed above.
论据与证据
The paper makes two main claims:
-
Modeling the structural information of instructions and examples separately and using Gaussian Process for prompt selection optimisation (they call it structural-aware deep kernel GP) is better. ->The paper provides dimensionality reduction comparison experiments (t-SNE visualization), showing that DK-GP aligns the prompt performance better in the low-dimensional space. They also did ablation experiments, demonstrating that structural-aware deep kernel GP outperforms a GP that does not model the structure.
-
HbBoPs improves efficiency by reducing computational cost through Hyperband’s early stopping mechanism. ->The experimental results are solid, and the fact that Hyperband improves efficiency is also expected.
方法与评估标准
The setup assumes that we have a large candidate instruction set and candidate example set, which already includes the high-quality choices we are looking for. Then, BO is used for efficiency to find the best combination. This application scenario feels somewhat limited because if a human were writing prompts, the number of candidates wouldn't be that large. Essentially, we would first need some method to generate a large set of instructions and examples to choose from, but the quality of this set itself is unknown. In 3.2 Structural-aware Deep Kernel, the feature extractor is trained separately for the instruction set and example set. If we later realize that both sets are of poor quality and want to update them, then the previous training becomes wasted and difficult to adjust
When running the experiments, HbBoPs does achieve a lower error rate with fewer LLM calls, but I am curious why the metric here is calls. Typically, we measure time and tokens instead. Also, it seems like the cost does not include the feature extractor’s computation overhead, but generally, comparison methods should consider the total cost.
理论论述
No theoretical claims.
实验设计与分析
The experiments corresponding to the main claims of the paper are conducted clearly, and the results are well-presented.
However, the paper does not explicitly compare different n-shot settings and instead fixes 5-shot throughout the experiments. I am curious about how factors such as candidate set size, quality, and n-shot choices affect the method’s performance and convergence. As mentioned earlier, this method seems to be designed specifically for scenarios where we already have a large, high-quality candidate set, since the candidate set is fixed and cannot be updated in the current framework. This raises concerns about its applicability to situations where prompt refinement is iterative. Additionally, the dimensionality reduction step does not explain why 10 dimensions were chosen, and many hyperparameters feel somewhat arbitrarily set without justification.
补充材料
No supplementary material provided
与现有文献的关系
Overall, this work feels somewhat narrow in scope, focusing on a specific setup where a high-quality, fixed candidate set is assumed, without addressing broader prompt optimisation techniques that involve iterative refinement or generation of new candidates.
遗漏的重要参考文献
No
其他优缺点
The paper is easy to follow.
其他意见或建议
It would be useful to discuss whether and how the approach could handle updates to the candidate set. Maybe compare with baselines that can handle updates, to make this work more convincing.
伦理审查问题
No
We thank the reviewer for their constructive review. We are pleased that the reviewer finds our method efficient, acknowledges the strength of our experimental results, and appreciates the clarity of the paper. Below, we respond to the concerns raised in the review.
1. The setup assumes that we have a large candidate instruction set and candidate example set, which already includes the high-quality choices we are looking for.
Our method is designed for the static black-box prompt selection setting, as discussed in Section 6, which is distinct from iterative prompt optimization frameworks. That said, while our experiments use a fixed candidate set to ensure fair comparison across methods, our approach does not require a fixed candidate set. It can be readily integrated into dynamic or iterative pipelines, where new prompts are generated - e.g., via evolutionary strategies (Fernando et al., 2024, ICML) - which update the candidate set.
2. This application scenario feels somewhat limited because if a human were writing prompts, the number of candidates wouldn't be that large.
In human-in-the-loop scenarios, combining a small number of candidate instructions with few-shot exemplars of different examples and permutations quickly leads to a combinatorially large search space. Therefore, moderately sized initial sets can produce rich candidate pools that benefit from principled selection.
3. The feature extractor is trained separately for the instruction set and example set. If we later realize that both sets are of poor quality and want to update them, then the previous training becomes wasted and difficult to adjust.
We acknowledge that changing the candidate set mid-optimization requires embedding new prompts. However, prior evaluations are not wasted. We retain all prior evaluations and use them to learn the relationship between latent prompt representation and performance. If newly added candidates are drawn from a different distribution and improve upon earlier ones, the GP will reflect this in its updated posterior enabling our method to adapt.
4. LLM calls used as cost metric
Overall wall clock time for the entire pipeline is less informative in the context of prompt selection, as prompt evaluations can be parallelized via asynchronous API calls. In the case of black-box LLMs accessed via APIs, cost is naturally tied to the actual monetary API cost. While token count could be used, API pricing distinguishes between input and output tokens (e.g., cost per 1000 input/output tokens), and these rates vary across LLMs. Aggregating results across different models using token-based cost introduces additional complexity. In contrast, the number of LLM calls offers a clean, model-agnostic measure of cost and it was positively noted by reviewers vVkq, rbkb and NvHi that our method reduces this cost.
5. Cost of feature extractor
We observe an overhead (on an M1 processor) of approximately 1-4 seconds per BO proposal, including training the deep kernel GP and selecting the next candidate prompt. This is similar to overhead induced by EASE or TRIPLE-GSE. We consider this overhead minimal in practice relative to the latency and monetary cost incurred by multiple LLM queries required for prompt evaluation. Moreover, the overhead of the encoder is small, since encoders have parameters in the range of millions.
6. Fixed 5-shot setting
We conducted preliminary investigations into the effect of the number of few-shot examples. Results indicated that increasing the number of examples improves the performance of prompts - similar to Wan et al. (2024, NeurIPS). EASE (Wu et al., 2024, NeurIPS), a competitor, uses in the experiments which we mimicked. From a methodological standpoint, we expect that increasing the number of few-shot examples further amplifies the benefits of our method. Competitors, in contrast, are not designed to disentangle the contributions of the instructions and few-shot exemplars.
7. Hyperparameters of the deep kernel
The latent dimensionality was motivated by widely observed limitations in the BO literature, where vanilla GPs tend to perform poorly in moderate to high-dimensional input spaces. Therefore, we selected as a conservative upper bound. With respect to the architecture and hyperparameters of the feature extractor, we acknowledge that these choices were not exhaustively tuned. However, we adopted widely accepted defaults: feedforward MLPs with ReLU activations, trained using AdamW with common hyperparameters.
We hope that our responses have adequately addressed the concerns raised by the reviewer, and we remain happy to clarify any follow-up questions.
This paper contributes in two main ways for prompt optimization in the context of maximizing total scores over datasets (e.g. GSM8K):
- Introducing a deep kernel GP, i.e. project high-dimensional prompt embeddings into a lower dimension before sending vectors to kernel
- Introducing standard Hyperband technique for multi-fidelity scheduling (i.e. avoid evaluating using the full validation set in GSM8K).
Experiments show this method outperforms previous prompt optimization methods (TRIPLE-GSE, TRIPLE-SH, MIPROv2, EASE), and ablations are done to analyze the effects when using different base methods (Claude, LLAMA, Mistral) and different embedders (BERT, MPNet, DistillRoBERTa).
Additionally Appendix A contains important ablations on understanding the behavior of the projected embeddings (i.e. if they match with the objective function).
给作者的问题
N/A - paper is overall solid.
论据与证据
Yes, the experiments are fairly comprehensive. I was most interested in making sure the projected embeddings on validation settings made sense (i.e. new objective landscape is smooth). It's possible that the high parameter-count of the embedding projector could've led to overfitting when performing GP logprob maximization, but apparently not.
方法与评估标准
Yes, the method makes complete intuitive sense and matches standard intuitions from Bayesian Optimization (e.g. making sure the Gaussian Process works well for regression) + using Hyperband for multifidelity optimization.
理论论述
N/A - no theoretical claims.
实验设计与分析
As mentioned earlier, I just wanted to make sure that the new deep kernel-GP provided accurate regression results, and according to Appendix A's visualizations, this confirmed it.
Everything else (i.e. optimization performance) follows directly.
补充材料
Appendix A mostly.
与现有文献的关系
The deep kernel GP idea using LLM embeddings might be much more broadly applicable than just prompt optimization. For example, since inputs are strings, one can also perhaps use such techniques to speed up evolutionary code searches (e.g. [1]) as well, or even generally, any optimization problem involving mapping a string to a score. Recently, it has been discovered [2] that even embedding tabular inputs leads to embeddings which have good objective landscape properties.
[1] (Romera-Paredes, 2023) Mathematical discoveries from program search with large language models.
[2] (Tang, 2024) Understanding LLM Embeddings for Regression
遗漏的重要参考文献
Since prompt optimization is a form of bandit optimization, there have also been other works involving combining embedding, GPs and Hyperband for regression (in more other optimization settings however), that the authors should consider citing:
[1] (Rankovic, 2023) BoChemian: Large language model embeddings for Bayesian optimization of chemical reactions.
[2] (Falkner, 2017) Combining Hyperband and Bayesian Optimization.
[3] (Kristiadi, 2024) A Sober Look at LLMs for Material Discovery: Are They Actually Good for Bayesian Optimization Over Molecules?
[4] (Tang, 2024): Understanding LLM Embeddings for Regression.
[5] (Nguyen, 2024): Predicting from Strings: Language Model Embeddings for Bayesian Optimization
其他优缺点
As mentioned before, I suspect this deep kernel GP idea using LLM embeddings will be much more broadly applicable to many different problems (any objective mapping a string to a score) than just prompt optimization. It would be worth investigating this in future work.
This is especially true if one can pretrain the GP over a massive amount of offline data first for better warm-starting.
其他意见或建议
- Just making sure (forgive me if I missed it) - in terms of amount of data, is the deep kernel GP retrained online at every optimization iteration only on the current optimization task's trials, or is it pretrained from offline data before the start of optimization?
- Why EI for acquisition, as opposed to UCB? EI can be quite flat, making acquisition optimization difficult sometimes, unless you use tricks like log-EI [1].
[1] (Ament, 2023) Unexpected Improvements to Expected Improvement for Bayesian Optimization.
We thank the reviewer for their constructive review. We are pleased that the reviewer recognizes our structure-aware deep kernel GP as a novel and meaningful contribution and that the integration with Hyperband is viewed as sound and well-justified. We also appreciate their acknowledgment of the comprehensiveness of our experiments and ablation studies. Below, we respond to weaknesses, feedback and questions raised by the reviewer:
1. Possibility of Overfitting in Deep Kernel GP
Indeed, prior work, such as Ober et al. (2021, UAI) has highlighted scenarios where deep kernel models may overfit. However, in our setting, this risk is mitigated. The encoder component (e.g., BERT) is a large, frozen, pre-trained language model. The only trainable component in our deep kernel GP consists of the comparably small feature extractor applied on top of the encoder, with approximately 100k trainable parameters. This is significantly smaller than typical deep kernel learning applications, where models with 5-10x more trainable parameters are used. Additionally, when optimizing the GP marginal likelihood, the log-determinant term in the objective acts as a form of complexity regularization, penalizing overly flexible kernel functions.
2. EI as acquisition function
We chose EI as our acquisition function since it is widely regarded as the de facto standard in BO. While UCB is a popular alternative, it requires the specification or tuning of a mean-variance trade-off parameter, which introduces an additional hyperparameter whose optimal value can be problem-dependent. The reviewer is correct in referring to Ament et al. (2024, NeurIPS), who discuss challenges associated with optimizing EI using gradient-based methods. These difficulties are primarily due to EI being flat in large regions of the input space and often numerically close to zero. This results in first- or second-order methods having serious challenges during optimization, which can be mitigated by optimizing the EI on log scale (and performing other numerical tricks for robust computation of the EI). However, these challenges do not arise in our setting. In particular, we do not rely on gradient-based optimization of the acquisition function. Instead, since our candidate set of prompts is discrete, we evaluate the EI exhaustively over all candidates and select the argmax directly. As such, issues of flatness, vanishing gradients, or numerical instability in gradient-based optimization are not applicable in our case.
3. Is the deep kernel GP trained only on the current task’s trials or pre-trained on offline data?
The deep kernel GP is trained entirely online, using only the observations collected during the current optimization run. We do not pre-train the deep kernel feature extractor on offline data before the task. That said, we agree that exploring pre-training strategies for the deep kernel - particularly leveraging ideas from transfer learning - represents a promising direction for future research!
4. Related Work
We thank the reviewer for pointing us to relevant work that focuses on obtaining meaningful representations of non-standard input to make BO applicable in more complex domains. We appreciate these suggestions and will ensure that these works are included and appropriately discussed in the camera-ready version of the paper. Concerning the directly related work of Falkner et al. (2018, ICML) on combining Hyperband with BO, we would like to clarify that this reference is already cited and discussed in Section 3.4 of our paper.
We thank the reviewer again for their feedback and we remain happy to clarify any follow-up questions.
The paper introduces HbBoPs, a novel method for optimizing prompt selection in large language models (LLMs) in black-box settings. The method combines a structural-aware deep kernel Gaussian Process with Hyperband, a multi-fidelity scheduler, to efficiently select prompts. The approach is designed to handle large, combinatorial search spaces and high evaluation costs associated with black-box LLMs. Extensive experiments demonstrate that HbBoPs outperforms state-of-the-art methods in both performance and efficiency.
给作者的问题
How does the choice of encoder model affect the performance of HbBoPs in practice? Can the method be extended to handle other components of prompts, such as output guidance or formatting constraints?
论据与证据
Claim: HbBoPs improves query-efficiency by adaptively allocating resources across different fidelity levels. Evidence: The paper presents experimental results across ten benchmarks and three LLMs, showing that HbBoPs outperforms existing methods in terms of both performance and efficiency. Claim: The structural-aware deep kernel GP enhances the surrogate model’s ability to predict prompt performance. Evidence: The paper includes an ablation study demonstrating the effectiveness of the structural-aware deep kernel GP compared to non-structural-aware models.
方法与评估标准
The paper uses a combination of Gaussian Processes and Hyperband for prompt selection. The structural-aware deep kernel GP is used to learn a low-dimensional prompt representation, while Hyperband governs the number of validation instances for prompt evaluation.
The methods are evaluated based on their ability to identify well-performing prompts with a limited number of LLM calls. The performance is measured using validation and test errors across various tasks and LLMs.
理论论述
The paper claims that the structural-aware deep kernel GP can effectively use structural and semantic differences in prompts to improve prediction accuracy. It also claims that Hyperband can efficiently explore the search space by terminating poor-performing prompts early.
实验设计与分析
The experiments are conducted on ten tasks commonly used for LLM evaluation, using three different LLMs. The paper compares HbBoPs against several baselines and state-of-the-art methods, using a total budget of 25 full-fidelity evaluations for each task.
补充材料
N/A
与现有文献的关系
N/A
遗漏的重要参考文献
n/a
其他优缺点
The method is highly efficient in terms of both sample and query efficiency. The integration of Hyperband with a structural-aware GP is novel and well-justified.
The method's performance depends on the choice of encoder model for embeddings, although the paper claims robustness to this choice.
其他意见或建议
In line 51, the authors claim that EASE does not consider the joint optimization of instruction and exeamplers, which is not true. Please check the EASE paper carefully again.
We thank the reviewer for their constructive feedback and for recognizing the novelty and efficiency of our proposed method. While we appreciate the reviewer’s feedback, we note that some contributions of the paper may not have been fully recognized. Below, we respond to the two concerns raised in the review.
1. In line 51, the authors claim that EASE does not consider the joint optimization of instruction and exemplars, which is not true. Please check the EASE paper carefully again.
EASE can be used for joint instruction and few-shot exemplar selection by treating the prompt as a single block of text. We ran EASE in exactly this way, as intended by the authors. However, in contrast to our method, EASE does not make use of the structural information of the prompt being composed of different building blocks. Therefore, compared to our method, EASE is "not explicitly designed for the problem" which is what we stated in line 51. In our experiments, we observed that this structural unawareness leads to lower optimization performance compared to our approach. Still, to prevent misunderstanding, we will rewrite line 51 as: "EASE can be used for joint instruction and few-shot exemplar selection by treating the composed prompt as a block of text".
2. The method's performance depends on the choice of encoder model for embeddings, although the paper claims robustness to this choice. How does the choice of encoder model affect the performance of HbBoPs in practice?
To assess the impact of the encoder model on the performance of our method, we conducted an ablation study using three distinct encoder models: BERT, MPNet, and DistillRoBERTa. These experiments were carried out across all benchmark tasks and LLMs, and the results are reported in Section 5.2 of the paper. As shown in Table 3, the anytime validation and test performance remains stable across the different encoders, suggesting that our method is robust to the choice of encoder. We further confirm this observation with a statistical analysis provided in Appendix E.3. This evidence supports our claim that the overall effectiveness of our method is not sensitive to the particular encoder. We note that this robustness was also recognized as a positive aspect by reviewer Nvhi.
We hope that our responses above have adequately addressed the two concerns raised by the reviewer responsible for the comparably low rating.
The reviewer further had the following question:
Can the method be extended to handle other components of prompts, such as output guidance or formatting constraints?
HbBoPs is sufficiently flexible to be extended to handle additional prompt components, such as output guidance or formatting constraints. One natural extension would be to introduce categorical hyperparameters that encode the presence or type of such components - for example, binary indicators for the inclusion of output guidance or formatting constraints or categorical variables specifying formatting styles (e.g., none, JSON, XML, Markdown). This would expand the search space beyond instructions and few-shot exemplars along these dimensions. Incorporating these into our framework involves augmenting the input representation to the deep kernel GP. Specifically, the additional parameters, after suitable encoding, can be concatenated with the learned representations of the instructions and few-shot exemplars. For example, let be the dimensional output of our feature extractor operating on embeddings of instructions and few-shot exemplars . We can then concatenate additional hyperparameters that indicate or encode the inclusion of output guidance or formatting constraints to obtain , which make up the features used within the GP. The surrogate model will then jointly learn to predict performance based on the prompt structure including these new components. We view this as a promising direction for future work.
Finally, we observed that the reviewer listed the design of our structural-aware deep kernel GP and the incorporation of Hyperband as a multi-fidelity scheduler under Theoretical Claims. We would like to clarify that our work does not propose novel theoretical contributions related to Hyperband or BO. As such, we believe that any critique in this regard is not applicable.
We thank the reviewer again and are happy to clarify any further questions.
Thank you for your response. I have raised my score accordingly.
The paper uses hyperband, combined with a Gaussian process surrogate model with deep kernel learning, to optimize the prompt for black-box LLMs. The method optimizes both the instruction and exemplars, and specifically aims to optimize the efficiency in terms of the total number of LLM calls. Extensive experiments and comparisons with previous baseline methods demonstrate the efficacy of the proposed method.
给作者的问题
Please see my individual comments above.
论据与证据
The claims made in the paper are supported by clear and convincing empirical evidence. Extensive experiments and comparisons with previous baseline methods demonstrate the efficacy of the proposed method. The experimental results clearly show that the proposed method performs better than previous full-fidelity and multi-fidelity methods. Additionally, the ablation studies are well-designed and reveal the importance of different components of the proposed method, further supporting the claims.
方法与评估标准
The proposed methods and evaluation criteria are reasonable for the studied problem. Applying hyperband to prompt optimization is a natural and suitable idea. The paper specifically considers minimizing the total number of LLM calls on the validation set, which is an important yet often overlooked aspect in previous works on prompt optimization. This focus on efficiency (in terms of LLM calls) is commendable and aligns well with the goal in practice. The structural-aware deep kernel design is very interesting. It separately embeds the instruction and exemplars before combining them via another MLP.
理论论述
The paper does not have theoretical claims.
实验设计与分析
The experimental designs and analyses appear sound and valid. The experimental results are a strength of the paper, showing the superiority of the proposed method over previous approaches. The ablation studies are well-designed, revealing the importance of different components such as the structural-aware deep kernel design. One potential drawback is that the search space in the experiments may be a little too small, as the space contains only 5 instructions (although the entire space has 250 input prompts).
补充材料
I briefly checked the additional results in the appendix.
与现有文献的关系
The paper contributes to the existing broader scientific literature, particularly in the areas of prompt optimization for black-box LLMs. The use of hyperband for prompt optimization is a natural and effective idea.
遗漏的重要参考文献
I'm not aware of any essential missing reference.
其他优缺点
Strengths:
- The design of the structural-aware deep kernel is elegant and innovative.
- The experimental results are strong and clearly demonstrate the efficacy of the proposed method.
- The focus on minimizing the number of LLM calls on the validation set is a commendable and practical aspect.
Weaknesses:
- The search space in the experiments may be a little too small, as it only contains five instructions.
其他意见或建议
An interesting (but optional) further experiment to explore is to optimize the prompt for the most powerful LLMs nowaways (such as those reasoning models like DeepSeek R1), to see whether the proposed method can improve the performance of these strong models.
We thank the reviewer for their constructive review. We appreciate the reviewer’s recognition of the extensiveness of our experiments, as well as their assessment that our claims are supported by clear and convincing empirical evidence. Moreover, we are happy to hear that they find the structural-aware deep kernel to be elegant. We are also pleased that the design and insightfulness of our ablation studies were positively noted. Lastly, we thank the reviewer for highlighting that our method is highly efficient in reducing the number of LLM calls required to identify a well-performing prompt - a practical aspect that we believe is crucial, yet was not explicitly recognized by all reviewers. Below, we respond to weaknesses, feedback and questions raised by the reviewer:
1. Limited instruction search space
While our experiments consider a fixed set of 5 instructions, the total prompt search space spans 250 candidates due to the combinatorial selection of instructions and few-shot exemplars. This results in a search space comparable in size to those used in prior work concerned with black-box prompt selection, such as Shi et al. (2024, NeurIPS).
From a methodological standpoint, our approach is expected to scale effectively with larger instruction spaces. This is due to the structural-aware surrogate model, which encodes instructions and exemplars separately. Thus, increasing the instruction set size would likely further amplify the benefits of our modeling choices rather than hinder them.
2. Interesting optional experiments on the most powerful LLMs, such as DeepSeeks R1
While applying our method to cutting-edge models such as DeepSeek R1 is indeed a promising direction for future work, we believe that our current experimental setup already provides strong evidence for the method’s applicability across diverse LLMs. In particular, we evaluate our approach on a representative and heterogeneous set of models, including Claude 3 Haiku, LLaMA 3 8B Instruct, and Mistral 7B Instruct. The consistent improvements achieved by our method across these models suggest that it is robust to model-specific variations and well-suited for deployment in real-world scenarios.
We thank the reviewer again for their feedback and we remain happy to clarify any follow-up questions.
This paper tackles the challenge of selecting optimal prompts (combinations of instructions and few-shot examples) for LLMs in black-box settings. The authors propose HbBoPs which combines 1) a structural-aware deep kernel Gaussian Process (GP) that handles instructions and exemplars separately and 2) hyperband, a multifidelity optimization strategy to allocate evaluation resources.
Overall, the reviewers are positive about the work. All reviewers agree that 1) the claims are well-supported by clear and convincing empirical evidence and the gain in performance/efficiency is significant; 2) sound and intuitive methodology and 3) comprehensive evaluation. On the flip side, there are some concerns on the rather small search space (e.g., fixed number of examples) and the fact that the performance can be dependent on the encoder model for embeddings. Overall, either the authors were able to address them satisfactorily, or (in my opinion) the concerns are relatively minor when weighed against the contributions of the paper. I therefore concur with the reviewers and recommend acceptance of this paper.