PaperHub
5.5
/10
Spotlight4 位审稿人
最低3最高8标准差1.8
3
5
8
6
3.5
置信度
正确性2.3
贡献度2.3
表达2.8
NeurIPS 2024

Localized Zeroth-Order Prompt Optimization

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06
TL;DR

We propose a principled local optimization method to optimize the discrete prompts for black-box LLMs that outperforms all baseline methods in performance and efficency. .

摘要

关键词
Prompt OptimizationLarge Language ModelsLLMInstruction Optimization

评审与讨论

审稿意见
3

This paper focuses on the prompt optimization task. The authors first propose two insights: (1) Instead of pursuing the global optimum, this paper claims that local optima are usually prevalent and well-performing. (2) The input domain for prompt optimization affects the choice of local optima. Inspired by these two observations, this paper proposes a zeroth-order optimization method that incorporates a Neural Tangent Kernel-based derived Gaussian process to search for local optima. This method achieves competitive results on benchmarking datasets.

优点

  1. I like the analysis in Section 3. The two insights are well supported by the provided studies, and the motivation and reasoning for decision-making in this paper are explained in an informative way.

  2. Compared to methods that aim to find global optimization, incorporating NTK-based Gaussian processes in prompt optimization should theoretically be much faster.

  3. The input domain transformation process leads to a dense numerical feature space for prompts, making the optimization problem easier.

缺点

  1. This paper does not discuss the following recent prompt optimization methods:

[1] Zekun Li, Baolin Peng, Pengcheng He, Michel Galley, Jianfeng Gao, and Xifeng Yan. Guiding large language models via directional stimulus prompting. Advances in Neural Information Processing Systems, 36, 2024.

[2] Hao Sun, Alihan Hüyük, and Mihaela van der Schaar. Query-dependent prompt evaluation and optimization with offline inverse rl. In The Twelfth International Conference on Learning Representations, 2023.

[3] Xinyuan Wang, Chenxi Li, Zhen Wang, Fan Bai, Haotian Luo, Jiayou Zhang, Nebojsa Jojic, Eric Xing, and Zhiting Hu. Promptagent: Strategic planning with language models enables expert-level prompt optimization. In The Twelfth International Conference on Learning Representations, 2024. 3

  1. For those compared methods, ZOPO did not consistently show advantages in Table 1 and Table 3 in Appendix D2.

  2. This paper emphasized efficiency. However, in Figure 5, Figure 10, and Appendix D.2, I can observe an obvious advantage in efficiency when compared with other methods. In addition, we can observe some results that are contradictory to the paper’s claim. In many scenarios in Figure 10, ZOPO does not show obvious advantages when the query number is small, which is contradictory to the 'query-efficient prompt' claim.

问题

I cannot see a consistent advantage of ZOPO in many figures and tables. Can you explain this part? Thanks.

局限性

They listed the limitation and claimed to solve it in future work.

作者回复

We thank Reviewer TMXX for taking the time to review our paper and appreciate the reviewer's feedback. We would like to provide the following response to address the concerns and hope it can improve your opinion of our work.


[W1] This paper does not discuss the following recent prompt optimization methods...

In fact, we have already covered a wide range of representative and most recent related works [3, 5, 8, 10, 17, 22, 29, 43, 44] in our main paper for the field of prompt optimization. We thank you for pointing out these additional related works. We will discuss those works in our revised paper.

[W2] For those compared methods, ZOPO did not consistently show advantages in Table 1 and Table 3 in Appendix D2. [Q1] I cannot see a consistent advantage of ZOPO in many figures and tables. Can you explain this part? Thanks.

If the consistent advantages you mentioned mean that the proposed method should achieve the best performance across the majority of tasks, then our ZOPO has indeed demonstrated this consistent advantage. Actually our ZOPO has achieved the best performance on the largest number of tasks compared with other baselines in Table 1, which has dominated 14 out of 20 tasks vs. 8 out of 20 resulting from the second-best method INSTINCT [17]. The commonly used performance profile matrix [7] (defined in Eq. 10 of Appendix C.1) shown in Figure 1 has also supported this consistent advantage achieved by our ZOPO.

However, if you are referring to consistent advantages to achieving the best performance on every task, we believe it is nearly impossible for a single algorithm to achieve these "consistent advantages", inspired by the no free lunch theorem in various fields [R1, R2].

Overall, while ZOPO may not dominate every individual task (no other baseline method does), its superior average performance and higher frequency of achieving top results in our experiments can already underscore its effectiveness in practice. We believe this is sufficient to evidence the clear advantage of ZOPO across a broad spectrum of tasks.

References

[R1] Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for optimization. IEEE transactions on evolutionary computation, 1(1), 67-82.

[R2] Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms. Neural computation, 8(7), 1341-1390.

[W3] This paper emphasized efficiency. However, in Figure 5, Figure 10, and Appendix D.2, I can observe an obvious advantage in efficiency when compared with other methods. In addition, we can observe some results that are contradictory to the paper’s claim. In many scenarios in Figure 10, ZOPO does not show obvious advantages when the query number is small, which is contradictory to the 'query-efficient prompt' claim.

We acknowledge that our method does not achieve the highest efficiency in every individual task. However, it consistently ranks among the top three most efficient methods across a wide range of tasks while the efficiency of other methods varies a lot for different tasks as shown in Figure 5 and Figure 10. These results should be reasonably sound to evidence that our ZOPO has generally better query-efficiency as claimed in our main paper (refer to line 281). We would add this clarification in our revised manuscript.


We hope our clarifications have addressed your concerns and increased your opinions of our work. We are happy to provide any further clarification in the discussion period.

评论

Dear Reviewer TMXX,

Thank you for taking the time to review our paper and for your valuable feedback. We have provided clarifications above to address your concerns. We sincerely hope our clarifications could increase your opinion of our work.

If you have any more questions or need more details, we are happy to answer them promptly within the discussion period.

Best,

Authors

审稿意见
5

The paper titled "Localized Zeroth-Order Prompt Optimization" proposes a novel algorithm called ZOPO (Localized Zeroth-Order Prompt Optimization) aimed at enhancing the efficiency of prompt optimization in large language models (LLMs). The authors argue that local optima, as opposed to global optima, are more prevalent and can be more effectively targeted for prompt optimization. They introduce a combination of Neural Tangent Kernel (NTK) and Gaussian processes within a zeroth-order optimization framework to improve query efficiency and optimization performance.

优点

  1. The thorough empirical study conducted provides a detailed comparison between local and global optima, highlighting the potential advantages of targeting local optima.

  2. The ZOPO algorithm is well-designed, incorporating NTK-based Gaussian processes to enhance the optimization process, which shows promise in improving query efficiency.

缺点

The proposed ZOPO algorithm is complex and might be challenging to implement for practitioners who are not deeply versed in NTK and Gaussian processes. This limits the accessibility and practical utility of the proposed method.

问题

How do the proposed prompt-tuning approaches compare to the fully fine-tuning ZO approaches such as MeZO? It would be better to justify the settings that require prompt optimization.

局限性

NA

作者回复

We are grateful to Reviewer TQjw for the constructive feedback and for positively recognizing that our empirical study is thorough and our proposed algorithm is well-designed. We will incorporate the suggested discussion into our revised work. We respond below to their concerns and hope our responses can improve the reviewer's opinion of our work.


The proposed ZOPO algorithm is complex and might be challenging to implement for practitioners who are not deeply versed in NTK and Gaussian processes. This limits the accessibility and practical utility of the proposed method.

We would like to clarify that our proposed ZOPO algorithm is in fact quite straightforward. Specifically, ZOPO has only two major components: GP-NTK in learner_diag.py (52 lines for its core ideas of computing empirical NTK and fitting GP with query history), and the zeroth-order optimization in optimization.py (about 3 lines for its core ideas of gradient estimation from derived GP) in the supplementary material we have provided. Moreover, since we have provided the codes for our ZOPO, it becomes less challenging for practitioners without deep expertise in NTK and Gaussian processes to utilize and integrate our method into their interested problems, which we believe will highly benefit the accessibility and practical utility of our ZOPO.

How do the proposed prompt-tuning approaches compare to the fully fine-tuning ZO approaches such as MeZO? It would be better to justify the settings that require prompt optimization.

To clarify, the fully fine-tuning ZO approaches (e.g., MeZO) and prompt optimization ZO approaches (i.e., ZOPO) are designed for different contexts/settings.

  • MeZO: it utilizes ZO approach to reduce the memory footprint when fine-tuning the model parameters of white-box LLMs (e.g., LLaMA) on downstream tasks, as backpropagation typically requires a prohibitively large amount of memory.
  • ZOPO: In contrast, this method is tailored for scenarios where the LLMs (e.g., ChatGPT) are treated as black box systems, where direct fine-tuning of model parameters is not feasible. Therefore, prompt optimization becomes a better choice for adapting black-box LLMs to downstream tasks by only tweaking the text inputs for these tasks.

We will add a detailed discussion comparing MeZO and ZOPO in our revised version.


We appreciate the reviewer's valuable input and hope our answers can address your concerns and increase your opinions of our work. Thank you!

评论

Dear Reviewer TQjw,

Thank you for taking the time to review our paper and for your valuable questions. We have provided clarifications above to respond to your questions and we hope we have increased your opinion of our work.

If you have any more questions or need more details, we are happy to answer them promptly within the discussion period.

Best,

Authors

审稿意见
8

The paper proposes multiple contributions:

  1. Establishes a new visualization technique for the objective landscapes of blackbox functions over prompts. This is done by converting the high dimensional embeddings of strings into 2D (via t-SNE), and visualizing the landscape in 3D. Using this, the paper finds several patterns:
  • There is a correlation between the smoothness of the landscape and the strength of the prompt generator.
  • Much of the landscape is filled with local minima.
  1. Proposes a new Bayesian Optimization-like algorithm, with the following setup:
  • The regressor + uncertainty estimator is a NTK-GP using an MLP
  • The acquisition maximization is a gradient descent in the embedding space, with a projection back into the original prompt space.

优点

  • The proposed visualization method is simple yet surprisingly very insightful. I believe this might become a very important tool for any string-based blackbox optimization to assess the landscape.

  • The Bayesian optimization-like algorithm makes intuitive sense (bar the weaknesses, see below). This paper is well-written and is straightforward to read.

  • The conducted experiments are rigorous and comprehensive over numerous tasks with multiple baselines. Ablation studies are also relevant and insightful.

缺点

  • Section 4.2 is not well-motivated. I understand that the idea is to construct an acquisition function expressing explore/exploit tradeoffs, and the most natural regressor to use is a Gaussian process, leading to the idea of using the NTK kernel. But this may seem overly complicated. For instance, why not use a simpler regressor / uncertainty estimator, like an ensemble of MLPs?

    • Since the NTK requires computing dot-products of gradients, this makes it tricky to use for larger models (which have much longer gradients as feature vectors).
  • (Small) I would tone down the statement that Bayesian Optimization (or more generally, regressor guided search) will do poorly in local-optima situations simply because it was designed to search for global optima. There are several previous works over traditional black-box optimization showing Bayesian Optimization remains competitive even with multiple local optima. Furthermore, one could argue that the paper is essentially a Bayesian Optimization technique given the gradient ascent over essentially an explore-exploit acquisition.

问题

Please address the main questions in the weaknesses above.

These are more for clarification:

  1. How does h1h^{-1} (i.e. mapping an embedding back into some text) work? L165 mentions storing (z,v)(z,v) for constructing this inverse mapping - does this mean there are already alot of candidate prompts pre-generated forming an embedding set ZZ, and for a new zz, we simply perform randomized rounding or projection to the nearest legitimate prompt in ZZ?

  2. Eq 6: What happens if we fully attempt to argmax the acquisition (like regular Bayesian Optimization), rather than just move by a gradient step? I understand that this gradient step may be motivated by the landscape being full of local optima, but there may be missed gains here. Is this explained in L220-L230?

局限性

Section 7 discusses some limitations, but it may be worth considering the raised weaknesses above.

作者回复

We are highly encouraged by Reviewer GMgT's positive and constructive feedback! We appreciate that the reviewer positively recognizes that our visualization method is insightful and could be a very important tool for studying the black-box prompt optimization landscape, our designed algorithm is intuitive, the paper is well-written and straightforward, and our experiments are rigorous and comprehensive. We would like to address the comments as follows.


Firstly, we would like to clarify that our ZOPO is in fact a gradient ascend-like algorithm rather than a Bayesian optimization-like algorithm. More specifically, in standard Bayesian optimization, Gaussian process will be applied to construct its acquisition function (i.e., a surrogate function to the original objective function with GP mean and covariance) to trade off exploitation and exploration for its global optimization. In contrast, ZOPO will make use of a derived Gaussian process (i.e., Eq. 3) from a standard Gaussian process to estimate the gradient of the original objective function with GP mean (see line 188) as well as to measure the uncertainty on this gradient estimation with GP covariance (which will be later used in our local exploration for more accurate gradient estimation in Sec.4.3) for a local optimization, i.e., the gradient ascend in our Eq. 6. We would like to refer you to [35] for detailed comparison between zeroth-order optimization with derived GP and standard Bayesian optimization.

[W1] Section 4.2 is not well-motivated

Thank you for pointing out this interesting question. As we have clarified above, the idea in our ZOPO is in fact to leverage a derived Gaussian process to estimate gradient and then apply gradient ascend to maximize the objective function. However, as mentioned in line 189 of our main paper, the underlying objective F~\widetilde{F} is complex and highly related to transformers, i.e., it's computed based on the inference of transformers. As a result, standard kernels may not have a powerful representation ability to approximate this objective function and hence can not provide a good gradient estimation to the underlying objective function (as supported in Table. 11). To leverage the powerful representation ability from deep neural networks for our prompt optimization, we can either train an ensemble of MLPs (the method you mentioned), or apply empirical NTK to avoid this training process while maintaining a good approximation to the predictions of neural networks and hence preserving the compelling representation ability of neural networks. Of note, such an effectiveness of the empirical NTK has in fact been widely evidenced from both theoretical and empirical perspectives [2,15,33,34]. Although empirical NTK requires computing dot-products of gradients, it is still more computationally efficient (as no training is required) than training an ensemble of MLPs, especially when we typically use a small neural network, e.g., 2-layer MLPs in our implementation, to compute this empirical NTK in practice. In light of the effectiveness and efficiency of NTK, we therefore choose to apply NTK in this paper for our ZOPO. We would like to add these discussions to our revised paper.

[W2] Statement on Bayesian Optimization

Thank you for your valuable feedback. We acknowledge that our previous statement about Bayesian Optimization performing poorly in local-optima situations could be overly broad. We will revise this statement in our revised version. Below, we clarify that our ZOPO is different from Bayesian Optimization algorithms.

While we acknowledge the existence of previous works on local Bayesian Optimization, our method, ZOPO, is fundamentally a gradient ascent-like algorithm, which is inherently more suited for local optimization tasks. Similar to [35], our ZOPO does not construct the acquisition function like Bayesian optimization. Instead, ZOPO will apply the derived GP mean to estimate gradient (line 187-188) and derived GP covariance for more queries to better estimate the gradient (Sec. 4.3), which emphasizes exploitation when updating the next queries rather than the exploration-exploitation trade-off in Bayesian Optimization. Importantly, our empirical results in Sec. 5 show that our local optimization algorithm ZOPO generally outperforms the global optimization (i.e., Bayesian optimization) algorithms, such as InstructZero and INSTINCT, in the context of prompt optimization, which therefore indicates the advantages of local optimization in this specific setting.

[Q1] Implementation of h1h^{-1}

Yes, your understanding is correct. To clarify, the inverse mapping is built on a finite set. Specifically, we pre-generate a finite set of unique prompt candidates (i.e., V={v}\mathcal{V}=\{v\}), that is, each text prompt vv in this set will be unique. With these unique prompt candidates, the complex embedding model usually will produce the corresponding unique embedding vectors Z={z}\mathcal{Z} = \{z\}. Our mapping hh is then defined on these two finite sets V\mathcal{V} and Z\mathcal{Z}, i.e., h:VZh:\mathcal{V}\rightarrow\mathcal{Z}, which therefore leads to the one-to-one mapping. In practice, we only perform the search in the finite set Z\mathcal{Z} (as shown in Equation 2), and all the gradient-updated points are projected back into Z\mathcal{Z}, which then finds a unique vVv \in \mathcal{V} in the natural language space (as shown in line 201-203).

[Q2] Eq. 6

We clarify that Eq. 6 is not the acquisition function in BO and using argmax will require a global modeling like traditional BO which may not be better than our local modeling. Besides, L220-L230 states that we need more queries to improve the gradient estimation.


Thanks for your insightful suggestions. We will incorporate these discussions above into our revised version. We hope our clarification will address your concerns and improve your opinion of our work.

评论

Thanks for clarifying on the NTK procedure. I now understand that the NTK-GP is used for gradient estimation to perform local gradient updates instead. I recommend simplifying the writing to make this clear.

Is it possible to further explain in more detail on how you produced the landscape images or provide code? I actually tried this method on my own data, but was unable to get such smooth landscapes. Did you apply some form of local smoothing before rendering the plot?

As for other reviewer scores: It seems that other reviewers gave lower scores, primarily due to raising the issue of more comparisons. I am not bullish on requiring so many comparisons myself (there are probably 10+ different prompt tuning algorithms out there at this point anyways) as long as the method itself makes sense and is clean, so I will keep my current score.

评论

Thank you so much for your positive feedback and thoughtful suggestions. We sincerely appreciate your recognition and support.


We are glad to hear that our clarification on the NTK procedure was helpful. We will simplify the writing to make this aspect clearer in the revision.


Regarding the smooth landscape visualizations, we are pleased that you found them compelling. Below is a detailed explanation of the process with the code snippet used for generating these plots:

  1. We initialized the space (based on 300 randomly sampled prompt candidates) for each task and extracted embeddings and their corresponding values.
  2. We then performed a t-SNE transformation to reduce the dimensions of the embeddings to two dimensions (X and Y)
  3. We used griddata from SciPy to linearly interpolate the Z values on a grid. This allowed us to create a smoother surface from these scatter points before rendering.

Here is the code snippet for generating the landscape visualizations:

import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
from scipy.interpolate import griddata

fig = plt.figure(figsize=(6,2), dpi=500)
tasks = ['taxonomy_animal', 'cause_and_effect', 'informal_to_formal']
for i, task in enumerate(tasks):
    emb_space = load_init_space(task)

    ax = fig.add_subplot(1,len(tasks), i+1, projection='3d')
    data = np.array([emb_space[emb]['data'].tolist() for emb in emb_space])
    Z = np.array([np.asarray(emb_space[emb]['function_value']).item() for emb in emb_space])

    # We first perform t-SNE on the embedding representation
    tsne = TSNE(n_components=2)
    transformed_data = tsne.fit_transform(data)
    # Store the 2D t-SNE feature as x and y.
    X, Y = transformed_data[:, 0], transformed_data[:, 1]

    xi = np.linspace(X.min(), X.max(), 60)
    yi = np.linspace(Y.min(), Y.max(), 60)
    xi, yi = np.meshgrid(xi, yi)

    # Linearly interpolate Z values on the grid
    zi = griddata((X, Y), Z, (xi, yi), method='linear')

    my_cmap = plt.get_cmap('YlOrRd')
    surf = ax.plot_surface(xi, yi, zi, cmap = my_cmap,
                       edgecolor ='none',antialiased=True,
                       linewidth=0,rstride =1,cstride =1,
                       vmin=0,vmax=1)
    ax.view_init(azim=20)
    plt.setp(ax.get_xticklabels(), visible=False)
    plt.setp(ax.get_yticklabels(), visible=False)
    plt.setp(ax.get_zticklabels(), visible=False)
    ax.set_axis_off()

plt.show()
审稿意见
6

This paper addresses prompt optimization for a black-box API LLM. This paper empirically investigated the objective function landscape of the prompt optimization problem and derived two insights: (I) local optima are usually prevalent and well-performed, (II) choice of the input domain affects the identification of well-performing local optima. Based on these insights, a novel local prompt optimization algorithm based on NTK-GP is proposed. Empirical comparisons have been conducted to show the performance differences between the proposed and baseline approaches.

优点

  • a novel and efficient prompt optimization algorithm targeted for a black-box API LLM

  • promising performance over baseline approaches

  • analysis and visualization of the objective function landscape using t-SNE

缺点

  • The clarity of the algorithm could be improved (see the question part)

  • The validity of the derived insights is not sufficiently high (see the question part)

问题

L 118. “We then investigate the function surface (i.e., accuracy landscape) using two different embeddings for prompt candidates in Fig. 4 (more details in Appx. D.1.2) where the embeddings are mapped into a 2-dimensional domain using the t-SNE for better visualization. “ It is not clear how the local optima in the 2D t-SNE space is related to the local optimality of the objective function in the original space? Because the relation is not clear, I am not sure if the insight derived in this paper is valid or not.

It is not clear how Section 4.1 is related to Insight (II). What is the novelty of this part? This question is also related to the next question.

L165. “We store (z, v) … for constructing the one-to-one inverse mapping.” Because h is a mapping from a discrete space to a continuous space, there does not exist a bijection theoretically. Therefore, it is not clear what the authors mean by this sentence. It is also not clear how this goal is achieved. Please clarify this point.

L197. “theta_0 is the initialized parameters …“ Is it trained? If so, how and when?

Table 1. I couldn’t find the explanation of ZOPO_{GPT}. What is the difference between ZOPO and ZOPO_{GPT}?

The performance of the baselines and the proposed approaches are compared only up to 200 queries. It is interesting to see how it will change if more queries are allowed.

局限性

A limitation has been addressed in the conclusion.

作者回复

We thank Reviewer bDUw for recognizing that our algorithm is novel and efficient, and its performance is promising. We would like to address your concerns below and hope our response will improve your opinion of our work.


[Q1] Validity of our insight

Thank you for your insightful comment. We address your concern below:

  1. The t-SNE is well-suited for visualizing the landscape because it generally preserves the local structure of original data [R1], including the relative positions of local optima.

  2. To further support this claim, we present an additional result in Figure[R] 1 of the rebuttal PDF to show that our derived insight is indeed valid.

    • We first identify the local optima points in the original space by comparing F~(z)\widetilde{F}(z) of each point zz with those of the k-nearest neighbors (k=10) around zz.
    • We then apply the same t-SNE transformation to these identified local optima points and highlight them in red with an "x" marker in the 2D space.

    This visualization demonstrates that the local optima identified in the original space generally correspond well with those "peak values" in the 2D t-SNE space, confirming that our derived insights are indeed valid.

We will include this clarification and the updated visualization to enhance our manuscript.

[R1] Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(11).


[Q2] Section 4.1 and its novelty

Recall that Insight (II) emphasizes the importance of both the generation and representation of prompt candidates. Motivated by Insight (II), we are the first to integrate both the generation and representation of prompts within a unified problem formulation (refer to our Eq. 2), which then leads to a more general domain transformation of prompt optimization in our Section 4.1 (as one of the novel contributions of our paper). This is in contrast to previous works such as APE and INSTINCT, which focus solely on either generation or representation, but not both. Our domain transformation instead allows for improved prompt optimization by leveraging not only the remarkable generation ability from any type of LLMs (white/black-box, like ChatGPT) but also the impressive representation ability from existing embedding models (refer to our introduction and Sec. 4.1), which in fact has been widely supported by our empirical results in Sec. 5. For example, as shown in Table 1, our approach, denoted as ZOPOGPT\text{ZOPO}_{\text{GPT}}, achieves promising performance on many complex tasks. We believe this contribution can inspire the field and may benefit future research.

[Q3] One-to-one mapping

To clarify, the mapping is in fact built on a finite set. Specifically, we pre-generate a finite set of unique prompt candidates (i.e., V=v\mathcal{V}=\\{v\\}), that is, each text prompt vv in this set will be unique. With these unique prompt candidates, the complex embedding model usually will produce the corresponding unique embedding vectors Z=z\mathcal{Z} = \\{z\\}. Our mapping hh is then defined on these two finite sets V\mathcal{V} and Z\mathcal{Z}, i.e., h:VZh:\mathcal{V}\rightarrow\mathcal{Z}, which therefore leads to the one-to-one mapping. In practice, we only perform the search in the finite set Z\mathcal{Z} (as shown in Eq. 2) and all the gradient-updated points are projected back into Z\mathcal{Z}, which then finds a unique vVv \in \mathcal{V} (as shown in line 201-203).

We will include a detailed clarification of this point in our revised manuscript.


[Q4] theta_0

No, theta_0 is not trained. We use the empirical NTK that is based on random initialization of network parameters to avoid the training process while maintaining a good approximation to the predictions of neural networks and hence preserving the compelling representation ability of neural networks for the GP regression in our ZOPO. Of note, such an effectiveness of the empirical NTK has been widely evidenced from both theoretical and empirical perspectives [2,15,33,34], as well as the compelling performance achieved by our ZOPO in Sec. 5.

[Q5] ZOPO_{GPT} in Table 1

The explanation of ZOPOGPT\text{ZOPO}_{\text{GPT}} can be found in the third paragraph of Section 5.1. For your convenience, we summarize it as a more straightforward comparison below:

  • ZOPO: we use the Vicuna-13B model for both prompt generation and representation (specifically, the last token embedding). This choice was made to ensure a fair comparison against existing baselines such as InstructZero [3] and INSTINCT [17].
  • ZOPOGPT\text{ZOPO}_{\text{GPT}}: Here, we utilize GPT-3.5 for prompt generation and SBERT for embedding representation inspired by our Insight (II). This approach leverages the superior generation ability of GPT-3.5, resulting in significantly higher accuracy on challenging tasks like second_word_letter and sentence_similarity. This demonstrates our method is capable of performing numerical optimization on ChatGPT-generated prompts.

We hope this clarifies the distinctions between the two variants and highlights the strengths of ZOPOGPT\text{ZOPO}_{\text{GPT}}.

[Q6] More queries

We would like to first clarify that our work follows the same query setting as previous baselines [3, 17, 44] (up to 200 queries) primarily for a fair comparison. For your interest in the results of more queries, we additionally performed experiments on only 4 GLUE tasks due to the limited money budget and time in this rebuttal, extending the number of queries to 1000. The experimental results, presented in Table[R] 1 of the rebuttal PDF, indicate that our proposed method ZOPO continues to achieve better or comparable results even in a query-rich setting compared with 165-queries results in Table 4.


With our elaboration and additional results, we hope our response has addressed your concerns and increased your opinions of our work. We are happy to provide more clarifications if needed.

评论

The things are now clearer and the response is satisfactory. Thanks.

评论

Thank you very much for the prompt reply! We are happy to hear that our response is satisfactory.

Do let us if you have any further questions. We would be glad to address them and sincerely hope that our clarifications can improve your opinion of our work.

作者回复

Global Response

We sincerely appreciate the insightful feedback provided by the reviewers, which has significantly contributed to enhancing the quality of our paper. We hope we have addressed all questions raised by the reviewers, providing our clarifications and additional results. In this global response, we have attached an Author Rebuttal PDF file with the table and figure as additional results to support our response. Below, we summarize the strengths of our paper as highlighted by the reviewers:


The reviewers have positively accepted several aspects of our work:

  • The experiments conducted are thorough, rigorous, and comprehensive, demonstrating the promising empirical performance of ZOPO. The ablation studies are relevant and insightful (Reviewer bDUw, GMgT, TQjw).
  • Our designed algorithm is intuitive, novel, efficient, and well-designed (Reviewer bDUw, GMgT, TQjw, TMXX).
  • Our empirical analysis in Section 3 is thorough and well-supported (Reviewer TQjw, TMXX).
  • The visualization method we proposed is insightful and could serve as a crucial tool for studying the black-box prompt optimization landscape (Reviewer GMgT).
  • The proposed input domain transformation simplifies the optimization problem (Reviewer TMXX).
  • The paper is well-written and straightforward, with clear explanations of the motivation and reasoning (Reviewer GMgT, TMXX).

We would like to express our gratitude once again to the reviewers for their constructive feedback. We hope that our responses and clarifications have further improved your opinion of our work.

Best regards,

The Authors

最终决定

There were no major outstanding concerns following the author rebuttal.

Reviewers were unanimously impressed by the visualization techniques, since they may be broadly applicable across LLM-based string-optimization tasks. The result showing the coincidence of local optima in the original and projected space is quite satisfying. This insight about the structure of the optimization space, instead of simply treating it as an end-to-end black-box as is done in prior work, is a major contribution of this paper.

For example, its ability to characterize smoothness v. ruggedness of a prompt generation scheme, can help diagnose optimization pathologies, and should lead to more rapid development of effective implementations, such as the one introduced in the paper: ZOPO.

I generally agree with reviewer ideas that ZOPO itself may not necessarily become the most common optimizer in practice, since the method is (1) somewhat more mathematically complicated than simpler approaches that stay completely in text space (e.g. OPRO), and (2) text-only methods often do perfectly fine (and it may be easier for practitioners to intuitively interpret their natural language behavior). However, the experimental results convincingly show that if one were forced to select a method today, without knowing the task a priori, ZOPO would be a good choice, and likely the optimal one.

Overall, I expect the paper to have a meaningful impact in the space of automated prompt optimization.