Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation
We use language models to write code that recursively improves its ability to improve code.
摘要
评审与讨论
This paper proposes using GPT models for self-improving an improver code that improves the solution code for five different problems, given a utility function for each problem. The aggregated utilities measure the performance of the solutions of each problem and serve as the signal for the improver to improve itself. The improver code improves itself a few times and is used to improve the solution codes. The results demonstrate that GPT-4 proposes and implements self-improvement strategies such as beam/tree search, genetic algorithm, and simulated annealing. Results using GPT-4 demonstrate a diminishing increase in aggregated (meta) utility within four iterations, whereas results using GPT-3.5 demonstrate decreasing aggregated utility.
优点
The paper is well-written, and the problem formulation and method are precise. At a high level, the formulation is elegant, functional, and novel in the context of GPT's. The paper clearly distinguishes between improving a solution and improving an improver (scaffolding) that improves a solution. The work addresses safety, adding sandbox safety measures.
缺点
The improver's signal for improvement is a real value, which is the aggregation of utilities of solutions to different problems. Moreover, the utilities of the solution of each problem may be on different scales, and they are linearly combined.
- Different tasks have utilities with different values and scales.
- Aggregating utilities of different tasks to a single number is a minimal input to the improver and may be an insufficient signal for improvement (just one number in each iteration and previous improver code).
- Tasks may not be representative, and their selection is unclear.
- Each utilities instantiates data for the problem, for example maxcut instantiates 3 random graphs with 300 nodes each. It's unclear how the data and instances are selected. 3 and 4 together: how many tasks, data instances, and their types are required?
- Figure 4a is missing iterations beyond T=4.
- The approach assumes many solution variants are generated for each task, and it's unclear how much of the "improvement" is due to increasing the number of variants.
- It's unclear how much of the "improvement" is due to GPT-4 being a generic language model that may improve each selected task rather than generic recursive self-improvement.
- The paper is well written, and the presented formulation is elegant at a high-level, however, it is missing key details of self-improvement.
问题
- How sensitive are the results to changes in the prompts?
- How does going wider (range of tasks and data) effect going deeper (iterations of recursive self improvement)?
- Will the link to the repo become available?
Thank you for your insightful feedback and constructive points. We sincerely appreciate the care you put into this review.
The improver's signal for improvement is a real value, which is the aggregation of utilities of solutions to different problems… Different tasks have utilities with different values and scales.
Great point. We updated the paper so that all utilities are normalized to be on a scale (more generally one could use ). This is an issue faced by any performance benchmark that measures performance across diverse tasks.
Aggregating utilities ... may be an insufficient signal for improvement (just one number in each iteration and previous improver code).
Indeed, this is a fundamental challenge in all multi-task learning (i.e., task weighting). However, with some weak assumptions, and constraining the range of possible utilities between 0 and 1 (rather than all real numbers), we have updated the appendix with a discussion showing some theoretical bounds on the expected generalization if the tasks are independent samples from the task distribution.
Tasks may not be representative, and their selection is unclear.
We discuss our reasoning for selecting the tasks in their respective sections of “Experiments and Results.” We will clarify in the revision that we select less-well-known tasks for evaluation because for better-known tasks like efficient SAT solving, modern language models like GPT-4 have almost certainly trained on many implementations (for example, if you search for “SAT solver” on GitHub over 3000 SAT repositories are returned, but only 6 are returned for “parity with noise”) and hence are more likely to be overfit to them. It’s true that we don’t cover the full space of all challenging algorithmic problems, and additional tasks could improve the representativeness of our evaluation - we are happy to evaluate more transfer tasks if there are any particular ones you’d be interested in seeing.
It's unclear how the data and instances are selected. 3 and 4 together: how many tasks, data instances, and their types are required?
This is a great point and an important part of the experimental design. We’ve incorporated this into the paper in supplementary experiment details. In short, we selected parameters such that the problem was approachable (i.e., the base improver could at least sometimes improve performance over the base performance) but non-trivial (the model should not immediately get perfect performance).
Figure 4a is missing iterations beyond T=4.
Due to the number of GPT-4 calls necessary per iteration (roughly three thousand calls per iteration per run), we only performed 4 iterations on the aggregated runs. We’ll highlight the significant cost of running this algorithm in limitations, and have added a discussion of budgets in the section where we introduce STOP, as well as a more formal discussion in the appendix. However, in the context of the framing of STOP as a pre-optimization technique, ultimately, we note that this may ultimately save compute.
It's unclear how much of the "improvement" is due to [improving] each selected task rather than generic recursive self-improvement… How does going wider (range of tasks and data) effect going deeper (iterations of recursive self improvement)?
Thank you for this critical question: if we view this from the perspective of trying to find the best possible solution for a particular downstream task, many of our decisions are likely inefficient. For many tasks, if we simply repurposed all of the language model and utility calls used in self-improvement, and instead generated more attempts at that particular problem, then the resulting solution would perform at least as well as one generated by a self-improved improver. Given this, we've extended the problem statement and added a discussion of STOP as a “pre-optimization” technique, and some corresponding generalization bounds in Appendix A.2. We also note that, quantitatively, our transfer results empirically suggest some generalization. Lastly, we’ve evaluated two additional baselines to contextualize the results, namely a chain-of-thought baseline, similar to the current seed improver but with only one attempted improvement, and a greedy iterative improver (i.e., make the maximum permissible calls, select the best improvement, repeat as the budget allows). Across ten runs, the CoT baseline has a meta-utility of 57.7%±3.0% when it doesn’t error (49.6%±3.5% with errors), while the greedy iterative improver scores 64.2%±0.9%.
How sensitive are the results to changes in the prompts?
Good point! Although we do touch on some specific considerations in the paragraphs labeled “Designing the seed improver” and “Describing the utility.” However, a systematic evaluation of prompt sensitivity is left to future work, and we have added a note about this in the limitations.
Will the link to the repo become available?
Yes, we will release the code publicly!
The paper proposes a method that iteratively (meta-)optimises a program. At the beginning this program implements a simple optimisation heuristic through access to an LLM via an API and prompting the LLM for changes to the program that yield higher utility. The utility measures the performance of the program on some downstream optimisation task like 3SAT or Maxcut. The program is replaced with the output after running the program with its own code as input. The proposed method is tested with GPT-4 and GPT-3.5-turbo. Some of the programs that are found through this procedure mimic well known optimisation algorithms like genetic algorithms or simulated annealing.
优点
Prompting the LLM for better methods to prompt an LLM is a promising direction given the success of handcrafted prompt strategies like chain-of-thought.
缺点
It is unclear what the connection between the input program (initial_solution) and the output program (best_solution) is. Is the seed program more helpful than compared to prompting the LLM for a program that maximizes some utility (possibly with example programs that demonstrate the task, similar to the few-shot setting for LLMs)? I understand that the goal here is to design a method that can recursively improve itself, but the significance of the results seem questionable to me if it is possible to attain the same results with a conceptually simple prompt.
The experimental results are limited to GPT-4 and GPT-3.5-turbo, which are both closed source and AFAIK can change over time. This means that the results will likely not be reproducible. I'd suggest running some additional experiments with open source models that offer the possibility of reproducing the results.
The experiments are restricted to simple problem settings (albeit hard optimization problems) like LPN, 3SAT, or Maxcut. This is quite a restricted scope when comparing to claims such as "... a method in which code that applies a language model to improve arbitrary solutions is applied recursively to improve itself." from the introduction. I would suggest to tone down the writing, especially in the introduction, as it currently to suggest a level of generality that is not support by experiments.
Considering that the STOP method manages to find well-known optimization algorithms like genetic algorithms or simulated annealing, I get the impression that all that is really happening here is that the LLM is prompted to return some known optimisation heuristic and adapt it to use LLMs. The recursive nature of STOP does not appear to be important here. Can the authors provide suitable ablations or baselines to show that recurrence is indeed important? This is also related to my first point.
问题
Why is it important to choose a less-well-known task as the meta optimization objective? Is the expectation here that GPT-4 has seen less training data on this topic?
How were the qualitative examples chosen? Always the final solution? Did the authors go through all generated programs manually and pick the ones that recognisably correspond to a known optimisation algorithm?
Thank you for your very detailed comments and suggestions. We feel that your feedback has been incredibly helpful in strengthening the rigor and clarity of our work. We've tried to address all of your comments and questions in the revised text, but please let us know if any of our responses are unclear or if you have any additional questions or concerns.
It is unclear what the connection between the input program (initial_solution) and the output program (best_solution) is. Is the seed program more helpful than compared to prompting the LLM for a program that maximizes some utility (possibly with example programs that demonstrate the task, similar to the few-shot setting for LLMs)? I understand that the goal here is to design a method that can recursively improve itself, but the significance of the results seem questionable to me if it is possible to attain the same results with a conceptually simple prompt.
Excellent points, and we strongly agree that this is a really important point to make clear! First, we’ve revised the problem statement a bit to emphasize the improver formulation and further highlight its equivalence to the maximization formulation in the Appendix subsection “Analysis of equivalent maximization formulation,” where we also discuss how one can think of maximization recursively. However, note that in practice, an initial solution provided to the improver may be unlikely for even a good maximizer to generate, allowing for what is essentially a “warm-start.” As a result, we do not see a maximizer as necessarily a simpler solution – with that being said however, we believe that better understanding the optimization dynamics of maximizers and understanding how these dynamics differ from those of improvers would be valuable future research.
The experimental results are limited to GPT-4 and GPT-3.5-turbo, which are both closed source and AFAIK can change over time. This means that the results will likely not be reproducible. I'd suggest running some additional experiments with open source models that offer the possibility of reproducing the results.
Thanks for pointing this out! The OpenAI API in fact offers timestamped models to facilitate reproducibility. We have added the following sentence to the first paragraph of the experiments section (Sec. 5): "Timestamped versions of the two models we utilize, gpt-4-0314 and gpt-3.5-turbo-0301, are available within the OpenAI API, to facilitate reproducibility." We have also added a sentence to our limitations section addressing this limitation. Regarding open-source models, given that best open-source models have coding performance at or below the level of GPT-3.5 on most benchmarks [1], it is unlikely that any of today's open source models will give improvements using STOP. We strongly suspect using these open-source models such as Mistral [2] would not result in improvement – however, if you believe that it would strengthen the results of this paper, we would be happy to run it.
The experiments are restricted to simple problem settings (albeit hard optimization problems) like LPN, 3SAT, or Maxcut… I would suggest to tone down the writing, especially in the introduction, as it currently to suggest a level of generality that is not support by experiments.
Thank you again for this excellent suggestion. We agree and have made several changes throughout the introduction to emphasize the more constrained scope of the approach in the paper. Here are some of the changes we’ve made to the introduction:
- Added "within a defined scope" to the 2nd paragraph
- "demonstrate improvement" => "indicate potential improvements … within the specific contexts of our experiments."
- "The broader concept of RSI dates back at least half a century, formalized by [Good, 1996] and later by [Schmidhuber, 2003], but our work represents a more modest and specific application of these ideas."
- Contribution (b) "demonstrating that this system…" => "(b) providing a case study where a system”
- "Further explorations in \cref{sec:circumvent} measure the rate at which the model attempts to disable a sandbox flag, providing early but intriguing findings in this area."
Why is it important to choose a less-well-known task as the meta optimization objective? Is the expectation here that GPT-4 has seen less training data on this topic?
Your intuition is exactly right. We will clarify in the revision that we select less-well-known tasks for evaluation because for better-known tasks like efficient SAT solving, modern language models like GPT-4 have almost certainly trained on many implementations (for example, if you search for “SAT solver” on GitHub over 3000 SAT repositories are returned, but only 6 are returned when you search for “parity with noise”) and hence are more likely to be overfit to those problems.
How were the qualitative examples chosen?
We performed a manual inspection on a random subset of the improvements from a large set of samples generated from the seed improver as well as those proposed by the model across the iterations throughout our experiments (e.g., in generating the meta-utility evaluation curves). However, we’ve now added an additional qualitative discussion of the trajectories corresponding to the iterative improvements to the Appendix, in the subsection “Improvements across Iterations.” If you believe it would be valuable, we would also be happy to select a large (e.g., 100-item) random sample and manually categorize the types of improvements observed, in order to give a more quantitative description of the types of improvements proposed by the model.
I get the impression that all that is really happening here is that the LLM is prompted to return some known optimisation heuristic and adapt it to use LLMs. The recursive nature of STOP does not appear to be important here. Can the authors provide suitable ablations or baselines to show that recurrence is indeed important?
There seem to be three closely connected points tied together here.
First, there is the question of the impact of the iterative nature of STOP. We hope that the previous response as well as the main results in the paper have addressed this first point. We’ve also evaluated two additional baselines to help contextualize the results, namely a chain-of-thought baseline which is similar to the current seed improver but with only one attempted improvement, and a greedy iterative improver (i.e., make the maximum permissible calls, select the best improvement, repeat as the budget allows). Across ten runs, the chain-of-thought baseline has a meta-utility of 57.7%±3.0% when it doesn’t error (49.6%±3.5% if we include errors), while the greedy iterative improver scores 64.2%±0.9%.
Then, there is a more precise question about the value of recurrence in particular, as opposed to other iterative strategies. However, this second point is deeper and certainly warrants further investigation. We chose to define as the application of the current improver to itself, but we could have just as easily defined it in terms of either 1) the seed improver applied to the current improver or 2) the current improver applied to the seed improver. This was a decision we made after careful consideration, in the interest of studying self-improving code generation. However, we acknowledge (and strongly agree) that understanding how this affects the learning dynamics of this algorithm would be valuable future research.
Finally, there is the question of the novelty of the proposed algorithms. Whether the fundamental optimization techniques are or are not novel, we would suggest that the automatic search and application of the existing optimization ideas to language model optimization and recursive self-improvement is novel. Many existing scaffolding innovations are also existing optimization techniques applied to LM-based optimization, such as Tree of Thoughts and Parsel. Part of the challenge comes from understanding which aspects of the optimization algorithm map onto which elements of the problem. For example, we found that STOP generated many genetic algorithms: however, only in a small subset of these generated genetic algorithms did it use the language model to perform the crossover and mutation. In many others, it performed mutations by randomly changing characters and crossover by randomly concatenating two randomly-truncated solution strings. Moreover, almost any new optimization algorithm can be described with reference to existing optimization algorithms, so it is ambiguous when an optimization algorithm is “different enough” to be considered new as opposed to simply a variant of an existing approach. However, we believe that this is an extremely important discussion to have in the paper which is currently missing. We have added Appendix K (“On the Novelty of Improvements”) to incorporate this discussion.
[1] “Llama 2: Open foundation and fine-tuned chat models,” Touvron et al. 2023
[2] “Mistral 7B,” Jiang et al. 2023
Thanks for the clarifications. The responses have addressed most of my concerns. Given the extensive analysis and that the topic is both interesting and relevant, I revise my initial negative stance to a weak accept. The main concern that remains is that the proposed algorithm is only shown to work with GPT-4 for which we do not know how it was trained. Thus, the results presented here might not generally apply to other LLMs (even ones with similar scale or capability), and for example require specific pre-training data or training techniques.
The paper introduces the Self-Taught Optimizer (STOP), a framework that uses a language model to recursively improve code scaffolding generation in a meta-learning paradigm. This paper demonstrates, across a variety of algorithmic tasks, STOP generates better code scaffolding in more iterations that brings better performance of downstream code. Moreover, several meta heuristics are discovered during the meta-optimization process. This paper further Investigates the potential of misuse of language models about how it circumvents safety measures and reward hacking.
优点
-
originality: this paper is the first to propose a meta-learning framework for optimizing code scaffolding, aiming at better performance of downstream tasks.
-
significance: this paper takes a new and important perspective into revealing the power and potential misuse of large language model by querying the model to optimize the meta-heuristic of solving code tasks. though their current framework is not general to cover all tasks, their observation about reward hacking and sandbox circumvention is of great interest to the AI alignment community.
-
clarity: their paper is easy to follow, with a high level schematic.
缺点
- missing comparison with two types of baselines, the one is human designed prompt structure such as Chain-of-Though, and Program of Thoughts. which one is better, the prompt structure found in their meta-learning paradigm or these human crafted ones? the other is heuristics for coding such as genetic algorithm. My question is given a downstream task, if STOP can find better meta-heuristics than the common ones?
问题
- is there any creativity in the meta-heuristics found by the language model? such as the combination of genetic algorithm and beam search? It should be of great interest to the community if the language model can find the new or more task-specific meta-heuristics, which brings better performance than human crafting.
Thank you for your suggestions and encouraging comments!
missing comparison with two types of baselines, the one is human designed prompt structure such as Chain-of-Though, and Program of Thoughts. which one is better, the prompt structure found in their meta-learning paradigm or these human crafted ones?
Great suggestions! First, we’d like to note that our current seed improver can be thought of as leveraging chain-of-thought prompting, since our prompt requests that the model first proposes an idea for an improvement before implementing the change to the code. However, we think that a purer chain-of-thoughts baseline, which simply attempts to improve the code once without calling the utility function, is a useful addition, so we’ve run this to help contextualize the results. In addition, as another reference, we’ve added a simple greedy iterative prompting baseline (i.e., make the maximum permissible calls, select the best improvement, repeat as the budget allows). Across ten runs, the chain-of-thought baseline has a meta-utility of 57.7%±3.0% when it doesn’t error (49.6%±3.5% if we include errors), while the greedy iterative improver scores 64.2%±0.9%. These baselines help provide a scale by which we can understand the magnitude of the improvements.
My question is given a downstream task, if STOP can find better meta-heuristics than the common ones? … is there any creativity in the meta-heuristics found by the language model? such as the combination of genetic algorithm and beam search? It should be of great interest to the community if the language model can find the new or more task-specific meta-heuristics, which brings better performance than human crafting.
We’ll preface this response with a caveat: whether a proposed idea is creative or novel is ultimately always going to be a subjective judgment and is, to some extent, tied to the training data of the model – this is further complicated by the fact that, for the models we used, we do not have access to the details of their training data. With that being said, we felt that some of the strategies proposed by the model indeed appeared creative and substantially different from the techniques that we had observed. For example, the simulated annealing approach seemed like a clever technique by implicitly recognizing that the underlying global optimization task may require non-monotonic improvement. At the same time, it is not the first time that simulated annealing has been used to optimize a difficult NLP task (e.g., [1]). Similarly, the choice to attempt to improve code by attempting to improve one function at a time instead of revising also seemed creative, but the idea of decomposing a problem into parts and attempting to solve them individually is certainly not new in this space.
Whether the fundamental optimization techniques are or are not novel, we would suggest that the automatic search and application of the existing optimization ideas to language model optimization and recursive self-improvement is novel. Many existing scaffolding innovations are also existing optimization techniques applied to LM-based optimization, such as Tree of Thoughts and Parsel. Part of the challenge comes from understanding which aspects of the optimization algorithm map onto which elements of the problem. For example, we found that STOP generated many genetic algorithms: however, only in a small subset of these generated genetic algorithms did it use the language model to perform the crossover and mutation. In many others, it performed mutations by randomly changing characters and crossover by randomly concatenating two randomly-truncated solution strings. Moreover, almost any new optimization algorithm can be described with reference to existing optimization algorithms, so it is ambiguous when an optimization algorithm is “different enough” to be considered new as opposed to simply a variant of an existing approach. However, we believe that this is an extremely important discussion to have in the paper which is currently missing. We have added Appendix K (“On the Novelty of Improvements”) to incorporate this discussion.
[1] “Unsupervised Paraphrasing by Simulated Annealing” Liu et al., 2019
This paper proposes a meta-learning algorithm in the context of code generation with large language models (LLM). The algorithm has two components: an outer algorithm/ meta-algorithm (the self-improver algorithm) and
an inner algorithm (the improver algorithm). The inner algorithm (or improver algorithm) optimizes code for downstream tasks by prompting an LLM. The proposed meta-algorithm maintains a single inner algorithm
at each iteration instead of a population. In each iteration, the meta-algorithm (the self-improver algorithm)
passes the optimization goal, and constraints such as limits on runtime and model queries. The meta-algorithm measures the performance of each improver on the downstream task and returns the best solution (improver) based on a meta-utility function.
优点
The key strengths of this research are, first and foremost, the demonstration of a proof-of-concept for using LLMs for self-improvement and meta-learning.
Second, the strength of the paper is that LLMs can optimize code which includes the model itself. This approach demonstrated by LLM is similar to evolutionary algorithms without any exposure in the training data.
Third, the impact of this research is profound, because it demonstrates that LLMs are able to self-improve themselves in contrast to Reinforcement learning from the Human Feedback approach. This research has the potential to have a significant influence in areas where the input data is continually changing, such as education and health.
Fourth, the self-improving behavior can be attributed to emerging capabilities in LLM [1], where "...these tasks are not explicitly included in pre-training" (See Section 5 of Reference [1]) and can be observed only on sufficiently large models. Future researchers can explore more in this direction.
References
- Wei J, Tay Y, Bommasani R, Raffel C, Zoph B, Borgeaud S, Yogatama D, Bosma M, Zhou D, Metzler D, Chi EH. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682. 2022 Jun 15.
缺点
The base LLM has a huge importance on STOP's performance (Section 5.3)
问题
Do the authors have thoughts how their proposed solutions can be used in meta-learning applications where the utility function may not be describle in natural language?
Thank you for your encouraging feedback and useful questions!
The base LLM has a huge importance on STOP's performance (Section 5.3)
Thank you for raising this great point! On one hand, this is not particularly surprising given the consistent advantage that GPT-4 has over other models in prior work [1] and is in line with prior results about emergent properties from scaling [2]. With that said, the fact that these results currently require a large closed language model is indeed a limitation worth highlighting - we have added this to the limitations section. We have also extended Section 5.3 with a more detailed figure.
[1] “GPT-4 Technical Report” OpenAI 2023
[2] “Emergent Abilities of Large Language Models” Wei et al. 2022
Do the authors have thoughts how their proposed solutions can be used in meta-learning applications where the utility function may not be describle in natural language?
This is also a great question. Although we primarily use utility functions described in code and qualitatively find that these are preferable to those in pure natural language, recent work [3] indicates that for certain classes of problems, having users describe their preferences is less effective than having the model simply query the user. Some utilities can also be described by example. Exploring the effectiveness of a method like STOP on this class of problems is a valuable direction for future research.
[3] “Eliciting Human Preferences with Language Models” Li et al 2023
Thanks for your responses to my questions.
This paper proposed self-taught optimizer (STOP), which is a nested-iterative method that uses a language model to improve itself, which is an "improver" that attempts to improve program solutions to certain tasks (e.g., learning parity w/ noise, 3-SAT). Experiments are conducted with GPT-4 and GPT-3.5 models, and the results show that GPT-4 is able to come up with methods such as beam search and genetic algorithms for the improver. On 5 different algorithmic tasks, STOP was shown effective in improving the original solutions.
优点
S1: The proposed method is interesting and the idea of having a formulation where the improver can improve over itself given a utility function is very cool.
S2: A clear discussion of the limitations and concerns for STOP is presented, which is very helpful.
S3: A number of (i.e., five) tasks are considered in the experiments, and the proposed methods yield non-trivial improvements on all of them.
缺点
W1: My main concern about this work is missing some of the important results for us to understand how well the proposed method works. More concretely,
- How much of improvements each iteration made. For example, it was mentioned that the improvements may not be monotonic, (which implies non-greedy, global optimization), while such improvement curves will be very interesting to look at, only one example of such is shown in Fig 4.
- Quantitatively, the differences between the results with GPT-4 and GPT-3.5. It was only briefly mentioned in 5.3 about some of the pitfalls of GPT-3.5 that are not observed for GPT-4, but some concrete comparison would be helpful in understanding how applicable is STOP on other models.
W2: The writing of the paper could be improved. While I appreciate the authors explaining many of the design choices and giving alternative solutions that are not eventually part of the proposed framework (e.g., all those "one can/may ..."), it inadvertently breaks the flow of the paper, making it quite hard to follow sometimes.
W3: I am not entirely convinced about the "novelty" of the improver program generated by GPT-4. Though the authors attempt to compare the proposed methods to some recent research that happened after the knowledge cutoff time, the ideas behind those improver programs (e.g., beam search, genetic algorithm, etc) are not new at all.
W4: Missing related work. A couple of more related works for (self-)improving code generation with LLMs could have been mentioned, for example [1] and [2]
[1] https://arxiv.org/abs/2302.07867
[2] https://arxiv.org/abs/2304.05128
问题
Q1: Is the Maximizer formulation used in the final proposed pipeline? If so, is the notation consistent with the used in section 5.1?
Q2: Are there any baseline methods (even heuristics) that you can compare the proposed method to? For example, iterative prompting without using a program as an improver?
Q3: Can you comment on how difficult is it for the model to generalize to the improver solutions, given that it must have seen things like beam search, genetic algorithms, etc during training?
Thank you for the very helpful and constructive comments!
My main concern about this work is missing some of the important results for us to understand how well the proposed method works… Quantitatively, the differences between the results with GPT-4 and GPT-3.5… Are there any baseline methods (even heuristics) that you can compare the proposed method to?
Thank you for the suggestions! We have now added a new chart to help visualize the quantitative performance differences between GPT-4 and GPT-3.5 in Figure 4. We’ve also evaluated two additional baselines to help contextualize the results, namely a chain-of-thought baseline which is similar to the current seed improver but with only one attempted improvement, and a greedy iterative improver (i.e., make the maximum permissible calls, select the best improvement, repeat as the budget allows). Across ten runs, the chain-of-thought baseline has a meta-utility of 57.7%±3.0% when it doesn’t error (49.6%±3.5% if we include errors), while the greedy iterative improver scores 64.2%±0.9%. We’ve added this detail to the paper.
The writing of the paper could be improved… (e.g., all those "one can/may ...")
Thank you for this recommendation. We agree completely, and in the updated pdf have removed the references to alternative but ultimately unused design choices or, where appropriate, replaced them with their corresponding implications. For example, when we discuss the maximizer formulation, we now point to its useful theoretical implications in the appendix.
I am not entirely convinced about the "novelty" of the improver program generated by GPT-4… the ideas behind those improver programs (e.g., beam search, genetic algorithm, etc) are not new at all.
While the fundamental optimization techniques may or may not be novel, the automatic search and application of the existing optimization ideas to language model optimization and recursive self-improvement is novel. Many existing scaffolding innovations are also existing optimization techniques applied to LM-based optimization, such as Tree of Thoughts and Parsel. Part of the challenge comes from understanding which aspects of the optimization algorithm map onto which elements of the problem. For example, we found that STOP generated many genetic algorithms: however, only in a small subset of these generated genetic algorithms did it use the language model to perform the crossover and mutation. In many others, it performed mutations by randomly changing characters and crossover by randomly concatenating two randomly-truncated solution strings. Moreover, almost any new optimization algorithm can be described with reference to existing optimization algorithms, so it is ambiguous when an optimization algorithm is “different enough” to be considered new as opposed to simply a variant of an existing approach. However, we believe that this is an extremely important discussion to have in the paper which is currently missing. We have added Appendix K (“On the Novelty of Improvements”) to incorporate this discussion.
A couple of more related works for (self-)improving code generation with LLMs could have been mentioned
Thank you! We have discussed these references in the revision.
Is the Maximizer formulation used in the final proposed pipeline? If so, is the notation consistent with the M used in section 5.1?
Great point, thank you! In our revision, we have added a new more-theoretical section to the Appendix that uses the maximizer formulation. We have clarified in the revision (Appendix subsection “Analysis of equivalent maximization formulation,” paragraph 2) the equivalence of the maximizer formulation with the improver formulation, and show how it is useful for our extended theoretical analysis throughout the Appendix section “Theoretical Analysis.”
Can you comment on how difficult is it for the model to generalize to the improver solutions, given that it must have seen things like beam search, genetic algorithms, etc during training?
In practice, while the underlying algorithms are often things that the model has seen, many scaffolds that use well-known meta-optimization algorithms for language model solution optimization have been broadly adopted as non-trivial generalizations (for example, the now-widely-referenced “tree of thoughts” paper). We also add a theoretical discussion to elaborate on when we expect this to approach the true optimal meta-utility.
Thanks for the response and additional results.
Though Fig 4 seems to suggest that the proposed method only works for stronger models (e.g., GPT-4), it helps a lot in understanding how much STOP improves over each iteration. Moreover, another way to look at this might be that the recursive self-improving ability might be emergent for model capabilities at GPT-4 level, so I do not necessarily think it is a negative result.
Overall, I like the proposed method, and self-improving of LLMs is one of the domains of great importance. The additional discussions and improved writing are also helpful in understanding this work. I am updating my score accordingly.
We thank all the reviewers for their positive and constructive feedback. We are delighted that the reviewers found the self-taught optimizer perspective “new and important” (Reviewer X5AY), the impact “profound” (yFD6), the formulation “elegant, functional, and novel” (UnQ9), the work “very cool” (443q), and the presentation excellent (yFD6 and UnQ9) and “easy to follow” (X5AY). We are grateful for the numerous suggestions and believe they have significantly strengthened the work. We have revised the PDF accordingly and responded to each point raised individually in the comments below.
As the discussion period wraps up, we wanted to take this chance to highlight the main points raised and how we addressed them:
- Baselines and curves. We responded to comments about additional hand-crafted baselines by introducing more baselines and ablations (443q, X5AY, VZPP, UnQ9), including a chain-of-thought-inspired baseline. We also added a new evaluation of GPT-3.5 performance across iterations.
- Creativity of GPT-4. We addressed questions about the creativity of GPT-4's improvements (443q, X5AY, VZPP) with examples and a discussion of novelty in a new Appendix K (“On the Novelty of Improvements”).
- Generalization and theory. To questions about the generalization properties of STOP and the value of the alternative maximizer formulation (443q, UnQ9), we added Appendix A ("Theoretical Analysis").
- Additional limitations. We extended our limitations section, e.g., the dependence on the base model and the high cost of GPT-4 (yFD6, VZPP, UnQ9). Lastly, we revised the writing to clarify the constrained experimental scope (VZPP).
We again thank the reviewers for their comments and suggestions and hope we addressed them in our responses and revisions. If there are any remaining questions (or if they've all been resolved), please let us know – we are more than happy to clarify.
In this paper, the authors introduced a new approach to iteratively solve complex tasks via LLMs and recursively self-improved scaffolding programs. The approach includes a seed improver that improves an initial program, using a specific utility function and calling LLM multiple times to obtain the best solution.
Most reviewers appreciated the novelty of the approach of having an improver to systematically improve itself using a utility function. The reviewers also appreciated the good discussion of any concerns and implied impacts of the current approach. However, there are still some major concerns in the paper that are not fully resolved: limited experimental results on only close-sourced models (GPT4 and GPT3.5), and limited applications on a quite small set of specific tasks. Hence, it is quite difficult to judge the generalization of the current results and insights. I highly recommend the authors review the feedback from reviewers, especially on issues related to experiments (additional results on base LLMs, downstream tasks, and baseline results).
为何不给更高分
The overall approach is quite novel with a reasonable formulation of using an improver for self-improvement with LLMs on optimization tasks. However, the experiments are still limited to only GPT models and on a small set of downstream tasks. It is therefore hard to verify the generalization of the current results and insights beyond just a proof-of-concept.
为何不给更低分
N/A
Reject