FunBO: Discovering Acquisition Functions forBayesian Optimization with FunSearch
We propose FunBO, an LLM-based method that can be used to learn new acquisition functions for Bayesian Optimization that perform well across a variety of experimental settings.
摘要
评审与讨论
This paper extends the LLM-based FunSearch algorithm by proposing a novel framework, FunBO, for discovering acquisition functions in Bayesian Optimization (BO). FunBO's main pipeline is based on an iterative optimization scheme to incrementally improve a baseline acquisition function. At each iteration, the algorithm proposes new acquisition functions, evaluates them through a complete BO loop, and records their performance in a database, ultimately selecting the best function for usage. FunBO is tested on standard BO benchmarks, specific out-of-distribution (OOD) tasks, and in a few-shot adaptation scenario.
优点
Explainable Acquisition Function: In the context of meta-learned acquisition function prediction using neural networks, FunBO provides a pipeline for creating interpretable acquisition functions.
Exhaustive Benchmarking across Domains: The paper thoroughly benchmarks FunBO across different application settings. Notably, the generated acquisition functions may also work out of the box for arbitrary objective functions, potentially addressing generalization issues common in meta-learning frameworks.
缺点
Restrictive Acquisition Function Input: Unlike previous approaches, this framework only uses the model’s marginal predictive distribution (predictive mean and variance) as inputs for the acquisition function. This restriction may limit its applicability compared to more complex but empirically powerful acquisition functions (e.g., Entropy Search, Knowledge Gradient).
Limited Comparison with Recent Meta-BO Work: While the paper includes comparisons with MetaBO [1], adding comparisons with more recent meta-acquisition functions (e.g., [2]) would provide a clearer picture of FunBO’s empirical performance.
Lack of Theoretical Insight: As with most LLM-based models, FunBO primarily relies on an empirical pipeline, without providing theoretical insights into the underlying methodology. It remains unclear to what extent this method will be useful in practice.
Computationally Intensive Training: Similar to other RL-based meta-acquisition function frameworks, FunBO’s iterative learning scheme requires a full BO loop evaluation at each iteration, which is computationally cumbersome.
[1] Volpp, M., Fröhlich, L. P., Fischer, K., Doerr, A., Falkner, S., Hutter, F., & Daniel, C. (2019). Meta-learning acquisition functions for transfer learning in bayesian optimization. arXiv preprint arXiv:1904.02642.
[2] Maraval, A., Zimmer, M., Grosnit, A., & Bou Ammar, H. (2024). End-to-end meta-bayesian optimisation with transformer neural processes. Advances in Neural Information Processing Systems, 36.
问题
Surprisingly good random search: On OOD-Bench: One surprising result is the strong performance of random search, suggesting it performs well across a wide range of problems. This raises concerns about the experiment’s credibility. Could the authors elaborate on why random search appears so effective in this context? This is essentially saying BO is not useful for a lot of problems.
Given an expensive black box function where and , Bayesian optimization (BO) is a class of algorithm for finding global minimum . Typically, the two main design choices in BO are a Gaussian process that is fit to the past function observations , and an acquisition function (AF) that quantifies the benefit of a new sample point . These choices are usually ad-hoc or based on expert knowledge.
In the case where we have a history of similar objective functions, we may use this history to help make these BO design choices for us.
This work proposes FunBO, a pipeline that takes as input a GP model and a set of objective functions, and uses a genetic algorithm (GA) for searching the space of acquisition functions returning the best acquisition function. GA requires specifying a mutation operator and a fitness function. This work uses the python source code generation ability of a large language model to take the source code of existing AFs and generate new AFs. The fitness is determined by taking an AF and executing BO with the GP model on all the objective functions and measuring the quality and number of iterations for the BO to converge.
The GA + LLM was first proposed by FunSearch and this paper adapts the framework for acquisition functions and BO.
Experiments are performed and compared to MetaBO that also learns an optimal AF from past functions via a different procedure.
优点
- Intuitive: at a high level this seems very simple
- well written: it was clear and easy to follow
缺点
My concerns are mostly minor and arguably subjective.
- discretization the learnt AFs take as input a collection of points, does this not suffer the curse of dimensionallity? Typically AFs are continuously optimized by gradient ascent.
My following points focus essentially on the messaging of what is being learnt
- Structure of past functions I personally feel the paper should more clearly explicitly make the distinction between meta-learning a surrogate model and meta-learning an AF. Given an empirical distribution over functions, , if they all have the same peak, or similar shapes e.g. linearity, unimodailty or peridicity (e.g. use a method like Neural Processes), FunBO does not learn or exploit any such information, at multiple times the paper expresses “structure of past functions” and I feel this is misleading and risks accidentally blurring the boundary between surrogate model and AF. A sentence to clarify this boundary would help especially for a broader non-expert audience e.g.
“Note that meta learning surrogate model explicitly learns structure and features from past functions, e.g. minima in the same region of space, unimodality, linearity, and such common structure can be leveraged when fitting a surrogate model to a new function of the same class. FunBO does not itself explicitly learn any structure from past functions, rather, it takes as inputs an existing surrogate modelling method and history of functions and and outputs the best possible AF. In experiments in this work we use the most popular choice of surrogate modelling method, a GP with an RBF kernel.”
-
scope of learnable AFs the AFs only take predictive mean, predictive variance, and incumbent as arguments. Hence expensive exotic methods like Entropy search and KG that require covariance cannot be learnt.
-
why does AF vary for different function classes? I realise this is out of scope of this work and the authors may argue this is an empirical observation from past works. Assume we have two different sets of training objectives and and we learn two different acquisition functions, and , then given one set of predictive means, variances, and incumbent and we pass these into each acquisition function, why should we get a different recommendation back? Particularly since is not an input, and the points are order invariant, the AF has no knowledge of where the points are or and where the peak may be. (Personally, my only intuition is that if the GP is a bad model fit for one function class, the AF may increase exploration as that is all it can do. In which case FunBO is learning AFs that learn to proportionally compensate for quality of GP model fit. Similarly the past works/BO community is a crowd sourced genetic algorithm making the same mistakes and is not specific to FunBO.)
-
Conditioned on surrogate model Following from above, as the FunBO inputs are a surrogate modelling method and set of functions, then the output AF is conditioned on the choice of surrogate model, I may have missed it but this should also be explicitly stated. One possible corollary is that changing the surrogate model (new kernel or prior mean function) would also require re-running FunBO and finding the new AF. (e.g. as above with a better suited GP, FunBO can find an AF that trades off explore and exploit appropriately)
-
sentitivity for out of distribution In the limitations, the authors state that FunBO was run multiple times to get the out-of distribution results, I thank you to the authors for the honesty and transparency, however I am not confused, this wording makes it sound like cherry picking, is not the average of all repeated runs reported?
问题
- Given two AFs trained on different function classes and we pass the same arguments, what is the intuition that they should differ in their output? If it is compensating for a quality of the model fit?
- the underlying assumption of finding an AF that works out of distribution suggests that the underlying distribution does not matter? There is one and only one true best AF?
- how sensitive is FunBO to the choice of GP model?
- can the authors clarify how the out of distribution results were tested?
Thank you for the thorough response. Most of my points were minor and somewhat subjective and the responses seem reasonable and thank fully no surprises.
Regarding the minor point on why two AFs may differ, I am not sure I fully agree with the example, though I think I may have convinced myself with another example.
An AF trained to optimize function belonging to this class would have implicitly learned about this pattern and will incentivize exploring the boundary of the input space
I personally don't see how this is possible. For example, given one discretization of points, the AF could happen to pick one on the boundary, theoretically, one could replace the chosen point with another point away from the boundary that has the same mean and variance and the AF would pick the new non-boundary point instead (likewise, UCB and EI have no spatial awareness.)
Intuitively, the AF takes as input the incumbent and means and variances, we could visualize this on the scalar output real number line as one data point and a collection of bell curves, and the job of the AF is to pick one bell curve to sample from, it only ever gets to see scalar output space.
For a positive example, if the functions in all happen to have a common global optimal value and we want to use this to help future optimization runs, then instead of tediously manually modifying EI/UCB (e.g. truncating Gaussian tails), we could use FunBO to find the empirically best AF.
For a negative example, hypothetically, as a BO practitioner, if one knew that boundaries are the best place to search, and we were using a GP with RBF and a standard gradient ascent AF optimizer, and we had to manually customize EI or UCB to exploit boundaries by comparing means and variances and incumbent, I am not sure how one could do this following the same constraints as FunBO. Alternatively, if we had direct access to one could add a weighting to artificially amplify boundary regions.
The learnt AF only has access to the scalar output space and therefore can only ever exploit information contained in output values. However any information regarding inputs is lost (e.g. boundaries), the projection from multi-dimensional inputs to scalar output is a lossy non-invertible transformation.
So I now agree that FunBO can learn AFs and some function classes may have structure in their scalar output that can be exploited. Hence different function classes may have different optimal AFs that make different recommendations. On the other hand, input structure is unfortunately lost and not exploitable.
Would the authors agree with these examples and this interpretation?
I apologize for the lengthy and late reply.
If the authors do agree with the above, the paper should be updated accordingly,
-
input-output function structure (unimodality, peaks in a certain locations, peaks near boundaries) cannot be exploited
-
however output structure (e.g. set of values is bounded/truncated or the range of values has empty regions e.g. or the distribution of values has heavy tails etc, such structure can be exploited.
The paper proposes the design of novel Bayesian optimization (BayesOpt) acquisition functions by using an LLM to generate the code for those acquisition functions. As the code for the acquisition function is returned, this works towards improving interpretability for BayesOpt methods. Experiments show that the designed acquisition functions lead to better optimization performance than that of traditional acquisition functions and of meta-learned ones.
优点
The application of FunSearch to designing BayesOpt acquisition functions is to my knowledge novel. The writing is clear and easy to follow. The experiments are convincing in showing that the generated acquisition functions perform better than included baselines.
缺点
I have two main concerns about the approach the paper takes. First, it seems the settings in which the proposed method can be used are where repeated optimization runs can be conducted, so that the LLM can gain information about the search space and generate well-performing acquisition functions. At least, we would need access to objective functions drawn from the same distribution as the one we are targeting. If I'm dealing with a one-off optimization problem, this seems to exclude the proposed method, which raises questions about the usability of the method.
I also find the authors' argument for the desirability of the generated acquisition functions being interpretable not quite convincing. To me, the general-purpose acquisition functions are quite interpretable, including the ones the authors flagged for being "hard to implement and expensive to evaluate" (entropy search and knowledge gradient). This is because the motivations for the acquisition functions are clear (e.g., reducing entropy of the objective function's optimizer/optimum or increasing the expected one-step posterior mean optimizer), so even if the functional forms of these functions are complicated, I still view them as interpretable. The proposed method, on the other hand, limits itself to acquisition functions whose inputs include the predictive mean and standard deviation, which only offers simply ways to address the exploration–exploitation tradeoff and potentially sacrifices performance. On line 399, is that functional form really that interpretable? Say the acquisition function is not doing well, could interpretability help us modify the functional form in a way that increases performance, for example?
The authors could consider including entropy search policies, knowledge gradient, as well as portfolio-style policies that automatically adjust the trade-off parameter of UCB as baselines.
问题
- Lines 202–214 sound like we still require a training set of functions from , as opposed to what is said on line 192. Do you have any comments on how to gauge the sufficient size for the training set given a specific optimization problem?
- Lines 225--230: why sample only 2?
- Perhaps other choices for the scoring function (e.g., negative cumulative regret) are possible. Do you have any results from ablation studies on this?
- Fig. 3 (right panel) seems too busy. The authors could consider adopting the style of e.g. [1] which shows queries as ticks at the bottom of the plots?
- I'm not so sure Fig. 3 demonstrates the benfits of your method well. Can we see the associated GP predictions? EI seems to be “stuck” at the global optimum, which is not a bad thing. Which value of the trade-off parameter is UCB using? Can’t the over-exploitation be fixed by increasing ? This motivates including policies that automatically adjusts as baselines.
[1] Garnett, Bayesian Optimization, Cambridge University Press 2023.
This paper introduces FunBo, a method based on FunSearch that leverages large language models to automatically design acquisition functions (AFs) in computer code. FunBo takes an initial AF as input and learns a better one by iteratively modifying the initial one based on a limited number of evaluations from a set of objective functions. FunBO formulates the learning of AFs as an algorithm discovery problem, enabling it to explore the space of AFs systematically. Extensive experiments show the superiority of FunBo.
优点
- The idea of applying LLM to help generate AFs is interesting and novel.
- The authors perform extensive computational experiments, demonstrating that AFs generated by the proposed FunBo perform better than general-purpose AFs and are comparable to function-specific AFs.
缺点
- The presentation can be improved and more consistent. For example, the algorithm in Figure 1 has some inconsistent parts with the figure in Figure 1. The positions of figures can also be more considerable. For example, Figure 1 is on the top of page 3 and first mentioned at the end of page 4.
- Algorithm pseudocode would be easier to read than the original code.
- Figures 1, 4, 5, 6, and 7 can be more professional.
问题
- The authors claim that “this is the first work exploring AFs represented in computer code”, are there any works that explore AFs represented in other forms? If there are any others, could you provide a detailed literature review on them?
- It is mentioned that this work focuses on Python programs, can the proposed FunBo be applied to design AFs in other languages?
- Could you clarify the difficulties or challenges that learning AFs written in computer code faced, and how FunBo deals with them?
- What does it mean by “h(Left)” in the Setup of Figure 1’s algorithm?
In this paper, the authors propose a method to leverage LLMs to programmatically adapt and improve acquisition functions (AF) in Bayesian optimization (BO). Notably, the proposed method seeds an LLM with a python implementation of a standard Expected Improvement function and iteratively samples proposed values from a suite of objective functions (not necessarily of the same domain), running an LLM-generated acquisition function, and scoring the output. This method expands previous work, notably FunSearch (Romara-Paredes et al., 2023) and MetaBO (Volpp et al., 2020), by allowing the objective functions (that one wants to optimize) to take values of differing dimensions, and allows more diverse input types for the generated acquisition functions.
Romera-Paredes et al., "Mathematical Discoveries from Program Search with Large-Language Models", 2023.
Volpp et al., "Meta-learning acquisition functions for transfer learning in Bayesian Optimization", 2020.
优点
In general, this paper is well-written and proposes an interesting idea that is supported by a suite of experiments. I am not well-acquainted with the state of this research area, but the motivation and proposed method in this paper are sound to me. By ensuring the output of the LLM is an annotated program, this maintains the interpretability of outputted AF. Additionally, the proposed method seems to leverage the "in-context learning" ability of LLMs to shape AFs that interpolate between generic all-purpose AFs (which may be generally suboptimal) and a specifically tuned AF (which may be ungeneralizable beyond the specific case) by training on a suite of objective functions, such as the various optimization test functions in the paper. Without access to a generalist learner, proposing a performant (and interpretable) new AF just by looking at a collection of test functions seems like a hard problem, thus the direction of this paper might lead to alleviated development time for downstream compute/evaluation-restricted tasks.
The numerical experiments seem to support that the method is actually sensible (at least on the function classes tested), showing that the iterative scoring/refinement scheme can actually retrieve optimal solutions in relatively few iterations compared to out-of-the-box standard AFs, which can converge slowly, or may not converge at all.
缺点
I have a couple questions (detailed in the respective section), which once clarified may solve some of the listed weaknesses.
The main numerical examples in this paper are low-dimensional optimization test functions and hyperparameter-tuning of a small-scale classification problem. I am wondering how this method can be scaled up efficiently to larger-scale problems (or real-life ones), where function evaluations are even more expensive; even ignoring the prompting cost, the evaluation loop is performed on each function of the provided class, which introduced a significant fixed cost to the algorithm. For example, in the context of hyperparameter-tuning, it is not clear from the experiments of this paper whether the proposed method can scale to large-scale, say, computer vision or robot learning tasks, where there may be far more moving parts (dimensions) to the BO problem than tested in this work. Some more evidence demonstrating scale-transfer, a la mu-parameterization (Yang et al. 2022), showing that FunBO training on smaller-scale problems can actually generate AFs that zero-shot transfer to the larger scale problem at hand, would be very enlightening on this front.
Another related issue that I do not see addressed in the paper is whether (or to what extent) the proposed method is scale-sensitive. In many settings (e.g. robot simulations), the scales of various parameters of the problem vary wildly, and are to some extent arbitrary by choice of unit. However, by using an LLM to generate the AF, it is not clear to me that the proposed method is sensitive to these possible variations of scale and thus possible source of covariate shift between the training and test tasks.
Yang et al. "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer", 2022.
问题
In addition to looking for clarification on the weaknesses above, I have the following questions:
-
Can this be extended to generating optimization algorithms? At a high-level, instead of determining a procedure to select the next point only from sampled objective values, one can envision use a similar scheme to generate code for a descent step. If one proposes a class of building blocks, e.g. momentum, regularization, modular norms (Large et al, 2024), can this scheme be extended to generate optimizer code?
-
How well does the proposed method perform on (higher-dimensional) landscapes that are potentially very non-smooth and sensitive to collapse? For example, does the proposed scheme show promise for settings such as adversarial training or robot reinforcement learning that are otherwise challenging to tune manually?
Large et al., "Scalable Optimization in the Modular Norm", 2024.
This paper considers the problem of discovering acquisition functions for Bayesian Optimization using LLMs. Since LLMs can generate code proposals, the key idea behind the proposed approach is to wrap search around them. The paper employs the recently developed FunSearch for this purpose. Experiments comparing FunBO with some existing acquisition functions and Meta BO methods show positive results.
This is an interesting research direction. The approach is simple (application of Fun Search for BO setting) and the paper is well-written and generally well done. The reviewers are mostly leaning towards weak accept. They asked very good questions and the author rebuttal answered most of the questions. I also took the note to AC into account in my decision-making.
Based on my own reading of the paper, I feel the paper is borderline and could be improved some more on the following aspects to increase the impact of the paper.
- Ideally, we want to discover acquisition functions that generalize to a set of test tasks. The current evaluation using OOD-Bench doesn't seem comprehensive enough to stress-test the limits of the method. I suggest adding more test tasks (different input spaces including continuous, discrete, and hybrid; and more variety of functions). This is also related to the point reviewer Lcw4 was making about output structure vs. input-output function structure. My point is that the paper could have done more experiments even on the output structure aspect. This is important to understand when we can use this method in practice (safely).
- There were previous attempts at discovering acquisition functions using Reinforcement Learning. See this NeurIPS-2021 paper for example: https://openreview.net/forum?id=MSr3u_FCRW The paper will be strengthened by comparing with representative approaches from this line of work to understand their pros and cons w.r.t FunBO.
- It will be good to add some recent acquisition functions such as entropy search for comparison.
- Optionally, it will be great to see how FunBO discovers AFs for different BO settings (e.g., multi-objective, multi-fidelity, and constrained optimization).
Therefore, I'm recommending to reject this paper and strongly encourage the authors' to improve the paper based on the feedback from reviewers' for resubmission.
审稿人讨论附加意见
This is covered in the meta review.
Reject