/10

Poster4 位审稿人

最低2最高4标准差0.7

ICML 2025

FunBO: Discovering Acquisition Functions for Bayesian Optimization with FunSearch

Virginia Aglietti,Ira Ktena,Jessica Schrouff,Eleni Sgouritsa,Francisco Ruiz,Alan Malek,Alexis Bellot,Silvia Chiappa

提交: 2025-01-23更新: 2025-07-24

TL;DR

We propose FunBO, an LLM-based method that can be used to learn new acquisition functions for Bayesian Optimization that perform well across a variety of experimental settings.

摘要

关键词

Bayesian OptimizationLLMAcquisition Functions

评审与讨论

审稿意见

评分: 32025-02-26

This paper introduces FunBO, Building on the FunSearch framework to discover novel acquisition functions for BO. FunBO iteratively modifies an initial acquisition function to improve the BO performance. The discovered AFs are provided as explicit code snippets, enhancing interpretability and facilitating direct deployment without additional overhead. The method evaluates candidate AFs based on how accurately and efficiently they identify the global optimum of a set of objective functions.

The authors show that FunBO discovers AFs that generalize well both within specific function classes and across different types of functions. These AFs outperform established alternatives like EI and UCB, while achieving competitive performance against specialized AFs designed for specific function types.

给作者的问题

How robust are the discovered AFs to changes in the surrogate model (e.g., different kernel functions or non-GP models)?
Can you provide any interpretation why the discovered AFs that you showed in the appendix perform better than traditional AFs on the corresponding task?

I think this work is super interesting and I really like the idea, I am willing to increase my score if some of my concerns/questions are properly addressed.

论据与证据

The paper's primary claim is that FunBO can discover explicit, code-based acquisition functions that outperform traditional options like EI and UCB. The experimental results do support this claim across various settings - the discovered AFs generally perform better than standard acquisition functions in most of the tests.

However, the evidence to me is not strong enough. While the authors show performance improvements over traditional AFs, the margins of improvement vary considerably across different test functions, and in some cases appear modest.

The paper makes a strong case for interpretability - providing the actual code for all discovered AFs is compelling evidence for this aspect. But regarding performance claims, a more comprehensive comparison against state-of-the-art methods would have strengthened the paper's position. The authors acknowledge high variance in the quality of discovered AFs, which raises questions about the consistency and reliability of the approach without multiple runs.

方法与评估标准

The evaluation method is generally sound and appropriate in BO:

The authors use standard benchmark functions and synthetic GP functions that are common in the BO literature.
The use of normalized average simple regret as the primary metric is standard.
The experimental design carefully controls for confounding factors by maintaining consistent GP hyperparameters, evaluation grids, and initial designs across all methods.

However, I think the major limitation is the selection of baselines. The authors only compare against classical AFs (EI, UCB, PofI) and some recent meta-learning approaches (MetaBO, FSAF), they omit comparisons with several important recent advances:

More advanced AFs such as entropy-based methods (e.g., Max-Value Entropy Search), which have shown superior performance in many settings.
End-to-end methods that jointly learn surrogate models and acquisition strategies.

This gap makes it difficult to assess how FunBO compares to the current state-of-the-art beyond traditional methods. While the focus on comparing against interpretable baselines is understandable given FunBO's emphasis on interpretability, including at least some of these more advanced methods would have provided a more complete picture of FunBO's competitive standing in the broader landscape of modern BO techniques.

理论论述

The paper does not make formal theoretical claims requiring proof verification.

实验设计与分析

The experimental design appears methodologically sound.

补充材料

I have reviewed the supplementary material, including a detailed discussion of related work, code implementations for all FunBO components, and experimental details. The only thing I note is that while the experimental setup is detailed, the authors do not explicitly mention the total number of repetitions used to calculate the statistics presented in their results. Please correct me if I missed it.

与现有文献的关系

FunBO potentially addresses challenges in scientific optimization domains where traditional AFs struggle. This approach could particularly benefit high-dimensional problems, multi-objective optimization scenarios, or constrained optimization settings that are prevalent in scientific applications like drug discovery, materials science, and engineering design.

遗漏的重要参考文献

The literature review appears comprehensive, covering classical AFs, meta-learning approaches, and LLM-augmented BO methods. The authors also acknowledge concurrent work by Yao et al. (2024) on representing AFs in code.

I would only suggest authors include several other meta-learning BO papers [1-3], and one LLM BO paper [4].

[1] Song et al. "Reinforced in-context black-box optimization."

[2] Müller et al. "Pfns4bo: In-context learning for bayesian optimization."

[3] Yang et al. "MONGOOSE: Path-wise smooth Bayesian optimisation via meta-learning."

[4] Song et al. "Position: Leverage foundational models for black-box optimization."

其他优缺点

The computational cost of FunBO is prohibitively high (48 hours of training), which severely limits its practical utility. This investment only makes sense when repeatedly optimizing similar functions, but in most real applications, trying several standard AFs would be much faster than running FunBO.
While the authors emphasize interpretability as a key advantage, they don't actually interpret any of their discovered AFs. I would have expected at least one example where they explain why a particular code structure performs better than traditional AFs on a specific task. Without such analysis, the claimed interpretability benefit remains largely theoretical.
The current implementation seems limited to discrete evaluation points (Sobol grid). It's unclear how the approach would extend to true continuous optimization of acquisition functions, which might be important for some applications.
As the authors acknowledge, there's high variance in the quality of discovered AFs, requiring multiple FunBO runs to find good solutions. This inconsistency further increases the already substantial computational demands.

其他意见或建议

There appears to be a duplication in the related work section - the paragraph on "LLMs and black-box optimization" is repeated almost verbatim in the page 12.
Figure 3's top plot doesn't effectively demonstrate FunBO's advantages, as the exploration trajectories for all methods overlap significantly. A more visually distinct example would better highlight the differences in exploration strategies.
It would be helpful to see an ablation study on the impact of different components of FunBO (e.g., evaluation metric design, island model parameters).

作者回复

2025-04-01

Thanks for your in-depth review of our work and valuable feedback.

Missing Baselines Our work focuses (footnote 2) on AFs that can be evaluated in closed form given GP posterior parameters thus being fast-to-evaluate and easily deployable while avoiding complexities like Monte-Carlo sampling (e.g., for entropy/KG methods). This is a deliberate choice to avoid the complexities of approximations (e.g., MC sampling), which would make both the evaluation of candidate AFs within FunBO computationally intractable and the generation of corresponding code by the LLM significantly more difficult.

As we restricted FunBO's search space to focus on discovering AFs within the class of closed-form deterministic functions, we ensured a fairer comparison by contrasting the discovered AFs against appropriate baselines like EI/UCB/PoI. In terms of end-to-end methods, we view FunBO as complementary. As noted in our Related Work section, Chen et al. (2022) found that even their transformer-based model benefited from using a standard AF (EI) for action selection. This suggests that AFs discovered by FunBO could potentially be integrated into such end-to-end frameworks in the future to further boost performance.

Computational Cost & Variance We acknowledge that the computational cost of FunBO is significant. Note however that FunBO is primarily intended as an offline procedure to discover high-quality AFs. The computational expense is incurred during this one-time search phase. Once discovered, the resulting code-based AF is typically very cheap to evaluate during the actual online BO task. The idea is that FunBO invests computation upfront to find an AF that significantly improves sample efficiency later.
While variance in the quality of the discovered AFs might further increase the computational cost, it's crucial to distinguish between the variance of the search process from the value of its outcome. Once found, a successful AF discovered by FunBO has an intrinsic value that is independent of the stochastic process that generated it. The AF represents a novel optimization strategy whose performance can be evaluated and used independently of the original search.

Discrete AF Optimization Within our experimental evaluation, the optimization of the AF was performed over a discrete Sobol grid to ensure fair comparison across all AFs. However, the discovered AFs are not fundamentally restricted to discrete optimization. As long as the surrogate model is differentiable (like a GP with RBF kernel) and the generated code includes differentiable functions, the AF can be optimized using continuous methods, similarly to EI or UCB. For instance, we could rewrite the AF in Fig 2 to take the input location x, return the corresponding scalar acquisition value and optimize it via L-BFGS.

Robustness to Surrogate Model Our current work focused on discovering AFs conditioned on a specific surrogate model, as stated in Sec 4, to isolate the AF's contribution. The performance of the discovered AFs might indeed depend on this choice. However, the FunBO methodology itself is flexible. By simply modifying the run_bo implementation to use a different surrogate model, FunBO can be used to discover AFs tailored to the specific model's properties.

Interpretation of Discovered AFs Note that we included an AF interpretation on line 287 and commented on the Goldstein-Price AF on line 359. As an additional example, the AF found for GPs-ID can be simplified to be written as (EI ** 2) / (1 + (z / beta)**2 * sqrt(var))**2. This function calculates the standard EI, squares it, and then divides it by a penalty term that increases with both the standardized improvement z = (incumbent - mean) / std and the uncertainty sqrt(var). The squaring of the EI non-linearly amplifies regions with high EI values relative to those with low or moderate EI. It increases the "peakiness" of the acquisition function, leading to stronger exploitation of the most promising point(s). The term (1 + (z/beta)**2 * sqrt(var))**2 penalizes points more heavily if they have a very high expected improvement (z is large and positive) and/or high uncertainty (sqrt(var) is large). This might act as a regularizer against over-optimism or excessive jumps into highly uncertain areas, even if they look very promising according to EI. It's a novel way of balancing exploration and exploitation discovered by FunBO.

Number of Repetitions The mean and standard deviation shown in the regret plots represent the performance variation across the set of test functions, not across multiple independent runs of the BO algorithm on a single function instance. For instance, for OOD-Bench, the average/std dev is over the 9 distinct test functions. We will clarify this point in the captions.

审稿意见

评分: 42025-03-05

The authors propose FunBO, a novel method for designing novel acquisition functions (AFs) for BO using LLMs. FunBO aims to design new AFs that perform well across a variety of objective functions. FunBO takes a target set of objective functions and some initial standard AF. Then, FunBO iteratively modifies the initial AF to improve the resultant BO performance on the set of objective functions. The result is a novel AF designed specifically to perform well across the given set of objective functions. The novel AF is output directly in code format, making the designed AF both easy to use and interpret. The authors provide extensive empirical analysis of FunBO, comparing BO performance with AFs designed by FunBO, to performance with both standard general purpose AFs, and with AFs customized specifically for the given objective function type (function-specific AFs). Results show that AFs designed by FunBO outperform general purpose AFs, and perform comparably with the function-specific AFs. Additionally, results show that AFs designed by FunBO perform well even on objective functions that were outside of the FunBO training distribution, demonstrating that the AFs designed by FunBO generalize well to other objective functions. Empirical results include both standard benchmark objective functions and hyperparameter optimization tasks.

update after rebuttal

As I stated in my comment below, the authors sufficiently answered all of my questions in their rebuttal, and I maintain that this work should be accepted for all of the reasons stated in my initial review.

给作者的问题

Question 1: Have the authors considered applying FunBO to design AFs for additional real optimization problems outside of hyperparameter optimization? While hyperparameter optimization is one of the most common applications of BO, there is much recent work applying BO to various interesting problems in robotics, materials science, database tuning, drug design, nuclear reactor stability, etc. It would be cool to see if FunBO could be applied to design AFs to improve BO for some of these other kinds of practical applications.

Question 2: I wonder if additional information specific to the particular problem setting could also be provided to FunBO to further improve the AF design. For example, adding problem descriptive information such as “this is a problem in hyperparameter tuning” or “this is a problem in robotics” could be optionally provided to the LLM to allow FunBO to incorporate this information in the design of the AF. Is this something the authors have considered? Do the authors think this would be straightforward to do, and do you think that it could further improve performance of FunBO?

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

There are no theoretical claims or proofs in this paper as far as I am aware.

实验设计与分析

Yes, I checked all experimental design setups and experimental results provided in the paper. All are valid as far as I am aware.

补充材料

Yes, I reviewed all supplementary material in the appendix.

与现有文献的关系

Designing AFs for BO is a relevant problem because choosing or designing a good acquisition for a specific objective function can have a significant impact on optimization performance. Additionally, different AFs can be better or worse choices for different objective functions. New methods for designing good AFs that perform well across a variety of objective functions will be of interest to the broader scientific community, and especially to the Bayesian optimization research community.

遗漏的重要参考文献

There are no essential references missing as far as I am aware.

其他优缺点

Strengths:

1: Improving accessibility of BO methods for practitioners: Designing AFs to maximize BO performance is a difficult problem that often requires advanced knowledge about Bayesian optimization. FunBO presents a way for practitioners in other fields (i.e. the natural sciences), who may want to apply BO methods to their specific problems, to design AFs without this advanced knowledge. Additionally, the fact that FunBO outputs designed AFs directly in python code makes this extremely user friendly (it’s as simple as copying and pasting the designed AF into the user’s .py file).

2: Compelling empirical results: The empirical results provided are compelling and clearly demonstrate the author’s claims that 1) AFs designed by FunBO outperform general purpose AFs, 2) AFs designed by FunBO perform comparably even when compared to function-specific AFs, and 3) AFs designed by FunBO generalize well to objective functions outside of the training distribution.

Weaknesses:

A minor weakness of this paper is that the novel methodological contribution is on the smaller side. This is because FunBO builds on ideas from existing work (FunSearch) and applies these ideas to the new application of designing AFs. That being said, this is a very minor weakness because significant work was done to apply these ideas to this new application setting of designing AFs, and strong empirical results demonstrate that FunBO makes an impactful contribution to the literature. I therefore still recommend that this paper be accepted.

其他意见或建议

Typos:

Line 44: “In other to avoid” should be “In order to avoid”

Line 319: “that performs well across function classes” should be “that perform well across function classes”

Line 356: “FunBO is able reach” should be “FunBO is able to reach”

作者回复

2025-04-01

Thanks for your feedback and for highlighting the strengths of our work. We appreciate the constructive comments and recommendation for acceptance. Regarding the specific questions:

Application to Other Real-World Problems This is an excellent point. Our primary goal in this initial work was to establish the FunBO methodology itself, demonstrating its feasibility for discovering novel AFs from code and rigorously evaluating their performance and generalization capabilities using well-understood application domain (standard benchmarks and simple HPO problems). Exploring diverse real-world applications would require careful consideration of the specific problem characteristics and potentially curating relevant sets G representative of the target domain. We agree this is a significant and exciting direction, and we view it as important future work, building upon the foundation laid in this paper.

Incorporating Problem-Specific Information Currently, FunBO provides context to the LLM through the structure of the initial AF, the function docstrings and the prompt structure including prior examples. Building on this, there are indeed several ways the prompt context could be further enriched:

Adding Performance Data: We could go beyond the implicit ranking in the prompt and explicitly include the numerical scores attained by the included AFs. This might provide the LLM with a clearer signal about the magnitude of improvement to aim for.
Adding Self-Critique: Another avenue, inspired by self-improvement techniques, could involve incorporating automatically generated or human-written critique/analysis of why the provided example AFs perform as they do directly into the prompt. This could potentially help the LLM focus on refining successful strategies or avoiding pitfalls.
Adding Problem Descriptions: Incorporating explicit textual descriptions of the problem domain (e.g., "tuning hyperparameters for a binary classification model" or "optimizing a robotic grasp"), as proposed.

From an implementation perspective these can be easily incorporated in the prompt structure used by FunBO. It is plausible that providing richer information through any of these additions could allow the LLM to better leverage its internal knowledge base, potentially leading to further improvements in the discovered AFs. The exploration of the impact of richer prompt contexts (including scores, critique, or domain descriptions) is an interesting direction for future investigation, as mentioned in our Future Work section (Section 5).

审稿人评论

2025-04-02

Thank you for answering both of my questions. I maintain that this work should be accepted.

审稿意见

评分: 22025-03-12

The authors present FunBO, an algorithm that leverages an LLM to propose novel acquisition functions (AFs) for Bayesian optimization. The authors evaluate FunBO under three settings, in-distribution (ID), out-of-distribution (OOD), and few-shot. FunBO relies on the availability of a large set of related tasks (ID) or unrelated tasks (OOD) with which to evaluate new AF candidates. Overall, although the method is interesting, I have some concerns about the empirical evaluation of FunBO. First, the experiments are run using fixed hyperparameters for the GP surrogate and without acquisition function maximization which doesn't reflect the real-world application of Bayesian optimization. Second, it would be nice to see FunBO evaluated on more real-world black-box optimization problems such as molecule generation [13] or on a standard hyperparameter optimization benchmark [11]. Third, the fact that random search outperforms other AFs on standard optimization benchmarks in the authors' experiments seems to contradict the results of previous studies and raises some questions about the empirical evaluation which are not directly answerable because the authors have not provided their source code. Nonetheless, I will be willing to revise my score if the authors can address the concerns raised.

Post-Rebuttal Update

Unfortunately, the authors did not have sufficient time to reply to my final comment and supply either the code or the full details of their additional experiments such that they can be reproduced. My primary concern hence remains that the empirical results are somewhat suspicious given that e.g. the performance of random search against standard AFs is at odds with prior literature. During the AC-reviewer discussion phase I will try my best to independently verify the results in BoTorch. If the authors could supply the mathematical form of the FunBO acquisition function they evaluated as a confidential comment to the AC to be disseminated to me that would be great.

给作者的问题

The authors state, "FunBO explores a significantly more complex function space where programs take as inputs multiple arrays.". Can the authors give an example of what they mean by "multiple arrays"?
The authors state that, "When no validation functions are used ( $G = G_{Tr}$ ), the AF with the highest average performance on $G_{Tr}$ is returned.". When would the case arise where no validation functions are available?
The authors state in footnote 6, "We explored using a random selection of initial points as an alternative to EI. However, this approach did not yield good results as using a random selection was incentivizing the generation of functions with a stochastic output, for which convergence results are not reproducible.". What do the authors mean by this? Can the authors present these negative results?
For the plot in Figure 2, how many random seeds were taken? Why is std/2 used in place of the standard error?
In Figure 9, the legend labels are very difficult to read. It seems as though random search is the next best performing acquisition function relative to FunBO? I see the authors later mention the strong performance of random search but the explanation isn't particularly convincing given results reported for the same optimization benchmarks in other papers e.g. in [8] the authors find that random search is the worst-performing AF on the Michaliewicz function in Figure 3. Additionally, in [9] Figure 6, the authors find that random search is the worst performing AF on the Hartmann 6D function (in the parallel BO setting with a batch size q=4). Even if the experimental settings are slightly different the overall trend is still very surprising. Additionally, how many random seeds were taken. Why is the half standard deviation used for errorbars in place of the standard error?
In the abstract the authors state, "we propose FunBO, an LLM-based method that can be used to learn new AFs written in computer code by leveraging access to a limited number of evaluations for a set of objective functions.". What is meant by this? The number of evaluations seems quite large when aggregated across the set of related functions G required to evaluate the efficacy of the proposed AF.
In Figure 5 why isn't random search shown in the first two panels?

论据与证据

In terms of the claims made in the paper the authors state:

"Across all objective functions, αFunBO leads to a convergence performance that outperform general purpose AFs (Fig. 4)."

It seems as though MetaBO is stronger than FunBO on Hartmann?

In the experiment section the authors claim in relation to the performance of random search that,

"This is due to random search performing competitively on functions with numerous local optima, which are generally harder to optimize."

This does not seem to be borne out in the literature. I discuss this in more detail below. In the Related Work section, the authors state,

"This confirms the continued importance of AFs as crucial components in BO, even when combined with transformer-based approaches, and highlights the importance of a method such as FunBO that can be seamlessly integrated with these newer architectures, potentially leading to further improvements in performance."

Retraining a transformer architecture is likely to be very expensive. Will such an integration not fall foul of the limitations highlighted in the authors' conclusion regarding the expense of running FunBO?

方法与评估标准

It looks as though Equation 1 is underspecified. From the authors' description in the paragraph below, it is implied that the number of steps tao required to identify the optimum is recorded (otherwise at a fixed budget T, the second term would always be 1) yet this isn't indicated in Equation 1.
The authors state that, they removed the local maximization of AFs from the MetaBO baseline. Why was this done? Presumably because the authors' omit AF maximization in FunBO?
In their experiments the authors state,

"To isolate the effect of using different AFs and eliminate confounding factors related to AF maximization or surrogate models, we maximized all AFs on a fixed Sobol grid (of size NSG) over each function’s input space."

Surely the relevant comparison is whether FunBO can yield better acquisition functions in real-world problems where acquisition function maximization and hyperparameter tuning of the surrogate model would take place? It is colloquially accepted that the quality of the GP surrogate is a more important factor than the choice of acquisition function [10, 12] in real-world BO performance and so it would make sense to run an empirical evaluation where GP hyperparameter optimization and AF maximization were present. One option would be to take a SOTA BO algorithm such as HEBO [10] and evaluate whether FunBO could improve its performance on the Bayesmark benchmark [11].

理论论述

Not applicable.

实验设计与分析

The method hinges on the assumption that the related functions G are much cheaper to evaluate relative to f?
Why did the authors choose an RBF kernel in place of a Matern 3/2 kernel as is commonly adopted for standard optimization problems.
OOD-Bench seems to be the most practically relevant benchmark. For this experiment it would be nice (but expensive) to see the contents of G_tr and G_v rotated with optimization of GP hyperparameters.
Although MetaBO uses fixed GP hypers and the authors may have been concerned with a direct comparison, I reiterate that I believe a set of experiments that would support the real-world utility of FunBO would feature optimization of the GP hypers and maximization of the AF.
The authors use two example AFs in the prompt. How was this number chosen? Was there a sensitivity analysis on the number of AFs to include? Do the authors think the number of examples affects performance?

补充材料

In Appendix C, the precise sampling scheme should be provided. For example how are AFs sampled from within an island based on their score and length? In the absence of code this information would be needed to reproduce the paper even if it is included in the original FunSearch paper it would help if it was repeated here.

与现有文献的关系

It may be worth briefly discussing connections between FunSearch and OPRO [7]. While FunSearch maintains scores for each heuristic, OPRO maintains scores for each prompt.
How does the authors' approach differ from symbolic regression approaches uses genetic programming? Did the authors consider this?

遗漏的重要参考文献

When introducing Bayesian optimization, it may be worth citing the originating papers for the method [2, 3] as discussed in [4].
Reference [5] should be cited when introducing Expected Improvement (EI) as discussed in [4].
For probability of improvement, the originating work is [2] and for knowledge gradient the originating work is [5] as discussed in [4].
In the Related Work, it would be worth the authors mentioning the references [14, 15] which both leverage LLM embeddings for black-box optimization.
InstructZero was accepted at ICML 2024.

REFERENCES

[1] Benjamins, C., Raponi, E., Jankovic, A., Doerr, C. and Lindauer, M., 2023, December. Self-Adjusting Weighted Expected Improvement for Bayesian Optimization. In International Conference on Automated Machine Learning (pp. 6-1). PMLR.

[2] Kushner, HJ., A Versatile Stochastic Model of a Function of Unknown and Time Varying Form. Journal of Mathematical Analysis and Applications 5(1):150–167. 1962.

[3] Kushner HJ., A New Method of Locating the Maximum Point of an Arbitrary Multipeak Curve in the Presence of Noise. Journal of Basic Engineering 86(1):97–106. 1964.

[4] Garnett, R., Bayesian optimization. Cambridge University Press. 2023.

[5] Saltines, VR., One Method of Multiextremum Optimization. Avtomatika i Vychislitel’naya Tekhnika (Automatic Control and Computer Sciences) 5(3):33–38. 1971.

[6] Lyu, W., Yang, F., Yan, C., Zhou, D. and Zeng, X., 2018, July. Batch Bayesian optimization via multi-objective acquisition ensemble for automated analog circuit design. In International conference on machine learning (pp. 3306-3314). PMLR.

[7] Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J. and Yang, Y., Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers. In The Twelfth International Conference on Learning Representations.

[8] Daulton, S., Ament, S., Eriksson, D., Balandat, M. and Bakshy, E., 2023, December. Unexpected improvements to expected improvement for Bayesian optimization. In Proceedings of the 37th International Conference on Neural Information Processing Systems (pp. 20577-20612).

[9] Balandat, M., Karrer, B., Jiang, D., Daulton, S., Letham, B., Wilson, A.G. and Bakshy, E., 2020. BoTorch: A framework for efficient Monte-Carlo Bayesian optimization. Advances in neural information processing systems, 33, pp.21524-21538.

[10] Cowen-Rivers, A.I., Lyu, W., Tutunov, R., Wang, Z., Grosnit, A., Griffiths, R.R., Maraval, A.M., Jianye, H., Wang, J., Peters, J. and Bou-Ammar, H., 2022. HEBO: Pushing the limits of sample-efficient hyper-parameter optimisation. Journal of Artificial Intelligence Research, 74, pp.1269-1349.

[11] Turner, R., Eriksson, D., McCourt, M., Kiili, J., Laaksonen, E., Xu, Z. and Guyon, I., 2021, August. Bayesian optimization is superior to random search for machine learning hyperparameter tuning: Analysis of the black-box optimization challenge 2020. In NeurIPS 2020 Competition and Demonstration Track (pp. 3-26). PMLR.

[12] Shahriari, B., Swersky, K., Wang, Z., Adams, R.P. and De Freitas, N., 2015. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1), pp.148-175.

[13] Gao, W., Fu, T., Sun, J. and Coley, C., 2022. Sample efficiency matters: a benchmark for practical molecular optimization. Advances in Neural Information Processing Systems, 35, pp.21342-21357.

[14] Kristiadi, A., Strieth-Kalthoff, F., Skreta, M., Poupart, P., Aspuru-Guzik, A. and Pleiss, G., 2024, July. A Sober Look at LLMs for Material Discovery: Are They Actually Good for Bayesian Optimization Over Molecules?. In International Conference on Machine Learning (pp. 25603-25622). PMLR.

[15] Rankovic, B., Schwaller, P., BoChemian: Large language model embeddings for Bayesian optimization of chemical reactions. In: NeurIPS 2023 Workshop on Adaptive Experimental Design and Active Learning in the Real World. 2023.

其他优缺点

To my mind the experiments performed by the authors although extensive do not contain enough real-world examples. A hyperparameter optimization problem with two hyperparameters e.g. the SVM experiments is not particularly compelling. It would be more interesting to see the method applied to a setting such as molecule generation [13] for example or the Bayesmark benchmark [11]. If these problems are prohibitive from a cost perspective it would be nice to see this limitation explicitly mentioned in relation to these applications (instead of abstractly as is currently the case in the conclusion).
It would be great if the code could be released. If there is a delay on industrial approval for code release this will hopefully have been addressed in the time between submission and the rebuttal phase.

其他意见或建议

There are some missing capitalizations in the references e.g. "Bayesian".
There are some missing journal/conference indications for arXiv papers in the references e.g. [1] was published at the AutoML conference.
Typo in the introduction, "In other to avoid".
When listing approaches to mitigate reliance on a single acquisition function in the introduction it would also be worth mentioning ensemble approaches such as MACE [6] where the Pareto front over different acquisition functions is computed.
In the preliminaries section, it would be worth articulating the purpose of the auxiliary objective functions g_j. It would also be worth articulating any assumptions about the relation between these auxiliary functions g and the objective to be optimized f. The authors later state that, "The learned AFs are trained on the set G, whose functions are assumed to be drawn from the same distribution or function class associated to f." For clarity it would be worth including this assumption when the set G is first introduced.
PI would probably be a less cumbersome acronym than PofI?
Typo in Figure 6 docstring, "finds its optimum"
Typo in Table 1 and Table 2, the likelihood noise parameter sigma (1e-5) should not have a subscript f.
Typo section 3, "an algorithm discovery problem".
The clarity of the description of the terms in Equation 1 could be improved e.g. $y^{*}\_j$ is the known optimal value of the objective $g_j$ rather than the optimum which would be the corresponding $x_j$ . The authors describe $x_{j}^{t=0}$ as the optimal input value at $t=0$ . What does this mean? The optimal value according to the surrogate model at $t=0$ ? The authors also describe the "found optimal input value". I think this would be better described as the model's estimate of the optimal input value or similar.
In Figure 2, it would help if scores were indicated for each acquisition function to highlight the ascending order in which they appear in the prompt.
In the definition of regret on page 6 it is somewhat confusing to use $y*$ as the notation for the true optimum. Typically $f(x)$ refers to the noiseless value of the objective and $y$ refers to a noisy value of the objective.
On page 6, the authors state, "with less than 25 trials". This should be "with fewer than 25 trials". "Fewer" is typically used with countable nouns such as trials, oranges, ICML submissions, whereas "less" is used with uncountable nouns such as time, money, and water.
In Figure 8, the variable percentage_steps_needed is not defined.
Line 746, Figure 7, is there a typo? found_min = initial_min_y = float(np.min(model.Y)).
Line 762, Figure 7, comment is broken.
Line 764, Figure 7, if found_min == true_min. Should there not be some tolerance factor here if this is real Python code?
Line 793, using $\mathbf{X}$ is a slightly strange way of defining the RBF kernel if $\mathbf{X}$ is the design matrix. Why not use the standard $\mathbf{x}$ notation?
Can the authors explain line 948 in Figure 12, "predictive_var [(shape−10)//2] *= dim".

伦理审查问题

Not applicable.

作者回复

2025-04-01

Thanks for your in-depth review of our work and valuable feedback.

Transformer Cost We agree that re-training large transformer architectures is expensive. Our statement was intended to mean that a pre-trained transformer surrogate model could benefit from using a cheap-to-evaluate, FunBO-discovered AF instead of standard AFs (like EI, as used in Chen et al. 2022), rather than implying FunBO would be involved in the transformer training itself.

AF/HP optimization We removed the local maximization step from the MetaBO baseline to ensure a consistent comparison framework where all AFs are optimized over the same Sobol grid. This was done to isolate the performance differences attributable purely to the AF's selection logic, removing confounding effects from the quality of optimization routines. For the same reason, we used fixed GP hyperparameters. We aimed for a controlled comparison of the AFs' inherent selection strategies and believe this approach leads to an evaluation that is sound and appropriate as recognized by Reviewer FFHS (“The experimental design carefully controls for confounding factors by maintaining consistent GP hyperparameters …”). We will clarify that this might deviate from a fully realistic deployment scenario.

G cheaper than f FunBO can run even if functions in $G_{\text{Tr}}$ are expensive, but the offline discovery phase becomes proportionally costly. The practicality of FunBO relies on the idea that functions in $G_{\text{Tr}}$ are indeed cheaper (e.g., lower fidelity simulations, see response to Reviewer zkmL) and the large offline cost is amortized by finding an AF that saves numerous evaluations of the truly expensive target function f during online deployment.

Questions:

This refers to the AF signature where predictive_mean and predictive_var are NumPy arrays, as opposed to simpler functions found by FunSearch that might only take scalar or tuple inputs.
This occurs if $|G|$ is small and/or if the user decides to use all available functions for training to maximize the training signal e.g. in few shot settings.
Using a completely random initial AF encourages the LLM to generate AF code that itself included stochastic elements (e.g., various calls to np.random). This makes the resulting BO loop non-deterministic, making scoring inconsistent and reproducibility difficult.
The mean and standard deviation shown in the regret plots represent the performance variation across the set of test functions (e.g 9 distinct test functions for OOD-Bench), not across multiple independent runs of the BO algorithm on a single function instance. We used std/2 for visualization purpose.
Thanks for pointing us to these references. Notice that the figures you mention are averaging the convergence results over multiple random initializations. In our case we are averaging the convergence results over multiple, completely different, functions. Plotting the performance on the single functions reveals that known AFs outperform random search in most cases, performing better for lower dimensional functions and being competitive on functions with numerous local optima.
This refers to the limited number of functions available in G. We will rephrase this to be “by leveraging access to a number of evaluations for a limited set of objective functions.”
We omitted them to avoid clattering the figure but we are happy to add them back to the manuscript.

Other comments:

For Fig 4 (right panel, HM3), MetaBO is indeed outperforming FunBO. However, in the statement you report we refer to “general purpose AF” (e.g. EI) that are indeed underperforming compared to FunBO.
This is a typo. $T_{h_{j}^{\tau}}$ represents the number of BO trials required by the algorithm using AF $h_j^{\tau}$ to identify the true minimum value $y_j^*$ . This is recorded for each AF $h_j^{\tau}$ sampled at step $\tau$ . Note that also $x^*_{j,h^\tau}$ should be $x^*_{j,h_j^\tau}$ .
The sampling scheme details are given in Appendix C and they are kept fixed with respect to FunSearch. The code for the latter is available on GitHub making the algorithm reproducible.
As specified on line 312, all experiments are conducted using FunSearch with default hyperparameters, including the number of examples in the prompt.
Thank you for providing this helpful list of suggested references. We will incorporate them in the revised manuscript and add a related discussion. Regarding symbolic regressions, these methods also searches for mathematical expressions, often represented as trees and evolved using genetic operators. FunBO differs by operating directly on code representations, leveraging an LLM for generation/mutation, and using full BO performance as the fitness metric. While related in the goal of finding functional forms, the search space representation and operators are different.

审稿人评论

2025-04-03

Transformer Cost: Do the authors mean to imply that the FunBO-discovered AFs in this work could be applied out-of-the-box to transformer-based surrogates?

AF/HP optimization: The MetaBO paper uses fixed GP hyperparameters and so I appreciate that using fixed GP hyperparameters in FunBO is the fairest means of comparison. However, the practical utility of FunBO would be in improving real-world BO performance which would involve GP hyperparameter and AF optimization. I retain the opinion that the true test of FunBO's utility would be to see it improve upon SOTA on a real-world benchmark such as molecule generation or Bayesmark (or any other suitable real-world task) by discovering a performant acquisition function.

G Cheaper than f: Many thanks to the authors for clarifying the perceived use-case of FunBO in multifidelity settings.

Performance of Random Search: Many thanks to the authors for clarifying that the averaging occurs across tasks. In the individual plots the fact that random search is performant on Branin still contrasts with the results in the literature.

Sampling Scheme in Appendix C: I was indeed referring to the sampling scheme in appendix C in my initial review. The full details of the scheme should be provided within the current paper.

Comparison with Symbolic Regression Approaches: It would be quite interesting to see a direct comparison of FunBO and symbolic regression in future work.

Remaining Questions: I understand that space constraints may have prevented the authors from addressing all questions in my initial review. I am more than happy to engage in discussion on the remaining points.

In summary, I will be very happy to increase my score if a) experiments could be added demonstrating the advantages of FunBO together with GP hyperparameter and AF optimization on just a single example problem. b) the discrepancy in the performance of random search vs. standard AFs on Branin can be explained. c) an explanation can be provided as for why the source code has not been released.

作者评论

2025-04-08

Thanks for your additional response.

Transformer Cost: Yes

Sampling Scheme in Appendix C: We are happy to include this

Performance of Random Search: See b) below

a) To address this point we have tested an AF found by FunBO using the standard BO pipeline available in BoTorch. In particular, we have taken the AF found by FunBO for OOD-Bench and tested it on the test functions for OOD-Bench using the code in BoTorch tutorial 1 and tutorial 2. Note that this evaluation uses all standard settings in terms of optimization of the AFs, optimization of the GPs hyperparameters, random selection of initial points etc. For each test function we have run the algorithm with 10 different random initial designs and plotted the results averaging over the runs (for these plots we used the plot formatting of BoTorch given in the linked Colabs, this is to also address concerns regarding plotting).

Unfortunately we cannot share the code at this stage. However, we have added all additional convergence plots at this anonymous link. Notice how, across all functions, FunBO performs either comparable or better than EI and UCB. We are happy to repeat all evaluations with this BoTorch pipeline, add the convergence plots with GP hyperparameters optimization and AF optimization and share the corresponding code.

b) The convergence plots given at the anonymous link also include random search. Note how random search is, as expected, underperforming on all test functions. However, there are functions in which the performance of random search is comparable to EI (Sphere) or FunBO (Hartmann 3d). Removing the AF grid optimization, plotting the best objective value found over the number of observations rather than the regrets, averaging over random initial designs and disaggregating the results to show performance by test function leads to the expected low performance of the random search. We are happy to repeat the analysis using these settings for the remaining experiments in the manuscript.

c) As previously mentioned, all code needed to run the algorithm (together with the code in the FunSearch GitHub repository has been given in the appendix. However, if more convenient, we are happy to open source the repo itself.

审稿意见

评分: 32025-03-14

The paper proposes FunBO, a method to learn novel acquisition functions to increase efficiency of the Bayesian optimisation. In particular, acquisition function is represented as a Python program and FunSearch is adopted to search over programs' space. Authors evaluate their approach on both in-distribution settings, i.e., the target function is within the same class as given training functions, and out-of-distribution settings. It is shown that FunBO outperforms the considered baselines.

给作者的问题

I am curious about the intrinsic motivation of representing AFs with programs? What are the real benefits behind it? Representing them as programs leads to the requirement of doing hard discrete search, thus making the overall procedure even more expensive. Overall, I understand that it comes from the fact that modern LLMs became good priors for code generation and can generate "executable" programs, thus giving good start for such search, but besides that, what are the benefits compared to representing AF via neural network?
As you also mention in the limitations, FunBO might be really strong in the situations where running full BO loop is very cheap/fast to provide the feedback to an LLM, then FunBO can iterate quickly. What are the interesting practical use-cases for such setting?

论据与证据

The experiments convincingly show the advantages of the proposed FunBO method.

方法与评估标准

FunBO is evaluated on diverse range of functions, covering both in-distribution and out-of-distribution settings.

理论论述

N/A

实验设计与分析

Seems valid.

补充材料

I briefly checked the supplementary material.

与现有文献的关系

I think the problem of learning problem-specific acquisition functions is important problem. In general, Bayesian optimisation arises in many applications, thus many fields could benefit from FunBO that improves the efficiency of the optimisation.

遗漏的重要参考文献

N/A

其他优缺点

N/A

其他意见或建议

N/A

作者回复

2025-04-01

Thanks for your feedback and for highlighting the strengths of our work. We hope the responses below clarify the motivation and practical considerations for FunBO.

Benefits of Representing AFs as Programs Representing AFs as programs offers the following advantages:

Interpretability: As highlighted in our paper, code-based AFs are inherently interpretable. It is possible to directly inspect, analyze, and understand the logic behind the optimization strategy employed by the discovered AF. This contrasts with neural network based AFs, for which it is difficult to understand why they perform well or how they balance exploration/exploitation.
Deployability and Simplicity: AFs discovered by FunBO are outputted as concise code snippets (e.g., Python functions). As mentioned in Section 1, these can be readily integrated into existing BO libraries and workflows with minimal overhead.
Leveraging Foundational Knowledge: LLMs trained on vast code and text have knowledge about code but also about existing BO strategies. FunBO aims at leveraging this embedded knowledge to efficiently explore the program space and construct effective heuristics.

We have further clarified the benefits of the code representation by adding text elaborating on these advantages (Interpretability, Deployability, and Leveraging Foundational Knowledge) around line 92 in Section 1.

Practical Use Cases In the limitation section we mentioned the fact that FunBO's evaluation cost might be high during the discovery phase as it involves running a full BO loop for each candidate AF on functions in G. However, there are scenarios where the evaluation cost for the functions in G might relatively cheap. For instance:

Simulations/Models at Lower Fidelity: G could consist of faster, lower-fidelity simulators or simplified analytical models related to the expensive target problem f.
Surrogate Models: G could include cheap surrogate models learned from previous related tasks.
ML Hyperparameter Optimization. G can represent the performance of a ML model on different (potentially smaller) datasets, where training/evaluation might be faster than on the final, large target dataset.

Note that the computational cost incurred during this one-time search phase is compensated during the online deployment phase where the resulting discovered AF reduces the number of costly function evaluations needed.

We have expanded the discussion on the computational cost within the Limitations paragraph in Section 5 and gave the examples listed above as cases where the auxiliary functions used during the search phase might be relatively inexpensive.

最终决定Accept (poster)

2025-05-01

This paper uses LLMs to directly propose routines that can be used as acquisition functions for Bayesian optimization. Overall, I think this is a pretty creative use of an LLM's ability to generate small amounts of path critical code, and the acquisition functions designed by FunBO seem to be competitive with or outperform both general purpose and purpose built ones in some fairly interesting empirical results. I am generally not concerned about the authors' promise to address the last few remaining concerns that weren't addressed already in the author feedback (e.g., specific acquisition functions discovered), and I believe these are relatively minor additions to the manuscript.