PaperHub
5.5
/10
Poster4 位审稿人
最低3最高7标准差1.7
7
5
7
3
3.8
置信度
正确性2.8
贡献度2.5
表达3.0
NeurIPS 2024

Minimizing UCB: a Better Local Search Strategy in Local Bayesian Optimization

OpenReviewPDF
提交: 2024-05-16更新: 2024-11-06
TL;DR

We propose two local Bayesian optimziation algorithms to show that minimizing the UCB is better strategy than gradient descent in local search with a Gaussian process surrogate.

摘要

关键词
Bayesian OptimizationLocal OptimizationHigh dimensionalMinimizing upper confidence Bound

评审与讨论

审稿意见
7

The paper presents a novel method for local Bayesian Optimisation. Instead of relying on estimates of the gradient of the black-box functions like earlier methods, the algorithm proposed in the paper minimises an upper bound on the objective function. The paper shows a connection between the proposed method and gradient descent, whose update rule can also be thought of as minimising a certain upper bound on function. The authors study the convergence of the proposed algorithm theoretically and empirically, and show it can develop improvement over existing baselines. Additionally, the authors propose a variant of the algorithm with an improved exploration strategy.

优点

The paper is generally well-written and I found the connection between MinUCB and Gradient Descent very appealing and nicely presented. Minimising UCB seems to be an interesting alternative to performing gradient steps, as it does not require the specification of the stepsize. Empirical results show that at least on a number of selected tasks, the algorithm delivers a non-negligible improvement in performance. Summing up the theoretical and empirical results of the paper, I believe their contribution is sufficient and I recommend accepting the paper.

缺点

  1. I belive Theorem 1 could use a proof sketch, currently just by reading the main body, it is impossible to understand where the result is coming from.

  2. While authors acknowledge the existence of TuRBO and ARS, they do not empirically compare with them. It would be nice to see a comparison with those baselines, particularly with TuRBO, which is very popular.

  3. In the related work, the authors mention the methods using additive models in high-dimensional Bayesian Optimisation but only cite relatively old papers (the newest one is from 2017). Since then, additive methods have seen much development. To make the section more up-to-date, it would be great if the authors would cite more recent work, such as [1] or [2].

[1] Ziomek, Juliusz Krzysztof, and Haitham Bou Ammar. "Are random decompositions all we need in high dimensional Bayesian optimisation?." International Conference on Machine Learning. PMLR, 2023.

[2] Han, Eric, Ishank Arora, and Jonathan Scarlett. "High-dimensional Bayesian optimization via tree-structured additive models." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. No. 9. 2021.

问题

  1. In the paper, the authors explain the UCB is most likely to be “small” around already queried points and thus conclude the algorithm is mostly local, however, potentially allows to make bigger steps if needed. I wonder what would happen if we explicitly constrained the algorithm to be local, e.g. via the trust region strategy of TuRBO? Would it be possible for authors to conduct a simple ablation study on at least one of the problems? Thus could help gauge whether the improvement delivered by the algorithm comes from better local optimisation or from occasionally switching to a more global search.

局限性

The authors explicitly state the lack of convergence guarantee for LA-MinUCB as a limitation.

作者回复

Weakness 1: I belive Theorem 1 could use a proof sketch...

Response: We apologize for not including a proof sketch in the paper. The core tool we use in the paper is actually to prove the Lipshitz properties of various functions in Gaussian processes, i.e. the mean function μ(x)\mu(x), and the standard derivation function σ(x)\sigma(x). However, the difficulty is that the σ(x)\sigma(x) is not Lipshitz continuous under some conditions. We develop some tools to build a similar property for σ(x)\sigma(x) with some controllable error. We use the lipshitz properties of these function to build the following inequality: f(xt+1)minxXμ(x)+βtσ(x)f(x^t+1)+et<f(xt)+et f(x_{t+1})\le\min_{x\in \mathcal{X}}\mu(x)+\beta_{t}\sigma(x)\le f(\hat{x}^{t+1})+e_{t}<f(x_{t})+e_{t} where x^t+1=xtηtμ(xt)\hat{x}^{t+1}=x_{t}-\eta_{t}\nabla \mu(x_{t}), and stepsize ηt\eta_{t} decreases at a logarithmic rate with tt, and error term ete_{t} will decrease to 0. x^t+1\hat{x}^{t+1} is actually the gradient descent point starting from xtx_{t}. This inequality will directly help to prove the final result, i.e. the gradient convergence result.

Weakness 2: While authors acknowledge the existence of TurBO and ARS, they do not empirically compare with them. It would be nice to see a comparison with those baselines, particularly with TuRBO, which is very popular.

Response: We are sorry that our paper did not compare with these methods. We have added comparison results with TurBO in the experiment, please refer to the Fig 1,2,3 in the attached PDF for details.

Weakness 3: ...only cite relatively old papers ...

Response: We apologize for not citing the latest works. Relevant content will be added in future versions of the paper.

Question 1: ...I wonder what would happen if we explicitly constrained the algorithm to be local, e.g. via the trust region strategy of TuRBO? ...

Response: Thank you for your comment. We have added this type of experiments in Fig 4 in the PDF attached in the global rebuttal. We apply the trust region constrain on MinUCB, and find that adding the trust domain may actually lead to a worse result. We can explain from two aspects. On the one hand, the changes in the trust region are not timely. At the beginning, the radius of the trust region is small, which limits the search range. However, when approaching local optimum points, minimizing UCB often does not exceed the range of the trust region. The second aspect is that MinUCB itself can also be explained from the perspective of trust region, which can be refered to the third point of global rebuttal. So overall, as our method adopts the idea of trust region, it behaves local and can achieve good results.

评论

Thank you for your rebuttal. I remain positive about the paper.

审稿意见
5

This paper proposes and analysis a new local Bayesian optimisation scheme. The main idea is to replace the update of the current solution by minimising the upper confidence bound of the GP estimate (in the minimisation setting). The strategy is shown to obtain similar convergence rates as previous works. In addition, a "lookahead" variant of the algorithm is introduced. Finally, the proposed strategies are evaluated on synthetic benchmarks and RL tasks, showing competitive performance.

优点

Originality: The proposed algorithm is a variant of a prior approach (GIBO), which updates the iterate by minimising the UCB of the function instead if a local quadratic upper bound. The individual ideas are not necessarily novel (e.g. pessimistic estimates are quite commonly used in bandit/RL algorithms), but the combination with the local Bayesian optimisation scheme appears to be novel.

Quality:

  • The obtained bounds are slightly worse than prior works, but on the upside do not require the knowledge of a Lipschitz bound.
  • The experiments show good performance on the synthetic benchmarks and RL tasks

Clarity:

  • The algorithm and the motivation is clearly presented and overall the paper is well written, but would still benefit from polishing. However some details could be explained better as outlined below.

Significance: Local Bayesian optimization schemes have been a very successful approach to zero-order (noisy) black box optimisation and the theoretical analysis of these algorithms is still lacking, although the ground work has been layed in the prior work by Wu et al (2024). The paper therefore contributes to this landscape and the results are relevant at least for the Bayesian optimisation community and may inform better optimisation algorithms more broadly; or encourage further improved bounds.

缺点

My main concerns are the following:

  • The minimization of the UCB score in the step update appears to be global in nature. First, the range of min /argmin operations is never specified, e.g. in Algorithm 2, eq (4), (6), etc) which is confusing to the reader. Second, the authors claim that the acquisition step leads a "local" update (e.g. in lines 186-187, or the claim in 215 is not formally shown). What does it mean to be "local"? How is this formally proved?
  • While I agree with the general picture in Figure 1, and even if this is somewhat reflecting the general case, I believe one can construct kernels/examples where this "local" property is violated and the algorithm technically jumps to a different part of the domain (e.g. for linear kernel with near-zero slope, where observation data causes a sign switch and the step update moves to a different vertex of the domain).
  • Third, how can we efficiently find the minimizer of the UCB? We know that in general the UCB is not a convex function so finding the argmin is as hard as in global Bayesian optimization, without additional algorithmic constraints.
  • The discussion of prior works is lacking several works, e.g. the LineBO paper [1,2] which also provides local convergence guarantees based on a gradient descent scheme, and was shown in combination with trust regions.
  • The experiments lack several baselines (e.g. Turbo or LineBO)
  • The introduction of the Lookahead strategy in 7 is a bit out of the blue and not well connected to the remaining paper; and does not come with theoretical guarantees. How does this related/connected to Algorithm 1?
  • In general, the paper should better highlight the technical contributions and challenges related to the main result, and more formal/precise definitions of the problem setting and results.

Minor remarks:

  • The abstract uses acronyms that have not been introduced (BO, GIBO).
  • The claim in line 28 that dimension 20 is a critical limit for Bayesian optimisation is somewhat arbitrary, even in lower dimensions convergence can be slow if the function is non-smooth etc.
  • Equation (2) and the surrounding discussion does not make sense to me: The right-hand side of (2) does not depend on x, and can therefore not used as acquisition function or to minimise over x.
  • The type setting of equation (1) looks odd - try using \langle and \rangle for the inner product.
  • Although not wrong, the discussion of UCB in line 164 is lightly miss-leading as the classical UCB approach is used in the maximization setting, and the corresponding quantity in the context of this paper would be the lower confidence bound (LCB).
  • 233: gamma is not defined
  • 241: What is n ?
  • 187: "view" -> "viewed".

[1] Kirschner, J., Mutny, M., Hiller, N., Ischebeck, R., & Krause, A. (2019, May). Adaptive and safe Bayesian optimization in high dimensions via one-dimensional subspaces. In International Conference on Machine Learning (pp. 3429-3438). PMLR.

[2] Kirschner, J., Mutný, M., Krause, A., Coello de Portugal, J., Hiller, N., & Snuverink, J. (2022). Tuning particle accelerators with safety constraints using Bayesian optimization. Physical Review Accelerators and Beams, 25(6), 062802.

问题

Can you define what it means to be "local" and how the update step satisfies such a property?

局限性

The paper briefly discusses limitations, mainly the missing analysis of the lookahead approach.

作者回复

We thank the reviewer for the constructive feedback. The suggestions and questions are briefly responded below.

Weakness 1 and Question 1: The minimization of the UCB score in the step update appears to be global in nature... The range of min /argmin operations is never specified... What does it mean to be "local"?

Response: Thank you for your comment. In this paper, we would like to emphasize that the term 'local' mentioned does not refer to limiting optimization within a small range, such as Turbo, but rather to the algorithm exhibiting local behavior. Although the minimization is performed globally on UCB, according to our explanation in the first point of global rebuttal, the standard deviation term βσ(x)\beta\sigma(x) actually behaves as an additional penalty on the searching, which will limit the search to be near the current data set.

In our view, another main difference between global and local algorithm, is whether the algorithm attempts to conduct a global exploration. Local methods such as approximate gradient descent methods like GIBO or MPD, or other well-known Turbo methods, basically use greedy strategies to try to find a point with a lower function value than the previous iteration. They do not consider exploring the entire space to find possible global optimum. This strategy can ensure that each step has a descent and quickly converges to a local optimum point. This is because when we have a sequence f(x1),...,f(xn),...\\{f(x_1), ..., f(x_n), ...\\},Where f(xi+1)<f(xi)f(x_{i+1})<f(x_i), it can be mathematically proven that this sequence will eventually converge, and at the same time, xi,i=1...\\{x_i,i=1...\\} (or at least a subsequence of it) will also converge to a point xx^*. Under these algorithms, this xx^* is likely to be a local optimum point or saddle point of the function. According to the first explanation of MinUCB in the global rebuttal, we can assume that MinUCB also uses this greedy strategy, and only concentrates on the descent of the function value, which ensures that MinUCB can converge to the local optimum points efficiently. MinUCB can also be explained through a trust region view, which can be seen the third point of global rebuttal.

Weakness 2: ...One can construct kernels/examples where this "local" property is violated...e.g. for linear kernel...

Response: Thank you for your comment. We believe that the example you mentioned, that is, under a linear kernel, it is indeed possible for the algorithms to have a large search distance . However in this example, the global optimum and the local optimum are the same (if the optimum is unique). When some local information about the Gaussian process is known, such as the current function value being higher than the Gaussian process prior mean, or the norm of gradient being large nearby, MinUCB and LA-MinUCB will have a relatively large search distance, as the local information indicates that the local optimum point is a certain distance away from the current point. However, this is not a global search, as it is only looking for a point that is better than the current point function value through the information of Gaussian process. This greedy strategy will quickly approach the local optimum point of the current region, which is also its advantage over other methods as it better utilizes the information of Gaussian process.

Weakness 3: How can we efficiently find the minimizer of the UCB?

Response: In our numerical experiments, we only used the built-in optimizer in Botorch without any special range restrictions, but it seems that there was no significant numerical instability when minimizing UCB. If the dimensionality is really high, we believe that the results of the previous iteration can be used as a starting point to find a better local optimum point of UCB, which can at least achieve better results compared to the previous step.

Weakness 4: ...Lacking several works, e.g. the LineBO paper...lack several baselines(e.g. Turbo or LineBO)

Response: We are very sorry for not including an introduction to the LineBO series of work. We will include it in future version of paper. We have added a comparison with TurBO in Fig 1,2,3 in the PDF attached to the rebuttal, but due to time constraints, we apologize that we may not be able to present the comparison results with LineBO in this rebuttal. We will add relevant experiments in future version of paper.

Weakness 5: The introduction of the Lookahead strategy...not well connected to the remaining paper...

Response: Thanks for your comment. As the main contribution of our paper is to bulid the relationship between gradient descent with minimizing UCB, and MinUCB still partially depend on the gradient (the local exploration), then is it possible to develop an algorithm that better utilize the concept of UCB? Thus we apply a look ahead strategy to build the LA-MinUCB, and the experimental results are also very competitive. Proving the convergence of LA-MinUCB should be an important research direction in the future.

Weakness 6: In general, the paper should better highlight the technical contributions and challenges related to the main result...

Response: We are sorry for not clearly describe our contributions and challenges in our paper. We think the main contribution of this paper, is exactly to build the relationship between minimizing UCB and gradient descent under the Gaussian process surrogate, and show that minimizing UCB will bring extra efficiency in local search. We also develop the algorithms MinUCB and LA-MinUCB to utilize this idea. The theoretical proof for MinUCB is a big chanllage and is technically not trivial. We have developed some tools to prove the convergence of this algorithm.

Weakness 7: Minor remarks

Response: We apologize for some of the inaccurate statements and will improve them in future versions of the paper. γ\gamma is a coefficient in the matern kernel, and nn is the number of samples.

评论

I'd like to thank the authors for their response and for clarifying several issues.

I appreciate that TurBO was added as a baseline. The proposed approach remains competitive in the reported experiments.

It is also clearer now what "local" means, even though the notion is not fully defined formally. As far as I understand the goal is to be "local" in the kernel metric, and "efficient" only refers to sample efficiency, although empirically there are likely also computationally benefits. I think this should be made clear in the paper.

One weakness remains that the lookahead strategy (which performs best in the experiments) was not analysed and is not clearly motivate from the main algorithm.

This paper still has potential for improvements and therefore remains borderline, however I will raise my final score given the response by the authors.

审稿意见
7

The paper presents an extension of a local Bayesian optimization strategy, specifically targeting methods based on gradient information (GIBO). GIBO-type algorithms operate through two stages: an exploitation step and an exploration step. The paper introduces two novel algorithms within this framework.

In GIBO, the exploitation step is performed using gradient descent. The authors identify that gradient descent can be viewed as minimizing a quadratic upper bound. They further note that minimizing the tighter upper confidence bounds (UCB) can exploit more information from the GP surrogate model. Consequently, they replace the gradient descent step in GIBO with UCB minimization, resulting in a new algorithm named MinUCB. The paper provides a convergence analysis for MinUCB.

The second algorithm modifies the exploration phase of GIBO to optimally query points using a look-ahead strategy, aligning with the new exploitation approach. Both algorithms are empirically evaluated on some established benchmarks.

优点

  • The paper presents a clear, well-motivated, and relevant contribution to the field of local BO.
  • The related work is thoroughly discussed, situating the paper's contribution within the context of existing research.
  • The theoretical analysis is interesting and significantly strengthens the algorithmic contribution.
  • The proposed algorithms improve the state-of-the art for GIBO-style algorithms.

缺点

Main aspects

Locality: In contrast to GIBO and MDP, the exploitation step is standard UCB. Therefore xt+1x_{t+1} does not need to be in the local region around xtx_{t}. In fact this will only be the case if the algorithm is initialized in a region that is lower then the mean. I hypothesize that the "exploitation" step of MinUCB will explore globally until it finds a "promising region". I don't see this as a fundamental problem with the algorithm and may even allow MinUCB to switch between local and global exploration, which could be beneficial and avoid getting stuck in local optima. However, the paper does not discuss this behavior or explain why we should expect MinUCB to exhibit local-only exploration behavior. Including such a discussion and empirical investigations on how local MinUCB actually is could significantly improve the paper.

Empirical evaluation: The empirical evaluation is limited, missing important details, and is generally not well executed.

  • The algorithm is evaluated on a subset of the problems in Nguyen et al. and does not introduce any new benchmarks. The low number of problems makes it hard to assess how efficient the algorithm really is, especially since the Hopper experiments seem problematic due to state normalization. No method is able to "solve" the hopper task, and the achieved reward is very low.
  • There are unexplained differences in the results on the synthetic benchmarks compared to those reported by Nguyen et al., where MDP and GIBO have similar performance after 500 evaluations. In the new results, GIBO outperforms MDP.
  • There are important details missing:
    1. Batch sizes
    2. Hyperpriors and hyperparameters
  • The code is not available for review, which hinders the evaluation of the experiments at the time of review.
  • The meaning of the error bars changes between plots.
  • A sensitivity study is missing. How sensitive is the algorithm to the choices of the newly introduced hyperparmeter, namely β\beta in the UCB?

Limitations: The paper does not discuss the limitations of the proposed method and its evaluation.

Minor

  • In Theorem 1, nn is never introduced.
  • The paper often cites arXiv preprints when peer-reviewed versions of the articles are available.
  • There are some grammar and spelling mistakes/inaccuracies:
    • "..solve the high dimensional black-box.. problem" should be plural: "..solve high dimensional black-box.. problems" (there are many such problems).
    • "Gradient descent" should be lower case.
    • "large estimate to ensure the convergence" should be: "large estimate to ensure convergence" (we are talking about general convergence behavior).
  • The typesetting of "L-smooth" is off, perhaps due to being typed in math mode.
  • Some words/abbreviations are typeset in math mode (e.g., in equation 3, "UCB"; in equation 5, "trace").
  • Equation 1 is hard to parse due to the nested inequality and scalar product symbols.
  • Figure 3: "LA-MinUCB has consistently optimal performance." In what sense is the performance optimal? This claim seems premature and needs clarification.

问题

  1. In line 54 and later in the evaluation the authors claim "MPD ... can exhibits numerical instability.." but do not follow up on this claim. What numerical instabilities were observed?
  2. Why limit the experiments to a subset of the problems in Nguyen et al.?
  3. Why choose constant batch sizes and beta when the theory predicts choices with guaranteed convergence? Why deviate from the theoretically obtained values?
  4. Why are the results on the synthetic benchmarks significantly different from those reported in Nguyen et al.?
  5. How where the batch sizes chosen?

局限性

The limitations of the algorithm are not discussed and need to be addressed. For instance, the paper derives convergence for specific choices of beta and batch sizes, but chooses different values in the empirical results. This discrepancy raises the question of whether there is a gap between theory and practice. Discussing these limitations and the rationale behind the chosen parameters in the experiments would provide a more comprehensive understanding of the algorithm's performance and applicability.

作者回复

We thank the reviewer for the constructive feedback. The suggestions and questions are briefly responded below.

Weakness 1: Locality: I hypothesize that the "exploitation" step of MinUCB will explore globally until it finds a "promising region"... switch between local and global exploration

Response: Thank you for your comments and suggestions. We acknowledge a phenomenon that when the initial point value is higher than the prior mean, there may be a larger step size in the initial one or two exploitation step. However, this should be a fast local search instead of a global exploration. As we mentioned in the point 2,3 in global rebuttal, the local exploitation in each step is only to try find the point that have a lower value than the previous one, which does not involve global exploration idea. For the global-local balance, it's possible to use a mutistart strategy , which is similar as TurBO. However, we believe that the main contribution of this article is to propose the relationship between minimizing UCB and gradient descent under a Gaussian processes surrogate, and give the corresponding theory, so we do not emphasize the global-local balance in this paper.

Weakness 2 and Question 2,5: Empirical evaluation: details, no new benchmarks, error bars, open source code

Response: Thank you for your comments. In our experimental settings, we basically used the same settings as GIBO and MPD. For example, the hyperpriors is the uniform distribution on an interval and and hyperparameters is set the same as GIBO and MPD. For MinUCB, the batch size for local exploration is the data dimension dd. For LA-MinUCB, we set the batch size as dd in synthetic experiments, and emperically find that when the batch size is chosen as 0.2d0.2d on the reinforcement learning task, the numerical experiments will have good result.

We apologize for the lack of experiments in the original submission. We have added other real world objectives mentioned in MPD [1]. The experimental results can be seen in Fig 3 in the pdf attached to the rebuttal.

The differnent error bar is mainly due to the significant differences between sampled functions in the synthetic data, and directly plotting will result in large variances at different points. So we applied scaling to achieve better visual effects. We are sorry for not making the code public because we are afraid that it may lead to the leakage of author information, but we guarantee that the experimental results are accurate, and will make it public after the acceptance of paper.

Weakness 3: Lack of sensitivity study

Response: We are sorry for the lack of sensitivity experiments, which will be included in future papers. The PDF for this rebuttal does not have enough space to place the results. Overall, the β\beta controls the convergence speed and final convergence result. When β\beta is relatively small, the convergence speed will be fast. However, the final convergence result may be slightly weaker than that when β\beta is large. This is quiet similar with the gradient descent with a large or small stepsize. Small β\beta means the local exploitation is more aggresive, but may lose some accuracy, which will cause the algorithm jump around the local optima.

Weakness 4 and Question 1,4: The difference between our experimental result and Nguyen. et al[1]

Response: Thank you for your comments. When we conducted experiments based on Nguyen's code , we found that the MPD method is very likely to stuck at a specific point, and the search results afterwards may even get worse. This phenomenon is particularly significant in synthetic experiments. The MPD results in our experiment are actually very similar to the previous results of Nguyen et al., that they both achieving good result after the first batch of point. However, GIBO performed better in subsequent searches, and is robustness in diffenrent experiments. We speculate that this may be due to randomness or other unknown settings.

Question 3: Why choose constant batch sizes and beta

Response:The reason for selecting a fixed batch size and β\beta is to compare with previous methods. The previous methods used fixed parameters, so we also use fixed parameters for reasonable comparison. Although it is theoretically necessary to increase the batch size and β\beta to ensure convergence, fixed parameters can actually achieve good results (although there is a small distance between the result and the true local optimum).

[1] Quan Nguyen, Kaiwen Wu, Jacob Gardner, and Roman Garnett. Local bayesian optimization via maximizing probability of descent. Advances in neural information processing systems, 35:13190–13202, 2022

评论

We extend our heartfelt gratitude for the time and effort you have dedicated during the review process. We would greatly appreciate it if you could examine the comments we have returned. We are eagerly anticipating your feedback and are looking forward to the possibility of engaging in further discussion with you.

审稿意见
3

The author proposed a local Bayesian optimization method using UCB to drive its iterates. The UCB step replaces the gradient descent step that is typically in such methods.

优点

The comparison of gradient-descent to UCB is very interesting. The author included analysis as well so there is enough content to the paper. The assumptions on kernels are interesting and not straightforward. Clearly, effort has been made to establish convergence, though I think some of the theorems in the appendix would be better labeled lemma.

缺点

The biggest concern I have is that the algorithm is too similar to Bayesian optimization using UCB. The author tried to add more technical development such as the look-ahead algorithm. But the main difference from the proposed algorithm and GIBO is that the next iterate is found via UCB. In doing so, it's hard to argue the proposed algorithm is a gradient-based method anymore. And the proposed algorithm becomes too familiar.

问题

  1. Please proofread the paper to check grammar and spelling. For example, line 151, 161.
  2. Line 171 "more likely to be lower than" should be "more likely to have a smaller value than".
  3. Line 269 "minimize the minimum point?"
  4. The gradient descent methods have a rich literature in how to choose the step size. It's not always 1/L1/L. The author should clarify on that point.

局限性

Yes.

作者回复

We thank the reviewer for the constructive feedback. The suggestions and questions are briefly responded below.

Weakness : The biggest concern I have is that the algorithm is too similar to Bayesian optimization using UCB. The author tried to add more technical development such as the look-ahead algorithm. But the main difference from the proposed algorithm and GIBO is that the next iterate is found via UCB. In doing so, it's hard to argue the proposed algorithm is a gradient-based method anymore. And the proposed algorithm becomes too familiar.

Response: Thanks for your comment. In our work, we focus on 'local' algorithm and show that the algorithm we propose, and specifically the acquisition function of minimizing UCB can provide an efficient local search approach. This is in contrast to the traditional minimization of LCB (on the minimization problem context), where the focus is to directly balance the trade-offs between exploration and exploitation for global optimization.

We achieve this by first build the relationship between minimizing UCB and gradient descent under the Gaussian process surrogate, and show that minimizing UCB from this view will bring extra efficiency in local search. This provides us with an interesting idea that if we replace gradient descent with minimizing UCB, we may propose alternative efficient local BO algorithms. That is why we propose our two algorithms, MinUCB and LA-MinUCB, to illustrate this idea. As you mentioned, the proposed algorithm is not a gradient-based method anymore, but this is actually our final goal. The point selection strategy in our algorithm, has a similar behaviour as the gradient descent. The coefficient β\beta in UCB controls the search area, that larger β\beta will force algorithm to search in a smaller area, this is just similar with the stepsize in gradient descent. Our algorithm design use the idea of previous approximate gradient algorithm, and we apply minimizing UCB to better utilize the Gaussian process, which allows our method to integrate the advantages of both aspects. This makes our methods look similar to UCB, but in fact there are fundamentally different, that we transform the UCB concept from a global one to a local one. The experimental results also indicate that this combination can bring significant performance improvements, demonstrating the effectiveness of our idea.

Question 1,2,3: grammar and spelling

Response : Thanks for correcting the spelling and grammar errors in the paper, and we will make all revisions in future versions of the paper.

Question 4 :The gradient descent methods have a rich literature in how to choose the step size. It's not always 1/L1/L. The author should clarify on that point.

Response: Thank you for your suggestion. In this paper, we mainly want to explain the relationship between minimizing UCB and gradient descent. Some concepts may not be expressed clearly, and we will revise the relevant description in future versions of the paper.

评论

We extend our heartfelt gratitude for the time and effort you have dedicated during the review process. We would greatly appreciate it if you could examine the comments we have returned. We are eagerly anticipating your feedback and are looking forward to the possibility of engaging in further discussion with you.

评论

Dear author,

Thank you for the rebuttal. Sorry for the late reply.

I understand your motivation. But algorithm-wise, it seems to me your algorithm is still UCB BO. Your rebuttal did not add any information that would refute this, in my opinion. Correct me if I am wrong. Even down to the choice of β\beta for convergence analysis. I understand how you have come to UCB from the gradient descent algorithms. But you end up with an existing algorithm, albeit with some tweaks. I appreciate a lot of the extra work you put in, but to me this is clearly an upgrade to UCB. Only now your convergence has actually been weakened to a gradient optimality (KKT) condition, which does not even guarantee a local minimum. I really appreciate the KKT condition, in accordance with the "local" emphasis. But in my humble opinion, going from a global algorithm (UCB) to a local algorithm is not necessarily an upgrade. I know you are just comparing to existing gradient descent type methods. But you have to compare to UCB because that's what you ultimately proposed.

Correct me if I am wrong.

Thank you.

评论

I forgot to mention that in your theorem 1, δ\delta appears out of nowhere. You should define it first. I assume the theorem is valid with probability 1δ\geq 1-\delta? Please correct if so. Thank you!

评论

Dear Reviewer,

Thanks for your fast response and further clarification of your concern. We would like to appreciate your view and perspective on the UCB, which is quite relevant to our topic. We will answer your questions from the following three aspects:

Why we focus on a local approach?

In this paper, we want to emphasize that what we focus on is the local approach instead of searching the global optima. The reason is that, searching the global optima is nearly impossible in high dimensional case, which is so-called curse of dimensionality. Typical global algorithms will have poor performance on high-dimensional problems. Take traditional UCB BO algorithms as an example, that they try to balance exploration and exploitation through maximizing UCB or minimizing LCB. In high dimensional problem, this ‘exploration’ may become ‘over-exploration’. Traditional UCB BO uses almost all points to learn all uncertain parts of the function, making the algorithm approach the global optimum very slowly. This is also reflected in [1], that the regret bound of UCB BO will grow exponentially with data dimension dd. Traditional UCB BO also performs poorly in high-dimensional experiments. According to your latest suggestions, we have applied traditional UCB BO on our synthetic experiments and found that this method can hardly achieve any high value in these experiments. Similar experimental results can be found in Wu et al. [2].

On the contrary, local approach, like approximate gradient methods [2,3] or trust region methods [4], only focus on searching in a small area, which effectively decrease the over-exploration. Besides, a local optimum can be found relatively quickly, and exponential sample complexity is only necessary if one wishes to enumerate all local optima [2]. Although local approach only focuses on local optimums, this type of method still outperforms traditional global methods in high-dimensional performance [2,3,4]. That is why we mainly focus on local optimization algorithms in high-dimensional problems. As global and local are two completely different concepts, the global method and local method cannot be equated.

How do we transform the traditional UCB concept into a local one?

According to the statements above, we need to look at the local approach (focused on local optimization), and specifically those based on approximated gradient descent, as they have been proved to be effective in local BO. Our objective is to improve the efficiency of this type of gradient based approach. We first realize that there is a relationship between minimizing UCB and gradient descent under the Gaussian process surrogate, and show that minimizing UCB will bring extra efficiency in local search. Minimizing UCB, as it is a rather conservative strategy and only focuses on local exploitation (please refer to viewpoint 1 of Global Rebuttal), is a new idea and is completely different with the concept in traditional UCB BO method (as traditional UCB still considers exploration). We replace gradient descent with minimizing UCB to obtain extra efficiency, and thus we propose our two algorithms, MinUCB and LA-MinUCB, to illustrate this idea.

It should also be noted that our MinUCB achieve a polynomial convergence rate with data dimension dd, and this result is impossible to be obtained in traditional UCB BO. This is because traditional UCB BO will only converge at global optimum and its regret bound will grow exponentially with dd. That's also why our method performs very well in high dimensions, and our results far exceed the original UCB BO in these cases.

How to do global search in high dimensional case?

As the traditional BO like UCB BO falls in high dimensional case, a local search with adaptive restarts (similar with TurBO [4]) or a mixed global local search with our local algorithm can be much more efficient. According to the results in [2], local search with multiple starting points can effectively approximate the global optimal value without a large difference, and is very competitive with global methods.

[1] Srinivas, Niranjan, et al. "Gaussian process optimization in the bandit setting: No regret and experimental design." arXiv preprint arXiv:0912.3995 (2009).

[2] Wu, Kaiwen, et al. "The behavior and convergence of local bayesian optimization." Advances in neural information processing systems 36 (2024).

[3] Müller, Sarah, Alexander von Rohr, and Sebastian Trimpe. "Local policy search with Bayesian optimization." Advances in Neural Information Processing Systems 34 (2021): 20708-20720.

[4] Eriksson, David, et al. "Scalable global optimization via local Bayesian optimization." Advances in neural information processing systems 32 (2019).

作者回复

We thank the reviewers for their constructive comments and suggestions, which have brought great help to improve our paper. We would like to provide a clearer explanation of two aspects: the local behaviour and why our algorithms are local.

In this paper, we would like to emphasize that the term 'local' mentioned does not refer to limiting optimization within a small range or a specific interval, such as TurBO, but rather to the algorithm exhibiting local behavior. More specifically, the algorithm will only try to search points around the current dataset, without doing global exploration.

The locality of minimizing UCB: Compared with minimizing LCB, minimizing the UCB is a very conservative strategy that places additional penalties on exploration. Specifically, LCB is small with small μ(x)\mu(x) or large βσ(x)\beta\sigma(x), while UCB is only small when both μ(x)\mu(x) and βσ(x)\beta\sigma(x) are small. If we consider a point that is far away from the current dataset. Since there has been no exploration around this point, βσ(x)\beta\sigma(x) will be very large, which means that the minimum point of UCB will not be taken here. So the meaning of minimizing UCB is that, it will search towards points with smaller μ(x)\mu(x), while βσ(x)\beta\sigma(x) controls the search distance, forcing the search to be around the current dataset, which leads to minimizing UCB being a local strategy.

Why MinUCB (Algorithm 1) is local?(gradient descent view):

Our MinUCB can be viewed as the enhanced version of gradient descent with a larger descent in each step. If we assume xtx_{t} as the result of the ttht^{th} local exploitation (step move, line 8 in Alg 1), then what we actually proved in Theorem 1 is the following inequality:

f(xt+1)minxXμ(x)+βtσ(x)f(x^t+1)+et<f(xt)+etf(x_{t+1})\le\min_{x\in \mathcal{X}}\mu(x)+\beta_{t}\sigma(x)\le f(\hat{x}^{t+1}) + e_{t}<f(x_{t})+e_{t}

where x^t+1=xtηtμ(xt)\hat{x}^{t+1}=x_{t}-\eta_{t}\nabla \mu(x_{t}), and stepsize ηt\eta_{t} decreases at a logarithmic rate with tt, and some error terms in the proof ete_{t} will decrease to 0. x^t+1\hat{x}^{t+1} is actually the gradient descent point starting from xtx_{t}. The above inequality shows that MinUCB also adopts a greedy strategy that it only tries to find a point that is better than previous one. This reflects the locality of MinUCB, as it only focuses on the reliable improvement in each step, instead of doing exploration to search potiential global optima like minimizing LCB. This greedy strategy also guarantees that MinUCB will converge to local optima with a polynomial speed.

Why MinUCB (Algorithm 1) is local? (trust region view):

The trust region, which is defined in TurBO [1], is a hyperrectangle centered at the best solution found so far, and will shrink its size after too many consecutive “failures”, or expand it after many consecutive “successes”. The adjustment of trust region is trying to find a area, that the searching result xx in the area satisfies that f(x)<f(xt)f(x)<f(x_{t}) with a certain probability, where xtx_{t} is current best point, so that the proportion of “successes” and “failures” can be controled. From this view, we can construct a variant of trust region through an upper confidence bound view: UCBTRxtxμ(x)+βσ(x)<f(xt) UCBTR_{x_{t}} \coloneqq \\{x|\mu(x)+\beta\sigma(x)<f(x_{t})\\} This region is more conservative than the traditional trust region, as any points in this area will be lower than f(xt)f(x_{t}) with a probability larger than pp, and β\beta controls this probabiliity pp. Therefore, the local exploitation (step move, line 8 in Alg 1) can be viewed as searching the reliable minimum point in this trust region: minxXμ(x)+βσ(x)μ(xt)+βσ(xt)f(xt) \min_{x\in \mathcal{X}}\mu(x)+\beta\sigma(x)\le \mu(x_{t})+\beta\sigma(x_{t})\approx f(x_{t}) where the right approximation is guaranted through local exploration (sampling, line 4 in Alg 1), and this local exploration step will learn more local information around xtx_{t}, which will help to expand or shrink this area UCBTRxtUCBTR_{x_{t}}. In this view MinUCB can be treated as a variant trust region method method like TurBO, where we replace the trust region with a probability subset. This probability subset is accurate and reliable, which brings additional efficiency in local searching.

Why LA-MinUCB (Algorithm 2) is efficient?: Compared with MinUCB, LA-MinUCB adopts a more greedy stratgy in local exploration. In addition to learning local information, this local exploration step (line 3 in Alg 2) is aimed at obtaining better reliable minimum point in the next local exploitation through a look ahead strategy. This will possibily obtain a larger decreasement of function value than MinUCB in each step, which is also reflected in numerical experiments.

最终决定

I am willing to recommend acceptance for this paper over the single reject. However I do think there are a few ways beyond what has already been discussed in review that the paper could be accepted:

  • First, to the main criticism of the remaining negative review, why not simply run an off-the-shelf implementation of UCB, e.g. via a simple BoTorch BO loop or similar? The difference would presumably become extremely clear and immediately resolve this issue.

  • Second, to reviewer 9Krt's final point, I would strongly recommend that the authors add an ablation study of \beta. This should be straightforward given the relative cheapness of these baselines.