Density Ratio Estimation-based Bayesian Optimization with Semi-Supervised Learning
摘要
评审与讨论
In this article, the authors consider using a classifier to perform Bayesian optimization rather than regression models. To avoid issues with overconfident predictors, the proposition is to rely on unobserved inputs (unlabeled data), propagate the labels to this set before selecting the next query point among them. This selection is based on maximizing the probability to be in the class with lowest values. Unlabeled data can be generated by sampling. A benchmark is provided against some related methods.
update after rebuttal
I thank the authors for their response. Their answered my comments, with a focus on DRE-based BO approaches. Yet a more general discussion of regression vs classification plus comparisons to the state of the art BO would be beneficial to enlarge the discussion.
给作者的问题
- Can you state mathematically the class of optimization problem considered early on?
- Can you provide running times for the benchmark experiments?
论据与证据
Given observation values, the regression problem is supposedly simpler than a classification one obtained by removing function values above or below a threshold. Why this would work better in some cases should be better explained.
方法与评估标准
The choice of synthetic test functions chosen is not explained. Detailed results based on features of the considered function would be welcome, see for instance in Diouane, Y., Picheny, V., Riche, R. L., & Perrotolo, A. S. D. (2023). TREGO: a trust-region framework for efficient global optimization. Journal of Global Optimization, 86(1), 1-23.
理论论述
Only a preliminary analysis is given.
实验设计与分析
The analysis is mostly valid. It is hard to conclude anything from Figure 5 without more repetitions.
补充材料
Yes, most sections.
与现有文献的关系
The paper connects BO with label propagation and unsupervised learning.
遗漏的重要参考文献
Considering ranks only is studied, for instance by Picheny, V., Vakili, S., & Artemev, A. (2019). Ordinal bayesian optimisation. arXiv preprint arXiv:1912.02493.
GPs for classification are available as well, see Williams, C. K., & Rasmussen, C. E. (2006). Gaussian processes for machine learning (Vol. 2, No. 3, p. 4). Cambridge, MA: MIT press.
其他优缺点
The empirical comparison does not include state of the art BO methods, such as:
- Eriksson, D., Pearce, M., Gardner, J., Turner, R. D., & Poloczek, M. (2019). Scalable global optimization via local Bayesian optimization. Advances in neural information processing systems, 32.
- Cowen-Rivers, A. I., Lyu, W., Tutunov, R., Wang, Z., Grosnit, A., Griffiths, R. R., ... & Bou-Ammar, H. (2022). Hebo: Pushing the limits of sample-efficient hyper-parameter optimisation. Journal of Artificial Intelligence Research, 74, 1269-1349.
- Ament, S., Daulton, S., Eriksson, D., Balandat, M., & Bakshy, E. (2023). Unexpected improvements to expected improvement for bayesian optimization. Advances in Neural Information Processing Systems, 36, 20577-20612.
其他意见或建议
Figure 1 and 8: problem with captions
update after rebuttal
Sorry, I did not catch the small difference between subcaptions
We thank the reviewer for your constructive feedback to improve our work. We answer your questions and concerns in the following.
Given observation values, the regression problem is supposedly simpler than a classification (skipped)
We think that this question is regarding the general advantages of DRE-based Bayesian optimization (BO). We want to address it from three different perspectives.
Firstly, one of the most popular surrogate models in regression-based BO is a Gaussian process (GP) [1]. The computational complexity of GP regression is , where is #query points. As a result, DRE-based BO can be more efficient in terms of computational complexities than GP-based BO.
Secondly, classification can be viewed as a form of bounded regression. For example, using a logistic function, the output of a classification model is bounded in . Importantly, bounded regression usually makes model training easier than using the regression-based formulation.
Lastly, the studies [2, 3] show that DRE-based BO is equivalent to regression-based BO under specific conditions; it has been discussed in Section 3. Consequently, this DRE-based strategy is built on the aforementioned advantages while preserving the abilities of the regression-based method.
We will clarify these in the final paper.
[1] Garnett, R. Bayesian Optimization. Cambridge University Press, 2023.
[2] Tiao, L. C., et al. BORE: Bayesian optimization by density-ratio estimation. ICML, 2021.
[3] Song, J., et al. A general recipe for likelihood-free Bayesian optimization. ICML, 2022.
The choice of synthetic test functions chosen is not explained. Can you state mathematically the class of optimization problem considered early on?
Following the criteria described in [4], Beale, Branin, and Six-Hump Camel belong to 4) multi-modal with adequate global structure, which is often considered as the main target problem of BO [4, 5]. Bukin 6 seems to belong to 5) multi-modal with weak global structure.
We have provided the mathematical forms of these benchmarks in Section E.2. Also, the details of the real-world benchmarks tested in this work are described in Sections E.3, E.4, and E.5.
[4] Diouane, Y., et al. TREGO: a trust-region framework for efficient global optimization. J. Glob. Optim., 2023.
[5] Jones, D. R., et al. Efficient global optimization of expensive black-box functions. J. Glob. Optim., 1998.
It is hard to conclude anything from Figure 5 without more repetitions.
We would argue that most BO algorithms can generally solve Tabular Benchmarks without significant difficulty. However, our methods still show competitive or better performance compared to the baselines.
Considering ranks only is studied (skipped) GPs for classification are available as well
We will cite them by adding sentences like these:
Ordinal Bayesian optimization [6] proposes a novel Bayesian optimization framework that relies on variable ordering in a latent space to handle ill-conditioned or discontinuous objectives.
GP classification [7] can be used in DRE-based Bayesian optimization.
[6] Picheny, V., et al. Ordinal Bayesian optimisation. arXiv preprint arXiv:1912.02493, 2019.
[7] Rasmussen, C. E. and Williams, C. K. I. Gaussian Processes for Machine Learning. MIT Press, 2006.
The empirical comparison does not include state of the art BO methods
We appreciate your constructive suggestions. However, because our work focuses on addressing the over-exploitation problem in DRE-based BO, the main competitors of our methods are other DRE-based BO approaches. We think that adding such BO methods does not change the core message of our work. Nevertheless, we will cite and discuss those methods in the final version.
Figure 1 and 8: problem with captions
Could you please specify the problem of the captions of Figures 1 and 8? We will fix this problem in the final version if you let us know.
Can you provide running times for the benchmark experiments?
We provide running times for the benchmark experiments. Because our methods are defined with similarity matrices and transition probabilities where and are #labeled points and #unlabeled points; see Eqs. 7, 8, and 9, ours are relatively slower compared to the baselines. However, as discussed in our paper, our methods can mitigate the over-exploitation problem using both labeled and unlabeled points.
Furthermore, enhancing the efficiency of DRE-based BO with semi-supervised learning is left for future work.
| Method | Time |
|---|---|
| RF, BORE | 351.2 51.7 |
| RF, LFBO | 392.0 65.0 |
| GB, BORE | 17.5 2.0 |
| GB, LFBO | 20.4 3.2 |
| XGBoost, BORE | 59.2 27.7 |
| XGBoost, LFBO | 58.9 24.9 |
| MLP, BORE | 197.6 78.1 |
| MLP, LFBO | 239.6 89.6 |
| DRE-BO-SSL (LP) | 3850.5 2314.8 |
| DRE-BO-SSL (LS) | 820.8 193.0 |
This paper observes the overconfidence, or to say, over-exploit behavior of supervised learning methods in DRE-BO. With unlabeled points available, semi-supervised learning is combined with DRE-BO to tackle the issue. The proposed method, DRE-BO-SSL, are empirically evaluated at two cases including unlabeled point sampling and a fixed-size pool and shows promising performance.
update after rebuttal
My questions are answered by the authors. A revised version with those clarifications will improve this paper.
给作者的问题
I asked all questions above and my major concerns are stated in weaknesses.
论据与证据
I am sort of confused about the relationship between experiments and the motivation. DRE-BO-SSL is motivated by the overconfidence of DRE-BO-SL, giving the impression that the performance of DRE-BO-SL is limited because of the lack of sufficient exploration, which makes those SL-based methods keep improving current best solution while miss real global optima in uncertain area. However, the experiments are showing that DRE-BO-SSL has better optimization performance compared to other baselines, which is definitely good to the BO community but somehow mismatches the initial motivation.
方法与评估标准
Extending the SL part to SSL in DRE-BO seems to be quite natural to me. The evaluation criteria on optimization performance are sufficient, while the evaluations on improvements towards exploration are missing.
理论论述
This paper contributes in a methodology and empirical way and there are no theoretical claims to check.
实验设计与分析
This paper conducts thorough experiments, applying two SLL techniques, LP and LS to scenarios with/without unlabeled points on various benchmarks. The effects of different hyperparameters are also discussed and tested in supplementary materials.
补充材料
I reviewed all supplementary materials.
与现有文献的关系
This paper introduces semi-supervised learning to DRE-based BO, which could be seen as a combination of two literatures.
遗漏的重要参考文献
From my perspective, this paper already included essential references.
其他优缺点
Strengths
-
This paper combines DRE-BO with SSL, which is novel.
-
This paper is well-written and easy to follow.
-
Experiments are thoroughly conducted.
Weaknesses
-
Like I mentioned before, the motivation is not well-treated in experiments. A better regret curve does not mean the method explores the search space better.
-
Unclear about the choice of hyperparameters. Although the authors reported results with different values of hyperparameters, those results are inconsistent over benchmarks. For example, Figure 11 could not tell which number of unlabeled points should be chosen Figure 12 does not give a clear guidance about which sampling strategies should be used. Figure 14 seems to me that choosing a subset with size 10 is enough, which is counter-intuitive.
其他意见或建议
-
For the illustration example in Figure 1, it indeed shows that BORE and LFBO focus on area around the best current, and DRE-BO-SSL could explore well. However, it makes sense to spend several iterations to exploit a promising area and then explore uncertain domain. From this example it is unclear to me that if BORE and LFBO will get stuck at that local optimum or will explore other areas later. Is it more informative if the authors could plot similar figures at iterations 10, 20, 30, etc. to see their performance?
-
For Tabular benchmarks and NATS-Bench, the decision variables are integers as mentioned in Appendix E. The optimization of Eq. (4) in discrete search space is not mentioned in the paper.
-
Could the authors please check all citations to make sure they are properly ordered? e.g., line 14, line 157, etc.
We thank the reviewer for your constructive feedback to improve our work. We answer your questions and concerns in the following.
I am sort of confused about the relationship between experiments and the motivation. DRE-BO-SSL is motivated by the overconfidence of DRE-BO-SL, giving the impression that the performance of DRE-BO-SL is limited because of the lack of sufficient exploration, which makes those SL-based methods keep improving current best solution while miss real global optima in uncertain area. (skipped)
Like I mentioned before, the motivation is not well-treated in experiments. A better regret curve does not mean the method explores the search space better.
The evaluation criteria on optimization performance are sufficient, while the evaluations on improvements towards exploration are missing.
An empirical regret analysis on the benchmarks used in our work supports our motivation, as improved performance in terms of regret indicates that the Bayesian optimization algorithm effectively balances exploration and exploitation, ultimately leading to lower regret values. If the algorithm fails to explore effectively, improvement in regret values cannot be achieved. Therefore, our performance evaluation aligns with our initial motivation.
A similar discussion can be found in Section 10.1 of the Roman Garnett's book [1]. We would highlight one paragraph in the reference [1].
Regret is an unavoidable consequence of decision making under uncertainty. Without foreknowledge of the global optimum, we must of course spend some time searching for it, and even what may be optimal actions in the face of uncertainty may seem disappointing in retrospect. However, such actions are necessary in order to learn about the environment and inform future decisions. This reasoning gives rise to the classic tension between exploration and exploitation in policy design: although exploration may not yield immediate progress, it enables future success, and if we are careful, reduces future regret. Of course, exploration alone is not sufficient to realize a compelling optimization strategy, as we must also exploit what we have learned and adapt our behavior accordingly. An ideal algorithm thus explores efficiently enough that its regret can at least be limited in the long run.
[1] Garnett, R. Bayesian Optimization. Cambridge University Press, 2023.
Unclear about the choice of hyperparameters. Although the authors reported results with different values of hyperparameters, those results are inconsistent over benchmarks. For example, Figure 11 could not tell which number of unlabeled points should be chosen Figure 12 does not give a clear guidance about which sampling strategies should be used. Figure 14 seems to me that choosing a subset with size 10 is enough, which is counter-intuitive.
We think that this is a natural characteristic of black-box optimization, in which the hyperparameters depend on specific optimization problems. If a particular method or configuration consistently dominates alternatives, black-box optimization is not challenging. In this work, no trials are conducted before evaluating a black-box objective, so optimal hyperparameters are not used. It is important to note that our methods with default configurations outperform the baseline approaches despite not using optimal hyperparameters.
For the illustration example in Figure 1, it indeed shows that BORE and LFBO focus on area around the best current, and DRE-BO-SSL could explore well. However, it makes sense to spend several iterations to exploit a promising area and then explore uncertain domain. From this example it is unclear to me that if BORE and LFBO will get stuck at that local optimum or will explore other areas later. Is it more informative if the authors could plot similar figures at iterations 10, 20, 30, etc. to see their performance?
Resulting performance comparison between the baseline methods and our methods (Figures 1 and 8) is shown in Figure 2b. Ultimately, our methods outperform the baselines by effectively balancing exploration and exploitation.
We will clarify this in the final manuscript.
For Tabular benchmarks and NATS-Bench, the decision variables are integers as mentioned in Appendix E. The optimization of Eq. (4) in discrete search space is not mentioned in the paper.
Following the previous work [2], we transform continuous variables to discrete variables before evaluating them. We have briefly mentioned in Section E.3, but we will more clarify this in the final version.
[2] Garrido-Merchan, E. C. and Hernandez-Lobato, D. Dealing with categorical and integer-valued variables in Bayesian optimization with Gaussian processes. Neurocomputing, 2020.
Could the authors please check all citations to make sure they are properly ordered?
Thank you for pointing this out. We will fix them in the final paper.
This paper proposes an algorithm for black-box optimisation which is inspired by Bayesian Optimisation by density-Ratio Estimation (BORE, Tiao et al., 2021), a BO algorithm that transforms the usual BO regression problem into classification, and is combined with semi-supervised learning to better balance the exploration-exploitation trade-off. The paper starts by showing that BORE and likelihood-free Bayesian optimisation (LFBO) algorithms in general tend to overexploit due to the sole reliance on a small amount labelled training data. It is then shown that a semi-supervised learning approach allows an algorithm to better explore the search space by making use unlabelled query points and their inherent uncertainty to promote exploration. Two semi-supervised learning algorithms for BO are then proposed based on label propagation (Zhu & Ghahramani, 2002) and label spreading (Zhou et al., 2003) techniques. Empirical results demonstrate a few performance when compared to BORE, LFBO and traditional BO baselines.
给作者的问题
- What is denoting in Eq. 3? The derivative of ?
- It might seem that what the algorithm is mainly using randomly sampled data points from the search space to tune the RBF hyper-parameter in Eq. 5 via maximum entropy. It's not very clear, from a theoretical standpoint, what the extra machinery derived with the unlabelled data is doing for, e.g., the model's epistemic uncertainty or other factors that might promote exploration.
- Why would a simple uniform query distribution on the domain not be suitable for this work? I'm trying to understand why the truncated normal distribution was used, a distribution which might, e.g., concentrate the unlabelled data around the centre of the domain.
论据与证据
One of the main claims of the paper is that BO algorithms based on density-ratio estimation (DRE) are inherently biased towards over-exploitation. Arguments for this claim are mainly presented via illustrations and intuitive arguments with some mild theoretical justification, though no formal proof is provided. In addition, it is claimed that semi-supervised learning helps remediating this problem by reducing over-confidence in classifiers. Illustrations, intuitive arguments and empirical performance improvements are presented as evidence for such claim.
方法与评估标准
One my main concerns with this paper in its current form is that, to me, it seems that semi-supervised learning would only make sense if the data had a meaningful underlying distribution. However, in BO, the "data distribution" is basically determined by the algorithm itself, at least in continuous search spaces. The algorithm gets around some of these issues by randomly sampling unlabelled points from the domain according to a user-defined query-point distribution. However, imposing a distribution over the query points could be problematic in itself due to both theoretical (e.g., it imposes additional structure on the problem which may or may not be beneficial) and practical (e.g., how can one define a suitable data distribution for arbitrary domains) reasons. Despite the paper discussing a few of these aspects, there is not much emphasis on the intrinsic fact that the data in BO does not intrinsically follow a given underlying distribution. So the semi-supervised learning problem "as currently defined" seems ill posed.
理论论述
No theoretical results are provided with the paper.
实验设计与分析
The experimental evaluation is quite thorough, involving both synthetic and real-data problems, a reasonable number of representative baselines, and ablation studies.
补充材料
I briefly checked the appendix, especially sections A, C, D and J.
与现有文献的关系
The paper potentially advances research in Bayesian optimisation algorithms, which are an essential component of automated machine learning, hyper-parameter tuning and adaptive experimental design frameworks. Specifically, the paper focuses on likelihood-free methods for BO, which avoid defining an explicit prior distribution over the objective function, relying instead on a direct estimator for the algorithm's acquisition function. The algorithms presented in the paper potentially mitigate the over-exploitation problem in traditional approaches in this area by incorporating traditional semi-supervised learning techniques.
遗漏的重要参考文献
Nothing else significantly relevant that I could think of.
其他优缺点
- Assumption 4.1 (cluster assumption): I think this type of assumption is usually applicable to the case of i.i.d. data, not making sense in BO settings where the data is not i.i.d., being actively selected and dependent on all the previous algorithmic choices.
其他意见或建议
- It is not explicitly defined whether the algorithm is defined for maximisation or minimisation problems, though it is implicitly defined for minimisation due to the density on the numerator in Eq. 1.
- In-text citations should be explicitly mentioned. For example, "While the work (Bergstra et al., 2011) utilizes..." (line 177-178) would be better stated as "While the work of Bergstra et al. (2011) utilizes...", or "While Bergstra et al. (2011) utilizes...". A similar issue is found in line 254.
- In line 192, "as an acquisition function; it is simply denoted as...", the equation that follows is not simply a definition, but the result of a mathematical derivation in Tiao et al. (2021). So saying it's simply denoted as its result does not make sense.
- Implementation details, such as the use of multi-start L-BFGS-B for the optimisation of the acquisition function, can be moved to the appendix.
- Propagated labels are introduced in Eq. 6 before a proper definition.
- Line 74: "in order that it is possible that a cluster assumption (Seeger, 2000) is assumed" should be revised.
We thank the reviewer for your constructive feedback to improve our work. We answer your questions and concerns in the following.
One my main concerns with this paper in its current form is that, to me, it seems that semi-supervised learning would only make sense if the data had a meaningful underlying distribution. (skipped)
It's not very clear, from a theoretical standpoint, what the extra machinery derived with the unlabelled data is doing for (skipped)
First of all, our algorithm with fixed-size pools does not encounter this issue. As described in Section 1:
To make use of semi-supervised classifiers, we take into account two distinct scenarios with unlabeled point sampling and with a predefined fixed-size pool,
we assume two scenarios. More importantly, the real-world applications such as Tabular Benchmarks, NATS-Bench, and 64D minimum multi-digit MNIST search have access to fixed-size pools. Thus, the "data distribution" determined by the algorithm is defined based on the geometry information of possible query point candidates, where the candidates are from the true distribution. This information helps understand the search spaces of the problems we are tackling. In this context, the entropy (Eq. 6) is computed using the propagated labels of the unlabeled points from fixed-size pools, rather than from randomly-sampled points.
For the concern that the semi-supervised learning problem with randomly-sampled points is ill-posed, unlabeled points help shape the model’s understanding of the search space. This guides exploration by highlighting unexplored regions, even without labels. From an empirical perspective, we think that the unlabeled data can promote more informed and robust exploration. In addition, we would argue that our framework with semi-supervised learning can be understood from a regularization standpoint; eventually it can mitigate the over-exploitation problem.
We will clarify it in the revised version.
I think this type of assumption is usually applicable to the case of i.i.d. data, not making sense in BO settings (skipped)
We agree with your point. We have already discussed this issue in Section D. We would like to highlight our discussion here:
However, these theoretical results cannot be directly applied in our sequential problem because this work assumes that both labeled and unlabeled data points are independent and identically distributed. Nevertheless, these results can hint a theoretical guarantee on better performance of semi-supervised classifiers with unlabeled data points. Further analysis for the circumstances of Bayesian optimization is left for future work.
This topic will be left for future work.
It is not explicitly defined whether the algorithm is defined for maximisation or minimisation problems
The algorithm is defined for minimization problems in our work. We will clarify it in the revision.
In-text citations should be explicitly mentioned.
Thank you for pointing these out. We will fix them in the final version.
In line 192, "as an acquisition function; it is simply denoted as...", the equation that follows is not simply a definition (skipped)
It is a good point. We will clarify this in the final version by mentioning that it is the derivation of Tiao et al. (2021).
Implementation details (skipped) can be moved to the appendix.
We will move it to the supplementary material.
Propagated labels are introduced in Eq. 6 before a proper definition.
Thank you for pointing this out. Eq. 6 and its corresponding description will be moved to the end of Section 4.1.
Line 74: "in order that it is possible that a cluster assumption (Seeger, 2000) is assumed" should be revised.
It will be revised like this:
to allow for the possibility of adopting a cluster assumption (Seeger, 2000).
What is denoting in Eq. 3? The derivative of ?
Yes, it is the derivative of . We will clarify it in the revision.
Why would a simple uniform query distribution on the domain not be suitable for this work? I'm trying to understand why the truncated normal distribution was used (skipped)
This question is closely related to the boundary issue of Bayesian optimization; see Section 4.4.1 of the Ph.D. thesis of Kevin Swersky [1]. The similar issue is discussed in the previous work [2]. This work [2] proposes a Bayesian optimization approach with cylindrical kernels concentrating on a region near the center of the search region. Also, the recent work [3] also mentions this boundary issue in Bayesian optimization.
We will clarify it in the final manuscript by extending our discussion presented in Section G and Figure 12.
[1] Swersky, K. Improving Bayesian optimization for machine learning using expert priors. University of Toronto, 2017.
[2] Oh, C., et al. BOCK: Bayesian optimization with cylindrical kernels. ICML, 2018.
[3] Hvarfner, C., et al. Vanilla Bayesian Optimization Performs Great in High Dimensions. ICML, 2024.
This paper proposes a semi-supervised extension of density ratio estimation-based Bayesian Optimization (DRE-BO), aiming to address over-exploitation in supervised classifier-based approaches by incorporating unlabeled points through label propagation and spreading techniques. The approach is evaluated on both synthetic and real-world black-box optimization tasks.
The reviewers agree the paper is well-written and that the use of semi-supervised learning within DRE-BO is novel and empirically promising. Key concerns raised include theoretical justifications for the proposed setting (particularly the validity of semi-supervised learning in BO where data distributions are induced by the algorithm) and limited exploration analysis beyond regret curves. Additional concerns involve hyperparameter sensitivity and missing comparisons to broader BO baselines.
The authors provided a thorough and thoughtful rebuttal, clarifying the setup (especially regarding fixed-size pools), empirical justifications for their choices, and additional insights on regret-based exploration evaluation. Reviewers acknowledged these clarifications and maintained weak accept scores, suggesting the paper makes a useful methodological contribution despite some limitations. Therefore, I recommend acceptance.